 # Dragon Notes UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Deep Learning:Model Design

Overfitting Problem

Overfitting: algorithm performs extremely well with training data, but poorly with unfamiliar data; it can be caused by having too many features in the cost function.
Underfitting, in contrast, suffers from having too few features - not providing the flexibility necessary to fit the data.$$\up$$ The problem of minimizing error whereby minimal bias yields large variance, and vice-versa - preventing learning models from generalizing beyond training set.
- Bias: error from erroneous assumptions in the learning model; high bias can cause a model to miss relevant relations between features and outputs (underfitting)
- Variance: error from model sensitivity to small fluctuations in training set; high variance can cause a model to fit the random noise in training data, rather than the intended outputs (overfitting) \begin{align} &E\lrbra{\t{MSE}_{\t{test}}} = E[(y_0-\hat{f}(x_0))^2] = \\ & \t{var}(\hat{f}(x_0)) + [\t{Bias}(\hat{f}(x_0))]^2 + \t{var}(\epsilon) \end{align}
$$\up\t{Bias}[\hat{f}(x_0)] = E[\hat{f}(x_0)] - f(x)$$
$$\dsup\t{var}(\hat{f}(x_0)) = E[(\hat{f}(x_0))^2] - E[\hat{f}(x_0)]^2$$
$$\ds\t{var}(\epsilon) =$$ irreducible error, arising from data noise

• Left, data is fit with models of varying flexibility: orange (lowest, linear reg.), black, and green (highest).
• Middle: at lowest, the model is too inflexible to fit the data (underfitting); as flexibility increases, both the training & test MSE decrease; increasing more, train. MSE drops further, but the test MSE rises sharply (overfitting).
• Right: for low flexibility, the MSE's mostly controlled by bias - for higher, by variance.
• Note that the test MSE is always above $$\t{var}(\epsilon)$$ - hence the error being 'irreducible'.

Learning Curves

$$J_{\t{CV}}$$ and $$J_{\t{train}}$$ will vary with training size $$m$$, both leveling off relative to desired error depending on model bias & variance:

High bias:
$$\up \t{Low }m\t{:}\ J_{\t{train}}=\t{low},\ J_{\t{CV}}=\t{high}$$
$$\t{High }m\t{:}\ J_{\t{train}}\approx J_{\t{CV}}=\t{high}$$; both continue to level off, with narrow gap between

$$\up$$More training data is unlikely to help
High variance:
$$\up\t{Low }m\t{:}\ J_{\t{train}}=\t{low},\ J_{\t{CV}}=\t{high}$$
$$\t{High }m\t{:}$$ $$J_{\t{train}}$$ cont. to increase, $$J_{CV}$$ cont. to decrease, $$\hspace{3px}$$w/ gap between them remaining significant

$$\up$$More training data is likely to help  Evaluating Hypothesis

A hypothesis must be evaluated for ability to generalize to unfamiliar exam- ples - which can be accomplished by dividing data into training & testing sets:

• $$\sim 60\%$$ of data: Training set $$-$$ set for training the algorithm
• $$\sim 20\%$$ of data: Cross-validation set $$-$$ set for optimizing regularization
• $$\sim 20\%$$ of data: Testing set $$-$$ set for final testing of algorithm
- $$J_{\t{test}}$$ does not include the regularization term
- $$J_{\t{train}}$$ is evaluated on training set of first $$i$$ ex's, and $$J_{\t{CV}}$$ on all $$m_{\t{CV}}$$
Linear Reg: $$\ds J_{\t{test}}(\bn{\Theta}) = \sfrac{1}{2m_{\t{test}}}\sum_{i=1}^{m_{\t{test}}}(h_{\bn{\Theta}}(x_{\t{test}}^{(i)})-y_{\t{test}}^{(i)})^2$$
Classification: $$\ds J_{\t{test}}(\bn{\Theta}) = \sfrac{1}{m_{\t{test}}}\sum_{i=1}^{m_{\t{test}}}\t{err}(h_{\bn{\Theta}}(x_{\t{test}}^{(i)}),y_{\t{test}}^{(i)})$$
\ds \t{err}(h_{\bn{\Theta}}(x^{(i)}),y^{(i)}) = \left\{ \begin{align} & 1\ \ \t{if }h_{\bn{\Theta}}(x^{(i)})\geq 0.5,\ y=0 \\ &\hspace{9px}\t{ or }h_{\bn{\Theta}}(x^{(i)}) < 0.5,\ y=1 \\ & 0\t{ otherwise} \end{align}\right.

Feature Mapping

Adding features to enable fitting non-linear data; for polynomial fitting, adding higher-order terms as products of existing terms for a degree-7 fit is:

\ds \begin{align} & \t{mapFeature}(x)= \\ & \lrbra{1 \ \ x_1 \ \ x_2 \ \ x_1^2 \ \ x_1x_2 \ \ x_2^2 \ \ x_1^3 \ \ \cdots \ \ x_1x_2^5 \ \ x_2^6} \Rightarrow \end{align}
\begin{align} & 1,\\ & x_1,\ x_2,\\ & x_1^2,\ x_1x_2,\ x_2^2,\\ & x_1^3,\ x_1^2x_2,\ x_1x_2^2,\ x_2^3,\\ & x_1^4,\ x_1^3x_2,\ x_1^2x_2^2,\ x_1x_2^3,\ x_2^4,\\ & x_1^5,\ x_1^4x_2,\ x_1^3x_2^2,\ x_1^2x_2^3,\ x_1x_2^4,\ x_2^5,\\ & x_1^6,\ x_1^5x_2,\ x_1^4x_2^2,\ x_1^3x_2^3,\ x_1^2x_2^4,\ x_1x_2^5,\ x_2^6 \end{align}\hspace{35px}
Test Error Diagnostic [Bias/Variance]

A high $$J_{\t{test}}$$ can result from model high vari- ance (overfitting), or high bias (underfitting):

Diagnostic:
\ \ \left\{\begin{align} \up\u{\t{Bias}}\t{:} &\ J_{\t{train}} = \t{high},\ J_{\t{CV}} \approx \hspace{4px} J_{\t{train}} \\ \u{\t{Var}}\t{:} &\ J_{\t{train}} = \t{low},\hspace{12px} J_{\t{CV}} \gg J_{\t{train}} \end{align} \right. Precision & Recall

A threshold-setting method for meeting accuracy objective, especially useful for skewed classes.

Precision: fraction of 'actual' positives predicted correctly; higher $$\ra$$ greater % of predicted positives $$=$$ actual positives $$\ra$$ less false alarms
Recall: fraction of 'actual' positives detected; higher $$\ra$$ greater % of actual positives predicted $$\ra$$ less actual positives overlooked
$$\ds \tb{Precision} = \frac{\t{True positives}}{\t{# Predicted positive}} = \frac{\t{True positives}}{\t{True pos + False pos}}$$
$$\ds \tb{Recall} = \frac{\t{True positives}}{\t{# Actual positive}} = \frac{\t{True positives}}{\t{True pos + False neg}}$$
Ex: Algorithm $$A$$ predicts cancer in patients with 99% 'accuracy'; however, only 0.5% of sampled patients have cancer. An algorithm $$B$$ predicting only $$y=0$$ (no cancer) would thus be 99.5% accurate - beating $$A$$. $$B$$, however, would have a Precision & Recall of $$0$$.  F1-Score

A single-value measure of combined Precision & Recall accuracy; shorthand called F-score:

$$\ds \t{F1} = \frac{2PR}{P+R} = \frac{\t{(Geometric mean)}^2}{\t{Arithmetic mean}}$$
Ex: If (A): $$P=.02$$, $$R=.99$$, and (B): $$P=.50$$, $$R=.40$$, the arithmetic mean $$(P+R)/2$$ suggests (A) to be better - where only 2% of predicted positives would be actual (thus 98% false alarms), which isn't reliable. $$F1$$ in contrast scores (B) about 10x higher. Design Pipeline

Outlining model design by assessing benefit vs. cost of a design component

Example: Photo OCR (optical character recognition)
Ceiling analysis: estimating maximum errors due to each component
What part of the pipeline should you spend the most time improving?
•  - Overall system, unimproved: $$72$$% accuracy
•  - Perfect algorithm for text detection $$\ra +17$$% accuracy
•  - Perfect algorithm for character segmentation $$\ra +1$$% accuracy
•  - Perfect algorithm for character recognition $$\ra +9$$% accuracy
 Component Accuracy Overall system 72% Text detection 89% Character segmentation 90% Character recognition 99%
Thus, improving character segmentation may be not worth the time/resources

(i.e., ceiling analysis answers "how much would you gain by making the best possible version of the component?") Big Data Handling

Large datasets require appropriate methods for efficiently structuring a machine learning project

• Should I train on full dataset? / Will my algorithm scale well?
$$\ra$$ Plot learning curves for smaller $$m$$, check for bias/variance
• Computer can't handle full dataset
$$\ra$$ Map-reduce: split workload among computers or CPU cores
• How do I get more data? -- (example: character recognition):
Artificial data synthesis:
$$\ra$$ compile examples likely to occur in test set (e.g. fonts library)
$$\ra$$ add noise / distortion to existing data that's likely to occur in test set (e.g. blur, background)
Labeling personally: estimate time required to increase dataset by x10
Labeling service: can be done inexpensively