Graduationwoot

Dragon Notes

i

\( \newcommand{bvec}[1]{\overrightarrow{\boldsymbol{#1}}} \newcommand{bnvec}[1]{\overrightarrow{\boldsymbol{\mathrm{#1}}}} \newcommand{uvec}[1]{\widehat{\boldsymbol{#1}}} \newcommand{vec}[1]{\overrightarrow{#1}} \newcommand{\parallelsum}{\mathbin{\|}} \) \( \newcommand{s}[1]{\small{#1}} \newcommand{t}[1]{\text{#1}} \newcommand{tb}[1]{\textbf{#1}} \newcommand{ns}[1]{\normalsize{#1}} \newcommand{ss}[1]{\scriptsize{#1}} \newcommand{vpl}[]{\vphantom{\large{\int^{\int}}}} \newcommand{vplup}[]{\vphantom{A^{A^{A^A}}}} \newcommand{vplLup}[]{\vphantom{A^{A^{A^{A{^A{^A}}}}}}} \newcommand{vpLup}[]{\vphantom{A^{A^{A^{A^{A^{A^{A^A}}}}}}}} \newcommand{up}[]{\vplup} \newcommand{Up}[]{\vplLup} \newcommand{Uup}[]{\vpLup} \newcommand{vpL}[]{\vphantom{\Large{\int^{\int}}}} \newcommand{lrg}[1]{\class{lrg}{#1}} \newcommand{sml}[1]{\class{sml}{#1}} \newcommand{qq}[2]{{#1}_{\t{#2}}} \newcommand{ts}[2]{\t{#1}_{\t{#2}}} \) \( \newcommand{ds}[]{\displaystyle} \newcommand{dsup}[]{\displaystyle\vplup} \newcommand{u}[1]{\underline{#1}} \newcommand{tu}[1]{\underline{\text{#1}}} \newcommand{tbu}[1]{\underline{\bf{\text{#1}}}} \newcommand{bxred}[1]{\class{bxred}{#1}} \newcommand{Bxred}[1]{\class{bxred2}{#1}} \newcommand{lrpar}[1]{\left({#1}\right)} \newcommand{lrbra}[1]{\left[{#1}\right]} \newcommand{lrabs}[1]{\left|{#1}\right|} \newcommand{bnlr}[2]{\bn{#1}\left(\bn{#2}\right)} \newcommand{nblr}[2]{\bn{#1}(\bn{#2})} \newcommand{real}[1]{\Ree\{{#1}\}} \newcommand{Real}[1]{\Ree\left\{{#1}\right\}} \newcommand{abss}[1]{\|{#1}\|} \newcommand{umin}[1]{\underset{{#1}}{\t{min}}} \newcommand{umax}[1]{\underset{{#1}}{\t{max}}} \newcommand{und}[2]{\underset{{#1}}{{#2}}} \) \( \newcommand{bn}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{bns}[2]{\bn{#1}_{\t{#2}}} \newcommand{b}[1]{\boldsymbol{#1}} \newcommand{bb}[1]{[\bn{#1}]} \) \( \newcommand{abs}[1]{\left|{#1}\right|} \newcommand{ra}[]{\rightarrow} \newcommand{Ra}[]{\Rightarrow} \newcommand{Lra}[]{\Leftrightarrow} \newcommand{rai}[]{\rightarrow\infty} \newcommand{ub}[2]{\underbrace{{#1}}_{#2}} \newcommand{ob}[2]{\overbrace{{#1}}^{#2}} \newcommand{lfrac}[2]{\large{\frac{#1}{#2}}\normalsize{}} \newcommand{sfrac}[2]{\small{\frac{#1}{#2}}\normalsize{}} \newcommand{Cos}[1]{\cos{\left({#1}\right)}} \newcommand{Sin}[1]{\sin{\left({#1}\right)}} \newcommand{Frac}[2]{\left({\frac{#1}{#2}}\right)} \newcommand{LFrac}[2]{\large{{\left({\frac{#1}{#2}}\right)}}\normalsize{}} \newcommand{Sinf}[2]{\sin{\left(\frac{#1}{#2}\right)}} \newcommand{Cosf}[2]{\cos{\left(\frac{#1}{#2}\right)}} \newcommand{atan}[1]{\tan^{-1}({#1})} \newcommand{Atan}[1]{\tan^{-1}\left({#1}\right)} \newcommand{intlim}[2]{\int\limits_{#1}^{#2}} \newcommand{lmt}[2]{\lim_{{#1}\rightarrow{#2}}} \newcommand{ilim}[1]{\lim_{{#1}\rightarrow\infty}} \newcommand{zlim}[1]{\lim_{{#1}\rightarrow 0}} \newcommand{Pr}[]{\t{Pr}} \newcommand{prop}[]{\propto} \newcommand{ln}[1]{\t{ln}({#1})} \newcommand{Ln}[1]{\t{ln}\left({#1}\right)} \newcommand{min}[2]{\t{min}({#1},{#2})} \newcommand{Min}[2]{\t{min}\left({#1},{#2}\right)} \newcommand{max}[2]{\t{max}({#1},{#2})} \newcommand{Max}[2]{\t{max}\left({#1},{#2}\right)} \newcommand{pfrac}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{pd}[]{\partial} \newcommand{zisum}[1]{\sum_{{#1}=0}^{\infty}} \newcommand{iisum}[1]{\sum_{{#1}=-\infty}^{\infty}} \newcommand{var}[1]{\t{var}({#1})} \newcommand{exp}[1]{\t{exp}\left({#1}\right)} \newcommand{mtx}[2]{\left[\begin{matrix}{#1}\\{#2}\end{matrix}\right]} \newcommand{nmtx}[2]{\begin{matrix}{#1}\\{#2}\end{matrix}} \newcommand{nmttx}[3]{\begin{matrix}\begin{align} {#1}& \\ {#2}& \\ {#3}& \\ \end{align}\end{matrix}} \newcommand{amttx}[3]{\begin{matrix} {#1} \\ {#2} \\ {#3} \\ \end{matrix}} \newcommand{nmtttx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{mtxx}[4]{\left[\begin{matrix}\begin{align}&{#1}&\hspace{-20px}{#2}\\&{#3}&\hspace{-20px}{#4}\end{align}\end{matrix}\right]} \newcommand{mtxxx}[9]{\begin{matrix}\begin{align} &{#1}&\hspace{-20px}{#2}&&\hspace{-20px}{#3}\\ &{#4}&\hspace{-20px}{#5}&&\hspace{-20px}{#6}\\ &{#7}&\hspace{-20px}{#8}&&\hspace{-20px}{#9} \end{align}\end{matrix}} \newcommand{amtxxx}[9]{ \amttx{#1}{#4}{#7}\hspace{10px} \amttx{#2}{#5}{#8}\hspace{10px} \amttx{#3}{#6}{#9}} \) \( \newcommand{ph}[1]{\phantom{#1}} \newcommand{vph}[1]{\vphantom{#1}} \newcommand{mtxxxx}[8]{\begin{matrix}\begin{align} & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4} \\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \\ \mtxxxxCont} \newcommand{\mtxxxxCont}[8]{ & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4}\\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \end{align}\end{matrix}} \newcommand{mtXxxx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{cov}[1]{\t{cov}({#1})} \newcommand{Cov}[1]{\t{cov}\left({#1}\right)} \newcommand{var}[1]{\t{var}({#1})} \newcommand{Var}[1]{\t{var}\left({#1}\right)} \newcommand{pnint}[]{\int_{-\infty}^{\infty}} \newcommand{floor}[1]{\left\lfloor {#1} \right\rfloor} \) \( \newcommand{adeg}[1]{\angle{({#1}^{\t{o}})}} \newcommand{Ree}[]{\mathcal{Re}} \newcommand{Im}[]{\mathcal{Im}} \newcommand{deg}[1]{{#1}^{\t{o}}} \newcommand{adegg}[1]{\angle{{#1}^{\t{o}}}} \newcommand{ang}[1]{\angle{\left({#1}\right)}} \newcommand{bkt}[1]{\langle{#1}\rangle} \) \( \newcommand{\hs}[1]{\hspace{#1}} \)

  UNDER CONSTRUCTION

Deep Learning:
Model Design



Overfitting Problem

Overfitting: algorithm performs extremely well with training data, but poorly with unfamiliar data; it can be caused by having too many features in the cost function.
Underfitting, in contrast, suffers from having too few features - not providing the flexibility necessary to fit the data.\(\up\)


Bias-Variance Tradeoff

The problem of minimizing error whereby minimal bias yields large variance, and vice-versa - preventing learning models from generalizing beyond training set.
- Bias: error from erroneous assumptions in the learning model; high bias can cause a model to miss relevant relations between features and outputs (underfitting)
- Variance: error from model sensitivity to small fluctuations in training set; high variance can cause a model to fit the random noise in training data, rather than the intended outputs (overfitting)
\(\begin{align} &E\lrbra{\t{MSE}_{\t{test}}} = E[(y_0-\hat{f}(x_0))^2] = \\ & \t{var}(\hat{f}(x_0)) + [\t{Bias}(\hat{f}(x_0))]^2 + \t{var}(\epsilon) \end{align}\)
\(\up\t{Bias}[\hat{f}(x_0)] = E[\hat{f}(x_0)] - f(x)\)
\(\dsup\t{var}(\hat{f}(x_0)) = E[(\hat{f}(x_0))^2] - E[\hat{f}(x_0)]^2\)
\(\ds\t{var}(\epsilon) = \) irreducible error, arising from data noise

  • Left, data is fit with models of varying flexibility: orange (lowest, linear reg.), black, and green (highest).
  • Middle: at lowest, the model is too inflexible to fit the data (underfitting); as flexibility increases, both the training & test MSE decrease; increasing more, train. MSE drops further, but the test MSE rises sharply (overfitting).
  • Right: for low flexibility, the MSE's mostly controlled by bias - for higher, by variance.
  • Note that the test MSE is always above \(\t{var}(\epsilon)\) - hence the error being 'irreducible'.

Learning Curves

\(J_{\t{CV}}\) and \(J_{\t{train}}\) will vary with training size \(m\), both leveling off relative to desired error depending on model bias & variance:

High bias:
\(\up \t{Low }m\t{:}\ J_{\t{train}}=\t{low},\ J_{\t{CV}}=\t{high}\)
\(\t{High }m\t{:}\ J_{\t{train}}\approx J_{\t{CV}}=\t{high}\); both continue to level off, with narrow gap between

\(\up\)More training data is unlikely to help
High variance:
\(\up\t{Low }m\t{:}\ J_{\t{train}}=\t{low},\ J_{\t{CV}}=\t{high}\)
\(\t{High }m\t{:}\) \(J_{\t{train}}\) cont. to increase, \(J_{CV}\) cont. to decrease, \(\hspace{3px}\)w/ gap between them remaining significant

\(\up\)More training data is likely to help

Machine learning curves, high bias
Machine learning curves, high variance

Evaluating Hypothesis

A hypothesis must be evaluated for ability to generalize to unfamiliar exam- ples - which can be accomplished by dividing data into training & testing sets:

  • \(\sim 60\%\) of data: Training set \(-\) set for training the algorithm
  • \(\sim 20\%\) of data: Cross-validation set \(-\) set for optimizing regularization
  • \(\sim 20\%\) of data: Testing set \(-\) set for final testing of algorithm
- \(J_{\t{test}}\) does not include the regularization term
- \(J_{\t{train}}\) is evaluated on training set of first \(i\) ex's, and \(J_{\t{CV}}\) on all \(m_{\t{CV}}\)
Linear Reg: \(\ds J_{\t{test}}(\bn{\Theta}) = \sfrac{1}{2m_{\t{test}}}\sum_{i=1}^{m_{\t{test}}}(h_{\bn{\Theta}}(x_{\t{test}}^{(i)})-y_{\t{test}}^{(i)})^2\)
Classification: \(\ds J_{\t{test}}(\bn{\Theta}) = \sfrac{1}{m_{\t{test}}}\sum_{i=1}^{m_{\t{test}}}\t{err}(h_{\bn{\Theta}}(x_{\t{test}}^{(i)}),y_{\t{test}}^{(i)})\)
\(\ds \t{err}(h_{\bn{\Theta}}(x^{(i)}),y^{(i)}) = \left\{ \begin{align} & 1\ \ \t{if }h_{\bn{\Theta}}(x^{(i)})\geq 0.5,\ y=0 \\ &\hspace{9px}\t{ or }h_{\bn{\Theta}}(x^{(i)}) < 0.5,\ y=1 \\ & 0\t{ otherwise} \end{align}\right. \)

Feature Mapping

Adding features to enable fitting non-linear data; for polynomial fitting, adding higher-order terms as products of existing terms for a degree-7 fit is:

\(\ds \begin{align} & \t{mapFeature}(x)= \\ & \lrbra{1 \ \ x_1 \ \ x_2 \ \ x_1^2 \ \ x_1x_2 \ \ x_2^2 \ \ x_1^3 \ \ \cdots \ \ x_1x_2^5 \ \ x_2^6} \Rightarrow \end{align}\)
\(\begin{align} & 1,\\ & x_1,\ x_2,\\ & x_1^2,\ x_1x_2,\ x_2^2,\\ & x_1^3,\ x_1^2x_2,\ x_1x_2^2,\ x_2^3,\\ & x_1^4,\ x_1^3x_2,\ x_1^2x_2^2,\ x_1x_2^3,\ x_2^4,\\ & x_1^5,\ x_1^4x_2,\ x_1^3x_2^2,\ x_1^2x_2^3,\ x_1x_2^4,\ x_2^5,\\ & x_1^6,\ x_1^5x_2,\ x_1^4x_2^2,\ x_1^3x_2^3,\ x_1^2x_2^4,\ x_1x_2^5,\ x_2^6 \end{align}\hspace{35px} \)
Test Error Diagnostic [Bias/Variance]

A high \(J_{\t{test}}\) can result from model high vari- ance (overfitting), or high bias (underfitting):

Diagnostic:
\(\ \ \left\{\begin{align} \up\u{\t{Bias}}\t{:} &\ J_{\t{train}} = \t{high},\ J_{\t{CV}} \approx \hspace{4px} J_{\t{train}} \\ \u{\t{Var}}\t{:} &\ J_{\t{train}} = \t{low},\hspace{12px} J_{\t{CV}} \gg J_{\t{train}} \end{align} \right.\)
Bias-variance diagnosis

Precision & Recall

A threshold-setting method for meeting accuracy objective, especially useful for skewed classes.

Precision: fraction of 'actual' positives predicted correctly; higher \(\ra\) greater % of predicted positives \(=\) actual positives \(\ra\) less false alarms
Recall: fraction of 'actual' positives detected; higher \(\ra\) greater % of actual positives predicted \(\ra\) less actual positives overlooked
\(\ds \tb{Precision} = \frac{\t{True positives}}{\t{# Predicted positive}} = \frac{\t{True positives}}{\t{True pos + False pos}}\)
\(\ds \tb{Recall} = \frac{\t{True positives}}{\t{# Actual positive}} = \frac{\t{True positives}}{\t{True pos + False neg}}\)
Ex: Algorithm \(A\) predicts cancer in patients with 99% 'accuracy'; however, only 0.5% of sampled patients have cancer. An algorithm \(B\) predicting only \(y=0\) (no cancer) would thus be 99.5% accurate - beating \(A\). \(B\), however, would have a Precision & Recall of \(0\).
Precision recall diagram

F1 score visual
F1-Score

A single-value measure of combined Precision & Recall accuracy; shorthand called F-score:

\(\ds \t{F1} = \frac{2PR}{P+R} = \frac{\t{(Geometric mean)}^2}{\t{Arithmetic mean}}\)
Ex: If (A): \(P=.02\), \(R=.99\), and (B): \(P=.50\), \(R=.40\), the arithmetic mean \((P+R)/2\) suggests (A) to be better - where only 2% of predicted positives would be actual (thus 98% false alarms), which isn't reliable. \(F1\) in contrast scores (B) about 10x higher.
F1 score visual

Design Pipeline

Outlining model design by assessing benefit vs. cost of a design component

Example: Photo OCR (optical character recognition)
Ceiling analysis: estimating maximum errors due to each component
    What part of the pipeline should you spend the most time improving?
  •  - Overall system, unimproved: \(72\)% accuracy
  •  - Perfect algorithm for text detection \(\ra +17\)% accuracy
  •  - Perfect algorithm for character segmentation \(\ra +1\)% accuracy
  •  - Perfect algorithm for character recognition \(\ra +9\)% accuracy
ComponentAccuracy
Overall system 72%
Text detection 89%
Character segmentation 90%
Character recognition 99%
Thus, improving character segmentation may be not worth the time/resources

(i.e., ceiling analysis answers "how much would you gain by making the best possible version of the component?")
Precision recall diagram

Big Data Handling

Large datasets require appropriate methods for efficiently structuring a machine learning project

  • Should I train on full dataset? / Will my algorithm scale well?
     \(\ra\) Plot learning curves for smaller \(m\), check for bias/variance
  • Computer can't handle full dataset
     \(\ra\) Map-reduce: split workload among computers or CPU cores
  • How do I get more data? -- (example: character recognition):
    Artificial data synthesis:
     \(\ra\) compile examples likely to occur in test set (e.g. fonts library)
     \(\ra\) add noise / distortion to existing data that's likely to occur in test set (e.g. blur, background)
    Labeling personally: estimate time required to increase dataset by x10
    Labeling service: can be done inexpensively




Dragon Notes,   Est. 2018     About

By OverLordGoldDragon