Dragon Notes

i

\( \newcommand{bvec}[1]{\overrightarrow{\boldsymbol{#1}}} \newcommand{bnvec}[1]{\overrightarrow{\boldsymbol{\mathrm{#1}}}} \newcommand{uvec}[1]{\widehat{\boldsymbol{#1}}} \newcommand{vec}[1]{\overrightarrow{#1}} \newcommand{\parallelsum}{\mathbin{\|}} \) \( \newcommand{s}[1]{\small{#1}} \newcommand{t}[1]{\text{#1}} \newcommand{tb}[1]{\textbf{#1}} \newcommand{ns}[1]{\normalsize{#1}} \newcommand{ss}[1]{\scriptsize{#1}} \newcommand{vpl}[]{\vphantom{\large{\int^{\int}}}} \newcommand{vplup}[]{\vphantom{A^{A^{A^A}}}} \newcommand{vplLup}[]{\vphantom{A^{A^{A^{A{^A{^A}}}}}}} \newcommand{vpLup}[]{\vphantom{A^{A^{A^{A^{A^{A^{A^A}}}}}}}} \newcommand{up}[]{\vplup} \newcommand{Up}[]{\vplLup} \newcommand{Uup}[]{\vpLup} \newcommand{vpL}[]{\vphantom{\Large{\int^{\int}}}} \newcommand{lrg}[1]{\class{lrg}{#1}} \newcommand{sml}[1]{\class{sml}{#1}} \newcommand{qq}[2]{{#1}_{\t{#2}}} \newcommand{ts}[2]{\t{#1}_{\t{#2}}} \) \( \newcommand{ds}[]{\displaystyle} \newcommand{dsup}[]{\displaystyle\vplup} \newcommand{u}[1]{\underline{#1}} \newcommand{tu}[1]{\underline{\text{#1}}} \newcommand{tbu}[1]{\underline{\bf{\text{#1}}}} \newcommand{bxred}[1]{\class{bxred}{#1}} \newcommand{Bxred}[1]{\class{bxred2}{#1}} \newcommand{lrpar}[1]{\left({#1}\right)} \newcommand{lrbra}[1]{\left[{#1}\right]} \newcommand{lrabs}[1]{\left|{#1}\right|} \newcommand{bnlr}[2]{\bn{#1}\left(\bn{#2}\right)} \newcommand{nblr}[2]{\bn{#1}(\bn{#2})} \newcommand{real}[1]{\Ree\{{#1}\}} \newcommand{Real}[1]{\Ree\left\{{#1}\right\}} \) \( \newcommand{bn}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{bns}[2]{\bn{#1}_{\t{#2}}} \newcommand{b}[1]{\boldsymbol{#1}} \newcommand{bb}[1]{[\bn{#1}]} \) \( \newcommand{abs}[1]{\left|{#1}\right|} \newcommand{ra}[]{\rightarrow} \newcommand{Ra}[]{\Rightarrow} \newcommand{Lra}[]{\Leftrightarrow} \newcommand{rai}[]{\rightarrow\infty} \newcommand{ub}[2]{\underbrace{{#1}}_{#2}} \newcommand{ob}[2]{\overbrace{{#1}}^{#2}} \newcommand{lfrac}[2]{\large{\frac{#1}{#2}}\normalsize{}} \newcommand{sfrac}[2]{\small{\frac{#1}{#2}}\normalsize{}} \newcommand{Cos}[1]{\cos{\left({#1}\right)}} \newcommand{Sin}[1]{\sin{\left({#1}\right)}} \newcommand{Frac}[2]{\left({\frac{#1}{#2}}\right)} \newcommand{LFrac}[2]{\large{{\left({\frac{#1}{#2}}\right)}}\normalsize{}} \newcommand{Sinf}[2]{\sin{\left(\frac{#1}{#2}\right)}} \newcommand{Cosf}[2]{\cos{\left(\frac{#1}{#2}\right)}} \newcommand{atan}[1]{\tan^{-1}({#1})} \newcommand{Atan}[1]{\tan^{-1}\left({#1}\right)} \newcommand{intlim}[2]{\int\limits_{#1}^{#2}} \newcommand{lmt}[2]{\lim_{{#1}\rightarrow{#2}}} \newcommand{ilim}[1]{\lim_{{#1}\rightarrow\infty}} \newcommand{zlim}[1]{\lim_{{#1}\rightarrow 0}} \newcommand{Pr}[]{\t{Pr}} \newcommand{prop}[]{\propto} \newcommand{ln}[1]{\t{ln}({#1})} \newcommand{Ln}[1]{\t{ln}\left({#1}\right)} \newcommand{min}[2]{\t{min}({#1},{#2})} \newcommand{Min}[2]{\t{min}\left({#1},{#2}\right)} \newcommand{max}[2]{\t{max}({#1},{#2})} \newcommand{Max}[2]{\t{max}\left({#1},{#2}\right)} \newcommand{pfrac}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{pd}[]{\partial} \newcommand{zisum}[1]{\sum_{{#1}=0}^{\infty}} \newcommand{iisum}[1]{\sum_{{#1}=-\infty}^{\infty}} \newcommand{var}[1]{\t{var}({#1})} \newcommand{exp}[1]{\t{exp}\left({#1}\right)} \newcommand{mtx}[2]{\left[\begin{matrix}{#1}\\{#2}\end{matrix}\right]} \newcommand{nmtx}[2]{\begin{matrix}{#1}\\{#2}\end{matrix}} \newcommand{nmttx}[3]{\begin{matrix}\begin{align} {#1}& \\ {#2}& \\ {#3}& \\ \end{align}\end{matrix}} \newcommand{amttx}[3]{\begin{matrix} {#1} \\ {#2} \\ {#3} \\ \end{matrix}} \newcommand{nmtttx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{mtxx}[4]{\left[\begin{matrix}\begin{align}&{#1}&\hspace{-20px}{#2}\\&{#3}&\hspace{-20px}{#4}\end{align}\end{matrix}\right]} \newcommand{mtxxx}[9]{\begin{matrix}\begin{align} &{#1}&\hspace{-20px}{#2}&&\hspace{-20px}{#3}\\ &{#4}&\hspace{-20px}{#5}&&\hspace{-20px}{#6}\\ &{#7}&\hspace{-20px}{#8}&&\hspace{-20px}{#9} \end{align}\end{matrix}} \newcommand{amtxxx}[9]{ \amttx{#1}{#4}{#7}\hspace{10px} \amttx{#2}{#5}{#8}\hspace{10px} \amttx{#3}{#6}{#9}} \) \( \newcommand{ph}[1]{\phantom{#1}} \newcommand{vph}[1]{\vphantom{#1}} \newcommand{mtxxxx}[8]{\begin{matrix}\begin{align} & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4} \\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \\ \mtxxxxCont} \newcommand{\mtxxxxCont}[8]{ & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4}\\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \end{align}\end{matrix}} \newcommand{mtXxxx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{cov}[1]{\t{cov}({#1})} \newcommand{Cov}[1]{\t{cov}\left({#1}\right)} \newcommand{var}[1]{\t{var}({#1})} \newcommand{Var}[1]{\t{var}\left({#1}\right)} \newcommand{pnint}[]{\int_{-\infty}^{\infty}} \newcommand{floor}[1]{\left\lfloor {#1} \right\rfloor} \) \( \newcommand{adeg}[1]{\angle{({#1}^{\t{o}})}} \newcommand{Ree}[]{\mathcal{Re}} \newcommand{Im}[]{\mathcal{Im}} \newcommand{deg}[1]{{#1}^{\t{o}}} \newcommand{adegg}[1]{\angle{{#1}^{\t{o}}}} \newcommand{ang}[1]{\angle{\left({#1}\right)}} \newcommand{bkt}[1]{\langle{#1}\rangle} \) \( \newcommand{\hs}[1]{\hspace{#1}} \)

  UNDER CONSTRUCTION

Machine Learning:
Classification



Decision Boundary

A line (/bounding manifold) separating regions of positive and negative classification (\(y=1\) and \(y=0\)), defined as the set of all points for which \(\underline{\bn{\theta}^T\bn{x}=0}\):

Feature Mapping

Adding features to enable fitting non-linear data; for polynomial fitting, adding higher-order terms as products of existing terms for a degree-7 fit is:

\(\ds \begin{align} & \t{mapFeature}(x)= \\ & \lrbra{1 \ \ x_1 \ \ x_2 \ \ x_1^2 \ \ x_1x_2 \ \ x_2^2 \ \ x_1^3 \ \ \cdots \ \ x_1x_2^5 \ \ x_2^6} \end{align}\)
The full list of generated terms is:
\(\begin{align} & 1,\\ & x_1,\ x_2,\\ & x_1^2,\ x_1x_2,\ x_2^2,\\ & x_1^3,\ x_1^2x_2,\ x_1x_2^2,\ x_2^3,\\ & x_1^4,\ x_1^3x_2,\ x_1^2x_2^2,\ x_1x_2^3,\ x_2^4,\\ & x_1^5,\ x_1^4x_2,\ x_1^3x_2^2,\ x_1^2x_2^3,\ x_1x_2^4,\ x_2^5,\\ & x_1^6,\ x_1^5x_2,\ x_1^4x_2^2,\ x_1^3x_2^3,\ x_1^2x_2^4,\ x_1x_2^5,\ x_2^6 \end{align} \)

Overfitting Problem

Overfitting: algorithm performs extremely well with training data, but poorly with unfamiliar data; it can be caused by having too many features in the cost function.
Underfitting, in contrast, suffers from having too few features - not providing the flexibility necessary to fit the data.\(\up\)


Bias-Variance Tradeoff

The problem of minimizing error whereby minimal bias yields large variance, and vice-versa - preventing learning models from generalizing beyond training set.
- Bias: error from erroneous assumptions in the learning model; high bias can cause a model to miss relevant relations between features and outputs (underfitting)
- Variance: error from model sensitivity to small fluctuations in training set; high variance can cause a model to fit the random noise in training data, rather than the intended outputs (overfitting)
\(\begin{align} &E\lrbra{\t{MSE}_{\t{test}}} = E[(y_0-\hat{f}(x_0))^2] = \\ & \t{var}(\hat{f}(x_0)) + [\t{Bias}(\hat{f}(x_0))]^2 + \t{var}(\epsilon) \end{align}\)
\(\up\t{Bias}[\hat{f}(x_0)] = E[\hat{f}(x_0)] - f(x)\)
\(\dsup\t{var}(\hat{f}(x_0)) = E[(\hat{f}(x_0))^2] - E[\hat{f}(x_0)]^2\)
\(\ds\t{var}(\epsilon) = \) irreducible error, arising from data noise

  • Left, data is fit with models of varying flexibility: orange (lowest, linear reg.), black, and green (highest).
  • Middle: at lowest, the model is too inflexible to fit the data (underfitting); as flexibility increases, both the training & test MSE decrease; increasing more, train. MSE drops further, but the test MSE rises sharply (overfitting).
  • Right: for low flexibility, the MSE's mostly controlled by bias - for higher, by variance.
  • Note that the test MSE is always above \(\t{var}(\epsilon)\) - hence the error being 'irreducible'.

Learning Curves

\(J_{\t{CV}}\) and \(J_{\t{train}}\) will vary with training size \(m\), both leveling off relative to desired error depending on model bias & variance:

High bias:
\(\up \t{Low }m\t{:}\ J_{\t{train}}=\t{low},\ J_{\t{CV}}=\t{high}\)
\(\t{High }m\t{:}\ J_{\t{train}}\approx J_{\t{CV}}=\t{high}\); both continue to level off, with narrow gap between

\(\up\)More training data is unlikely to help
High variance:
\(\up\t{Low }m\t{:}\ J_{\t{train}}=\t{low},\ J_{\t{CV}}=\t{high}\)
\(\t{High }m\t{:}\) \(J_{\t{train}}\) cont. to increase, \(J_{CV}\) cont. to decrease, \(\hspace{3px}\)w/ gap between them remaining significant

\(\up\)More training data is likely to help

Machine learning curves, high bias
Machine learning curves, high variance

Evaluating Hypothesis

A hypothesis must be evaluated for ability to generalize to unfamiliar exam- ples - which can be accomplished by dividing data into training & testing sets:

  • \(\sim 60\%\) of data: Training set \(-\) set for training the algorithm
  • \(\sim 20\%\) of data: Cross-validation set \(-\) set for optimizing regularization
  • \(\sim 20\%\) of data: Testing set \(-\) set for final testing of algorithm
- \(J_{\t{test}}\) does not include the regularization term
- \(J_{\t{train}}\) is evaluated on training set of first \(i\) ex's, and \(J_{\t{CV}}\) on all \(m_{\t{CV}}\)
Linear Reg: \(\ds J_{\t{test}}(\bn{\Theta}) = \sfrac{1}{2m_{\t{test}}}\sum_{i=1}^{m_{\t{test}}}(h_{\bn{\Theta}}(x_{\t{test}}^{(i)})-y_{\t{test}}^{(i)})^2\)
Classification: \(\ds J_{\t{test}}(\bn{\Theta}) = \sfrac{1}{m_{\t{test}}}\sum_{i=1}^{m_{\t{test}}}\t{err}(h_{\bn{\Theta}}(x_{\t{test}}^{(i)}),y_{\t{test}}^{(i)})\)
\(\ds \t{err}(h_{\bn{\Theta}}(x^{(i)}),y^{(i)}) = \left\{ \begin{align} & 1\ \ \t{if }h_{\bn{\Theta}}(x^{(i)})\geq 0.5,\ y=0 \\ &\hspace{9px}\t{ or }h_{\bn{\Theta}}(x^{(i)}) < 0.5,\ y=1 \\ & 0\t{ otherwise} \end{align}\right. \)

Test Error Diagnostic [Bias/Variance]

A high \(J_{\t{test}}\) can result from model high vari- ance (overfitting), or high bias (underfitting):

Diagnostic:
\(\ \ \left\{\begin{align} \up\u{\t{Bias}}\t{:} &\ J_{\t{train}} = \t{high},\ J_{\t{CV}} \approx \hspace{4px} J_{\t{train}} \\ \u{\t{Var}}\t{:} &\ J_{\t{train}} = \t{low},\hspace{12px} J_{\t{CV}} \gg J_{\t{train}} \end{align} \right.\)
Bias-variance diagnosis

One-vs-All: Multiclass Classification

For \((n+1)\) classes \(y=\{0,1,...,n\}\), dividing the problem into \((n+1)\) classification problems:

\(\ds \begin{align} & h_{\theta}^{(0)}(x) = P(y=0 | x; \theta) \\ & h_{\theta}^{(1)}(x) = P(y=1 | x;\theta) \\ & \vdots \\ & h_{\theta}^{(n)}(x)=P(y=n | x;\theta) \end{align}\) \(\ds \quad \t{prediction} = \underset{i}{\t{max}}(h_{\theta}^{(i)}(x))\)

[1] Choosing one class and lumping all the others into a single second class
[2] In each classification problem, predicting the probability that \(y\) is a member of one of the classes
[3] Given an input \(x\), outputs a \(y\) with the greatest \(h_{\theta}(x)\)

Precision & Recall

A threshold-setting method for meeting accuracy objective, especially useful for skewed classes.

Precision: fraction of 'actual' positives predicted correctly; higher \(\ra\) greater % of predicted positives \(=\) actual positives \(\ra\) less false alarms
Recall: fraction of 'actual' positives detected; higher \(\ra\) greater % of actual positives predicted \(\ra\) less actual positives overlooked
\(\ds \tb{Precision} = \frac{\t{True positives}}{\t{# Predicted positive}} = \frac{\t{True positives}}{\t{True pos + False pos}}\)
\(\ds \tb{Recall} = \frac{\t{True positives}}{\t{# Actual positive}} = \frac{\t{True positives}}{\t{True pos + False neg}}\)
Ex: Algorithm \(A\) predicts cancer in patients with 99% 'accuracy'; however, only 0.5% of sampled patients have cancer. An algorithm \(B\) predicting only \(y=0\) (no cancer) would thus be 99.5% accurate - beating \(A\). \(B\), however, would have a Precision & Recall of \(0\).
Precision recall diagram

F1 score visual
F1-Score

A single-value measure of combined Precision & Recall accuracy; shorthand called F-score:

\(\ds \t{F1} = \frac{2PR}{P+R} = \frac{\t{(Geometric mean)}^2}{\t{Arithmetic mean}}\)
Ex: If (A): \(P=.02\), \(R=.99\), and (B): \(P=.50\), \(R=.40\), the arithmetic mean \((P+R)/2\) suggests (A) to be better - where only 2% of predicted positives would be actual (thus 98% false alarms), which isn't reliable. \(F1\) in contrast scores (B) about 10x higher.
F1 score visual




Dragon Notes,   Est. 2018     About

By OverLordGoldDragon