Graduationwoot

Dragon Notes

i

\( \newcommand{bvec}[1]{\overrightarrow{\boldsymbol{#1}}} \newcommand{bnvec}[1]{\overrightarrow{\boldsymbol{\mathrm{#1}}}} \newcommand{uvec}[1]{\widehat{\boldsymbol{#1}}} \newcommand{vec}[1]{\overrightarrow{#1}} \newcommand{\parallelsum}{\mathbin{\|}} \) \( \newcommand{s}[1]{\small{#1}} \newcommand{t}[1]{\text{#1}} \newcommand{tb}[1]{\textbf{#1}} \newcommand{ns}[1]{\normalsize{#1}} \newcommand{ss}[1]{\scriptsize{#1}} \newcommand{vpl}[]{\vphantom{\large{\int^{\int}}}} \newcommand{vplup}[]{\vphantom{A^{A^{A^A}}}} \newcommand{vplLup}[]{\vphantom{A^{A^{A^{A{^A{^A}}}}}}} \newcommand{vpLup}[]{\vphantom{A^{A^{A^{A^{A^{A^{A^A}}}}}}}} \newcommand{up}[]{\vplup} \newcommand{Up}[]{\vplLup} \newcommand{Uup}[]{\vpLup} \newcommand{vpL}[]{\vphantom{\Large{\int^{\int}}}} \newcommand{lrg}[1]{\class{lrg}{#1}} \newcommand{sml}[1]{\class{sml}{#1}} \newcommand{qq}[2]{{#1}_{\t{#2}}} \newcommand{ts}[2]{\t{#1}_{\t{#2}}} \) \( \newcommand{ds}[]{\displaystyle} \newcommand{dsup}[]{\displaystyle\vplup} \newcommand{u}[1]{\underline{#1}} \newcommand{tu}[1]{\underline{\text{#1}}} \newcommand{tbu}[1]{\underline{\bf{\text{#1}}}} \newcommand{bxred}[1]{\class{bxred}{#1}} \newcommand{Bxred}[1]{\class{bxred2}{#1}} \newcommand{lrpar}[1]{\left({#1}\right)} \newcommand{lrbra}[1]{\left[{#1}\right]} \newcommand{lrabs}[1]{\left|{#1}\right|} \newcommand{bnlr}[2]{\bn{#1}\left(\bn{#2}\right)} \newcommand{nblr}[2]{\bn{#1}(\bn{#2})} \newcommand{real}[1]{\Ree\{{#1}\}} \newcommand{Real}[1]{\Ree\left\{{#1}\right\}} \newcommand{abss}[1]{\|{#1}\|} \newcommand{umin}[1]{\underset{{#1}}{\t{min}}} \newcommand{umax}[1]{\underset{{#1}}{\t{max}}} \newcommand{und}[2]{\underset{{#1}}{{#2}}} \) \( \newcommand{bn}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{bns}[2]{\bn{#1}_{\t{#2}}} \newcommand{b}[1]{\boldsymbol{#1}} \newcommand{bb}[1]{[\bn{#1}]} \) \( \newcommand{abs}[1]{\left|{#1}\right|} \newcommand{ra}[]{\rightarrow} \newcommand{Ra}[]{\Rightarrow} \newcommand{Lra}[]{\Leftrightarrow} \newcommand{rai}[]{\rightarrow\infty} \newcommand{ub}[2]{\underbrace{{#1}}_{#2}} \newcommand{ob}[2]{\overbrace{{#1}}^{#2}} \newcommand{lfrac}[2]{\large{\frac{#1}{#2}}\normalsize{}} \newcommand{sfrac}[2]{\small{\frac{#1}{#2}}\normalsize{}} \newcommand{Cos}[1]{\cos{\left({#1}\right)}} \newcommand{Sin}[1]{\sin{\left({#1}\right)}} \newcommand{Frac}[2]{\left({\frac{#1}{#2}}\right)} \newcommand{LFrac}[2]{\large{{\left({\frac{#1}{#2}}\right)}}\normalsize{}} \newcommand{Sinf}[2]{\sin{\left(\frac{#1}{#2}\right)}} \newcommand{Cosf}[2]{\cos{\left(\frac{#1}{#2}\right)}} \newcommand{atan}[1]{\tan^{-1}({#1})} \newcommand{Atan}[1]{\tan^{-1}\left({#1}\right)} \newcommand{intlim}[2]{\int\limits_{#1}^{#2}} \newcommand{lmt}[2]{\lim_{{#1}\rightarrow{#2}}} \newcommand{ilim}[1]{\lim_{{#1}\rightarrow\infty}} \newcommand{zlim}[1]{\lim_{{#1}\rightarrow 0}} \newcommand{Pr}[]{\t{Pr}} \newcommand{prop}[]{\propto} \newcommand{ln}[1]{\t{ln}({#1})} \newcommand{Ln}[1]{\t{ln}\left({#1}\right)} \newcommand{min}[2]{\t{min}({#1},{#2})} \newcommand{Min}[2]{\t{min}\left({#1},{#2}\right)} \newcommand{max}[2]{\t{max}({#1},{#2})} \newcommand{Max}[2]{\t{max}\left({#1},{#2}\right)} \newcommand{pfrac}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{pd}[]{\partial} \newcommand{zisum}[1]{\sum_{{#1}=0}^{\infty}} \newcommand{iisum}[1]{\sum_{{#1}=-\infty}^{\infty}} \newcommand{var}[1]{\t{var}({#1})} \newcommand{exp}[1]{\t{exp}\left({#1}\right)} \newcommand{mtx}[2]{\left[\begin{matrix}{#1}\\{#2}\end{matrix}\right]} \newcommand{nmtx}[2]{\begin{matrix}{#1}\\{#2}\end{matrix}} \newcommand{nmttx}[3]{\begin{matrix}\begin{align} {#1}& \\ {#2}& \\ {#3}& \\ \end{align}\end{matrix}} \newcommand{amttx}[3]{\begin{matrix} {#1} \\ {#2} \\ {#3} \\ \end{matrix}} \newcommand{nmtttx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{mtxx}[4]{\left[\begin{matrix}\begin{align}&{#1}&\hspace{-20px}{#2}\\&{#3}&\hspace{-20px}{#4}\end{align}\end{matrix}\right]} \newcommand{mtxxx}[9]{\begin{matrix}\begin{align} &{#1}&\hspace{-20px}{#2}&&\hspace{-20px}{#3}\\ &{#4}&\hspace{-20px}{#5}&&\hspace{-20px}{#6}\\ &{#7}&\hspace{-20px}{#8}&&\hspace{-20px}{#9} \end{align}\end{matrix}} \newcommand{amtxxx}[9]{ \amttx{#1}{#4}{#7}\hspace{10px} \amttx{#2}{#5}{#8}\hspace{10px} \amttx{#3}{#6}{#9}} \) \( \newcommand{ph}[1]{\phantom{#1}} \newcommand{vph}[1]{\vphantom{#1}} \newcommand{mtxxxx}[8]{\begin{matrix}\begin{align} & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4} \\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \\ \mtxxxxCont} \newcommand{\mtxxxxCont}[8]{ & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4}\\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \end{align}\end{matrix}} \newcommand{mtXxxx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{cov}[1]{\t{cov}({#1})} \newcommand{Cov}[1]{\t{cov}\left({#1}\right)} \newcommand{var}[1]{\t{var}({#1})} \newcommand{Var}[1]{\t{var}\left({#1}\right)} \newcommand{pnint}[]{\int_{-\infty}^{\infty}} \newcommand{floor}[1]{\left\lfloor {#1} \right\rfloor} \) \( \newcommand{adeg}[1]{\angle{({#1}^{\t{o}})}} \newcommand{Ree}[]{\mathcal{Re}} \newcommand{Im}[]{\mathcal{Im}} \newcommand{deg}[1]{{#1}^{\t{o}}} \newcommand{adegg}[1]{\angle{{#1}^{\t{o}}}} \newcommand{ang}[1]{\angle{\left({#1}\right)}} \newcommand{bkt}[1]{\langle{#1}\rangle} \) \( \newcommand{\hs}[1]{\hspace{#1}} \)

  UNDER CONSTRUCTION

Machine Learning:
Optimization Methods



Methods Summary

[1] Feature scaling, \(x_i := x_i / s_i\). (scaling input features to a similar range, by dividing input values by input range)
[2] Mean normalization, \(x_i := (x_i-\mu_i)/s_i\). (shifting input features to a zero mean, and scaling, by subtracting input mean & dividing by input range or std. dev)
[3] Vectorization - optimizing computing efficiency by writing code that utilizes built-in optimized algorithms
[4] Regularization - increasing algorithm generality to unfamiliar data by minimizing overfitting
[5] Advanced libraries - algorithm libraries can be imported to apply advanced optimization methods to solve problems
[6] Test error minimization - balancing bias against variance to obtain optimal \(J_{\t{test}}\)
[7] Stochastic/mini-batch GD - computing gradient from a single/few training exmaple(s) - trading speed for accuracy
[8] Map-reduce - distributing big data workload among multiple computers / cores
[1, 2]: \(\theta\) descends quicker on small ranges than large, and oscillates inefficiently down the optimum if variables are scaled very unevenly
[6]: - Diagnosing hypothesis for bias & variance
- Varying model & regularization parameters
- Testing hypoth. on cross-validation & testing sets

Gradient Descent (single feature)

Gradient descent (GD) is a method for determining function optima via step-approximation. For a single feature,

\(\ds [1]\quad \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\theta_0,\theta_1)\)
  \((\t{for }j=0\t{ and }j=1)\)
\(\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\theta_0,\theta_1) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\theta_0,\theta_1) \end{align} \)

\(\alpha =\) learning rate (step-size)
\(\theta_j =\) model parameter

\(\alpha\) too great \(\ra\) may overshoot the minimum and diverge
\(\alpha\) too small \(\ra\) may be too slow (require many steps)

Correct:

temp0

\(\ds := \theta_0 - \alpha \sfrac{\pd}{\pd\theta_0}J(\theta_0,\theta_1)\)

temp1

\(\ds := \theta_1 - \alpha \sfrac{\pd}{\pd\theta_1}J(\theta_0,\theta_1)\)
\(\theta_0 :=\)

temp0


\(\theta_1 :=\)

temp1

Incorrect:

temp0

\(\ds :=\theta_0 - \alpha \sfrac{\pd}{\pd \theta_0}J(\theta_0,\theta_1)\)
\(\theta_0 :=\)

temp0


temp1

\(\ds := \theta_1 -\alpha \sfrac{\pd}{\pd \theta_1}J(\theta_0,\theta_1)\)
\(\theta_1 :=\)

temp1


    Descent Algorithm (see left)
  • \([1]\) is to be repeated until convergence. At each iteration \(j\), the parameters \(\theta_0, \theta_1\) must be updated simultaneously
  • Updating a parameter prior to calculating another one on the \(j^{\t{th}}\) iteration yields a wrong implementation

Gradient Descent (Multivariate)

The complete gradient descent works with any # of features - and is given by

\(\ds \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\bn{\theta})\)
  \((\t{for }j=0,1,...,n)\)
\(\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\bn{\theta}) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\bn{\theta}) \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{\pd}{\pd \theta_n}J(\bn{\theta}) \\ \end{align} \)

- repeated until convergence. \(\bn{\theta} =\) model parameter vector, \((n+1)\times 1\)

Stochastic Gradient Descent

Descent with gradient computed from a single training example - trading speed for accuracy:

    Algorithm: (Stochastic)\(\up\)
  • - Randomly reorder training examples
  • - Repeat {\(\hspace{9px}\)

    for

    \(i = 1\hspace{2px}\t{:}\hspace{2px}m\)
    \(\theta_j \t{ := } \theta_j - \alpha(h_{\bn{\theta}}(x^{(i)})-y^{(i)})\cdot x_j^{(i)}\) }

    Algorithm: (Mini-batch)
  • Suppose \(m=1000\), \(b=10\):
  • - Randomly reorder training examples
  • - Repeat {\(\hspace{9px}\)

    for

    \(i = 1, 11, 21, ..., 991\)
    \(\ds\theta_j \t{ := } \theta_j - \alpha\frac{1}{10}\sum_{k=i}^{i+9}(h_{\bn{\theta}}(x^{(i)})-y^{(i)})\cdot x_j^{(i)}\) }

Learning rate: can reduce \(\alpha\) over iterations to aid convergence
Stochastic gradient descent vs mini-batch vs batch
  • - Batch GD can be slow for very large datasets \((m\approx 10^8)\)
  • - Mini-batch computes gradient over \(b\) examples
    \(\quad\ra\) Faster than batch, more accurate than stochastic
  • - Stochastic & mini-batch oscillate about global minimum
  • - Can check convergence by plotting \(J\) vs. # iterations

Regularization

An optimization method aiming to generalize an algorithm by minimizing overfitting
(1): Adding a regularization term to penalize all features, hence reducing the weight of unnecessary ones
(2): Adding a regularization term to penalize features known to be less important

[1]: The \(\theta_0\) term isn't regularized by convention (\(x_0\) updated as usual)
[2]: \(\frac{\lambda}{m}\theta_j\) serves as the regularization term
[3]: [2] can be rewritten as [3]; the second term is that of the usual iteration, but the left term is now modified with a quantity \(<1\)
[4]: Regularized logistic regression cost function
\(\ds \begin{align} & \theta_0 := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_0 \quad \bb{1}\\ & \theta_j := \theta_j -\alpha\lrbra{\lrpar{\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j}+\sfrac{\lambda}{m}\theta_j} \ \ \bb{2} \\ & \theta_j := \theta_j\lrpar{1-\alpha\sfrac{\lambda}{m}}-\sfrac{\alpha}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j \quad \bb{3} \end{align}\)
(above apply to both linear & logistic regression)

\(\ds \begin{align} & J(\theta) = -\sfrac{1}{m}\sum_{i=1}^{m}[y^{(i)}\t{log}(h_{\theta}(x^{(i)}))\ + \\ & \hspace{56px}+(1-y^{(i)})\t{log}(1-h_{\theta}(x^{i}))] + \sfrac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^2 \quad \bb{4} \end{align} \)

Vectorization: Linear Regression

Optimizing computing efficiency by writing code that utilizes built-in optimized algorithms. For linear regression, the vectorized implementation is:

\(\ds \bn{\theta} := \bn{\theta} - \alpha \bn{\delta}, \t{ where }\bn{\delta} = \sfrac{1}{m}\sum_{i=1}^{m}(h_{\bn{\theta}}(\bn{x}^{(i)})-\bn{y}^{(i)}) \bn{x}^{(i)},\)
\(\ds \bn{x}^{(i)}=\lrbra{\nmtttx{x^{(i)}_0}{x^{(i)}_1}{\vdots} {x^{(i)}_n}},\)\(\ds\ \ \bn{\theta}=\lrbra{\nmtttx{\theta_0}{\theta_1}{\vdots}{\theta_n}}\)
Gradient Checking

Provides a check on bugs and other faults as assurance to proper functionality of backpropagation - implemented by comparing computed gradient with its approximation:

\(\ds \begin{align} & \frac{\pd}{\pd\Theta_j}J(\bn{\Theta})\approx\frac{J(\Theta_1,...,\Theta_j+\epsilon,...,\Theta_n)-J(\Theta_1,..., \Theta_j-\epsilon,\Theta_n)}{2\epsilon}, \\ & \t{where }\epsilon=\t{small }(\t{e.g. }10^{-4})\end{align}\)

- Once \(\s{}\t{gradApprox}\approx\t{deltaVector}\) is verified, the verification algorithm should be turned off, as it is computationally expensive
EPSILON = 1e-4;
for i = 1:n
  thetaPlus = theta;
  thetaPlus(i) += EPSILON;
  thetaMinus = theta;
  thetaMinus(i) -= EPSILON;
  gradApprox(i) = (J(thetaPlus)-J(thetaMinus))/(2*EPSILON);
end

Minimizing Test Error

Balancing bias against variance to obtain optimal \(J_{\t{test}}\):

  • [1]: Create a list of lambdas - e.g. \(\lambda\in\{0,0.01,0.02,0.04,0.08,...,10.24\}\)
  • [2]: Create a set of models w/ different polynomial degrees, hidden layers, neurons per layer, etc.
  • [3]: Iterate through all the combinations of \(\lambda\)'s & models to learn some \(\bn{\Theta}\)
  • [4]: Compute \(J_{\t{CV}}\) using the learned \(\bn{\Theta}\) (computed w/ \(\lambda\)) without \(\lambda\) (i.e. \(=0\))
  • [5]: Select combination yielding the lowest \(J_{\t{CV}}\)
  • [6]: Apply the selected \(\bn{\Theta}\) & \(\lambda\) on \(J_{\t{test}}\) to assess success of generalization
Fixing High Bias Fixing High Variance
- Add features
- Decrease \(\lambda\)
- Remove features
- Increase \(\lambda\)
- Add training examples
Unrolling Parameters

Advanced optimization libraries may expect vector input arguments; hence, to utilize in neural networks, parameter matrices can be 'unrolled' into vectors:

Given matrices \(\bn{\Theta^{(1)}},\bn{\Theta^{(2)}},\bn{\Theta^{(3)}}\), and \(\bn{D^{(1)}},\bn{D^{(2)}},\bn{D^{(3)}}\), we can unroll all the elements into one long vector:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:)]
deltaVector = [ D1(:); D2(:); D3(:)]

If Theta 1,2,3 dims are 10x11, 10x11, and 1x11, respectively, the original matrices can be retrieved as:
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

Advanced Libraries

Algorithm libraries can be imported to apply advanced optimization methods to solve problems.

\(\ds \begin{align} & \t{Let } \\ & J(\theta)= (\theta_1 - 5)^2 \\ & \hspace{37px} + (\theta_2 -5)^2 \\ & \t{Then,} \\ & \sfrac{\pd}{\pd\theta_1} J(\theta) = 2(\theta_1 -5) \\ & \sfrac{\pd}{\pd\theta_2}J(\theta)=2(\theta_2 -5) \end{align}\)
function [jVal,grad] = costFunction(theta)
   jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
   grad = zeros(2,1);
   grad(1) = 2*(theta(1)-5);
   grad(2) = 2*(theta(2)-5);
end

options = optimset('GradObj','on', ... 
                        'MaxIter',100);
initTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = ... 
 fminunc(@costFunction,initTheta,options);

% see help fminunc, options, optimset

Big Data Handling

Large datasets require appropriate methods for efficiently structuring a machine learning project

  • Should I train on full dataset? / Will my algorithm scale well?
     \(\ra\) Plot learning curves for smaller \(m\), check for bias/variance
  • Computer can't handle full dataset
     \(\ra\) Map-reduce: split workload among computers or CPU cores
  • How do I get more data? -- (example: character recognition):
    Artificial data synthesis:
     \(\ra\) compile examples likely to occur in test set (e.g. fonts library)
     \(\ra\) add noise / distortion to existing data that's likely to occur in test set (e.g. blur, background)
    Labeling personally: estimate time required to increase dataset by x10
    Labeling service: can be done inexpensively




Dragon Notes,   Est. 2018     About

By OverLordGoldDragon