Dragon Notes

i

\( \newcommand{bvec}[1]{\overrightarrow{\boldsymbol{#1}}} \newcommand{bnvec}[1]{\overrightarrow{\boldsymbol{\mathrm{#1}}}} \newcommand{uvec}[1]{\widehat{\boldsymbol{#1}}} \newcommand{vec}[1]{\overrightarrow{#1}} \newcommand{\parallelsum}{\mathbin{\|}} \) \( \newcommand{s}[1]{\small{#1}} \newcommand{t}[1]{\text{#1}} \newcommand{tb}[1]{\textbf{#1}} \newcommand{ns}[1]{\normalsize{#1}} \newcommand{ss}[1]{\scriptsize{#1}} \newcommand{vpl}[]{\vphantom{\large{\int^{\int}}}} \newcommand{vplup}[]{\vphantom{A^{A^{A^A}}}} \newcommand{vplLup}[]{\vphantom{A^{A^{A^{A{^A{^A}}}}}}} \newcommand{vpLup}[]{\vphantom{A^{A^{A^{A^{A^{A^{A^A}}}}}}}} \newcommand{up}[]{\vplup} \newcommand{Up}[]{\vplLup} \newcommand{Uup}[]{\vpLup} \newcommand{vpL}[]{\vphantom{\Large{\int^{\int}}}} \newcommand{lrg}[1]{\class{lrg}{#1}} \newcommand{sml}[1]{\class{sml}{#1}} \newcommand{qq}[2]{{#1}_{\t{#2}}} \newcommand{ts}[2]{\t{#1}_{\t{#2}}} \) \( \newcommand{ds}[]{\displaystyle} \newcommand{dsup}[]{\displaystyle\vplup} \newcommand{u}[1]{\underline{#1}} \newcommand{tu}[1]{\underline{\text{#1}}} \newcommand{tbu}[1]{\underline{\bf{\text{#1}}}} \newcommand{bxred}[1]{\class{bxred}{#1}} \newcommand{Bxred}[1]{\class{bxred2}{#1}} \newcommand{lrpar}[1]{\left({#1}\right)} \newcommand{lrbra}[1]{\left[{#1}\right]} \newcommand{lrabs}[1]{\left|{#1}\right|} \newcommand{bnlr}[2]{\bn{#1}\left(\bn{#2}\right)} \newcommand{nblr}[2]{\bn{#1}(\bn{#2})} \newcommand{real}[1]{\Ree\{{#1}\}} \newcommand{Real}[1]{\Ree\left\{{#1}\right\}} \) \( \newcommand{bn}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{bns}[2]{\bn{#1}_{\t{#2}}} \newcommand{b}[1]{\boldsymbol{#1}} \newcommand{bb}[1]{[\bn{#1}]} \) \( \newcommand{abs}[1]{\left|{#1}\right|} \newcommand{ra}[]{\rightarrow} \newcommand{Ra}[]{\Rightarrow} \newcommand{Lra}[]{\Leftrightarrow} \newcommand{rai}[]{\rightarrow\infty} \newcommand{ub}[2]{\underbrace{{#1}}_{#2}} \newcommand{ob}[2]{\overbrace{{#1}}^{#2}} \newcommand{lfrac}[2]{\large{\frac{#1}{#2}}\normalsize{}} \newcommand{sfrac}[2]{\small{\frac{#1}{#2}}\normalsize{}} \newcommand{Cos}[1]{\cos{\left({#1}\right)}} \newcommand{Sin}[1]{\sin{\left({#1}\right)}} \newcommand{Frac}[2]{\left({\frac{#1}{#2}}\right)} \newcommand{LFrac}[2]{\large{{\left({\frac{#1}{#2}}\right)}}\normalsize{}} \newcommand{Sinf}[2]{\sin{\left(\frac{#1}{#2}\right)}} \newcommand{Cosf}[2]{\cos{\left(\frac{#1}{#2}\right)}} \newcommand{atan}[1]{\tan^{-1}({#1})} \newcommand{Atan}[1]{\tan^{-1}\left({#1}\right)} \newcommand{intlim}[2]{\int\limits_{#1}^{#2}} \newcommand{lmt}[2]{\lim_{{#1}\rightarrow{#2}}} \newcommand{ilim}[1]{\lim_{{#1}\rightarrow\infty}} \newcommand{zlim}[1]{\lim_{{#1}\rightarrow 0}} \newcommand{Pr}[]{\t{Pr}} \newcommand{prop}[]{\propto} \newcommand{ln}[1]{\t{ln}({#1})} \newcommand{Ln}[1]{\t{ln}\left({#1}\right)} \newcommand{min}[2]{\t{min}({#1},{#2})} \newcommand{Min}[2]{\t{min}\left({#1},{#2}\right)} \newcommand{max}[2]{\t{max}({#1},{#2})} \newcommand{Max}[2]{\t{max}\left({#1},{#2}\right)} \newcommand{pfrac}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{pd}[]{\partial} \newcommand{zisum}[1]{\sum_{{#1}=0}^{\infty}} \newcommand{iisum}[1]{\sum_{{#1}=-\infty}^{\infty}} \newcommand{var}[1]{\t{var}({#1})} \newcommand{exp}[1]{\t{exp}\left({#1}\right)} \newcommand{mtx}[2]{\left[\begin{matrix}{#1}\\{#2}\end{matrix}\right]} \newcommand{nmtx}[2]{\begin{matrix}{#1}\\{#2}\end{matrix}} \newcommand{nmttx}[3]{\begin{matrix}\begin{align} {#1}& \\ {#2}& \\ {#3}& \\ \end{align}\end{matrix}} \newcommand{amttx}[3]{\begin{matrix} {#1} \\ {#2} \\ {#3} \\ \end{matrix}} \newcommand{nmtttx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{mtxx}[4]{\left[\begin{matrix}\begin{align}&{#1}&\hspace{-20px}{#2}\\&{#3}&\hspace{-20px}{#4}\end{align}\end{matrix}\right]} \newcommand{mtxxx}[9]{\begin{matrix}\begin{align} &{#1}&\hspace{-20px}{#2}&&\hspace{-20px}{#3}\\ &{#4}&\hspace{-20px}{#5}&&\hspace{-20px}{#6}\\ &{#7}&\hspace{-20px}{#8}&&\hspace{-20px}{#9} \end{align}\end{matrix}} \newcommand{amtxxx}[9]{ \amttx{#1}{#4}{#7}\hspace{10px} \amttx{#2}{#5}{#8}\hspace{10px} \amttx{#3}{#6}{#9}} \) \( \newcommand{ph}[1]{\phantom{#1}} \newcommand{vph}[1]{\vphantom{#1}} \newcommand{mtxxxx}[8]{\begin{matrix}\begin{align} & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4} \\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \\ \mtxxxxCont} \newcommand{\mtxxxxCont}[8]{ & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4}\\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \end{align}\end{matrix}} \newcommand{mtXxxx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{cov}[1]{\t{cov}({#1})} \newcommand{Cov}[1]{\t{cov}\left({#1}\right)} \newcommand{var}[1]{\t{var}({#1})} \newcommand{Var}[1]{\t{var}\left({#1}\right)} \newcommand{pnint}[]{\int_{-\infty}^{\infty}} \newcommand{floor}[1]{\left\lfloor {#1} \right\rfloor} \) \( \newcommand{adeg}[1]{\angle{({#1}^{\t{o}})}} \newcommand{Ree}[]{\mathcal{Re}} \newcommand{Im}[]{\mathcal{Im}} \newcommand{deg}[1]{{#1}^{\t{o}}} \newcommand{adegg}[1]{\angle{{#1}^{\t{o}}}} \newcommand{ang}[1]{\angle{\left({#1}\right)}} \newcommand{bkt}[1]{\langle{#1}\rangle} \) \( \newcommand{\hs}[1]{\hspace{#1}} \)

  UNDER CONSTRUCTION

Machine Learning:
Optimization Methods



Optimization Methods

[1] Feature scaling, \(x_i := x_i / s_i\). (scaling input features to a similar range, by dividing input values by input range)
[2] Mean normalization, \(x_i := (x_i-\mu_i)/s_i\). (shifting input features to a zero mean, and scaling, by subtracting input mean & dividing by input range or std. dev)
[3] Vectorization - optimizing computing efficiency by writing code that utilizes built-in optimized algorithms
[4] Regularization - increasing algorithm generality to unfamiliar data by minimizing overfitting
[5] Advanced libraries - algorithm libraries can be imported to apply advanced optimization methods to solve problems
[6] Test error minimization - balancing bias against variance to obtain optimal \(J_{\t{test}}\)
[1, 2]: \(\theta\) descends quicker on small ranges than large, and oscillates inefficiently down the optimum if variables are scaled very unevenly
[6]: - Diagnosing hypothesis for bias & variance
- Varying model & regularization parameters
- Testing hypoth. on cross-validation & testing sets

Regularization

An optimization method aiming to generalize an algorithm by minimizing overfitting
(1): Adding a regularization term to penalize all features, hence reducing the weight of unnecessary ones
(2): Adding a regularization term to penalize features known to be less important

[1]: The \(\theta_0\) term isn't regularized by convention (\(x_0\) updated as usual)
[2]: \(\frac{\lambda}{m}\theta_j\) serves as the regularization term
[3]: [2] can be rewritten as [3]; the second term is that of the usual iteration, but the left term is now modified with a quantity \(<1\)
[4]: Regularized logistic regression cost function
\(\ds \begin{align} & \theta_0 := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_0 \quad \bb{1}\\ & \theta_j := \theta_j -\alpha\lrbra{\lrpar{\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j}+\sfrac{\lambda}{m}\theta_j} \ \ \bb{2} \\ & \theta_j := \theta_j\lrpar{1-\alpha\sfrac{\lambda}{m}}-\sfrac{\alpha}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j \quad \bb{3} \end{align}\)
(above apply to both linear & logistic regression)

\(\ds \begin{align} & J(\theta) = -\sfrac{1}{m}\sum_{i=1}^{m}[y^{(i)}\t{log}(h_{\theta}(x^{(i)}))\ + \\ & \hspace{56px}+(1-y^{(i)})\t{log}(1-h_{\theta}(x^{i}))] + \sfrac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^2 \quad \bb{4} \end{align} \)

Vectorization: Linear Regression

Optimizing computing efficiency by writing code that utilizes built-in optimized algorithms. For linear regression, the vectorized implementation is:

\(\ds \bn{\theta} := \bn{\theta} - \alpha \bn{\delta}, \t{ where}\)
\(\ds \bn{\delta} = \sfrac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\bn{x}^{(i)},\ \ \bn{x}^{(i)}=\lrbra{\nmtttx{x^{(i)}_0}{x^{(i)}_1}{\vdots} {x^{(i)}_n}},\)\(\ds\ \ \bn{\theta}=\lrbra{\nmtttx{\theta_0}{\theta_1}{\vdots}{\theta_n}}\)
The parameters are updated simultaneously.

Advanced Libraries

Algorithm libraries can be imported to apply advanced optimization methods to solve problems.

\(\ds \begin{align} \t{Let } & J(\theta)=(\theta_1 - 5)^2 + (\theta_2 -5)^2. \t{ Then,} \\ & \sfrac{\pd}{\pd\theta_1} J(\theta) = 2(\theta_1 -5) \\ & \sfrac{\pd}{\pd\theta_2}J(\theta)=2(\theta_2 -5) \end{align}\)
function [jVal, gradient] = costFunction(theta)
   jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
   gradient = zeros(2,1);
   gradient(1) = 2*(theta(1)-5);
   gradient(2) = 2*(theta(2)-5);
end

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = ... 
   fminunc(@costFunction, initialTheta, options);

% see help fminunc, options, optimset


Minimizing Test Error

Balancing bias against variance to obtain optimal \(J_{\t{test}}\):

  • [1]: Create a list of lambdas - e.g. \(\lambda\in\{0,0.01,0.02,0.04,0.08,...,10.24\}\)
  • [2]: Create a set of models w/ different polynomial degrees, hidden layers, neurons per layer, etc.
  • [3]: Iterate through all the combinations of \(\lambda\)'s & models to learn some \(\bn{\Theta}\)
  • [4]: Compute \(J_{\t{CV}}\) using the learned \(\bn{\Theta}\) (computed w/ \(\lambda\)) without \(\lambda\) (i.e. \(=0\))
  • [5]: Select combination yielding the lowest \(J_{\t{CV}}\)
  • [6]: Apply the selected \(\bn{\Theta}\) & \(\lambda\) on \(J_{\t{test}}\) to assess success of generalization
Fixing High Bias Fixing High Variance
- Add features
- Decrease \(\lambda\)
- Remove features
- Increase \(\lambda\)
- Add training examples
Unrolling Parameters

Advanced optimization libraries may expect vector input arguments; hence, to utilize in neural networks, parameter matrices can be 'unrolled' into vectors:

Given matrices \(\bn{\Theta^{(1)}},\bn{\Theta^{(2)}},\bn{\Theta^{(3)}}\), and \(\bn{D^{(1)}},\bn{D^{(2)}},\bn{D^{(3)}}\), we can unroll all the elements into one long vector:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:)]
deltaVector = [ D1(:); D2(:); D3(:)]

If Theta 1,2,3 dims are 10x11, 10x11, and 1x11, respectively, the original matrices can be retrieved as:
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)




Dragon Notes,   Est. 2018     About

By OverLordGoldDragon