Graduationwoot

Dragon Notes

i

\( \newcommand{bvec}[1]{\overrightarrow{\boldsymbol{#1}}} \newcommand{bnvec}[1]{\overrightarrow{\boldsymbol{\mathrm{#1}}}} \newcommand{uvec}[1]{\widehat{\boldsymbol{#1}}} \newcommand{vec}[1]{\overrightarrow{#1}} \newcommand{\parallelsum}{\mathbin{\|}} \) \( \newcommand{s}[1]{\small{#1}} \newcommand{t}[1]{\text{#1}} \newcommand{tb}[1]{\textbf{#1}} \newcommand{ns}[1]{\normalsize{#1}} \newcommand{ss}[1]{\scriptsize{#1}} \newcommand{vpl}[]{\vphantom{\large{\int^{\int}}}} \newcommand{vplup}[]{\vphantom{A^{A^{A^A}}}} \newcommand{vplLup}[]{\vphantom{A^{A^{A^{A{^A{^A}}}}}}} \newcommand{vpLup}[]{\vphantom{A^{A^{A^{A^{A^{A^{A^A}}}}}}}} \newcommand{up}[]{\vplup} \newcommand{Up}[]{\vplLup} \newcommand{Uup}[]{\vpLup} \newcommand{vpL}[]{\vphantom{\Large{\int^{\int}}}} \newcommand{lrg}[1]{\class{lrg}{#1}} \newcommand{sml}[1]{\class{sml}{#1}} \newcommand{qq}[2]{{#1}_{\t{#2}}} \newcommand{ts}[2]{\t{#1}_{\t{#2}}} \) \( \newcommand{ds}[]{\displaystyle} \newcommand{dsup}[]{\displaystyle\vplup} \newcommand{u}[1]{\underline{#1}} \newcommand{tu}[1]{\underline{\text{#1}}} \newcommand{tbu}[1]{\underline{\bf{\text{#1}}}} \newcommand{bxred}[1]{\class{bxred}{#1}} \newcommand{Bxred}[1]{\class{bxred2}{#1}} \newcommand{lrpar}[1]{\left({#1}\right)} \newcommand{lrbra}[1]{\left[{#1}\right]} \newcommand{lrabs}[1]{\left|{#1}\right|} \newcommand{bnlr}[2]{\bn{#1}\left(\bn{#2}\right)} \newcommand{nblr}[2]{\bn{#1}(\bn{#2})} \newcommand{real}[1]{\Ree\{{#1}\}} \newcommand{Real}[1]{\Ree\left\{{#1}\right\}} \newcommand{abss}[1]{\|{#1}\|} \newcommand{umin}[1]{\underset{{#1}}{\t{min}}} \newcommand{umax}[1]{\underset{{#1}}{\t{max}}} \newcommand{und}[2]{\underset{{#1}}{{#2}}} \) \( \newcommand{bn}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{bns}[2]{\bn{#1}_{\t{#2}}} \newcommand{b}[1]{\boldsymbol{#1}} \newcommand{bb}[1]{[\bn{#1}]} \) \( \newcommand{abs}[1]{\left|{#1}\right|} \newcommand{ra}[]{\rightarrow} \newcommand{Ra}[]{\Rightarrow} \newcommand{Lra}[]{\Leftrightarrow} \newcommand{rai}[]{\rightarrow\infty} \newcommand{ub}[2]{\underbrace{{#1}}_{#2}} \newcommand{ob}[2]{\overbrace{{#1}}^{#2}} \newcommand{lfrac}[2]{\large{\frac{#1}{#2}}\normalsize{}} \newcommand{sfrac}[2]{\small{\frac{#1}{#2}}\normalsize{}} \newcommand{Cos}[1]{\cos{\left({#1}\right)}} \newcommand{Sin}[1]{\sin{\left({#1}\right)}} \newcommand{Frac}[2]{\left({\frac{#1}{#2}}\right)} \newcommand{LFrac}[2]{\large{{\left({\frac{#1}{#2}}\right)}}\normalsize{}} \newcommand{Sinf}[2]{\sin{\left(\frac{#1}{#2}\right)}} \newcommand{Cosf}[2]{\cos{\left(\frac{#1}{#2}\right)}} \newcommand{atan}[1]{\tan^{-1}({#1})} \newcommand{Atan}[1]{\tan^{-1}\left({#1}\right)} \newcommand{intlim}[2]{\int\limits_{#1}^{#2}} \newcommand{lmt}[2]{\lim_{{#1}\rightarrow{#2}}} \newcommand{ilim}[1]{\lim_{{#1}\rightarrow\infty}} \newcommand{zlim}[1]{\lim_{{#1}\rightarrow 0}} \newcommand{Pr}[]{\t{Pr}} \newcommand{prop}[]{\propto} \newcommand{ln}[1]{\t{ln}({#1})} \newcommand{Ln}[1]{\t{ln}\left({#1}\right)} \newcommand{min}[2]{\t{min}({#1},{#2})} \newcommand{Min}[2]{\t{min}\left({#1},{#2}\right)} \newcommand{max}[2]{\t{max}({#1},{#2})} \newcommand{Max}[2]{\t{max}\left({#1},{#2}\right)} \newcommand{pfrac}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{pd}[]{\partial} \newcommand{zisum}[1]{\sum_{{#1}=0}^{\infty}} \newcommand{iisum}[1]{\sum_{{#1}=-\infty}^{\infty}} \newcommand{var}[1]{\t{var}({#1})} \newcommand{exp}[1]{\t{exp}\left({#1}\right)} \newcommand{mtx}[2]{\left[\begin{matrix}{#1}\\{#2}\end{matrix}\right]} \newcommand{nmtx}[2]{\begin{matrix}{#1}\\{#2}\end{matrix}} \newcommand{nmttx}[3]{\begin{matrix}\begin{align} {#1}& \\ {#2}& \\ {#3}& \\ \end{align}\end{matrix}} \newcommand{amttx}[3]{\begin{matrix} {#1} \\ {#2} \\ {#3} \\ \end{matrix}} \newcommand{nmtttx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{mtxx}[4]{\left[\begin{matrix}\begin{align}&{#1}&\hspace{-20px}{#2}\\&{#3}&\hspace{-20px}{#4}\end{align}\end{matrix}\right]} \newcommand{mtxxx}[9]{\begin{matrix}\begin{align} &{#1}&\hspace{-20px}{#2}&&\hspace{-20px}{#3}\\ &{#4}&\hspace{-20px}{#5}&&\hspace{-20px}{#6}\\ &{#7}&\hspace{-20px}{#8}&&\hspace{-20px}{#9} \end{align}\end{matrix}} \newcommand{amtxxx}[9]{ \amttx{#1}{#4}{#7}\hspace{10px} \amttx{#2}{#5}{#8}\hspace{10px} \amttx{#3}{#6}{#9}} \) \( \newcommand{ph}[1]{\phantom{#1}} \newcommand{vph}[1]{\vphantom{#1}} \newcommand{mtxxxx}[8]{\begin{matrix}\begin{align} & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4} \\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \\ \mtxxxxCont} \newcommand{\mtxxxxCont}[8]{ & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4}\\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \end{align}\end{matrix}} \newcommand{mtXxxx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{cov}[1]{\t{cov}({#1})} \newcommand{Cov}[1]{\t{cov}\left({#1}\right)} \newcommand{var}[1]{\t{var}({#1})} \newcommand{Var}[1]{\t{var}\left({#1}\right)} \newcommand{pnint}[]{\int_{-\infty}^{\infty}} \newcommand{floor}[1]{\left\lfloor {#1} \right\rfloor} \) \( \newcommand{adeg}[1]{\angle{({#1}^{\t{o}})}} \newcommand{Ree}[]{\mathcal{Re}} \newcommand{Im}[]{\mathcal{Im}} \newcommand{deg}[1]{{#1}^{\t{o}}} \newcommand{adegg}[1]{\angle{{#1}^{\t{o}}}} \newcommand{ang}[1]{\angle{\left({#1}\right)}} \newcommand{bkt}[1]{\langle{#1}\rangle} \) \( \newcommand{\hs}[1]{\hspace{#1}} \)

  UNDER CONSTRUCTION

Deep Learning:
Optimization Methods



Gradient Descent

Seeks to minimize the cost function via step-approximations:

\(\ds \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\bn{\theta})\)
  \((\t{for }j=0,1,...,n)\)
\(\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\bn{\theta}) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\bn{\theta}) \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{\pd}{\pd \theta_n}J(\bn{\theta}) \\ \end{align} \)

- repeated until convergence. \(\bn{\theta} =\) model parameter vector

Stochastic Gradient Descent

Descent with gradient computed from a single training example - trading speed for accuracy:

    Algorithm: (Stochastic)\(\up\)
  • - Randomly reorder training examples
  • - Repeat {\(\hspace{9px}\)

    for

    \(i = 1\hspace{2px}\t{:}\hspace{2px}m\)
    \(\theta_j \t{ := } \theta_j - \alpha(h_{\bn{\theta}}(x^{(i)})-y^{(i)})\cdot x_j^{(i)}\) }

    Algorithm: (Mini-batch)
  • Suppose \(m=1000\), \(b=10\):
  • - Randomly reorder training examples
  • - Repeat {\(\hspace{9px}\)

    for

    \(i = 1, 11, 21, ..., 991\)
    \(\ds\theta_j \t{ := } \theta_j - \alpha\frac{1}{10}\sum_{k=i}^{i+9}(h_{\bn{\theta}}(x^{(i)})-y^{(i)})\cdot x_j^{(i)}\) }

Learning rate: can reduce \(\alpha\) over iterations to aid convergence
Stochastic gradient descent vs mini-batch vs batch
  • - Batch GD can be slow for very large datasets \((m\approx 10^8)\)
  • - Mini-batch computes gradient over \(b\) examples
    \(\quad\ra\) Faster than batch, more accurate than stochastic
  • - Stochastic & mini-batch oscillate about global minimum
  • - Can check convergence by plotting \(J\) vs. # iterations

Dropout illustration
Dropout

Randomly eliminating neurons to regularize the network:

    Algorithm:\(\up\)
  • [1] Set dropout rate:
    1 - keep_prob
  • [2] Generate dropout vector:
    D2 = rand(n2,n1)
  • [3] Drop neurons, scale the rest to preserve expected \(z^{[l]}\):
    A2 = A2*(D2 < keep_prob)/keep_prob

  • - By dropping a portion of neurons from each layer, the remnant neurons are forced to make best use of available inputs - reducing reliance on other neurons to "make up" for their shortcomings
  • - Withholding information, to an extent, mimics fitting on the CV set; with each iteration, neurons work with some unfamiliar examples - tuning to work better with a greater range of ex's
  • - Each hidden layer is effectively both a predictor (takes inputs from \(l-1\)) and an input (feeds outputs to \(l+1\)); hence, if a neuron is an excellent predictor, but no neuron finds it relevant as an input in the next layer, its weight is diminished

Adam Algorithm

Adaptive momentum alg. descends w/ past gradients accounted – combining momentum & RMSprop:

    Algorithm:\(\up\)
  • [1] Choose features \(x_i\) that may indicate anomalous examples
  • [2] Update via

    for

    \(l = 1\hspace{2px}\t{:}\hspace{2px}L\) {  \(\hspace{3px}\ds v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \sfrac{\pd J}{\pd W^{[l]}}\)
    \(\ds v_{dW^{[l]}}^{\t{corr}} = \frac{v_{dW^{[l]}}}{\s{}1 - \beta_1^t}\)
    \(\ds s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2)\lrpar{\sfrac{\pd J}{\pd W^{[l]}}}^{\hspace{-4px}2}\)
    \(\ds s_{dW^{[l]}}^{\t{corr}} = \frac{s_{dW^{[l]}}}{1 - \beta_2^t}\)
    \(\ds W^{[l]} = W^{[l]} - \alpha \frac{v_{dW^{[l]}}^{\t{corr}}}{\sqrt{\smash{s_{dW^{[l]}}^{\t{corr}}}} + \epsilon}\)  }
Batch Normalization

Normalizing intermediate layer inputs \(z^{[l]}\) for numerical efficiency, as with input data:

\(\ds z_{\t{norm}}^{[l](i)} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}\)
\(\dsup\widetilde{z}^{[l](i)} = \gamma z_{\t{norm}}^{[l](i)} + \beta\)
\(\ds \mu = \frac{1}{m} \sum_i z^{[l](i)}\)
\(\ds \sigma^2 = \frac{1}{m} \sum_i (z^{[l](i)} - \mu)^2\)

\(\gamma\) and \(\beta\) are learnable parameters:
\(\ds \beta^{[l]} = \beta^{[l]} - \alpha d\beta^{[l]},\ \ \gamma^{[l]} = \gamma^{[l]} - \alpha d\gamma^{[l]}\)

  • - Speeds up learning
  • - Increases robustness: deeper layers less sensitive to weight changes in previous layers
  • - Forces mean & variance to remain the same - stabilizing features wrt. ground-truth decision boundary
  • - Adds some noise to \(\widetilde{z}\), as the mean & variance of each minibatch scales differently per minibatch

Backpropagation

Minimizing neural network cost function by computing layer cost from the cost of the succeeding layer, starting from the output - 'back-propagating' to every (but inpupt) layer. The goal is to compute \(\underset{\Theta}{\t{min}}J(\Theta)\).

Given training set \(\s{}\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})\}\),
[1]: Set \(\s{}\Delta_{i,j}^{(l)}:=0\) for all \(\s{}(i,j,l)\)
[2]: Set \(a^{(1)}:=x^{(t)}\)
[3]: Compute \(a^{(l)}\) for \(\s{}l=2,3,...,L\) using forward propagation
[4]: Compute \(\s{}\delta^{(L)}=a^{(L)}-y^{(t)}\) using \(y^{(t)}\)

[5]: Compute \(\s{}\delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)}\) using \(\s{}\delta^{(l)}=((\Theta^{(l)})^T\delta^{(l+1)}).*a^{(l)}.*(1-a^{(l)})\)
[6]: \(\s{}\Delta_{i,j}^{(l)}:=\Delta_{i,j}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}\) - vectorized, \(\s{}\Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T\)

[7]:
\(\s{}\begin{align} D_{i,j}^{(l)} &:=\sfrac{1}{m}\s{}\lrpar{\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)}} & \hs{-10px}j\neq 0 \\ &:=\sfrac{1}{m}\s{}\Delta_{i,j}^{(l)} & \hs{-10px} j=0 \end{align}\)

[8]: \(D\) is used as an 'accumulator' to add up costs throughout iterations - eventually computing the partial derivative: \(\frac{\pd}{\pd\Theta_{i,j}^{l}}\s{}J(\Theta)=D_{i,j}^{(l)}\)
Forward propagation:
Given a training set \((\bn{x},\bn{y})\), and \(L=4\),

\(\begin{align} \bn{a}^{(1)} &= \bn{x} \\ \bn{z}^{(2)} &= \bn{\Theta}^{(1)}\bn{a}^{(1)} \\ \bn{a}^{(2)} &= g(\bn{z}^{(2)}),\quad \t{add } a_0^{(2)} \\ \bn{z}^{(3)} &= \bn{\Theta}^{(2)}\bn{a}^{(2)} \\ \bn{a}^{(3)} &= g(\bn{z}^{(3)}),\quad \t{add }a_0^{(3)} \\ \bn{z}^{(4)} &= \bn{\Theta}^{(3)}\bn{a}^{(3)} \\ \bn{a}^{(4)} &= h_{\bn{\Theta}}(\bn{x}) = g(\bn{z}^{(4)}) \end{align}\)

Notes: In [5],
\(\delta_j^{(l)}=\Theta_{i,j}^{(l)}\delta^{(l+1)}\frac{\pd}{\pd z_j^{(l)}}\t{cost}(t)\)

(The backpropagated error is scaled by the neural weight (Theta), and multiplied by the rate of change of cost w.r.t. input to yield the error for that neuron)

In [6], omitting proof,

\(a_j^{(l)}\delta_i^{(l+1)}=\frac{\pd}{\pd\Theta_{i,j}^{(l)}}J(\bn{\Theta})\)

L2-Regularization

Regularizing NN cost function via summing squared weights – weight decaying layers toward linearity: \(J(\bn{\Theta}) = \)

\(\ds \begin{align} & -\sfrac{1}{m}\sum_{i=1}^m\sum_{k=1}^K \lrpar{y_k^{(i)}\t{log}(h_{\Theta}(x^{(i)}))_k + (1-y_k^{(i)})\t{log}(1-(h_{\Theta}(x^{(i)}))_k)} \\ & +\sfrac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2 \end{align}\)

- \(L =\) # of layers; \(s_l = \) # of units in layer \(l\) (excluding bias unit)
- \((h_{\Theta}(x))_i = i^{\t{th}}\) output; \(h_{\Theta}(x),y\in \mathbb{R}^K\)
- \(K=\) # of output units/classes
Minimizing Test Error

Balancing bias against variance to obtain optimal \(J_{\t{test}}\):

  • [1]: Create a list of lambdas - e.g. \(\lambda\in\{0,.01,.02,.04,.08,...,10.24\}\)
  • [2]: Create a set of models w/ different polynomial degrees, hidden layers, neurons per layer, etc.
  • [3]: Iterate through all the combinations of \(\lambda\)'s & models to learn some \(\bn{\Theta}\)
  • [4]: Compute \(J_{\t{CV}}\) using the learned \(\bn{\Theta}\) (computed w/ \(\lambda\)) without \(\lambda\)
  • [5]: Select combination yielding the lowest \(J_{\t{CV}}\)
  • [6]: Apply the selected \(\bn{\Theta}\) & \(\lambda\) on \(J_{\t{test}}\) to assess success of generalization
Fixing High Bias Fixing High Variance
- Add features
- Decrease \(\lambda\)
- Remove features
- Increase \(\lambda\)
- Add training examples




Dragon Notes,   Est. 2018     About

By OverLordGoldDragon