# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Deep Learning:Optimization Methods

Seeks to minimize the cost function via step-approximations:

$$\ds \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\bn{\theta})$$
$$(\t{for }j=0,1,...,n)$$
\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\bn{\theta}) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\bn{\theta}) \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{\pd}{\pd \theta_n}J(\bn{\theta}) \\ \end{align}

- repeated until convergence. $$\bn{\theta} =$$ model parameter vector

Descent with gradient computed from a single training example - trading speed for accuracy:

Algorithm: (Stochastic)$$\up$$
• - Randomly reorder training examples
• - Repeat {$$\hspace{9px}$$

for

$$i = 1\hspace{2px}\t{:}\hspace{2px}m$$
$$\theta_j \t{ := } \theta_j - \alpha(h_{\bn{\theta}}(x^{(i)})-y^{(i)})\cdot x_j^{(i)}$$ }

Algorithm: (Mini-batch)
• Suppose $$m=1000$$, $$b=10$$:
• - Randomly reorder training examples
• - Repeat {$$\hspace{9px}$$

for

$$i = 1, 11, 21, ..., 991$$
$$\ds\theta_j \t{ := } \theta_j - \alpha\frac{1}{10}\sum_{k=i}^{i+9}(h_{\bn{\theta}}(x^{(i)})-y^{(i)})\cdot x_j^{(i)}$$ }

Learning rate: can reduce $$\alpha$$ over iterations to aid convergence

• - Batch GD can be slow for very large datasets $$(m\approx 10^8)$$
• - Mini-batch computes gradient over $$b$$ examples
$$\quad\ra$$ Faster than batch, more accurate than stochastic
• - Stochastic & mini-batch oscillate about global minimum
• - Can check convergence by plotting $$J$$ vs. # iterations

Dropout

Randomly eliminating neurons to regularize the network:

Algorithm:$$\up$$
• [1] Set dropout rate:
1 - keep_prob
• [2] Generate dropout vector:
D2 = rand(n2,n1)
• [3] Drop neurons, scale the rest to preserve expected $$z^{[l]}$$:
A2 = A2*(D2 < keep_prob)/keep_prob

• - By dropping a portion of neurons from each layer, the remnant neurons are forced to make best use of available inputs - reducing reliance on other neurons to "make up" for their shortcomings
• - Withholding information, to an extent, mimics fitting on the CV set; with each iteration, neurons work with some unfamiliar examples - tuning to work better with a greater range of ex's
• - Each hidden layer is effectively both a predictor (takes inputs from $$l-1$$) and an input (feeds outputs to $$l+1$$); hence, if a neuron is an excellent predictor, but no neuron finds it relevant as an input in the next layer, its weight is diminished

Adaptive momentum alg. descends w/ past gradients accounted – combining momentum & RMSprop:

Algorithm:$$\up$$
• [1] Choose features $$x_i$$ that may indicate anomalous examples
• [2] Update via

for

$$l = 1\hspace{2px}\t{:}\hspace{2px}L$$ {  $$\hspace{3px}\ds v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \sfrac{\pd J}{\pd W^{[l]}}$$
$$\ds v_{dW^{[l]}}^{\t{corr}} = \frac{v_{dW^{[l]}}}{\s{}1 - \beta_1^t}$$
$$\ds s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2)\lrpar{\sfrac{\pd J}{\pd W^{[l]}}}^{\hspace{-4px}2}$$
$$\ds s_{dW^{[l]}}^{\t{corr}} = \frac{s_{dW^{[l]}}}{1 - \beta_2^t}$$
$$\ds W^{[l]} = W^{[l]} - \alpha \frac{v_{dW^{[l]}}^{\t{corr}}}{\sqrt{\smash{s_{dW^{[l]}}^{\t{corr}}}} + \epsilon}$$  }
Batch Normalization

Normalizing intermediate layer inputs $$z^{[l]}$$ for numerical efficiency, as with input data:

$$\ds z_{\t{norm}}^{[l](i)} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
$$\dsup\widetilde{z}^{[l](i)} = \gamma z_{\t{norm}}^{[l](i)} + \beta$$
$$\ds \mu = \frac{1}{m} \sum_i z^{[l](i)}$$
$$\ds \sigma^2 = \frac{1}{m} \sum_i (z^{[l](i)} - \mu)^2$$

$$\gamma$$ and $$\beta$$ are learnable parameters:
$$\ds \beta^{[l]} = \beta^{[l]} - \alpha d\beta^{[l]},\ \ \gamma^{[l]} = \gamma^{[l]} - \alpha d\gamma^{[l]}$$

• - Speeds up learning
• - Increases robustness: deeper layers less sensitive to weight changes in previous layers
• - Forces mean & variance to remain the same - stabilizing features wrt. ground-truth decision boundary
• - Adds some noise to $$\widetilde{z}$$, as the mean & variance of each minibatch scales differently per minibatch

Backpropagation

Minimizing neural network cost function by computing layer cost from the cost of the succeeding layer, starting from the output - 'back-propagating' to every (but inpupt) layer. The goal is to compute $$\underset{\Theta}{\t{min}}J(\Theta)$$.

Given training set $$\s{}\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})\}$$,
[1]: Set $$\s{}\Delta_{i,j}^{(l)}:=0$$ for all $$\s{}(i,j,l)$$
[2]: Set $$a^{(1)}:=x^{(t)}$$
[3]: Compute $$a^{(l)}$$ for $$\s{}l=2,3,...,L$$ using forward propagation
[4]: Compute $$\s{}\delta^{(L)}=a^{(L)}-y^{(t)}$$ using $$y^{(t)}$$

[5]: Compute $$\s{}\delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)}$$ using $$\s{}\delta^{(l)}=((\Theta^{(l)})^T\delta^{(l+1)}).*a^{(l)}.*(1-a^{(l)})$$
[6]: $$\s{}\Delta_{i,j}^{(l)}:=\Delta_{i,j}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}$$ - vectorized, $$\s{}\Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$$

[7]:
\s{}\begin{align} D_{i,j}^{(l)} &:=\sfrac{1}{m}\s{}\lrpar{\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)}} & \hs{-10px}j\neq 0 \\ &:=\sfrac{1}{m}\s{}\Delta_{i,j}^{(l)} & \hs{-10px} j=0 \end{align}

[8]: $$D$$ is used as an 'accumulator' to add up costs throughout iterations - eventually computing the partial derivative: $$\frac{\pd}{\pd\Theta_{i,j}^{l}}\s{}J(\Theta)=D_{i,j}^{(l)}$$
Forward propagation:
Given a training set $$(\bn{x},\bn{y})$$, and $$L=4$$,

\begin{align} \bn{a}^{(1)} &= \bn{x} \\ \bn{z}^{(2)} &= \bn{\Theta}^{(1)}\bn{a}^{(1)} \\ \bn{a}^{(2)} &= g(\bn{z}^{(2)}),\quad \t{add } a_0^{(2)} \\ \bn{z}^{(3)} &= \bn{\Theta}^{(2)}\bn{a}^{(2)} \\ \bn{a}^{(3)} &= g(\bn{z}^{(3)}),\quad \t{add }a_0^{(3)} \\ \bn{z}^{(4)} &= \bn{\Theta}^{(3)}\bn{a}^{(3)} \\ \bn{a}^{(4)} &= h_{\bn{\Theta}}(\bn{x}) = g(\bn{z}^{(4)}) \end{align}

Notes: In [5],
$$\delta_j^{(l)}=\Theta_{i,j}^{(l)}\delta^{(l+1)}\frac{\pd}{\pd z_j^{(l)}}\t{cost}(t)$$

(The backpropagated error is scaled by the neural weight (Theta), and multiplied by the rate of change of cost w.r.t. input to yield the error for that neuron)

In [6], omitting proof,

$$a_j^{(l)}\delta_i^{(l+1)}=\frac{\pd}{\pd\Theta_{i,j}^{(l)}}J(\bn{\Theta})$$

L2-Regularization

Regularizing NN cost function via summing squared weights – weight decaying layers toward linearity: $$J(\bn{\Theta}) =$$

\ds \begin{align} & -\sfrac{1}{m}\sum_{i=1}^m\sum_{k=1}^K \lrpar{y_k^{(i)}\t{log}(h_{\Theta}(x^{(i)}))_k + (1-y_k^{(i)})\t{log}(1-(h_{\Theta}(x^{(i)}))_k)} \\ & +\sfrac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2 \end{align}

- $$L =$$ # of layers; $$s_l =$$ # of units in layer $$l$$ (excluding bias unit)
- $$(h_{\Theta}(x))_i = i^{\t{th}}$$ output; $$h_{\Theta}(x),y\in \mathbb{R}^K$$
- $$K=$$ # of output units/classes
Minimizing Test Error

Balancing bias against variance to obtain optimal $$J_{\t{test}}$$:

• [1]: Create a list of lambdas - e.g. $$\lambda\in\{0,.01,.02,.04,.08,...,10.24\}$$
• [2]: Create a set of models w/ different polynomial degrees, hidden layers, neurons per layer, etc.
• [3]: Iterate through all the combinations of $$\lambda$$'s & models to learn some $$\bn{\Theta}$$
• [4]: Compute $$J_{\t{CV}}$$ using the learned $$\bn{\Theta}$$ (computed w/ $$\lambda$$) without $$\lambda$$
• [5]: Select combination yielding the lowest $$J_{\t{CV}}$$
• [6]: Apply the selected $$\bn{\Theta}$$ & $$\lambda$$ on $$J_{\t{test}}$$ to assess success of generalization
 Fixing High Bias Fixing High Variance - Add features- Decrease $$\lambda$$ - Remove features- Increase $$\lambda$$- Add training examples