# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Machine Learning:Artificial Neural Networks

ANN Base Model

An artificial neural network (ANN) is a set of artificial neurons processing inputs via intermediate layers to yield an output.

- An artificial neuron (AN) is a node in an ANN modifying input data via a mathematical function, then passing to other ANs
- Layers include a bias unit, $$a_0$$, analogous to the intercept term in logistic regression; may be omitted from diagrams (img. right omits neuron links)

ANN Base Model (II)

ANN model representation may utilize logistic regression functions for ANs:

\ds \begin{align} a_1^{(2)} &= g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \\ a_2^{(2)} &= g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \\ a_3^{(2)} &= g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \\ h_{\Theta}(x) = a_1^{(3)} &= g(\Theta_{10}^{(2)}x_0 + \Theta_{11}^{(2)}x_1 + \Theta_{12}^{(2)}x_2 + \Theta_{13}^{(2)}x_3) \\ \end{align}
- Bias unit is added to layer $$j$$ after implementing $$\bb{1}$$
- In the last step, between layer $$\s{}j$$ and $$\s{}(j+1)$$, the last theta mtx. $$\bn{\Theta}^{(j)}$$ will only have one row, which is multiplied by one column $$\bn{a}^{(j)}$$ - to yield a scalar
- The output is passed to layer $$\s{}(j+1)$$ as $$h_{\bn{\Theta}}(\bn{x}) = \bn{a}^{(j+1)} = g(\bn{z}^{(j+1)})$$
- Vectorization: see $$\bb{1}$$, for AN w/ $$j$$ layers & $$n$$ inputs, where $$k=$$ node index
- $$\t{dim}(\bn{\Theta}) = s_{j+1}\s{\times}\ns{}(s_j+1)$$, to account for the bias unit in layer $$j$$
- $$\bn{\Theta}$$ rows $$\ra \s{}(j+1)$$'s activ. units, columns $$\ra$$ prev. layer units w/ bias (i.e. $$\s{}j$$)
\ds \begin{align} a_1^{(j)} &= g(z_1^{(j)}) \\ a_2^{(j)} &= g(z_2^{(j)}) \\ & \vdots \\ a_n^{(j)} &= g(z_n^{(j)}) \end{align} \quad \quad \bn{z}^{(j)} = \bn{\Theta}^{(j-1)}\bn{a}^{(j-1)}\ \bb{1}
$$\ds z_k^{(j)} = \Theta_{k,0}^{(j-1)}x_0 + \Theta_{k,1}^{(j-1)}x_1 + ... + \Theta_{k,n}^{(j-1)}x_n$$

Cost Function

The cost function for a regularized logistic regression neural network classifier is

\ds \begin{align} J(\Theta) = & -\sfrac{1}{m}\sum_{i=1}^m\sum_{k=1}^K \lrpar{y_k^{(i)}\t{log}(h_{\Theta}(x^{(i)}))_k + (1-y_k^{(i)})\t{log}(1-(h_{\Theta}(x^{(i)}))_k)} \\ & +\sfrac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2 \end{align}

- $$L =$$ # of layers; $$s_l =$$ # of units in layer $$l$$ (excluding bias unit)
- $$(h_{\Theta}(x))_i = i^{\t{th}}$$ output; $$h_{\Theta}(x),y\in \mathbb{R}^K$$
- $$K=$$ # of output units/classes

Backpropagation Algorithm

Minimizing neural network cost function by computing layer cost from the cost of the succeeding layer, starting from the output - 'back-propagating' to every (but inpupt) layer. The goal is to compute $$\underset{\Theta}{\t{min}}J(\Theta)$$.

Given training set $$\s{}\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})\}$$,
[1]: Set $$\s{}\Delta_{i,j}^{(l)}:=0$$ for all $$\s{}(i,j,l)$$
[2]: Set $$a^{(1)}:=x^{(t)}$$
[3]: Compute $$a^{(l)}$$ for $$\s{}l=2,3,...,L$$ using forward propagation
[4]: Compute $$\s{}\delta^{(L)}=a^{(L)}-y^{(t)}$$ using $$y^{(t)}$$

[5]: Compute $$\s{}\delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)}$$ using $$\s{}\delta^{(l)}=((\Theta^{(l)})^T\delta^{(l+1)}).*a^{(l)}.*(1-a^{(l)})$$
[6]: $$\s{}\Delta_{i,j}^{(l)}:=\Delta_{i,j}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}$$ - vectorized, $$\s{}\Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$$

[7]:
\s{}\begin{align} D_{i,j}^{(l)} &:=\sfrac{1}{m}\s{}\lrpar{\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)}} & \hs{-10px}j\neq 0 \\ &:=\sfrac{1}{m}\s{}\Delta_{i,j}^{(l)} & \hs{-10px} j=0 \end{align}

[8]: $$D$$ is used as an 'accumulator' to add up costs throughout iterations - eventually computing the partial derivative: $$\frac{\pd}{\pd\Theta_{i,j}^{l}}\s{}J(\Theta)=D_{i,j}^{(l)}$$
Forward propagation:
Given a training set $$(\bn{x},\bn{y})$$, and $$L=4$$,

\begin{align} \bn{a}^{(1)} &= \bn{x} \\ \bn{z}^{(2)} &= \bn{\Theta}^{(1)}\bn{a}^{(1)} \\ \bn{a}^{(2)} &= g(\bn{z}^{(2)}),\quad \t{add } a_0^{(2)} \\ \bn{z}^{(3)} &= \bn{\Theta}^{(2)}\bn{a}^{(2)} \\ \bn{a}^{(3)} &= g(\bn{z}^{(3)}),\quad \t{add }a_0^{(3)} \\ \bn{z}^{(4)} &= \bn{\Theta}^{(3)}\bn{a}^{(3)} \\ \bn{a}^{(4)} &= h_{\bn{\Theta}}(\bn{x}) = g(\bn{z}^{(4)}) \end{align}

Notes: In [5],
$$\delta_j^{(l)}=\Theta_{i,j}^{(l)}\delta^{(l+1)}\frac{\pd}{\pd z_j^{(l)}}\t{cost}(t)$$

(The backpropagated error is scaled by the neural weight (Theta), and multiplied by the rate of change of cost w.r.t. input to yield the error for that neuron)

In [6], omitting proof,

$$a_j^{(l)}\delta_i^{(l+1)}=\frac{\pd}{\pd\Theta_{i,j}^{(l)}}J(\bn{\Theta})$$

Random Initialization

Breaking symmetry arising from setting all thetas to zero by initializing them to some small random values:

Theta1 = INIT_EPSILON*(2*rand(10,11)-1);
Theta2 = INIT_EPSILON*(2*rand(10,11)-1);
Theta3 = INIT_EPSILON*(2*rand(1,11)-1);


Initializes each $$\Theta_{i,j}^{(l)}$$ to a random value in $$[-\epsilon,\epsilon]$$, where $$\epsilon=\t{small}$$ - assuming dim Theta 1,2,3 $$=$$ 10x11, 10x11, and 1x11, respectively.