# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Machine Learning:Optimization Methods

Methods Summary

[1] Feature scaling, $$x_i := x_i / s_i$$. (scaling input features to a similar range, by dividing input values by input range)
[2] Mean normalization, $$x_i := (x_i-\mu_i)/s_i$$. (shifting input features to a zero mean, and scaling, by subtracting input mean & dividing by input range or std. dev)
[3] Vectorization - optimizing computing efficiency by writing code that utilizes built-in optimized algorithms
[4] Regularization - increasing algorithm generality to unfamiliar data by minimizing overfitting
[5] Advanced libraries - algorithm libraries can be imported to apply advanced optimization methods to solve problems
[6] Test error minimization - balancing bias against variance to obtain optimal $$J_{\t{test}}$$
[7] Stochastic/mini-batch GD - computing gradient from a single/few training exmaple(s) - trading speed for accuracy
[8] Map-reduce - distributing big data workload among multiple computers / cores
[1, 2]: $$\theta$$ descends quicker on small ranges than large, and oscillates inefficiently down the optimum if variables are scaled very unevenly
[6]: - Diagnosing hypothesis for bias & variance
- Varying model & regularization parameters
- Testing hypoth. on cross-validation & testing sets

Gradient descent (GD) is a method for determining function optima via step-approximation. For a single feature,

$$\ds [1]\quad \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\theta_0,\theta_1)$$
$$(\t{for }j=0\t{ and }j=1)$$
\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\theta_0,\theta_1) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\theta_0,\theta_1) \end{align}

$$\alpha =$$ learning rate (step-size)
$$\theta_j =$$ model parameter

$$\alpha$$ too great $$\ra$$ may overshoot the minimum and diverge
$$\alpha$$ too small $$\ra$$ may be too slow (require many steps)

Correct:

temp0

$$\ds := \theta_0 - \alpha \sfrac{\pd}{\pd\theta_0}J(\theta_0,\theta_1)$$

temp1

$$\ds := \theta_1 - \alpha \sfrac{\pd}{\pd\theta_1}J(\theta_0,\theta_1)$$
$$\theta_0 :=$$

temp0

$$\theta_1 :=$$

temp1

Incorrect:

temp0

$$\ds :=\theta_0 - \alpha \sfrac{\pd}{\pd \theta_0}J(\theta_0,\theta_1)$$
$$\theta_0 :=$$

temp0

temp1

$$\ds := \theta_1 -\alpha \sfrac{\pd}{\pd \theta_1}J(\theta_0,\theta_1)$$
$$\theta_1 :=$$

temp1

Descent Algorithm (see left)
• $$[1]$$ is to be repeated until convergence. At each iteration $$j$$, the parameters $$\theta_0, \theta_1$$ must be updated simultaneously
• Updating a parameter prior to calculating another one on the $$j^{\t{th}}$$ iteration yields a wrong implementation

The complete gradient descent works with any # of features - and is given by

$$\ds \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\bn{\theta})$$
$$(\t{for }j=0,1,...,n)$$
\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\bn{\theta}) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\bn{\theta}) \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{\pd}{\pd \theta_n}J(\bn{\theta}) \\ \end{align}

- repeated until convergence. $$\bn{\theta} =$$ model parameter vector, $$(n+1)\times 1$$

Descent with gradient computed from a single training example - trading speed for accuracy:

Algorithm: (Stochastic)$$\up$$
• - Randomly reorder training examples
• - Repeat {$$\hspace{9px}$$

for

$$i = 1\hspace{2px}\t{:}\hspace{2px}m$$
$$\theta_j \t{ := } \theta_j - \alpha(h_{\bn{\theta}}(x^{(i)})-y^{(i)})\cdot x_j^{(i)}$$ }

Algorithm: (Mini-batch)
• Suppose $$m=1000$$, $$b=10$$:
• - Randomly reorder training examples
• - Repeat {$$\hspace{9px}$$

for

$$i = 1, 11, 21, ..., 991$$
$$\ds\theta_j \t{ := } \theta_j - \alpha\frac{1}{10}\sum_{k=i}^{i+9}(h_{\bn{\theta}}(x^{(i)})-y^{(i)})\cdot x_j^{(i)}$$ }

Learning rate: can reduce $$\alpha$$ over iterations to aid convergence

• - Batch GD can be slow for very large datasets $$(m\approx 10^8)$$
• - Mini-batch computes gradient over $$b$$ examples
$$\quad\ra$$ Faster than batch, more accurate than stochastic
• - Stochastic & mini-batch oscillate about global minimum
• - Can check convergence by plotting $$J$$ vs. # iterations

Regularization

An optimization method aiming to generalize an algorithm by minimizing overfitting
(1): Adding a regularization term to penalize all features, hence reducing the weight of unnecessary ones
(2): Adding a regularization term to penalize features known to be less important

[1]: The $$\theta_0$$ term isn't regularized by convention ($$x_0$$ updated as usual)
[2]: $$\frac{\lambda}{m}\theta_j$$ serves as the regularization term
[3]: [2] can be rewritten as [3]; the second term is that of the usual iteration, but the left term is now modified with a quantity $$<1$$
[4]: Regularized logistic regression cost function
\ds \begin{align} & \theta_0 := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_0 \quad \bb{1}\\ & \theta_j := \theta_j -\alpha\lrbra{\lrpar{\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j}+\sfrac{\lambda}{m}\theta_j} \ \ \bb{2} \\ & \theta_j := \theta_j\lrpar{1-\alpha\sfrac{\lambda}{m}}-\sfrac{\alpha}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j \quad \bb{3} \end{align}
(above apply to both linear & logistic regression)

\ds \begin{align} & J(\theta) = -\sfrac{1}{m}\sum_{i=1}^{m}[y^{(i)}\t{log}(h_{\theta}(x^{(i)}))\ + \\ & \hspace{56px}+(1-y^{(i)})\t{log}(1-h_{\theta}(x^{i}))] + \sfrac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^2 \quad \bb{4} \end{align}

Vectorization: Linear Regression

Optimizing computing efficiency by writing code that utilizes built-in optimized algorithms. For linear regression, the vectorized implementation is:

$$\ds \bn{\theta} := \bn{\theta} - \alpha \bn{\delta}, \t{ where }\bn{\delta} = \sfrac{1}{m}\sum_{i=1}^{m}(h_{\bn{\theta}}(\bn{x}^{(i)})-\bn{y}^{(i)}) \bn{x}^{(i)},$$
$$\ds \bn{x}^{(i)}=\lrbra{\nmtttx{x^{(i)}_0}{x^{(i)}_1}{\vdots} {x^{(i)}_n}},$$$$\ds\ \ \bn{\theta}=\lrbra{\nmtttx{\theta_0}{\theta_1}{\vdots}{\theta_n}}$$

Provides a check on bugs and other faults as assurance to proper functionality of backpropagation - implemented by comparing computed gradient with its approximation:

\ds \begin{align} & \frac{\pd}{\pd\Theta_j}J(\bn{\Theta})\approx\frac{J(\Theta_1,...,\Theta_j+\epsilon,...,\Theta_n)-J(\Theta_1,..., \Theta_j-\epsilon,\Theta_n)}{2\epsilon}, \\ & \t{where }\epsilon=\t{small }(\t{e.g. }10^{-4})\end{align}

- Once $$\s{}\t{gradApprox}\approx\t{deltaVector}$$ is verified, the verification algorithm should be turned off, as it is computationally expensive
EPSILON = 1e-4;
for i = 1:n
thetaPlus = theta;
thetaPlus(i) += EPSILON;
thetaMinus = theta;
thetaMinus(i) -= EPSILON;
end


Minimizing Test Error

Balancing bias against variance to obtain optimal $$J_{\t{test}}$$:

• [1]: Create a list of lambdas - e.g. $$\lambda\in\{0,0.01,0.02,0.04,0.08,...,10.24\}$$
• [2]: Create a set of models w/ different polynomial degrees, hidden layers, neurons per layer, etc.
• [3]: Iterate through all the combinations of $$\lambda$$'s & models to learn some $$\bn{\Theta}$$
• [4]: Compute $$J_{\t{CV}}$$ using the learned $$\bn{\Theta}$$ (computed w/ $$\lambda$$) without $$\lambda$$ (i.e. $$=0$$)
• [5]: Select combination yielding the lowest $$J_{\t{CV}}$$
• [6]: Apply the selected $$\bn{\Theta}$$ & $$\lambda$$ on $$J_{\t{test}}$$ to assess success of generalization
 Fixing High Bias Fixing High Variance - Add features- Decrease $$\lambda$$ - Remove features- Increase $$\lambda$$- Add training examples
Unrolling Parameters

Advanced optimization libraries may expect vector input arguments; hence, to utilize in neural networks, parameter matrices can be 'unrolled' into vectors:

Given matrices $$\bn{\Theta^{(1)}},\bn{\Theta^{(2)}},\bn{\Theta^{(3)}}$$, and $$\bn{D^{(1)}},\bn{D^{(2)}},\bn{D^{(3)}}$$, we can unroll all the elements into one long vector:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:)]
deltaVector = [ D1(:); D2(:); D3(:)]


If Theta 1,2,3 dims are 10x11, 10x11, and 1x11, respectively, the original matrices can be retrieved as:
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)


Algorithm libraries can be imported to apply advanced optimization methods to solve problems.

\ds \begin{align} & \t{Let } \\ & J(\theta)= (\theta_1 - 5)^2 \\ & \hspace{37px} + (\theta_2 -5)^2 \\ & \t{Then,} \\ & \sfrac{\pd}{\pd\theta_1} J(\theta) = 2(\theta_1 -5) \\ & \sfrac{\pd}{\pd\theta_2}J(\theta)=2(\theta_2 -5) \end{align}
function [jVal,grad] = costFunction(theta)
jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
end

'MaxIter',100);
initTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = ...
fminunc(@costFunction,initTheta,options);
% see help fminunc, options, optimset

Big Data Handling

Large datasets require appropriate methods for efficiently structuring a machine learning project

• Should I train on full dataset? / Will my algorithm scale well?
$$\ra$$ Plot learning curves for smaller $$m$$, check for bias/variance
• Computer can't handle full dataset
$$\ra$$ Map-reduce: split workload among computers or CPU cores
• How do I get more data? -- (example: character recognition):
Artificial data synthesis:
$$\ra$$ compile examples likely to occur in test set (e.g. fonts library)
$$\ra$$ add noise / distortion to existing data that's likely to occur in test set (e.g. blur, background)
Labeling personally: estimate time required to increase dataset by x10
Labeling service: can be done inexpensively