# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Feb 20 Regression - Data Science Feb 19 +Classification - ML Feb 18 Control Design - Control Sys. Feb 16 Solved Problems - Data Sci. Feb 12 +Control Systems - MATLAB Feb 10 Image Recognition - Digits:ML Algorithms

# Machine Learning:Optimization Methods

Optimization Methods

[1] Feature scaling, $$x_i := x_i / s_i$$. (scaling input features to a similar range, by dividing input values by input range)
[2] Mean normalization, $$x_i := (x_i-\mu_i)/s_i$$. (shifting input features to a zero mean, and scaling, by subtracting input mean & dividing by input range or std. dev)
[3] Vectorization - optimizing computing efficiency by writing code that utilizes built-in optimized algorithms
[4] Regularization - increasing algorithm generality to unfamiliar data by minimizing overfitting
[5] Advanced libraries - algorithm libraries can be imported to apply advanced optimization methods to solve problems
[6] Test error minimization - balancing bias against variance to obtain optimal $$J_{\t{test}}$$
[1, 2]: $$\theta$$ descends quicker on small ranges than large, and oscillates inefficiently down the optimum if variables are scaled very unevenly
[6]: - Diagnosing hypothesis for bias & variance
- Varying model & regularization parameters
- Testing hypoth. on cross-validation & testing sets

Regularization

An optimization method aiming to generalize an algorithm by minimizing overfitting
(1): Adding a regularization term to penalize all features, hence reducing the weight of unnecessary ones
(2): Adding a regularization term to penalize features known to be less important

[1]: The $$\theta_0$$ term isn't regularized by convention ($$x_0$$ updated as usual)
[2]: $$\frac{\lambda}{m}\theta_j$$ serves as the regularization term
[3]: [2] can be rewritten as [3]; the second term is that of the usual iteration, but the left term is now modified with a quantity $$<1$$
[4]: Regularized logistic regression cost function
\ds \begin{align} & \theta_0 := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_0 \quad \bb{1}\\ & \theta_j := \theta_j -\alpha\lrbra{\lrpar{\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j}+\sfrac{\lambda}{m}\theta_j} \ \ \bb{2} \\ & \theta_j := \theta_j\lrpar{1-\alpha\sfrac{\lambda}{m}}-\sfrac{\alpha}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j \quad \bb{3} \end{align}
(above apply to both linear & logistic regression)

\ds \begin{align} & J(\theta) = -\sfrac{1}{m}\sum_{i=1}^{m}[y^{(i)}\t{log}(h_{\theta}(x^{(i)}))\ + \\ & \hspace{56px}+(1-y^{(i)})\t{log}(1-h_{\theta}(x^{i}))] + \sfrac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^2 \quad \bb{4} \end{align}

Vectorization: Linear Regression

Optimizing computing efficiency by writing code that utilizes built-in optimized algorithms. For linear regression, the vectorized implementation is:

$$\ds \bn{\theta} := \bn{\theta} - \alpha \bn{\delta}, \t{ where}$$
$$\ds \bn{\delta} = \sfrac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\bn{x}^{(i)},\ \ \bn{x}^{(i)}=\lrbra{\nmtttx{x^{(i)}_0}{x^{(i)}_1}{\vdots} {x^{(i)}_n}},$$$$\ds\ \ \bn{\theta}=\lrbra{\nmtttx{\theta_0}{\theta_1}{\vdots}{\theta_n}}$$
The parameters are updated simultaneously.

Algorithm libraries can be imported to apply advanced optimization methods to solve problems.

\ds \begin{align} \t{Let } & J(\theta)=(\theta_1 - 5)^2 + (\theta_2 -5)^2. \t{ Then,} \\ & \sfrac{\pd}{\pd\theta_1} J(\theta) = 2(\theta_1 -5) \\ & \sfrac{\pd}{\pd\theta_2}J(\theta)=2(\theta_2 -5) \end{align}
function [jVal, gradient] = costFunction(theta)
jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
end

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = ...
fminunc(@costFunction, initialTheta, options);
% see help fminunc, options, optimset


Minimizing Test Error

Balancing bias against variance to obtain optimal $$J_{\t{test}}$$:

• [1]: Create a list of lambdas - e.g. $$\lambda\in\{0,0.01,0.02,0.04,0.08,...,10.24\}$$
• [2]: Create a set of models w/ different polynomial degrees, hidden layers, neurons per layer, etc.
• [3]: Iterate through all the combinations of $$\lambda$$'s & models to learn some $$\bn{\Theta}$$
• [4]: Compute $$J_{\t{CV}}$$ using the learned $$\bn{\Theta}$$ (computed w/ $$\lambda$$) without $$\lambda$$ (i.e. $$=0$$)
• [5]: Select combination yielding the lowest $$J_{\t{CV}}$$
• [6]: Apply the selected $$\bn{\Theta}$$ & $$\lambda$$ on $$J_{\t{test}}$$ to assess success of generalization
 Fixing High Bias Fixing High Variance - Add features- Decrease $$\lambda$$ - Remove features- Increase $$\lambda$$- Add training examples
Unrolling Parameters

Advanced optimization libraries may expect vector input arguments; hence, to utilize in neural networks, parameter matrices can be 'unrolled' into vectors:

Given matrices $$\bn{\Theta^{(1)}},\bn{\Theta^{(2)}},\bn{\Theta^{(3)}}$$, and $$\bn{D^{(1)}},\bn{D^{(2)}},\bn{D^{(3)}}$$, we can unroll all the elements into one long vector:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:)]
deltaVector = [ D1(:); D2(:); D3(:)]


If Theta 1,2,3 dims are 10x11, 10x11, and 1x11, respectively, the original matrices can be retrieved as:
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)