# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Feb 20 Regression - Data Science Feb 19 +Classification - ML Feb 18 Control Design - Control Sys. Feb 16 Solved Problems - Data Sci. Feb 12 +Control Systems - MATLAB Feb 10 Image Recognition - Digits:ML Algorithms

# Machine Learning:Gradient Descent

Gradient Descent (single feature)

Gradient descent (GD) is a method for determining function optima via step-approximation. For a single feature,

$$\ds [1]\quad \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\theta_0,\theta_1)$$
$$(\t{for }j=0\t{ and }j=1)$$
\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\theta_0,\theta_1) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\theta_0,\theta_1) \end{align}

$$\alpha =$$ learning rate (step-size)
$$\theta_j =$$ model parameter

$$\alpha$$ too great $$\ra$$ may overshoot the minimum and diverge
$$\alpha$$ too small $$\ra$$ may be too slow (require many steps)

$$[1]$$ is to be repeated until convergence. At each iteration $$j$$, the parameters $$\theta_0, \theta_1$$ must be updated simultaneously:
Correct:

temp0

$$\ds := \theta_0 - \alpha \sfrac{\pd}{\pd\theta_0}J(\theta_0,\theta_1)$$

temp1

$$\ds := \theta_1 - \alpha \sfrac{\pd}{\pd\theta_1}J(\theta_0,\theta_1)$$
$$\theta_0 :=$$

temp0

$$\theta_1 :=$$

temp1

Incorrect:

temp0

$$\ds :=\theta_0 - \alpha \sfrac{\pd}{\pd \theta_0}J(\theta_0,\theta_1)$$
$$\theta_0 :=$$

temp0

temp1

$$\ds := \theta_1 -\alpha \sfrac{\pd}{\pd \theta_1}J(\theta_0,\theta_1)$$
$$\theta_1 :=$$

temp1

Updating a parameter prior to calculating another one on the $$j^{\t{th}}$$ iteration yields a wrong implementation.

Linear Regression GD

For linear regression involving multiple variables, the GD algorithm is

$$\ds \theta_j := \theta_j -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_j$$
$$(\t{for }j=0,1,...,n)$$
\ds \Rightarrow \ \ \begin{align} \theta_0 & := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_0 \\ \theta_1 & := \theta_1 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_1 \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_n \\ \end{align}

$$x^{(i)}_j =$$ value of feature $$j$$ in the $$i^{\t{th}}$$ training example
$$m =$$ # of training examples
$$x^{(i)}=$$ input (features) of the $$i^{\t{th}}$$ training ex.$$\vph{x^{(i)}_j}$$
$$n =$$ # of features
Logistic Regression GD

Same as Linear Regression's, except different hypothesis & cost functions:

$$\ds h_{\theta}(x)=g(\bn{\theta}^T\bn{x}),\ \ \t{where } g(z)= \frac{1}{1+e^{-z}}$$

$$\ds J(\bn{\theta})= \sfrac{1}{m}\lrpar{-\bn{y}^T\t{log}(\bn{h})-(1-\bn{y})^T\t{log}(1-\bn{h})}$$

$$\dsup \bn{\theta}: = \bn{\theta}-\sfrac{\alpha}{m}\bn{x}^T(g(\bn{x}\bn{\theta})-\bn{y})$$

Hypothesis Function (Logistic Regression)

The hypothesis function for logistic regression is

$$\ds h_{\theta}(x)=g(\bn{\theta}^T\bn{x}),$$
$$\ds \t{where } g(z)= \frac{1}{1+e^{-z}}$$

- Used in classification problems
- Output is bounded between $$0$$ and $$1$$ (contrasting w/ linear regression's $$\pm\infty$$)
- $$h_{\theta}$$ gives the probability that the output is positive (=1).

The multiple feature extension of GD, the multivariate gradient descent, is

$$\ds \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\bn{\theta})$$
$$(\t{for }j=0,1,...,n)$$
\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\bn{\theta}) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\bn{\theta}) \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{\pd}{\pd \theta_n}J(\bn{\theta}) \\ \end{align}

- repeated until convergence. $$\bn{\theta} =$$ model parameter vector, $$(n+1)\times 1$$

Cost Function (Logistic Regression)

The cost function for logistic regression is

\ds \begin{align} J(\theta) &= \sfrac{1}{m}\sum_{i=1}^{m}\t{Cost}(h_{\theta}(x^{(i)}),y^{(i)}) \\ &= -\sfrac{1}{m}\sum_{i=1}^{m}\lrpar{y^{(i)}\t{log}(h_{\theta}(x^{(i)}))+(1-y^{(i)})\t{log}(1-h_{\theta}(x^{(i)}))} \end{align}
\ds \t{Cost}(h_{\theta}(x),y)=\left\{ \begin{align} & -\t{log}(h_{\theta}(x)) && \t{if } y = 1 \\ & -\t{log}(1-h_{\theta}(x)) && \t{if } y = 0 \end{align} \right.
$$\dsup \hspace{160px}= -y\ \t{log}(h_{\theta}(x))-(1-y)\t{log}(1-h_{\theta}(x))$$

- Makes the cost function convex for linear regression hypothesis function
- Imposes steep penalty on false positives and false negatives

Provides a check on bugs and other faults as assurance to proper functionality of backpropagation - implemented by comparing computed gradient with its approximation:

\ds \begin{align} & \frac{\pd}{\pd\Theta_j}J(\bn{\Theta})\approx\frac{J(\Theta_1,...,\Theta_j+\epsilon,...,\Theta_n)-J(\Theta_1,..., \Theta_j-\epsilon,\Theta_n)}{2\epsilon}, \\ & \t{where }\epsilon=\t{small }(\t{e.g. }10^{-4})\end{align}

- Once $$\s{}\t{gradApprox}\approx\t{deltaVector}$$ is verified, the verification algorithm should be turned off, as it is computationally expensive
EPSILON = 1e-4;
for i = 1:n
thetaPlus = theta;
thetaPlus(i) += EPSILON;
thetaMinus = theta;
thetaMinus(i) -= EPSILON;