Dragon Notes

i

\( \newcommand{bvec}[1]{\overrightarrow{\boldsymbol{#1}}} \newcommand{bnvec}[1]{\overrightarrow{\boldsymbol{\mathrm{#1}}}} \newcommand{uvec}[1]{\widehat{\boldsymbol{#1}}} \newcommand{vec}[1]{\overrightarrow{#1}} \newcommand{\parallelsum}{\mathbin{\|}} \) \( \newcommand{s}[1]{\small{#1}} \newcommand{t}[1]{\text{#1}} \newcommand{tb}[1]{\textbf{#1}} \newcommand{ns}[1]{\normalsize{#1}} \newcommand{ss}[1]{\scriptsize{#1}} \newcommand{vpl}[]{\vphantom{\large{\int^{\int}}}} \newcommand{vplup}[]{\vphantom{A^{A^{A^A}}}} \newcommand{vplLup}[]{\vphantom{A^{A^{A^{A{^A{^A}}}}}}} \newcommand{vpLup}[]{\vphantom{A^{A^{A^{A^{A^{A^{A^A}}}}}}}} \newcommand{up}[]{\vplup} \newcommand{Up}[]{\vplLup} \newcommand{Uup}[]{\vpLup} \newcommand{vpL}[]{\vphantom{\Large{\int^{\int}}}} \newcommand{lrg}[1]{\class{lrg}{#1}} \newcommand{sml}[1]{\class{sml}{#1}} \newcommand{qq}[2]{{#1}_{\t{#2}}} \newcommand{ts}[2]{\t{#1}_{\t{#2}}} \) \( \newcommand{ds}[]{\displaystyle} \newcommand{dsup}[]{\displaystyle\vplup} \newcommand{u}[1]{\underline{#1}} \newcommand{tu}[1]{\underline{\text{#1}}} \newcommand{tbu}[1]{\underline{\bf{\text{#1}}}} \newcommand{bxred}[1]{\class{bxred}{#1}} \newcommand{Bxred}[1]{\class{bxred2}{#1}} \newcommand{lrpar}[1]{\left({#1}\right)} \newcommand{lrbra}[1]{\left[{#1}\right]} \newcommand{lrabs}[1]{\left|{#1}\right|} \newcommand{bnlr}[2]{\bn{#1}\left(\bn{#2}\right)} \newcommand{nblr}[2]{\bn{#1}(\bn{#2})} \newcommand{real}[1]{\Ree\{{#1}\}} \newcommand{Real}[1]{\Ree\left\{{#1}\right\}} \) \( \newcommand{bn}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{bns}[2]{\bn{#1}_{\t{#2}}} \newcommand{b}[1]{\boldsymbol{#1}} \newcommand{bb}[1]{[\bn{#1}]} \) \( \newcommand{abs}[1]{\left|{#1}\right|} \newcommand{ra}[]{\rightarrow} \newcommand{Ra}[]{\Rightarrow} \newcommand{Lra}[]{\Leftrightarrow} \newcommand{rai}[]{\rightarrow\infty} \newcommand{ub}[2]{\underbrace{{#1}}_{#2}} \newcommand{ob}[2]{\overbrace{{#1}}^{#2}} \newcommand{lfrac}[2]{\large{\frac{#1}{#2}}\normalsize{}} \newcommand{sfrac}[2]{\small{\frac{#1}{#2}}\normalsize{}} \newcommand{Cos}[1]{\cos{\left({#1}\right)}} \newcommand{Sin}[1]{\sin{\left({#1}\right)}} \newcommand{Frac}[2]{\left({\frac{#1}{#2}}\right)} \newcommand{LFrac}[2]{\large{{\left({\frac{#1}{#2}}\right)}}\normalsize{}} \newcommand{Sinf}[2]{\sin{\left(\frac{#1}{#2}\right)}} \newcommand{Cosf}[2]{\cos{\left(\frac{#1}{#2}\right)}} \newcommand{atan}[1]{\tan^{-1}({#1})} \newcommand{Atan}[1]{\tan^{-1}\left({#1}\right)} \newcommand{intlim}[2]{\int\limits_{#1}^{#2}} \newcommand{lmt}[2]{\lim_{{#1}\rightarrow{#2}}} \newcommand{ilim}[1]{\lim_{{#1}\rightarrow\infty}} \newcommand{zlim}[1]{\lim_{{#1}\rightarrow 0}} \newcommand{Pr}[]{\t{Pr}} \newcommand{prop}[]{\propto} \newcommand{ln}[1]{\t{ln}({#1})} \newcommand{Ln}[1]{\t{ln}\left({#1}\right)} \newcommand{min}[2]{\t{min}({#1},{#2})} \newcommand{Min}[2]{\t{min}\left({#1},{#2}\right)} \newcommand{max}[2]{\t{max}({#1},{#2})} \newcommand{Max}[2]{\t{max}\left({#1},{#2}\right)} \newcommand{pfrac}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{pd}[]{\partial} \newcommand{zisum}[1]{\sum_{{#1}=0}^{\infty}} \newcommand{iisum}[1]{\sum_{{#1}=-\infty}^{\infty}} \newcommand{var}[1]{\t{var}({#1})} \newcommand{exp}[1]{\t{exp}\left({#1}\right)} \newcommand{mtx}[2]{\left[\begin{matrix}{#1}\\{#2}\end{matrix}\right]} \newcommand{nmtx}[2]{\begin{matrix}{#1}\\{#2}\end{matrix}} \newcommand{nmttx}[3]{\begin{matrix}\begin{align} {#1}& \\ {#2}& \\ {#3}& \\ \end{align}\end{matrix}} \newcommand{amttx}[3]{\begin{matrix} {#1} \\ {#2} \\ {#3} \\ \end{matrix}} \newcommand{nmtttx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{mtxx}[4]{\left[\begin{matrix}\begin{align}&{#1}&\hspace{-20px}{#2}\\&{#3}&\hspace{-20px}{#4}\end{align}\end{matrix}\right]} \newcommand{mtxxx}[9]{\begin{matrix}\begin{align} &{#1}&\hspace{-20px}{#2}&&\hspace{-20px}{#3}\\ &{#4}&\hspace{-20px}{#5}&&\hspace{-20px}{#6}\\ &{#7}&\hspace{-20px}{#8}&&\hspace{-20px}{#9} \end{align}\end{matrix}} \newcommand{amtxxx}[9]{ \amttx{#1}{#4}{#7}\hspace{10px} \amttx{#2}{#5}{#8}\hspace{10px} \amttx{#3}{#6}{#9}} \) \( \newcommand{ph}[1]{\phantom{#1}} \newcommand{vph}[1]{\vphantom{#1}} \newcommand{mtxxxx}[8]{\begin{matrix}\begin{align} & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4} \\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \\ \mtxxxxCont} \newcommand{\mtxxxxCont}[8]{ & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4}\\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \end{align}\end{matrix}} \newcommand{mtXxxx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{cov}[1]{\t{cov}({#1})} \newcommand{Cov}[1]{\t{cov}\left({#1}\right)} \newcommand{var}[1]{\t{var}({#1})} \newcommand{Var}[1]{\t{var}\left({#1}\right)} \newcommand{pnint}[]{\int_{-\infty}^{\infty}} \newcommand{floor}[1]{\left\lfloor {#1} \right\rfloor} \) \( \newcommand{adeg}[1]{\angle{({#1}^{\t{o}})}} \newcommand{Ree}[]{\mathcal{Re}} \newcommand{Im}[]{\mathcal{Im}} \newcommand{deg}[1]{{#1}^{\t{o}}} \newcommand{adegg}[1]{\angle{{#1}^{\t{o}}}} \newcommand{ang}[1]{\angle{\left({#1}\right)}} \newcommand{bkt}[1]{\langle{#1}\rangle} \) \( \newcommand{\hs}[1]{\hspace{#1}} \)

  UNDER CONSTRUCTION

Machine Learning:
Gradient Descent





Gradient Descent (single feature)

Gradient descent (GD) is a method for determining function optima via step-approximation. For a single feature,

\(\ds [1]\quad \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\theta_0,\theta_1)\)
  \((\t{for }j=0\t{ and }j=1)\)
\(\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\theta_0,\theta_1) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\theta_0,\theta_1) \end{align} \)

\(\alpha =\) learning rate (step-size)
\(\theta_j =\) model parameter

\(\alpha\) too great \(\ra\) may overshoot the minimum and diverge
\(\alpha\) too small \(\ra\) may be too slow (require many steps)

\([1]\) is to be repeated until convergence. At each iteration \(j\), the parameters \(\theta_0, \theta_1\) must be updated simultaneously:
Correct:

temp0

\(\ds := \theta_0 - \alpha \sfrac{\pd}{\pd\theta_0}J(\theta_0,\theta_1)\)

temp1

\(\ds := \theta_1 - \alpha \sfrac{\pd}{\pd\theta_1}J(\theta_0,\theta_1)\)
\(\theta_0 :=\)

temp0


\(\theta_1 :=\)

temp1

Incorrect:

temp0

\(\ds :=\theta_0 - \alpha \sfrac{\pd}{\pd \theta_0}J(\theta_0,\theta_1)\)
\(\theta_0 :=\)

temp0


temp1

\(\ds := \theta_1 -\alpha \sfrac{\pd}{\pd \theta_1}J(\theta_0,\theta_1)\)
\(\theta_1 :=\)

temp1

Updating a parameter prior to calculating another one on the \(j^{\t{th}}\) iteration yields a wrong implementation.

Linear Regression GD

For linear regression involving multiple variables, the GD algorithm is

\(\ds \theta_j := \theta_j -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_j\)
  \((\t{for }j=0,1,...,n)\)
\(\ds \Rightarrow \ \ \begin{align} \theta_0 & := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_0 \\ \theta_1 & := \theta_1 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_1 \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_n \\ \end{align} \)

\(x^{(i)}_j =\) value of feature \(j\) in the \(i^{\t{th}}\) training example
\(m =\) # of training examples
\(x^{(i)}=\) input (features) of the \(i^{\t{th}}\) training ex.\(\vph{x^{(i)}_j}\)
\(n =\) # of features
Logistic Regression GD

Same as Linear Regression's, except different hypothesis & cost functions:

\(\ds h_{\theta}(x)=g(\bn{\theta}^T\bn{x}),\ \ \t{where } g(z)= \frac{1}{1+e^{-z}}\)

\(\ds J(\bn{\theta})= \sfrac{1}{m}\lrpar{-\bn{y}^T\t{log}(\bn{h})-(1-\bn{y})^T\t{log}(1-\bn{h})}\)

\(\dsup \bn{\theta}: = \bn{\theta}-\sfrac{\alpha}{m}\bn{x}^T(g(\bn{x}\bn{\theta})-\bn{y})\)

Hypothesis Function (Logistic Regression)

The hypothesis function for logistic regression is

\(\ds h_{\theta}(x)=g(\bn{\theta}^T\bn{x}),\)
\(\ds \t{where } g(z)= \frac{1}{1+e^{-z}}\)

- Used in classification problems
- Output is bounded between \(0\) and \(1\) (contrasting w/ linear regression's \(\pm\infty\))
- \(h_{\theta}\) gives the probability that the output is positive (=1).
Multivariate Gradient Descent

The multiple feature extension of GD, the multivariate gradient descent, is

\(\ds \theta_j := \theta_j -\alpha \frac{\pd}{\pd \theta_j}J(\bn{\theta})\)
  \((\t{for }j=0,1,...,n)\)
\(\ds \Rightarrow \quad \begin{align} \theta_0 & := \theta_0 -\alpha \frac{\pd}{\pd \theta_0}J(\bn{\theta}) \\ \theta_1 & := \theta_1 -\alpha \frac{\pd}{\pd \theta_1}J(\bn{\theta}) \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{\pd}{\pd \theta_n}J(\bn{\theta}) \\ \end{align} \)

- repeated until convergence. \(\bn{\theta} =\) model parameter vector, \((n+1)\times 1\)

Cost Function (Logistic Regression)

The cost function for logistic regression is

\(\ds \begin{align} J(\theta) &= \sfrac{1}{m}\sum_{i=1}^{m}\t{Cost}(h_{\theta}(x^{(i)}),y^{(i)}) \\ &= -\sfrac{1}{m}\sum_{i=1}^{m}\lrpar{y^{(i)}\t{log}(h_{\theta}(x^{(i)}))+(1-y^{(i)})\t{log}(1-h_{\theta}(x^{(i)}))} \end{align}\)
\(\ds \t{Cost}(h_{\theta}(x),y)=\left\{ \begin{align} & -\t{log}(h_{\theta}(x)) && \t{if } y = 1 \\ & -\t{log}(1-h_{\theta}(x)) && \t{if } y = 0 \end{align} \right. \)
\(\dsup \hspace{160px}= -y\ \t{log}(h_{\theta}(x))-(1-y)\t{log}(1-h_{\theta}(x))\)

- Makes the cost function convex for linear regression hypothesis function
- Imposes steep penalty on false positives and false negatives

Gradient Checking

Provides a check on bugs and other faults as assurance to proper functionality of backpropagation - implemented by comparing computed gradient with its approximation:

\(\ds \begin{align} & \frac{\pd}{\pd\Theta_j}J(\bn{\Theta})\approx\frac{J(\Theta_1,...,\Theta_j+\epsilon,...,\Theta_n)-J(\Theta_1,..., \Theta_j-\epsilon,\Theta_n)}{2\epsilon}, \\ & \t{where }\epsilon=\t{small }(\t{e.g. }10^{-4})\end{align}\)

- Once \(\s{}\t{gradApprox}\approx\t{deltaVector}\) is verified, the verification algorithm should be turned off, as it is computationally expensive
EPSILON = 1e-4;
for i = 1:n
  thetaPlus = theta;
  thetaPlus(i) += EPSILON;
  thetaMinus = theta;
  thetaMinus(i) -= EPSILON;
  gradApprox(i) = (J(thetaPlus)-J(thetaMinus))/(2*EPSILON);
end




Dragon Notes,   Est. 2018     About

By OverLordGoldDragon