Graduationwoot

Dragon Notes

i

\( \newcommand{bvec}[1]{\overrightarrow{\boldsymbol{#1}}} \newcommand{bnvec}[1]{\overrightarrow{\boldsymbol{\mathrm{#1}}}} \newcommand{uvec}[1]{\widehat{\boldsymbol{#1}}} \newcommand{vec}[1]{\overrightarrow{#1}} \newcommand{\parallelsum}{\mathbin{\|}} \) \( \newcommand{s}[1]{\small{#1}} \newcommand{t}[1]{\text{#1}} \newcommand{tb}[1]{\textbf{#1}} \newcommand{ns}[1]{\normalsize{#1}} \newcommand{ss}[1]{\scriptsize{#1}} \newcommand{vpl}[]{\vphantom{\large{\int^{\int}}}} \newcommand{vplup}[]{\vphantom{A^{A^{A^A}}}} \newcommand{vplLup}[]{\vphantom{A^{A^{A^{A{^A{^A}}}}}}} \newcommand{vpLup}[]{\vphantom{A^{A^{A^{A^{A^{A^{A^A}}}}}}}} \newcommand{up}[]{\vplup} \newcommand{Up}[]{\vplLup} \newcommand{Uup}[]{\vpLup} \newcommand{vpL}[]{\vphantom{\Large{\int^{\int}}}} \newcommand{lrg}[1]{\class{lrg}{#1}} \newcommand{sml}[1]{\class{sml}{#1}} \newcommand{qq}[2]{{#1}_{\t{#2}}} \newcommand{ts}[2]{\t{#1}_{\t{#2}}} \) \( \newcommand{ds}[]{\displaystyle} \newcommand{dsup}[]{\displaystyle\vplup} \newcommand{u}[1]{\underline{#1}} \newcommand{tu}[1]{\underline{\text{#1}}} \newcommand{tbu}[1]{\underline{\bf{\text{#1}}}} \newcommand{bxred}[1]{\class{bxred}{#1}} \newcommand{Bxred}[1]{\class{bxred2}{#1}} \newcommand{lrpar}[1]{\left({#1}\right)} \newcommand{lrbra}[1]{\left[{#1}\right]} \newcommand{lrabs}[1]{\left|{#1}\right|} \newcommand{bnlr}[2]{\bn{#1}\left(\bn{#2}\right)} \newcommand{nblr}[2]{\bn{#1}(\bn{#2})} \newcommand{real}[1]{\Ree\{{#1}\}} \newcommand{Real}[1]{\Ree\left\{{#1}\right\}} \newcommand{abss}[1]{\|{#1}\|} \newcommand{umin}[1]{\underset{{#1}}{\t{min}}} \newcommand{umax}[1]{\underset{{#1}}{\t{max}}} \newcommand{und}[2]{\underset{{#1}}{{#2}}} \) \( \newcommand{bn}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{bns}[2]{\bn{#1}_{\t{#2}}} \newcommand{b}[1]{\boldsymbol{#1}} \newcommand{bb}[1]{[\bn{#1}]} \) \( \newcommand{abs}[1]{\left|{#1}\right|} \newcommand{ra}[]{\rightarrow} \newcommand{Ra}[]{\Rightarrow} \newcommand{Lra}[]{\Leftrightarrow} \newcommand{rai}[]{\rightarrow\infty} \newcommand{ub}[2]{\underbrace{{#1}}_{#2}} \newcommand{ob}[2]{\overbrace{{#1}}^{#2}} \newcommand{lfrac}[2]{\large{\frac{#1}{#2}}\normalsize{}} \newcommand{sfrac}[2]{\small{\frac{#1}{#2}}\normalsize{}} \newcommand{Cos}[1]{\cos{\left({#1}\right)}} \newcommand{Sin}[1]{\sin{\left({#1}\right)}} \newcommand{Frac}[2]{\left({\frac{#1}{#2}}\right)} \newcommand{LFrac}[2]{\large{{\left({\frac{#1}{#2}}\right)}}\normalsize{}} \newcommand{Sinf}[2]{\sin{\left(\frac{#1}{#2}\right)}} \newcommand{Cosf}[2]{\cos{\left(\frac{#1}{#2}\right)}} \newcommand{atan}[1]{\tan^{-1}({#1})} \newcommand{Atan}[1]{\tan^{-1}\left({#1}\right)} \newcommand{intlim}[2]{\int\limits_{#1}^{#2}} \newcommand{lmt}[2]{\lim_{{#1}\rightarrow{#2}}} \newcommand{ilim}[1]{\lim_{{#1}\rightarrow\infty}} \newcommand{zlim}[1]{\lim_{{#1}\rightarrow 0}} \newcommand{Pr}[]{\t{Pr}} \newcommand{prop}[]{\propto} \newcommand{ln}[1]{\t{ln}({#1})} \newcommand{Ln}[1]{\t{ln}\left({#1}\right)} \newcommand{min}[2]{\t{min}({#1},{#2})} \newcommand{Min}[2]{\t{min}\left({#1},{#2}\right)} \newcommand{max}[2]{\t{max}({#1},{#2})} \newcommand{Max}[2]{\t{max}\left({#1},{#2}\right)} \newcommand{pfrac}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{pd}[]{\partial} \newcommand{zisum}[1]{\sum_{{#1}=0}^{\infty}} \newcommand{iisum}[1]{\sum_{{#1}=-\infty}^{\infty}} \newcommand{var}[1]{\t{var}({#1})} \newcommand{exp}[1]{\t{exp}\left({#1}\right)} \newcommand{mtx}[2]{\left[\begin{matrix}{#1}\\{#2}\end{matrix}\right]} \newcommand{nmtx}[2]{\begin{matrix}{#1}\\{#2}\end{matrix}} \newcommand{nmttx}[3]{\begin{matrix}\begin{align} {#1}& \\ {#2}& \\ {#3}& \\ \end{align}\end{matrix}} \newcommand{amttx}[3]{\begin{matrix} {#1} \\ {#2} \\ {#3} \\ \end{matrix}} \newcommand{nmtttx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{mtxx}[4]{\left[\begin{matrix}\begin{align}&{#1}&\hspace{-20px}{#2}\\&{#3}&\hspace{-20px}{#4}\end{align}\end{matrix}\right]} \newcommand{mtxxx}[9]{\begin{matrix}\begin{align} &{#1}&\hspace{-20px}{#2}&&\hspace{-20px}{#3}\\ &{#4}&\hspace{-20px}{#5}&&\hspace{-20px}{#6}\\ &{#7}&\hspace{-20px}{#8}&&\hspace{-20px}{#9} \end{align}\end{matrix}} \newcommand{amtxxx}[9]{ \amttx{#1}{#4}{#7}\hspace{10px} \amttx{#2}{#5}{#8}\hspace{10px} \amttx{#3}{#6}{#9}} \) \( \newcommand{ph}[1]{\phantom{#1}} \newcommand{vph}[1]{\vphantom{#1}} \newcommand{mtxxxx}[8]{\begin{matrix}\begin{align} & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4} \\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \\ \mtxxxxCont} \newcommand{\mtxxxxCont}[8]{ & {#1}&\hspace{-17px}{#2} &&\hspace{-20px}{#3} &&\hspace{-20px}{#4}\\ & {#5}&\hspace{-17px}{#6} &&\hspace{-20px}{#7} &&\hspace{-20px}{#8} \end{align}\end{matrix}} \newcommand{mtXxxx}[4]{\begin{matrix}{#1}\\{#2}\\{#3}\\{#4}\end{matrix}} \newcommand{cov}[1]{\t{cov}({#1})} \newcommand{Cov}[1]{\t{cov}\left({#1}\right)} \newcommand{var}[1]{\t{var}({#1})} \newcommand{Var}[1]{\t{var}\left({#1}\right)} \newcommand{pnint}[]{\int_{-\infty}^{\infty}} \newcommand{floor}[1]{\left\lfloor {#1} \right\rfloor} \) \( \newcommand{adeg}[1]{\angle{({#1}^{\t{o}})}} \newcommand{Ree}[]{\mathcal{Re}} \newcommand{Im}[]{\mathcal{Im}} \newcommand{deg}[1]{{#1}^{\t{o}}} \newcommand{adegg}[1]{\angle{{#1}^{\t{o}}}} \newcommand{ang}[1]{\angle{\left({#1}\right)}} \newcommand{bkt}[1]{\langle{#1}\rangle} \) \( \newcommand{\hs}[1]{\hspace{#1}} \)

  UNDER CONSTRUCTION

Machine Learning:
Support Vector Machines


Support Vector Machine

Separates data mapped onto a higher-dimensional representation to attain largest possible margin, aiding regression & classification; for latter, (simplified):

\(\ds J(\bn{\theta})= C \sum_{i=1}^m \lrbra{y^{(i)}\t{cost}_1(\bn{\theta}^T x^{(i)}) + (1-y^{(i)})\t{cost}_0(\bn{\theta}^T x^{(i)})} + \sfrac{1}{2}\sum_{j=1}^n \theta_j^2\)
\(C=1/\lambda\) – plays role of regularization
\(y=1 \ra \bn{\theta}^T\bn{x}\geq 1\) (not just \(\geq 0\))
\(y=0 \ra \bn{\theta}^T\bn{x}\leq -1\) (not just \( < 0 \))
Support Vector Machine cost function

gif Code: simSVM.m

Support Vector Machines (II)

SVMs find support vectors that find the optimal margin separating data classes, acting as a large margin classifier:

  • - Define the general classifier \(h_{\bn{\theta}}(\bn{x})=g(\bn{\theta})\) as \(h_{\bn{w},b}(x)=g(\bn{w}^T \bn{x}+b)\)
  • - Define \(g(z)=1\) if \(z\geq 0\), and \(g(z)\leq -1\) otherwise\({}^{1}\)
  • - Define the geometric margin as \({\gamma}^{(i)}\), shown in fig. right, adjusting\({}^{2}\) w/ \(\bn{w}\) & \(b\):
  • \(\gamma^{(i)}=y^{(i)}\lrpar{\lrpar{\frac{\bn{w}}{\abss{\bn{w}}}}^T x^{(i)} + \frac{b}{\abss{\bn{w}}}}\)
  • - Lastly, to find the largest geometric margin, we pose an optimization problem:
  • \(\ds \umax{\gamma, \bn{w},b}\ \ \gamma \)
    \(\ds \begin{align} & \t{s.t.}\ \ y^{(i)}(w^T x^{(i)}+b)\geq \gamma,\ i = 1, ... , m \\ & \abss{\bn{w}}=1 \end{align}\)
Thus, maximizing the margin can be seen to be maximizing the confidence in prediction for the greatest # of pts, with pts lying very far away from the margin being near-certain; per latter, the support vectors that define the decision boundary \(g(z)=0\) are thus the vectors lying closest to the margin (least confident, most 'decisive' to boundary)

  • 1 - \(h\) now predicts \(1\) or \(-1\) directly, without intermediate 'estimates'
  • 2 - Is a scaled form of \(\hat{\gamma}^{(i)} = y^{(i)}(\bn{w}^T\bn{x}+b)\), invariant to scaling w/ \(\bn{w}\) & \(b\)
    For \(\hat{\gamma}^{(i)}\) to be large (i.e. prediction = confident & correct), we need \(\bn{w}^T\bn{x}+b\) to be large (& positive); conversely, for \(y^{(i)}=-1\), large & negative
SVM large margin classifier
Consider point \(A\) denoting training ex \(x^{(i)}\), w/ \(y^{(i)}=1\). \(B\) is then given by \(x^{(i)}-\gamma^{(i)}\cdot \bn{w}/\abss{\bn{w}}\) – which lies on the decision boundary, satisfying \(\bn{w}^T\bn{x}+b=0\); to find \(\gamma^{(i)}\), plug in, solve for \(\gamma^{(i)}\), and add \(y^{(i)}\) for negative exs

Kernel 'Trick'

Enables high-/infinte-dimensional algorithms by reducing algorithm computing dimensions; given a feature mapping \(\phi\), a Kernel is defined to be

\(K(x,z) = \phi(x)^T\phi(z),\)
where \(\phi\) is some feature-mapping transformation
  • - Allows computing \(\bkt{\phi(x),\phi(z)}\ra K(k,z)\) without first finding \(\phi(x), \phi(z)\)
  • - Provides an immense computational advantage; see ex. shown right
  • -
    asm
    Algor. uses a vector inner-product term \(\bkt{A,B}\) (SVM does)
Kernel example, shown with \(\phi(x)\) for \(n=3\):
\(\ds \begin{align} K(x,z) & = (x^Tz)^2 \\ & = \sum_{i,j=1}^n (x_i x_j)(z_i z_j) \end{align}\ \ra\ \phi (x) = \begin{bmatrix} x_1x_1 \\ x_1x_2\\ x_1x_3\\ x_2x_1\\ x_2x_2\\ x_2x_3\\ x_3x_1\\ x_3x_2\\ x_3x_3 \end{bmatrix}\)
Computing time: \(\ds\begin{align} \phi(x) &\ra\ O(n^2)\\ K(x,z)&\ra\ O(n)\end{align}\)

Using SVMs

SVMs w/ kernels can serve a powerful tool for classification & regression; some methods follow:

  • \(\u{n > m}\ra\) Logistic regression, or SVM w/o a kernel
  • \(n =\) small, \(m = \) intermediate \(\ra\) SVM w/ Gaussian kernel
  • \(n =\) small, \(m = \) large \(\ra\) add more features, then use logistic regression or SVM w/o a kernel
(small \(\approx\) 1-1000; intermed \(\approx\) 10-10,000; large \(\approx\) 50,000+)
Linear kernel: (no kernel)
\(\up K(x) = x\)
Gaussian kernel: (infinite-dimensional)
\(\ds K(x,z) = \t{exp}\lrpar{-\frac{\abss{x-z}^2}{2\sigma^2}}\)
- High \(\sigma \ra\) high bias, low low variance
- Low \(\hspace{3px}\sigma \ra\) low bias, high variance

SVM Optimization

The complete form of the SVM optimization problem, with \(L_1\)-regulatization, involves solving a dual Lagrangian optimization problem subject to maximum margin constraints:

\(\ds \mathcal{L}(w,b,\xi,\alpha,r)=\sfrac{1}{2}w^T w + C\sum_{i=1}^m \xi_i - \sum_{i=1}^m\alpha_i[y^{(i)}(x^T w + b) - 1 + \xi_i] - \sum_{i=1}^m r_i\xi_i\)
\(\ds \begin{align} \umax{\xi, w, b}\ \ & \sfrac{1}{2}\abss{w}^2 + C\sum_{i=1}^m \xi_i \\ \t{s.t.}\ \ & y^{(i)}(w^T x^{(i)} + b)\geq 1 - \xi_i,\ i = 1,...,m \\ & \xi_i \geq 0,\ i = 1,...,m \end{align}\ \ra \)
\(\ds \begin{align} \umax{\alpha}\ \ & W(\alpha) = \sum_{i=1}^m(\alpha_i)-\sfrac{1}{2}\sum_{i,j=1}^m y^{(i)} y^{(j)}\alpha_i \alpha_j \bkt{x^{(i)},x^{(j)}} \\ \t{s.t.}\ \ & 0 \leq \alpha \leq C,\ i = 1,...,m, \\ & \sum_{i=1}^m \alpha_i y^{(i)} = 0. \end{align}\)
  • - \(\alpha, r =\) Lagrange multipliers; \(\xi =\) regularization term
  • - \((w^T x + b) = \theta^T x\), treating the intercept term \(\theta_0\) explicitly     [Read more...]




Dragon Notes,   Est. 2018     About

By OverLordGoldDragon