# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Machine Learning:Support Vector Machines

Support Vector Machine

Separates data mapped onto a higher-dimensional representation to attain largest possible margin, aiding regression & classification; for latter, (simplified):

$$\ds J(\bn{\theta})= C \sum_{i=1}^m \lrbra{y^{(i)}\t{cost}_1(\bn{\theta}^T x^{(i)}) + (1-y^{(i)})\t{cost}_0(\bn{\theta}^T x^{(i)})} + \sfrac{1}{2}\sum_{j=1}^n \theta_j^2$$
$$C=1/\lambda$$ – plays role of regularization
$$y=1 \ra \bn{\theta}^T\bn{x}\geq 1$$ (not just $$\geq 0$$)
$$y=0 \ra \bn{\theta}^T\bn{x}\leq -1$$ (not just $$< 0$$)

gif Code: simSVM.m

Support Vector Machines (II)

SVMs find support vectors that find the optimal margin separating data classes, acting as a large margin classifier:

• - Define the general classifier $$h_{\bn{\theta}}(\bn{x})=g(\bn{\theta})$$ as $$h_{\bn{w},b}(x)=g(\bn{w}^T \bn{x}+b)$$
• - Define $$g(z)=1$$ if $$z\geq 0$$, and $$g(z)\leq -1$$ otherwise$${}^{1}$$
• - Define the geometric margin as $${\gamma}^{(i)}$$, shown in fig. right, adjusting$${}^{2}$$ w/ $$\bn{w}$$ & $$b$$:
• $$\gamma^{(i)}=y^{(i)}\lrpar{\lrpar{\frac{\bn{w}}{\abss{\bn{w}}}}^T x^{(i)} + \frac{b}{\abss{\bn{w}}}}$$
• - Lastly, to find the largest geometric margin, we pose an optimization problem:
• $$\ds \umax{\gamma, \bn{w},b}\ \ \gamma$$
\ds \begin{align} & \t{s.t.}\ \ y^{(i)}(w^T x^{(i)}+b)\geq \gamma,\ i = 1, ... , m \\ & \abss{\bn{w}}=1 \end{align}
Thus, maximizing the margin can be seen to be maximizing the confidence in prediction for the greatest # of pts, with pts lying very far away from the margin being near-certain; per latter, the support vectors that define the decision boundary $$g(z)=0$$ are thus the vectors lying closest to the margin (least confident, most 'decisive' to boundary)

• 1 - $$h$$ now predicts $$1$$ or $$-1$$ directly, without intermediate 'estimates'
• 2 - Is a scaled form of $$\hat{\gamma}^{(i)} = y^{(i)}(\bn{w}^T\bn{x}+b)$$, invariant to scaling w/ $$\bn{w}$$ & $$b$$
For $$\hat{\gamma}^{(i)}$$ to be large (i.e. prediction = confident & correct), we need $$\bn{w}^T\bn{x}+b$$ to be large (& positive); conversely, for $$y^{(i)}=-1$$, large & negative
Consider point $$A$$ denoting training ex $$x^{(i)}$$, w/ $$y^{(i)}=1$$. $$B$$ is then given by $$x^{(i)}-\gamma^{(i)}\cdot \bn{w}/\abss{\bn{w}}$$ – which lies on the decision boundary, satisfying $$\bn{w}^T\bn{x}+b=0$$; to find $$\gamma^{(i)}$$, plug in, solve for $$\gamma^{(i)}$$, and add $$y^{(i)}$$ for negative exs

Kernel 'Trick'

Enables high-/infinte-dimensional algorithms by reducing algorithm computing dimensions; given a feature mapping $$\phi$$, a Kernel is defined to be

$$K(x,z) = \phi(x)^T\phi(z),$$
where $$\phi$$ is some feature-mapping transformation
• - Allows computing $$\bkt{\phi(x),\phi(z)}\ra K(k,z)$$ without first finding $$\phi(x), \phi(z)$$
• - Provides an immense computational advantage; see ex. shown right
• -
asm
Algor. uses a vector inner-product term $$\bkt{A,B}$$ (SVM does)
Kernel example, shown with $$\phi(x)$$ for $$n=3$$:
\ds \begin{align} K(x,z) & = (x^Tz)^2 \\ & = \sum_{i,j=1}^n (x_i x_j)(z_i z_j) \end{align}\ \ra\ \phi (x) = \begin{bmatrix} x_1x_1 \\ x_1x_2\\ x_1x_3\\ x_2x_1\\ x_2x_2\\ x_2x_3\\ x_3x_1\\ x_3x_2\\ x_3x_3 \end{bmatrix}
Computing time: \ds\begin{align} \phi(x) &\ra\ O(n^2)\\ K(x,z)&\ra\ O(n)\end{align}

Using SVMs

SVMs w/ kernels can serve a powerful tool for classification & regression; some methods follow:

• $$\u{n > m}\ra$$ Logistic regression, or SVM w/o a kernel
• $$n =$$ small, $$m =$$ intermediate $$\ra$$ SVM w/ Gaussian kernel
• $$n =$$ small, $$m =$$ large $$\ra$$ add more features, then use logistic regression or SVM w/o a kernel
(small $$\approx$$ 1-1000; intermed $$\approx$$ 10-10,000; large $$\approx$$ 50,000+)
Linear kernel: (no kernel)
$$\up K(x) = x$$
Gaussian kernel: (infinite-dimensional)
$$\ds K(x,z) = \t{exp}\lrpar{-\frac{\abss{x-z}^2}{2\sigma^2}}$$
- High $$\sigma \ra$$ high bias, low low variance
- Low $$\hspace{3px}\sigma \ra$$ low bias, high variance

SVM Optimization

The complete form of the SVM optimization problem, with $$L_1$$-regulatization, involves solving a dual Lagrangian optimization problem subject to maximum margin constraints:

$$\ds \mathcal{L}(w,b,\xi,\alpha,r)=\sfrac{1}{2}w^T w + C\sum_{i=1}^m \xi_i - \sum_{i=1}^m\alpha_i[y^{(i)}(x^T w + b) - 1 + \xi_i] - \sum_{i=1}^m r_i\xi_i$$
\ds \begin{align} \umax{\xi, w, b}\ \ & \sfrac{1}{2}\abss{w}^2 + C\sum_{i=1}^m \xi_i \\ \t{s.t.}\ \ & y^{(i)}(w^T x^{(i)} + b)\geq 1 - \xi_i,\ i = 1,...,m \\ & \xi_i \geq 0,\ i = 1,...,m \end{align}\ \ra
\ds \begin{align} \umax{\alpha}\ \ & W(\alpha) = \sum_{i=1}^m(\alpha_i)-\sfrac{1}{2}\sum_{i,j=1}^m y^{(i)} y^{(j)}\alpha_i \alpha_j \bkt{x^{(i)},x^{(j)}} \\ \t{s.t.}\ \ & 0 \leq \alpha \leq C,\ i = 1,...,m, \\ & \sum_{i=1}^m \alpha_i y^{(i)} = 0. \end{align}
• - $$\alpha, r =$$ Lagrange multipliers; $$\xi =$$ regularization term
• - $$(w^T x + b) = \theta^T x$$, treating the intercept term $$\theta_0$$ explicitly     [Read more...]