 # Dragon Notes UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Machine Learning Subtopics

Basics

Supervised vs. Unsupervised Learning

 Supervised Unsupervised - Correct output is known - Goal is to derive relationship between input and output: Regression: mapping inputs to a continuous output function Classification: mapping inputs to representative discrete outputs - Correct output is not known - Goal is to derive structure from data where the effects of variables aren't necessarily known - Works by clustering data based on relationships among variables in the data - No feedback based on prediction results Examples: Regression: given a picture of a person, predict their age on the basis of the given picture Classification: given a patient with a tumor, predict whether the tumor is malignant of benign Examples: Clustering: given a collection of 1,000,000 different genes, find a way to automatically group these genes so that they are somehow similar or related by different variables, such as lifespan, location, roles, etc. Non-clustering: given a mesh of sounds at a cocktail party, identify individual voices and music (finding structure in a chaotic environment)

Cost Function

A measure of accuracy of hypothesis function - the mean squared error:
$$\ds J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m}(\hat{y}_i-y_i)^2=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_i)-y_i)^2$$
$$\hat{y}_i =$$ predeicted output
$$y_i =$$ true output
$$m =$$ # of training samples
$$x_i =$$ input
$$h_{\theta}(x_i) =$$ hypothesis function
$$\theta_0, \theta_1 =$$ model parameters

A method for determining function optima via step-approximation:
$$\ds \theta_j := \theta_j -\alpha \frac{\partial}{\partial \theta_j}J(\theta_0,\theta_1)$$
$$(\t{for }j=0\t{ and }j=1)$$

$$\alpha =$$ learning rate (step-size)
$$\theta_j =$$ model parameter

Hypothesis Function

A learned function aiming to map inputs to outputs with least error; multivariate hypothesis function:
$$\ds h_\theta(x) = \theta_0 +\theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$$
$$\ds h_\theta(x) = \bn{\theta}^T \bn{x}$$

$$x^{(i)}_j =$$ value of feature $$j$$ in the $$i^{\t{th}}$$ training example
$$m =$$ # of training examples
$$\bn{\theta} =$$ model parameter vector, $$(n+1)\times 1$$
$$x^{(i)}=$$ input (features) of the $$i^{\t{th}}$$ training example$$\vph{x^{(i)}_j}$$
$$n =$$ # of features
$$\bn{x} =$$ input vector, $$(n+1)\times 1$$, $$x_0 = 1$$

Linear Regression GD

For linear regression involving multiple variables, the GD algorithm is

$$\ds \theta_j := \theta_j -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_j$$
$$(\t{for }j=0,1,...,n)$$
\ds \Rightarrow \ \ \begin{align} \theta_0 & := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_0 \\ \theta_1 & := \theta_1 -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_1 \\ & \vdots \\ \theta_n & := \theta_n -\alpha \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}_n \\ \end{align}

$$x^{(i)}_j =$$ value of feature $$j$$ in the $$i^{\t{th}}$$ training example
$$m =$$ # of training examples
$$x^{(i)}=$$ input (features) of the $$i^{\t{th}}$$ training ex.$$\vph{x^{(i)}_j}$$
$$n =$$ # of features

Normal Equation

The normal equation minimizes $$J$$ explicitly by solving for its parameters' zero derivatives; in matrix form,
$$\ds \bn{\theta} = (\bn{X}^T\bn{X})^{-1}\bn{X}^T\bn{y}$$

Computing intensity grows with $$\sim n^3$$
Vectorization

Optimizing computing efficiency by writing code that utilizes built-in optimized algorithms
Unvectorized implementation
prediction = 0.0;
for j = 1:n+1,
prediction = prediction +
theta(j) * x(j)
end

Vectorized implementation
prediction = theta' * x;
The vectorized implementation uses the IDE's (MATLAB, Octave, etc.) highly-optimized numerical linear algebra routines to compute the vector inner product