# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Feb 20 Regression - Data Science Feb 19 +Classification - ML Feb 18 Control Design - Control Sys. Feb 16 Solved Problems - Data Sci. Feb 12 +Control Systems - MATLAB Feb 10 Image Recognition - Digits:ML Algorithms

# Machine Learning:Why's & How's

Why squared error?

Useful mathematical properties:
- If $$X$$ is a random variable, the estimator of $$X$$ that minimizes the squared-error is the mean, $$E(X)$$; the absolute error's estimator is the median.
- Mean adds linearly: $$E(X+Y)=E(X)+E(Y)$$, median doesn't.
- For independent random variables, expected squared errors (variances) add: $$\t{var}(X+Y)=\t{var}(X)+\t{var}(Y)$$, expected absolute errors don't.
- Differentiability: squared error is globally differentiable, absolute error isn't (e.g. at $$0$$); latter's optimization is more complex
- More useful if greater errors matter more, and small errors are negligible: square error underrepresents small errors, and overblows large errors
- Squared error components add: for a vector $$\s{}\vec{X}=(X_1,X_2)$$ estimated by $$\s{}\vec{x}=x_1,x_2$$, it doesn't matter whether components are considered separately or together: $$\s{}|| \vec{X}-\vec{x}||^2 = (X_1 - x_1)^2 + (X_2 -x_2)^2$$; the errors add. This doesn't hold for absolute errors.

Why the 1/2m?
(in [1])
$$\ds J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_i)-y_i)^2\quad [1]$$
- $$1/m$$ normalizes the error
- $$1/2$$ is for mathematical convenience; when differentiating, cancels the $$2$$ from $$(...)^2$$. Since the error function minimum is of principal interest, it can be scaled by any factor (while leaving min. the same)

Why do thetas differ greatly for gradient descent & normal equation?
(in Linear Regression)
- Poor learning rate (alpha)
- Feature scaling; if GD utilized feature scaling, then the produced thetas fit the scaled dataset - whereas the normal equation yields thetas for the unscaled dataset. Both, however, will yield the same predictions (asm. optimized; also, the input features must be scaled for predicting w/ GD's thetas)