# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Data Science:Classification

Overfitting

Overfitting: model performs extremely well with training data, but poorly with unfamiliar data; can be caused by having too many features in the cost function.
Underfitting, in contrast, suffers from having too few features - not providing the flexibility necessary to fit the data.$$\up$$

The problem of minimizing error whereby minimal bias yields large variance, and vice-versa - preventing learning models from generalizing beyond training set.
- Bias: error from erroneous assumptions in the learning model; high bias can cause a model to miss relevant relations between features and outputs (underfitting)
- Variance: error from model sensitivity to small fluctuations in training set; high variance can cause a model to fit the random noise in training data, rather than the intended outputs (overfitting)
\begin{align} &E\lrbra{\t{MSE}_{\t{test}}} = E[(y_0-\hat{f}(x_0))^2] = \\ & \t{var}(\hat{f}(x_0)) + [\t{Bias}(\hat{f}(x_0))]^2 + \t{var}(\epsilon) \end{align}
$$\up\t{Bias}[\hat{f}(x_0)] = E[\hat{f}(x_0)] - f(x)$$
$$\dsup\t{var}(\hat{f}(x_0)) = E[(\hat{f}(x_0))^2] - E[\hat{f}(x_0)]^2$$
$$\ds\t{var}(\epsilon) =$$ irreducible error, arising from data noise

• Left, data is fit with models of varying flexibility: orange (lowest, linear reg.), black, and green (highest).
• Middle: at lowest, the model is too inflexible to fit the data (underfitting); as flexibility increases, both the training & test MSE decrease; increasing more, train. MSE drops further, but the test MSE rises sharply (overfitting).
• Right: for low flexibility, the MSE's mostly controlled by bias - for higher, by variance.
• Note that the test MSE is always above $$\t{var}(\epsilon)$$ - hence the error being 'irreducible'.

One-vs-All: Multiclass Classification

For $$(n+1)$$ classes $$y=\{0,1,...,n\}$$, dividing the problem into $$(n+1)$$ classification problems:

\ds \begin{align} & h_{\theta}^{(0)}(x) = P(y=0 | x; \theta) \\ & h_{\theta}^{(1)}(x) = P(y=1 | x;\theta) \\ & \vdots \\ & h_{\theta}^{(n)}(x)=P(y=n | x;\theta) \end{align} $$\ds \quad \t{prediction} = \underset{i}{\t{max}}(h_{\theta}^{(i)}(x))$$

[1] Choosing one class and lumping all the others into a single second class
[2] In each classification problem, predicting the probability that $$y$$ is a member of one of the classes
[3] Given an input $$x$$, outputs a $$y$$ with the greatest $$h_{\theta}(x)$$