# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Machine Learning:Anomaly Detection

Anomaly Detection

Identifying datapoints that differ significantly from majority, and are indicative of unusual observations

Algorithm:$$\up$$
• [1] Choose features $$x_i$$ that may indicate anomalous examples
• [2] Fit parameters $$(\bn{\mu},\bn{\sigma}^2)=([\mu_1,...,\mu_n],[\sigma_1^2,...,\sigma_n^2])$$:
$$\ds \mu_j = \frac{1}{m}\sum_{i=1}^m x_j^{(i)},\ \ \sigma_j^2 = \frac{1}{m}\sum_{i=1}^m (x_j^{(i)}-\mu_j)^2$$
• [3] Given a new example $$x$$, compute $$p(x)$$:
$$\ds p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2) = \prod_{j=1}^n \frac{1}{\sqrt{2\pi}\sigma_j}\exp{\frac{(x_j-\mu_j)^2}{-2\sigma_j^2}}$$

• - Semi-supervised learning algorithm; trained on un-labeled data, tested on labeled
• - Useful in discovering hidden patterns, extracting features, detecting outliers
• - Applied in: detecting fraud, device defects; signal noise filtering; medical diagnosis ...

Evaluating Detection Systems

Anomaly detection systems learning per semi-supervised methods are trained on unlabeled data (assumed 'normal'), and tested on labeled data:

• $$\sim 60\%$$ 'normal': $$\hspace{137px}$$  Training set (unlabeled)
• $$\sim 20\%$$ 'normal' $$+\ 50\%$$ anomalous:  Cross-validation set (labeled)
• $$\sim 20\%$$ 'normal' $$+\ 50\%$$ anomalous:  Testing set (other 50% of anomalous)
Assessing accuracy: (on CV/test sets)
• $$\up p(x) < \epsilon \ra y = 1$$ (anomaly)
• $$p(x) \geq \epsilon \ra y = 0$$ (normal)
• Measures: Precision/Recall; $$\Up\t{F1}$$-score
• (use CV set to choose the best $$\epsilon$$)

Choosing Features

Features used in anomaly detection determine the types of unusual behavior/properties identified by the ultimate algorithm

• (1) Handpicking: relating features in a meaningful manner
• (2) Transforming: normalizing features for workability with Gaussian
• (3) MVG: automatically extracting features via Multivariate Gaussian
Examples
• (1) Let $$X_1 =$$ network traffic, $$X_2 =$$ CPU load. On computers dedicated to server hosting, higher network traffic $$\ra$$ higher CPU load; then, having one much greater than the other would be anomalous - which can be captured with a feature $$X_2/X_1$$
• (2)
\begin{align} & \t{log}(X) && \sqrt{X} \\ & \t{log}(X+c) && X^{1/3} \end{align}

Using Gaussian PDF to detect anomalies requires Gaussian features to work well; if a feature isn't normally distributed, it can be made so via a transformation
• (3) MVG captures correlations between features via the covariance matrix - automatically detecting linear relationships

Collaborative Filtering

A recommender systems algorithm using existing user preferences to predict preferences of the next; the algorithm cost function is

$$\ds J(\bn{x},\bn{\theta}) = \frac{1}{2}\hspace{-40px} \sum_{\hspace{40px}(i,j):r(i,j)=1} \hspace{-35px}[(\bn{\theta}^{(j)})^T \bn{x}^{(i)} - y^{(i,j)}]^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n (x_k^{(i)})^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2$$
Algorithm:$$\up$$
• [1] Initialize $$(\bn{x},\bn{\theta})=([x^{(i)},...,x^{(n_m)}],[\theta^{(1)},...,\theta^{(n_u)}])$$ to small random values (symmetry-breaking)
• [2] Minimize $$J(\bn{x},\bn{\theta})$$ over $$(\bn{x},\bn{\theta})$$:
(note: no pre-set intercept term $$\theta_0,x_0$$)

$$\ds x_k^{(i)} \t{:= } x^{(i)}_k - \alpha\lrpar{\sum_{j:r(i,j)=1}[(\bn{\theta}^{(j)})^T\bn{x}^{(i)} - y^{(i,j)}]\theta_k^{(j)} + \lambda x_k^{(i)}}$$
$$\ds \theta_k^{(j)} \t{:= } \theta^{(j)}_k - \alpha\lrpar{\sum_{j:r(i,j)=1}[(\bn{\theta}^{(j)})^T\bn{x}^{(i)} - y^{(i,j)}]x_k^{(i)} + \lambda\theta_k^{(j)}}$$
• [3] For a user w/ parameters $$\bn{\theta}$$ and a movie w/ (learned) features $$\bn{x}$$, predict a preference of $$\bn{\theta}^T\bn{x}$$

- $$n_u =$$ # of users
- $$n_m =$$ # of movies
- $$r(i,j) = 1$$ if user $$j$$ has rated movie $$i$$
- $$y(i,j) =$$ rating given by user $$j$$ to movie $$i$$
Mean Normalization:
$$\ds \bn{y} = \begin{bmatrix} 5 & 4 & 0 & ? \\ 4 & ? & 1 & ? \\ ? & 4 & 0 & ? \\ 0 & 0 & 3 & ? \end{bmatrix}\hspace{-7px},\ \bn{\mu} = \begin{bmatrix} 3 \\ 2.5 \\ 2 \\ 1 \end{bmatrix}$$
$$\ds \Ra \bn{y} - \bn{\mu} = \begin{bmatrix} 2 & 1 & -3 & ? \\ 1.5 & ? & -1.5 & ? \\ ? & 2 & -2 & ? \\ -1 & -1 & 2 & ? \end{bmatrix}$$

$$\ra$$ learn $$\bn{\theta}^{(j)},\bn{x}^{(i)}$$ $$\ra$$
$$\require{cancel}\ds\bn{\theta}^{(5)} = \bn{0},\ y^{(i,5)} = \cancelto{0}{(\bn{\theta}^{(5)})^T(\bn{x}^{(i)})}+\mu_i$$
$$\ds \Rightarrow \bn{y} = \begin{bmatrix} 2 & 1 & -3 & 3 \\ 1.5 & \bn{\theta}^{(2)}\bn{x}^{(2)} & -1.5 & 2.5 \\ \bn{\theta}^{(1)}\bn{x}^{(3)} & 2 & -2 & 2 \\ -1 & -1 & 2 & 1 \end{bmatrix}$$
Vectorization:

J =  sum(sum(((X*Theta').*R - Y).^2))/2


Multivariate Gaussian

Gaussian random vector, $$\bn{X}\sim\mathcal{N}(\bn{\mu},\bn{\Sigma})$$, is an $$(N\times 1)$$ random variable $$[X_1\ X_2\ ...\ X_N]^T$$ with a joint PDF given by the multivariate Gaussian PDF
$$p_{\bn{X}}(\bn{x})=\lfrac{1}{(2\pi)^{N/2}\t{det}^{1/2}(\bn{\Sigma})}\t{exp}\lrbra{-\frac{1}{2}\lrpar{\bn{x}-\bn{\mu}^T\bn{\Sigma}^{-1}(\bn{x}-\bn{\mu})}},$$

$$\ds\bn{\Sigma}=\left[\mtxxxx{\var{X_1}}{\cov{X_1,X_2}}{...}{\cov{X_1,X_N}}{\cov{X_2,X_1}}{\var{X_2}}{...}{\cov{X_2,X_N}}{\vdots}{\vdots} {\ddots}{\vdots}{\cov{X_N,X_1}}{\cov{X_N,X_2}}{...}{\cov{X_N,X_N}}\right]\hspace{-7px},$$ $$\ds\ \bn{\mu}=\lrbra{\nmtttx{\mu_1}{\mu_2}{\vdots}{\mu_N}}=\lrbra{\nmtttx{E_{X_1}[X_1]}{E_{X_2}[X_2]}{\vdots}{E_{X_N}[X_N]}}$$

 [1]: Only the first two moments, $$\bn{\mu}$$ and $$\bn{\Sigma}$$, are required to specify the entire PDF [2]: Uncorrelated$${}^*$$ $$\ra$$ independent (* - all random variables) [3]: A linear transformation of $$\bn{X}$$ produces another Gaussian vector. If $$\bn{Y}=\bn{GX}$$, then $$\bn{Y}\sim\mathcal{N}(\bn{G\mu}, \bn{G\Sigma G}^T)$$ -- ($$\bn{G} = M\times N$$ matrix, $$M\leq N$$)