# Dragon Notes

UNDER CONSTRUCTION
Latest content:
 Apr 05 Deep Learning Mar 19 Anomaly Detection - ML Mar 13 +Data Tables Mar 08 Clustering - Machine Learning Feb 28 Support Vector Machines - ML Feb 20 Regression - Data Science

# Data Science:Regression

Simple Linear Regression

Estimates response based on single predictor:
$$\ds \boxed{y\approx \beta_0 + \beta_1 x = \hat{y}}$$
$$\ds\hat{\beta_1}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2},\ \hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}$$

• - 'hats' on vars denote approximations, or estimators
• - Residual Sum of Squares (RSS) is a measure of predictive error
• - $$i^{\t{th}}$$ residual $$=$$ diff. between $$i^{\t{th}}$$ observed & predicted responses: $$e_i = y_i - \hat{y}_i$$
$$\t{RSS} = e_1^2 + e_2^2 + ... +e_n^2$$
$$\t{RSS} = (y_1 -\hat{\beta}_0 - \hat{\beta_1}x_1)^2 + ... (y_n -\hat{\beta}_0 - \hat{\beta_1}x_n)^2\ \t{ [Simp Lin Reg]}$$

Assessing Regression Accuracy

• $$\bb{1}$$ Standard Error (SE) of $$\hat{\mu}$$ measures how well the sample mean $$\hat{\mu}$$ estimates $$\mu$$
asm
the $$n$$ observations are uncorrelated
• $$\bb{2}$$ Residual Standard Error (RSE) estimates the standard deviation of $$\epsilon$$; measures 'lack of fit' of model to data
Ex: $$\s{}\t{RSE}=50 \ra$$ actual responses differ from true reg. line predictions by $$50$$, on average
• $$\bb{3}$$ $$\b{R}^2 =$$ proportion of variance explained by linear regression model
TSS $$\s{}\ra$$ total variance in the response $$\s{Y}$$ (variability inherent in $$\s{Y}$$);
RSS $$\s{}\ra$$ amount of variability left unexplained w/ regression;
$$\s{}(\t{TSS}-\t{RSS})\ra$$ amount of variability in the response that is explained (or removed) by performing regression, and $$\s{}R^2\ra$$ proportion of variability in $$\s{}Y$$ that can be explained using $$\s{}X$$
• $$\bb{4}$$ Confidence Interval indicates how well the mean is captured
• Interpretation: Suppose CI = 95%. Then, if the distribution is sampled many times, and a CI is computed for each sample, we'd expect about 95% of those intervals to include the true population mean.
• $$\bb{5}$$ Prediction Interval indicates where to expect the next sampled datapoint
• Interpretation: Suppose PI = 95%. We sample the distribution, compute a PI for the sample, and sample another datapoint. If we do this many times, we'd expect the next datapoint to lie within the PI for 95% of the samples.
• $$\ds \bb{1}\ \t{SE}(\hat{\mu})=\frac{\sigma}{\sqrt{n}} = \sqrt{\t{var}(\hat{\mu})}$$
• $$\ds \bb{2}\ \t{RSE} = \sqrt{\sfrac{1}{n-2}\t{RSS}},\ \t{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2$$
• $$\ds \bb{3}\ R^2 = \frac{\t{TSS}-\t{RSS}}{\t{TSS}} = 1 - \frac{\t{RSS}}{TSS},\ \t{TSS}=\sum_{i=1}^n (y_i - \bar{y})^2$$
• $$\ds \bb{4}\ \bar{x}\pm z_{\alpha/2}\sfrac{\sigma}{\sqrt{n}}$$
$$\bar{x}=$$ sample mean

$$\alpha=$$ confidence level
$$\sigma=$$ population std.

$$n=$$ sample size
• $$\ds \bb{5}\ \mu \pm z\sigma$$
asm
$$\mu$$ & $$\sigma$$ of distribution are known

Null Hypothesis

Tests whether there exists a relationship between variables or association between groups:
Null hypothesis, $$H_0$$: There is no relationship between $$X$$ and $$Y$$

Alt. Hypothesis, $$H_a$$: There is some relationship between $$X$$ and $$Y$$
For a predictor $$\beta_1$$ predicting $$Y$$ from $$X$$, this corresponds to
$$H_0\t{: } \beta_1 = 0,\ \ H_a\t{: } \beta_1 \neq 0$$

$$\ds t = \sfrac{\hat{\beta_1}-0}{\t{SE}(\hat{\beta_1})}$$
$$t$$-statistic $$=$$ # of standard errors the predictor is away from zero; used as a test of strength of null hypothesis by indicating whether $$\s{}\hat{\beta_1}$$ is 'sufficiently far' from zero so we can be confident that $$\beta_1$$ is non-zero.
$$p$$-value $$=$$ probability of attaining the $$t$$-value due to random chance; by convention, $$p<0.05$$ or $$p<0.01 \ra$$ reject null hypothesis
Interpretation: small $$p$$-value indicates that it is unlikely to observe a substantial association between predictor & response due to chance, in absence of any actual association between the predictor & the response
NOTE:
\ds \begin{align} & \t{Pr}(\t{observation | hypothesis}) \neq \\ & \t{Pr}(\t{hypothesis | observation}) \end{align}
Probability of observing a result given that some hypothesis is true is not equivalent to the probability that a hypothesis is true given that some result has been observed.

($$p$$-values indicate the 'incompatibility' of a given dataset with the null hypothesis; they do not validate the research hypothesis)

Null Hypothesis [Multivariate]

Is there a relationship between response and any of ($$p$$) predictors?

$$H_0\t{: }\beta_1 = \beta_2 = ... \beta_p = 0$$

$$H_a$$: at least one of $$\beta_j$$ is non-zero

$$\ra \ds F = \frac{(\t{TSS} - \t{RSS})/p}{\t{RSS}/(n-p-1)}$$
Is there a relationship between response and a subset of ($$q$$) predictors?

$$H_0\t{: }\beta_{p-q+1} = \beta_{p-q+2} = ... \beta_p = 0$$

$$H_a$$: at least one of $$\beta_j$$ selected is non-zero

$$\ra \ds F = \frac{(\t{RSS}_0 - \t{RSS})/q}{\t{RSS}/(n-p-1)}$$

• F-statistic $$=$$ measures multivariate strength of association
• - Higher $$F \ra$$ weaker $$H_0$$
• - Need $$F>1$$ to reject $$H_0$$; 'by how much' depends on $$n$$ & $$p$$
• - Larger $$n \ra$$ less-larger $$F$$ needed
• - Small $$p$$-value associated w/ $$F$$ $$\ra$$ sufficient $$F$$
• - $$\t{RSS}_0$$ is computed on a model using all but the selected $$q$$ predictors
NOTE: small individual $$p$$-values $$\neq$$ reject $$H_0$$
It seems likely that if any one of the $$p$$-values for predictors is very small, then at least one of the predictors is related to the response - so why use the $$F$$-statistic? This logic, however, may not hold when # of predictors $$p$$ is large. Ex: Suppose $$p=100$$ and $$H_0\t{: }\beta_1 = \beta_2 =...=\beta_p = 0$$ is true, so no predictor is truly associated with the response. Then, about 5% of the $$p$$-values associated with each variable will be below $$0.05$$ – simply by chance