# Definitions

### Collinearity

When there are two explanatory variables that are highly correlated with each other when doing a regression analysis, it can be said that there two explanatory variables are collinear.

### Confidence Interval

A confidence interval is the range of numbers that serves as an estimate of a population parameter. The 95% of is defined as $\displaystyle b_0 \plusmn 2*SE(b_0)$ SE is the standard error of the parameter, and is calculated as:

$\displaystyle SE(b_1) = \frac{RMSE}{\sqrt{n}*s_x}$

The confidence interval for the prediction interval is defined as:

$\displaystyle \bar{y} \plusmn 2*RMSE$ where $\displaystyle \bar{y}$ is the y value calculated from the given $\displaystyle X, B_0, B_1$

### Correlation Coefficient

The correlation coefficient, frequently denoted as r, measures the linear association between two numerical variables. The higher the absolute value of r, the larger the amount of variation that is explained by the regression model being used. The value of r is always between -1 and 1.

$\displaystyle R^2 \approx \frac{s_y^2-RMSE^2}{s_y^2}$

### Root Mean Squared Error

The root mean squared of a regression model is the average between the data between the data points and the least square regression line and is defined as $\displaystyle RMSE = \sqrt{\frac{1}{n-2}*\sum{(y_i-\hat{y_i})^2}}$

Room mean squared error can be approximated as $\displaystyle RMSE \approx s_y * \sqrt {1-R^2}$

The RMSE is an estimate of the standard deviation of the residuals for a simple regression model.

### Simple Linear Regression

The simple regression models tries to fit data in an equation that looks like: $\displaystyle E(y_i|x_i) = \beta_0 + \beta_1*x_i$ .

The following three assumptions must also apply:

1. The data points are independent from each other.
2. The residuals are normally distributed
3. the residuals have equal variance

$\displaystyle \beta_1$ is referred to as the marginal regression coefficient.

# Fixing for Violations of Assumptions in Linear Regression

## Non-Linear Data

### Log of the Explanatory Variable

Error creating thumbnail: Unable to save thumbnail to destination

When the plot of X versus Y looks like the image above, then the linear regression likely will take the form of $\displaystyle f(x) = b_0+b_1*log(x)$ . In this case, the slope can be interpreted as the amount that the response variable would increase if the explanatory variable was increase by 1%.

### Log of Explanatory Variable and Response Variable

Error creating thumbnail: Unable to save thumbnail to destination

When the plot of the explanatory variable versus the response variable looks like the image above, it is likely that the proper fit will require taking the log of the explanatory and response variable. The equation will look like: $\displaystyle log(f(x)) = b_0 + b_1*log(x)$ . In this case, the slope of the plot can be interpreted as the percentage that the response variable changes as the explanatory variable is changed by 1%.

### Increasing Variability in the Residuals

An assumption of linear regression is violated when there is heteroscedasticity, or in other words, the variance of the residuals changes over time. If the variance increases with time, the least square regression model can be transformed to: $\displaystyle (y_i/x_i) = \beta_0*(\frac{1}{x_i}) + \beta_1$

# Multiple Regression

## Partial F Test

The partial F statistic is equal to: $\displaystyle \frac{\frac{(R_{complete}^2 - R_{reduced}^2)}{r}}{\frac{1-R_{complete}^2}{n-K-1}}$

# Interpreting JMP Data

## Leveraged Residual Plots

The leverage residual plots show the residuals of the data versus one of the explanatory variables after model has been adjusted for other explanatory variables.

## Parameter p-values

If a parameter B has a p-value associated with it of x%, this means that if there were 100 samples taken of the population, then x of those samples would have a mean between 0 and B.