Regression Analysis
Contents
Definitions
Collinearity
When there are two explanatory variables that are highly correlated with each other when doing a regression analysis, it can be said that there two explanatory variables are collinear.
Confidence Interval
A confidence interval is the range of numbers that serves as an estimate of a population parameter. The 95% of is defined as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle b_0 \plusmn 2*SE(b_0)} SE is the standard error of the parameter, and is calculated as:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle SE(b_1) = \frac{RMSE}{\sqrt{n}*s_x}}
The confidence interval for the prediction interval is defined as:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \bar{y} \plusmn 2*RMSE} where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \bar{y}} is the y value calculated from the given Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X, B_0, B_1}
Correlation Coefficient
The correlation coefficient, frequently denoted as r, measures the linear association between two numerical variables. The higher the absolute value of r, the larger the amount of variation that is explained by the regression model being used. The value of r is always between -1 and 1.
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R^2 \approx \frac{s_y^2-RMSE^2}{s_y^2}}
Root Mean Squared Error
The root mean squared of a regression model is the average between the data between the data points and the least square regression line and is defined as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle RMSE = \sqrt{\frac{1}{n-2}*\sum{(y_i-\hat{y_i})^2}}}
Room mean squared error can be approximated as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle RMSE \approx s_y * \sqrt {1-R^2} }
The RMSE is an estimate of the standard deviation of the residuals for a simple regression model.
Simple Linear Regression
The simple regression models tries to fit data in an equation that looks like: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E(y_i|x_i) = \beta_0 + \beta_1*x_i} .
The following three assumptions must also apply:
- The data points are independent from each other.
- The residuals are normally distributed
- the residuals have equal variance
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta_1} is referred to as the marginal regression coefficient.
Fixing for Violations of Assumptions in Linear Regression
Non-Linear Data
Log of the Explanatory Variable
When the plot of X versus Y looks like the image above, then the linear regression likely will take the form of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f(x) = b_0+b_1*log(x)} . In this case, the slope can be interpreted as the amount that the response variable would increase if the explanatory variable was increase by 1%.
Log of Explanatory Variable and Response Variable
When the plot of the explanatory variable versus the response variable looks like the image above, it is likely that the proper fit will require taking the log of the explanatory and response variable. The equation will look like: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle log(f(x)) = b_0 + b_1*log(x)} . In this case, the slope of the plot can be interpreted as the percentage that the response variable changes as the explanatory variable is changed by 1%.
Increasing Variability in the Residuals
An assumption of linear regression is violated when there is heteroscedasticity, or in other words, the variance of the residuals changes over time. If the variance increases with time, the least square regression model can be transformed to: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (y_i/x_i) = \beta_0*(\frac{1}{x_i}) + \beta_1}
Multiple Regression
Partial F Test
The partial F statistic is equal to: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{\frac{(R_{complete}^2 - R_{reduced}^2)}{r}}{\frac{1-R_{complete}^2}{n-K-1}}}
Interpreting JMP Data
Leveraged Residual Plots
The leverage residual plots show the residuals of the data versus one of the explanatory variables after model has been adjusted for other explanatory variables.
Parameter p-values
If a parameter B has a p-value associated with it of x%, this means that if there were 100 samples taken of the population, then x of those samples would have a mean between 0 and B.