Introduction To Assumptions Of Linear Regression
To Understand What are the assumptions of linear regression, Firstly We need to understand what is “Linear Regression” is a statistical method to regress the data with a dependent variable having continuous values whereas independent variables can have either continuous or categorical values. In other words “Linear Regression” is a method to predict the dependent variable (Y) based on values of independent variables (X). It can be used for the cases where we want to predict some continuous quantity.
There are five different types of Assumptions in linear regression.
- Linear Relationship
- No Autocorrelation
- Multivariate Normality
- No or low Multicollinearity
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots.
The graph shows that the money invested in Tv advertising increases the sale also increases linearly it means there is a linear relation between Tv advertising and sales.
Test to check linear relationship
First:- To show to graph using predicted and actual value.
Second:- To show to graph using predicted and residual value
Rainbow test:- This test is used for checking the linearity of regression applied. Here we define two hypothesis statement.
Null Hypo – Regression is Linear
Alternate hypo – Regression is non-Linear
With the help of using statsmodels.api library we check the linearity of regression.
import statsmodels.api as sm sm.stats.diagnostic.linear_rainbow(res=lin_reg)
It gives us the p-value and then the p-value is compared to the significance value(α) which is 0.05. If the p-value is greater than the significance value then consider that the failure to reject the null hypothesis i.e. Regression is Linear, if it is greater then reject the null hypothesis i.e Regression is not linear.
Little or No autocorrelation
No or low autocorrelation is the second assumption in assumptions of linear regression. The linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent of each other. In other words when the value of y(x+1) is independent of the value of y(x).
If the values of a column or feature are correlated with values of that same column then it is said to be autocorrelated, In other words, Correlation within a column.
Test to check autocorrelation
Plot the ACF plot of residual to check the autocorrelation in data.
If the graph look like cyclic graph their means residuals contain positive autocorrelation, If the graph look like alternative graph their means residuals contain negative autocorrelation
Graph shows the positive autocorrelation because it look like the cyclic graph
Durbin-Watson(DW) Test is Generally used to check the Autocorrelation.
Durbin Watson Test Can be defined as:-
In Durbin-Watson test statistic is approximately equal to 2*(1-r) where r is the sample autocorrelation of the residuals in the model. Thus, for r == 0, indicating no serial correlation, the test statistic equals 2.
Range of Durbin Watson Test from 0 to 4, where 0-2 shows positive Autocorrelation 2 means NO Autocorrelation and 2-4 means Negative Autocorrelation.
Multivariate Normality is the third assumption in assumptions of linear regression. The linear regression analysis requires all variables to be multivariate normal. Means data should be normally distributed. As sample sizes increase then the normality for the residuals is not needed. If we take a repeated sampling from our population data, for large sample sizes data, the distribution (across repeated samples data) of the ordinary least squares estimates of the regression coefficients of the model follows a normal distribution, for moderate to large sample sizes, non-normality of residuals should not adversely affect the usual inferential procedures. This result is a consequence of an extremely important result in statistics, known as the central limit theorem.
Test to check multivariate normal
To check the normality use the q-q plot we can infer if the data comes from a normal distribution. If the data is normally distributed then it gets a fairly straight line. if it not normal then seen with deviation in the straight line
Example Code is:-
import statsmodels.api as sm sm.qqplot(lin_reg.resid,fit= True,line = 40)
Jarque Bera Test:– This test is for the goodness of fit test of whether the sample data have skewness and kurtosis matching to normal distribution or not.
Here we define null and alternative hypotheses.
Null Hypothesis – Error terms are normally distributed.
Alternate Hypo – Error terms are not normally distributed.
In this test, we find the Probability Value(p-value) of residual and then compare it with the Significant value(ie 5.99).
from scipy import stats stats.jarque_bera(lin_reg.resid)
Homoscedasticity is the fourth assumption in assumptions of linear regression. Homoscedasticity describes a situation in which the error term ( the “noise” or random disturbance in the relationship between the independent and the target) is the same across all values of the independent variables. A scatter plot of residual values vs predicted values is a good way to check for homoscedasticity.
If the variance of the residual is symmetrically distributed across the residual line then data is said to be homoscedastic.
If the variance is unequal for residual, across the residual line then the data is said to be heteroscedasticity. In this case, the residual can form bow-tie, arrow, or any non-symmetric shape.
Test to check multivariate normal
Draw regplot on the basis of predicted and residual to check homoscedasticity
Example:- sns.regplot(x=lin_reg.predict(X_constant),y=lin_reg.resid) plt.show() Goldfeld Test or Beusch wagon Test:- This test used to check Homosedasticity. Here we define Null And Alternative Hypothesis. Null hypothesis:- variance is constant across the range of data(ie Homosedacity) Alternate hypothesis:- variance is not constant across the data(ie Hetrosedacity) Example Code:- import statsmodels.stats.api as sms sms.het_goldfeldquandt(lin_reg.resid,lin_reg.model.exog) #exog is all of input parameter
It gives us the p-value and then the p-value is compared to the significance value(α) which is 0.05. If the p-value is greater than the significance value then consider that the failure to reject the null hypothesis, if it is greater then reject the null hypothesis.
No or low Multicollinearity
No or low Multicollinearity is the fifth assumption in assumptions of linear regression. It refers to a situation where a number of independent variables in a multiple regression model are closely correlated to one another. Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant information, skewing the results in a regression model.
Test to check multicollinearity
Correlation coefficients:- An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called perfect multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be removed from the model if at all possible.
Variance Inflation Factor( VIF ):- Another method to check multicollinearity is that Variance Inflation Factor is the quotient of variance in a model of multiple-term by the variance of the model with one term. It tells about multicollinearity(ie ratio).
VIF = 1/T
Where T is the Tolerance it measures the influence of one independent variable to all other independent variables. The tolerance is calculated with an initial regression analysis. Ia defined as the T = 1 – R² for the first step regression analysis. If the Tolerance is less than (T< 0.1) there might be multicollinearity in the data and if the Tolerance is less than 0.01(T < 0.01) there certainly is.
If the VIF is 1 means data is not correlated if it is Between 1 To 5 there is moderately Correlated or greater than 5 is highly correlated.
Why removing highly correlated features is important?
The independent variable contains a stronger correlation, the more difficult it is to change one feature without changing another feature. It becomes difficult for the model to estimate the relationship between each independant variable and the target variable independently because the features tend to change in unison.
How to remove multicollinearity in data?
Suppose we have two features that are highly correlated then drop one feature from it and take another otherwise combine the two features and form new features.
Example Code:- from statsmodels.stats.outliers_influence import variance_inflation_factor vif=[variance_inflation_factor(X_constant.values,i) for i in range(X_constant.shape)] # Check correlation use below code sns.heatmap(bos.corr(),annot=True)
Implementation of Assumption test using Stats model library.
We used the Advertising dataset to test the assumption of linear regression. Link of data set. In this dataset contain a TV, Radio, Newspaper Advertising investment, and according to their sale.
# Import required libraries # Import Required Library import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import seaborn as sns # For Visualization import matplotlib.pyplot as plt sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5, color_codes=True) import warnings warnings.filterwarnings('ignore') # Load DataSet # Load DataSet ad_data = pd.read_csv('Advertising.csv',index_col='Unnamed: 0') ad_data.head() # Check The Head Of Data # Output
# Take Independant and Dependant Variable x = ad_data.drop(["Sales"],axis=1) y = ad_data.Sales import statsmodels.api as sm # for apply OLS model import this library # Apply OLS Model X_constant=sm.add_constant(x) # Add X Constaant to independant variable lin_reg=sm.OLS(y,X_constant).fit() # Fit the data into OLS Model lin_reg.summary() # Check Summary of model Output:-
|Date:||Mon, 06 Jul 2020||Prob (F-statistic):||1.58e-96|
# It gives us all stastical measure of indepndant variable dependant variable and residual of model # Assumptions of Linear Regression No Autocorrelation Multivariate Normality Linear Relationship Homoscedasticity No or low Multicollinearity # Autocorrelation import statsmodels.tsa.api as smt # Load Library for drow ACF plot to check AutoCorrelation acf = smt.graphics.plot_acf(lin_reg.resid) # pass residual into the acf function acf.show()
# The value of the Durbin Watson test is also close to 2. (i.e.DW=2.084) So the data in No AutoCorrelation # Multivariate Normality # Draw Distribution plot To Check Noramality sns.distplot(lin_reg.resid) plt.show()
# Q Qplot To Check Normality import statsmodels.api as sm Q_Qplot = sm.qqplot(lin_reg.resid, fit=True)
# Check Normality Using Jarque Bera Test from scipy import stats stats.jarque_bera(lin_reg.resid) #left value is p-value Output:- (151.2414204760376, 0.0) # Here we will Accept the Null hypothesis because the p-value is greater than significant value (151.24 > 5.99). # Linear Relationship # Plot Predict Vs Actual To check linearity sns.regplot(x=lin_reg.predict(X_constant),y=y) plt.show()
# Here we see whether all the points are on the line if yes then data is linear # Plot Predict Vs Residual To Check Linearity sns.regplot(x=lin_reg.predict(X_constant),y=lin_reg.resid) plt.show()
# Here we see whether all the points are on the line if yes then data is linear # Plot Predict Vs Residual To Check Linearity sns.regplot(x=lin_reg.predict(X_constant),y=lin_reg.resid) plt.show() # From above as it forms a straight line on which most of the points are lying so linear # Rainbow Test for Linearity import statsmodels.api as sm sm.stats.diagnostic.linear_rainbow(res=lin_reg) #2nd value is p-value Output:- (0.8896886584728811, 0.7185004116483391) # Here we will accept the Null hypothesis because Pvalue>0.05 # Homosedacticity Test # Draw regplot of predicted vs Residual sns.regplot(x=lin_reg.predict(X_constant),y=lin_reg.resid)
# From above as it forms a straight line on which most of the points are lying so linear # Goldfeld Test or Beusch wagon Test import statsmodels.stats.api as sms sms.het_goldfeldquandt(lin_reg.resid,lin_reg.model.exog) #exog is all of input parameter # Middle Value is the p-value which is > 0.05 hence we will fail reject null hypothesis # Multi Correlinearity # Calculate VIF from statsmodels.stats.outliers_influence import variance_inflation_factor vif=[variance_inflation_factor(X_constant.values,i) for i in range(X_constant.shape)] # here firstly, X_constant.shape gives us the number of columns in X_constant for # which the no of time the loop will run. # now each column will get compare with all columns. vif Output:- from pandas import Series,DataFrame df=DataFrame(vif,index=X_constant.columns,columns=['vif']) df Output:-
# Find Correlation Of independent features plt.figure(figsize=(10,8)) sns.heatmap(ad_data.corr(),annot=True) plt.show()
Each and every parameter are low correlated with each other so there is No Multi Correlinearity in Our data
Conclusion:– In this blog, you will get a better understanding of the assumption of linear regression and which test is used to check that assumption and their statistical way this will give you more confidence to solve any type of regression problem.