How do I run linear regression with Python?

We’ll be using the oldie but goldie Boston Housing dataset, which you can find attached here – BostonHousing to try and predict the median value of a home, indicated by the MEDV variable.

The other variables defined in the dataset are:

# CRIM per capita crime rate by town
# ZN proportion of residential land zoned for lots over 25,000
# INDUS proportion of non-retail business acres per town.
# CHAS Charles River dummy variable (1 if tract bounds river; 0
# NOX nitric oxides concentration (parts per 10 million)
# RM average number of rooms per dwelling
# AGE proportion of owner-occupied units built prior to 1940
# DIS weighted distances to five Boston employment centres
# RAD index of accessibility to radial highways
# TAX full-value property-tax rate per $10,000
# PTRATIO pupil-teacher ratio by town
# LSTAT % lower status of the population
# MEDV Median value of owner-occupied homes in $1000

As they all seem potentially reasonable predictors, I’m going to throw them all in the code snippet below. Not only will this estimate the OLS linear regression model (e.g. R-squared, adjsuted R-squred, F-ratio and correlation coefficients) but it will also calculate the:

  1. the Breusch-Pagan test to check for heteroscedasticity
  2. VIF (varince inflation factor) to check for multicollinearity
import pandas as pd
import numpy as np
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import os;
path ='C:/Users/hp/PycharmProjects/datascience/'
os.chdir(path)
os.getcwd()
dataset = pd.read_csv('BostonHousing.csv')
#print (dataset)
#--------------Function definitions------------------------------------------------------
# CRIM     per capita crime rate by town 
# ZN       proportion of residential land zoned for lots over 25,000 sq.ft. 
# INDUS    proportion of non-retail business acres per town. 
# CHAS     Charles River dummy variable (1 if tract bounds river; 0 otherwise) 
# NOX      nitric oxides concentration (parts per 10 million) 
# RM       average number of rooms per dwelling 
# AGE      proportion of owner-occupied units built prior to 1940 
# DIS      weighted distances to five Boston employment centres 
# RAD      index of accessibility to radial highways 
# TAX      full-value property-tax rate per $10,000 
# PTRATIO pupil-teacher ratio by town 
# B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 
# LSTAT   % lower status of the population 
# MEDV    Median value of owner-occupied homes in $1000
# -------------Script execution------------------------------------------------------------
# define the data/predictors as the pre-set feature names
df = pd.DataFrame(dataset,columns=['CRIM', 'ZN', 'INDUS', 'CHAS','NOX','RM','AGE','DIS', 'RAD', 'PTRATIO','B', 'LSTAT', 'TAX'])
# setting the target — the variable we’re trying to predict/estimate.
target = pd.DataFrame(dataset, columns=['MEDV'])
#-------------Running a linear regression without a constant--------------------------------
x = df
x = sm.add_constant(x)
y = target
# Note the difference in argument order
model = smf.OLS(y, x).fit()
predictions = model.predict(x) # make the predictions by the model
# Print out the statistics
print(model.summary())
# -------------1. Is the assumption of homoscedasticity violated ? -------------------------
# NB! Test for homoscedasticity (i.e. uneven variance across the error term)
# i.e. higher TAX causing higher larger variances among higher tax brackets
name = ['Lagrange multiplier statistic', 'p-value',
'f-value', 'f p-value']
bp = statsmodels.stats.diagnostic.het_breushpagan(model.resid, model.model.exog)
print(pd.DataFrame(name,bp))
# the Breusch-Pagan test indicates if results are being influenced by heteroscedasticity and are therefore unreliable
# the p-value is less than 0.05, this indicates that heteroscedasticity is present
# --------------2. Is the assumption of multicollinearity violated ? ------------------------
# Multicolinearity denotes the case when two x variables are highly correlated. \
# Not only is it redundant to include both related variables in a multiple regression
# model, but it’s also problematic. The bottom line is this: If two x variables
# are highly correlated, only include one of them in the regression
# model, not both. If you include both, the coefficients for each of the two variables
# will be unreliable because they share their contribution to determining the value of y.
# for multicolinearity check to work, columns need to be declared as variables
age = dataset['AGE']
rm = dataset['RM']
tax = dataset['TAX']
crim = dataset['CRIM']
zn = dataset['ZN']
indus = dataset['INDUS']
chas = dataset['CHAS']
nox = dataset['NOX']
dis = dataset['DIS']
rad = dataset['RAD']
pratio = dataset['PTRATIO']
lstat = dataset['LSTAT']
# --------------------------------calculating VIF---------------------------------------------------------
z = np.column_stack((age, rm, crim, zn, indus, chas, nox, dis, rad, pratio, lstat))
z = sm.add_constant(z, prepend=True)
from sklearn.metrics import r2_score
from sklearn import linear_model
lm = linear_model.LinearRegression()
model = lm.fit(z,tax)
predictions = lm.predict(z)
#print(predictions)
rsquared_t = r2_score(tax, predictions)
vif =1/(1-(rsquared_t))
print('VIF =', vif)
# We are defining our predictor variables) as a variable z.
# Once we have done this, we are then running this variable against the age variable (our auxiliary regression).
# Once we have the predictions, we are then obtaining the R-Squared value and then calculating the VIF statistic.
# If VIF > 10, there is cause for concern
#Myers (1990) suggests that a value of 10 is a good value at which to worry
#if the average VIF is greater than 1, then multicollinearity may be biasing the regression model (Bowerman & O’Connell, 1990).
#To calculate the average VIF we simply add the VIF values for each predictor and divide by the number of predictors (k):
# To check for the others, we simply select the relevant variable as the dependent one, and regress against the others
# rm, tax vs. age = 1.362
# age, tax vs. rm = 0.168
# rm, age vs. tax = 0.006
# -------------------Printing correlation matrix-----------------------------------------
#age, rm, crim, zn, indus, chas, nox, dis, rad, pratio, lstat
d = {'age': age,
'tax': tax,
'rm': rm,
'crim': crim,
'zn': zn,
'indus': indus,
'chas': chas,
'nox': nox,
'dis': dis,
'rad': rad,
'pratio': pratio,
'lstat': lstat}
cr = pd.DataFrame(data = d)
print("Data Frame")
print("Correlation Matrix")
print(cr.corr())
print()
def get_redundant_pairs(cr):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = cr.columns
for i in range(0, cr.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_abs_correlations(cr, n=5):
au_corr = cr.corr().abs().unstack()
labels_to_drop = get_redundant_pairs(cr)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
print("Top Absolute Correlations")
print(get_top_abs_correlations(cr, 10))
view raw LinearRegression.py hosted with ❤ by GitHub

What do they actually mean?

(1) R aquared = the ratio indicating how much variability can be explained by the model, stands at the reasonable 0.741 or 74%

(2) The adjusted R-squared, which penalises the usage of multiple predictors is estimated at 0.734 or 73%. This is a good indication as it suggests that we haven’t overcomplicated the model by throwing too many predictors

(3) The F-ratio is significant, which indicates that the model does a better job at predicting than the regular average

The t-tests for all CRIM, ZN,CHAS, NOX, RM, DIS, RAD, PRATIO, LSTAT, TAX are significant (p<0.05), which indicates that all these are strong predictors of the median price of a house

Great! So we, can apply the significant coefficients and calculate our predicted values?

(1) The Breusch-Pagan test is significant, which indicates the presence of heteroscedasticity in the data.

Heteroscedasticity is a problem because it:

(a) causes deflated p-values, which may lead you to think that a model is significant when it is actually not

(b) coefficient estimates, hence predicted values are less precise

(2) Whilst still below the point of worry 10, VIF (when regressed against the TAX variable) is quite high, suggesting that there might be varables that highly correlate with TAX

Therefore, we print the detailed correlation matrix and indeed, it turns out that rad and tax variables are highly correlated (r > 0.9)

As both assumptions have been violated, the current model is far from reliable. A way to address both issues is to re-run a different combination of variables.

I’m going to start by removing the correlated variable ‘RAD’ and the ones that were estimated as insignificant in model 1 (i.e. INDUS, AGE)

Whilst it did solve the issue with multicollinearity, it did nothing for heteroscdasticity

Therefore, we’ll either have to:

(1) build the model with a completely new set of predictors

(2) transform the predicted variable using the Box-Cox approach

(3) use weighted least squares regression model to correct for the heteroscedasticity