How do I run linear regression with Python?

We’ll be using the oldie but goldie Boston Housing dataset, which you can find attached here – BostonHousing to try and predict the median value of a home, indicated by the MEDV variable.

The other variables defined in the dataset are:

# CRIM per capita crime rate by town
# ZN proportion of residential land zoned for lots over 25,000
# INDUS proportion of non-retail business acres per town.
# CHAS Charles River dummy variable (1 if tract bounds river; 0
# NOX nitric oxides concentration (parts per 10 million)
# RM average number of rooms per dwelling
# AGE proportion of owner-occupied units built prior to 1940
# DIS weighted distances to five Boston employment centres
# RAD index of accessibility to radial highways
# TAX full-value property-tax rate per $10,000
# PTRATIO pupil-teacher ratio by town
# LSTAT % lower status of the population
# MEDV Median value of owner-occupied homes in $1000

As they all seem potentially reasonable predictors, I’m going to throw them all in the code snippet below. Not only will this estimate the OLS linear regression model (e.g. R-squared, adjsuted R-squred, F-ratio and correlation coefficients) but it will also calculate the:

  1. the Breusch-Pagan test to check for heteroscedasticity
  2. VIF (varince inflation factor) to check for multicollinearity

What do they actually mean?

(1) R aquared = the ratio indicating how much variability can be explained by the model, stands at the reasonable 0.741 or 74%

(2) The adjusted R-squared, which penalises the usage of multiple predictors is estimated at 0.734 or 73%. This is a good indication as it suggests that we haven’t overcomplicated the model by throwing too many predictors

(3) The F-ratio is significant, which indicates that the model does a better job at predicting than the regular average

The t-tests for all CRIM, ZN,CHAS, NOX, RM, DIS, RAD, PRATIO, LSTAT, TAX are significant (p<0.05), which indicates that all these are strong predictors of the median price of a house

Great! So we, can apply the significant coefficients and calculate our predicted values?

(1) The Breusch-Pagan test is significant, which indicates the presence of heteroscedasticity in the data.

Heteroscedasticity is a problem because it:

(a) causes deflated p-values, which may lead you to think that a model is significant when it is actually not

(b) coefficient estimates, hence predicted values are less precise

(2) Whilst still below the point of worry 10, VIF (when regressed against the TAX variable) is quite high, suggesting that there might be varables that highly correlate with TAX

Therefore, we print the detailed correlation matrix and indeed, it turns out that rad and tax variables are highly correlated (r > 0.9)

As both assumptions have been violated, the current model is far from reliable. A way to address both issues is to re-run a different combination of variables.

I’m going to start by removing the correlated variable ‘RAD’ and the ones that were estimated as insignificant in model 1 (i.e. INDUS, AGE)

Whilst it did solve the issue with multicollinearity, it did nothing for heteroscdasticity

Therefore, we’ll either have to:

(1) build the model with a completely new set of predictors

(2) transform the predicted variable using the Box-Cox approach

(3) use weighted least squares regression model to correct for the heteroscedasticity