We’ll be using the oldie but goldie Boston Housing dataset, which you can find attached here – BostonHousing to try and predict the median value of a home, indicated by the MEDV variable.

The other variables defined in the dataset are:

# CRIM | per capita crime rate by town |

# ZN | proportion of residential land zoned for lots over 25,000 |

# INDUS | proportion of non-retail business acres per town. |

# CHAS | Charles River dummy variable (1 if tract bounds river; 0 |

# NOX | nitric oxides concentration (parts per 10 million) |

# RM | average number of rooms per dwelling |

# AGE | proportion of owner-occupied units built prior to 1940 |

# DIS | weighted distances to five Boston employment centres |

# RAD | index of accessibility to radial highways |

# TAX | full-value property-tax rate per $10,000 |

# PTRATIO | pupil-teacher ratio by town |

# LSTAT | % lower status of the population |

# MEDV | Median value of owner-occupied homes in $1000 |

As they all seem potentially reasonable predictors, I’m going to throw them all in the code snippet below. Not only will this estimate the OLS linear regression model *(e.g. R-squared, adjsuted R-squred, F-ratio and correlation coefficients)* but it will also calculate the:

- the Breusch-Pagan test to check for heteroscedasticity
- VIF (varince inflation factor) to check for multicollinearity

**(1) R aquared = **the ratio indicating how much variability can be explained by the model, stands at the reasonable 0.741 or 74%

**(2) The adjusted R-squared**, which penalises the usage of multiple predictors is estimated at 0.734 or 73%. This is a good indication as it suggests that we haven’t overcomplicated the model by throwing too many predictors

**(3) The F-ratio is significant**, which indicates that the model does a better job at predicting than the regular average

**The t-tests** for all CRIM, ZN,CHAS, NOX, RM, DIS, RAD, PRATIO, LSTAT, TAX **are significant (p<0.05),** which indicates that all these are strong predictors of the median price of a house

Great! So we, can apply the significant coefficients and calculate our predicted values?

**(1) The Breusch-Pagan test** is significant, which indicates the presence of heteroscedasticity in the data.

**Heteroscedasticity is a problem because it:**

(a) causes deflated p-values, which may lead you to think that a model is significant when it is actually not

(b) coefficient estimates, hence predicted values are less precise

**(2) Whilst still below the point of worry 10**, **VIF** (when regressed against the TAX variable) is quite high, suggesting that there might be varables that highly correlate with TAX

Therefore, we print the detailed correlation matrix and indeed, it turns out that **rad** and **tax** variables are highly correlated **(r > 0.9)**

As both assumptions have been violated, the current model is far from reliable. A way to address both issues is to re-run a different combination of variables.

I’m going to start by removing the correlated variable ‘RAD’ and the ones that were estimated as insignificant in model 1 (i.e. INDUS, AGE)

Whilst it did solve the issue with multicollinearity, it did nothing for heteroscdasticity

**Therefore, we’ll either have to:**

(1) build the model with a completely new set of predictors

(2) transform the predicted variable using the Box-Cox approach

(3) use weighted least squares regression model to correct for the heteroscedasticity