Contrary to what you might think – running a linear regression (be it in Python, R, SPSS or Excel) is not rocket science! What’s trickier is the outcome interpretation, so let’s first have a bit of review of what linear regression is all about!

Regression analysis is a method of supervised learning used to **predict an outcome ***(dependent variable)* from **one** or **several predictors **(independent variables).

Classical statistics deems that 4 assumptions need to be met for a regression model to be reliable:

1. The noise ε (or equivalently, the dependent variable) follows a normal distribution

2. The linear relationship is correct

3. The cases are independent of each other (no multicollinearity)

4. The variability in Y values for a given set of predictors is the same regardless of the values of the predictors (homoscedasticity)

The normal distribution assumption is paramount in academic research where data is scarce and the same dataset is used to both fit the model and assess its reliability.

Business environment, however, offers abundant data, which makes it possible to use training and validation sets, respectively.

The validation set enables the estimation of errors on our predictions, therefore, we can drop the assumption of normal distribution when using linear regression models in data mining.

Nevertheless, it is essential to make sure that the **assumptions of homoscedasticity and multicollinearity are not violated**

Failing to do so will result in **unreliable predictor coefficients**

How do I assess the reliability of my model?

A linear (OLS) regression model will produce the following ouput

**(1) Multiple R **(Pearson R) = indicates the strength of the relationship (*correlation*) between the predictor and the predicted variables.

*The closer to 1 the correlation coefficient is, the more the predicted variable is dependent on the predictor one and vice versa. *

In other words, **large values of multiple R indicate strong correlation** between predicted and observed values

**(2) R squared** = indicates how much variance the model explains relative to how much there is to explain in the first place:

(a) R-squared > 0.8 (great)

(b) R-squared between 0.5 and 0.8 (decent)

(c) R-squared < 0.5 (bad)

**NB!**** **Whilst high R-squared does suggest better predictive capacity, it can also be an indication of overfitting.

Overfitting typically occurs in highly complex models that rely on too many predictor variables and too few observations. Regression coefficients then represent the random error in the data rather than the genuine relationships between the variables

It’s, therefore, essential that high R-squared results always be cross-validated on observations that weren’t used to estimate the model (validation set)

**(3) adjusted R-squared** is especially important in multiple regression models. Adding a new predictor always increases the R-squared. Thus, models with multiple independent variables can seemingly provide a better fit than models with fewer predictors. **Adjusted R-squared** remedies this **by imposing a penalty** for the **number of predictors** that were used in the model.

Hence, it’s critical that you monitor how the adjusted R-squared varies across model versions with different number of predictors.

**(4) F-ratio**** = **how much variability the model can explain vs. how much it can’t *(i.e. the ratio between how good the model is compared to how bad it is).* As such, the F-ratio indicates **how much the model improves the predictive capacity compared to the regular average**. If significant **(p< 0.05)** then the model does a better job at predicting the values than a regular average.

*A good model should have a high F-ratio (at least higher than 1).* The **h****igher the value of the F-ratio** is, the **better model** we have *(e.g. if two models both have significant F-ratios, the one with a higher F-statistic will be doing a better job at predicting the dependent variable)*

**(5) p-value(s)**** =** represent the significance of the t-tests of the coefficient estimates. In other words, coefficient estimates that are insignificant do little to predict the dependant variable

**References: **

Field, A. (2009). *Discovering Statistics Using SPSS.* London: SAGE Publications Ltd

Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). *Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner.* Wiley.