Predicting using Linear Regression in R
This article was published as a part of the Data Science Blogathon.
Introduction
Can you predict the revenue of the company by analyzing the amount of budget it allocates to its marketing team? Yes, you can, we will discuss one of the simplest machine learning techniques Linear regression. Regression is almost a 200yearold tool that is still effective in predictive analysis. It is one of the oldest statistical tools still used in Machine learning predictive analysis.
Table of contents

What is Linear regression?

Significance of linear regression in predictive analysis.

Practical application of linear regression using R.

Application on blood pressure and age dataset.

Multiple linear regression using R.

Application on wine dataset.

Conclusion
What is a Linear Regression?
Simple linear regression analysis is a technique to find the association between two variables. The two variables involved are a dependent variable which response to the change and the independent variable. Note that we are not calculating the dependency of the dependent variable on the independent variable just the association.
For example, A firm is investing some amount of money in the marketing of a product and it has also collected sales data throughout the years now by analyzing the correlation in the marketing budget and sales data we can predict next year’s sale if the company allocate a certain amount of money to the marketing department. The above idea of prediction sounds magical but it’s pure statistics. Linear regression is basically fitting a straight line to our dataset so that we can predict future events.
The best fit line would be of the form:
Y = B0 + B1X
Where, Y – Dependent variable
X – Independent variable
B0 and B1 – Regression parameter
Predicting Blood pressure using Age by Regression in R
Now we are taking a dataset of Blood pressure and Age and with the help of the data train a linear regression model in R which will be able to predict blood pressure at ages that are not present in our dataset.
Download Dataset from below
Equation of the regression line in our dataset.
BP = 98.7147 + 0.9709 Age
Importing dataset
Importing a dataset of Age vs Blood Pressure which is a CSV file using function read.csv( ) in R and storing this dataset into a data frame bp.
bp < read.csv(“bp.csv”)
Creating data frame for predicting values
Creating a data frame which will store Age 53. And this data frame will be used to predict blood pressure at Age 53 after creating a linear regression model.
p < as.data.frame(53) colnames(p) < "Age"
Creating a scatter plot using ggplot2 library
Taking the help of ggplot2 library in R we can see that there is a correlation between Blood Pressure and Age as we can see that the increase in Age is followed by an increase in blood pressure.
It is quite evident by the graph that the distribution on the plot is scattered in a manner that we can fit a straight line through the points.
Calculating the correlation between Age and Blood pressure
We can also verify our above analysis that there is a correlation between Blood pressure and Age by taking the help of cor( ) function in R which is used to calculate the correlation between two variables.
cor(bp$BP,bp$Age)[1] 0.6575673
Creating a Linear regression model
Now with the help of lm( ) function, we are going to make a linear model. lm( ) function has two attributes first is a formula where we will use “BP ~ Age” because Age is an independent variable and Blood pressure is a dependent variable and the second is data, where we will give the name of the data frame containing data which is in this case, is data frame bp.
model < lm(BP ~ Age, data = bp)
Summary of our linear regression model
summary(model)
Output:
## ## Call: ## lm(formula = BP ~ Age, data = bp) ## ## Residuals: ## Min 1Q Median 3Q Max ## 21.724 6.994 0.520 2.931 75.654 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 98.7147 10.0005 9.871 1.28e10 *** ## Age 0.9709 0.2102 4.618 7.87e05 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 17.31 on 28 degrees of freedom ## Multiple Rsquared: 0.4324, Adjusted Rsquared: 0.4121 ## Fstatistic: 21.33 on 1 and 28 DF, pvalue: 7.867e05
Interpretation of the model
## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 98.7147 10.0005 9.871 1.28e10 *** ## Age 0.9709 0.2102 4.618 7.87e05 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 B0 = 98.7147 (Y intercept) B1 = 0.9709 (Age coefficient) BP = 98.7147 + 0.9709 Age
It means a change in one unit in Age will bring 0.9709 units to change in Blood pressure.
The standard error is variability to expect in coefficient which captures sampling variability so the variation in intercept can be up 10.0005 and variation in Age will be 0.2102 not more than that
T value: t value is Coefficient divided by standard error it is basically how big is estimated relative to error bigger the coefficient relative to Std. error the bigger the t score and t score comes with a pvalue because its a distribution pvalue is how statistically significant the variable is to the model for a confidence level of 95% we will compare this value with alpha which will be 0.05, so in our case pvalue of both intercept and Age is less than alpha (alpha = 0.05) this implies that both are statistically significant to our model.
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple Rsquared: 0.4324, Adjusted Rsquared: 0.4121
## Fstatistic: 21.33 on 1 and 28 DF, pvalue: 7.867e05
Residual standard error or the standard error of the model is basically the average error for the model which is 17.31 in our case and it means that our model can be off by on an average of 17.31 while predicting the blood pressure. Lesser the error the better the model while predicting.
Multiple Rsquared is the ratio of (1(sum of squared error/sum of squared total))
Adjusted Rsquared:
If we add variables no matter if its significant in prediction or not the value of Rsquared will increase which the reason Adjusted Rsquared is used because if the variable added isn’t significant for the prediction of the model the value of Adjusted Rsquared will reduce, it one of the most helpful tools to avoid overfitting of the model.
F – statistics is the ratio of the mean square of the model and mean square of the error, in other words, it is the ratio of how well the model is doing and what the error is doing, and the higher the F value is the better the model is doing on compared to the error.
One is the degrees of freedom of the numerator of the F – statistic and 28 is the degree of freedom of the errors.
Predict the value of blood pressure at Age 53
BP = 98.7147 + 0.9709 Age
The above formula will be used to calculate Blood pressure at the age of 53 and this will be achieved by using the predict function( ) first we will write the name of the linear regression model separating by a comma giving the value of new data set at p as the Age 53 is earlier saved in data frame p.
predict(model, newdata = p)
## 1
## 150.1708
So, the predicted value of blood pressure is 150.17 at age 53
As we have predicted the blood pressure with the association of Age now there can be more than one independent variable involved which shows a correlation with a dependent variable which is called Multiple Regression.
Multiple Linear Regression Model
MultiLinear regression analysis is a statistical technique to find the association of multiple independent variables on the dependent variable. For example, revenue generated by a company is dependent on various factors including market size, price, promotion, competitor’s price, etc. basically Multiple linear regression model establishes a linear relationship between a dependent variable and multiple independent variables.
Equation of Multiple Linear Regression is as follows:
Y = B0 + B1X1 + B2X2 + .. + BnXk + E
Where
Y – Dependent variable
X – Independent variable
B0, B1, B3, . – Multiple linear regression coefficients
E Error
Taking another example of the Wine dataset and with the help of AGST, HarvestRain we are going to predict the price of wine.
Importing the dataset
Using the function read.csv( ) import both data set wine.csv as well as wine_test.csv into data frame wine and wine_test respectively.
wine < read.csv("wine.csv") wine_test < read.csv("wine_test.csv")
Download Dataset from below
Finding the correlation between different variable
Using cor( ) function and round( ) function we can round off the correlation between all variables of the dataset wine to two decimal places.
round(cor(wine),2)
Output:
Year Price WinterRain AGST HarvestRain Age FrancePop ## Year 1.00 0.45 0.02 0.25 0.03 1.00 0.99 ## Price 0.45 1.00 0.14 0.66 0.56 0.45 0.47 ## WinterRain 0.02 0.14 1.00 0.32 0.28 0.02 0.00 ## AGST 0.25 0.66 0.32 1.00 0.06 0.25 0.26 ## HarvestRain 0.03 0.56 0.28 0.06 1.00 0.03 0.04 ## Age 1.00 0.45 0.02 0.25 0.03 1.00 0.99 ## FrancePop 0.99 0.47 0.00 0.26 0.04 0.99 1.00
Scattered plots
By using the library ggplot2 in R create a scatter plot which can clearly show that AGST and Price of the wine are highly correlated. Similarly, the scattered plot between HarvestRain and the Price of wine also shows their correlation.
ggplot(wine,aes(x = AGST, y = Price)) + geom_point() +geom_smooth(method = "lm")
ggplot(wine,aes(x = HarvestRain, y = Price)) + geom_point() +geom_smooth(method = "lm")
Creating a Multilinear regression model
model1 < lm(Price ~ AGST + HarvestRain,data = wine) summary(model1)
Output:
## ## Call: ## lm(formula = Price ~ AGST + HarvestRain, data = wine) ## ## Residuals: ## Min 1Q Median 3Q Max ## 0.88321 0.19600 0.06178 0.15379 0.59722 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 2.20265 1.85443 1.188 0.247585 ## AGST 0.60262 0.11128 5.415 1.94e05 *** ## HarvestRain 0.00457 0.00101 4.525 0.000167 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3674 on 22 degrees of freedom ## Multiple Rsquared: 0.7074, Adjusted Rsquared: 0.6808 ## Fstatistic: 26.59 on 2 and 22 DF, pvalue: 1.347e06
Interpretation of the Model
## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 2.20265 1.85443 1.188 0.247585 ## AGST 0.60262 0.11128 5.415 1.94e05 *** ## HarvestRain 0.00457 0.00101 4.525 0.000167 *** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 B0 = 98.7147 (Y intercept) B1 = 0.9709 (Age coefficient) Price = 2.20265 + 0.60262 AGST  0.00457 HarvestRain
It means for a change in one unit in AGST will bring 0.60262 units to change in Price and one unit change in HarvestRain will bring 0.00457 units to change in Price.
The standard error is variability to expect in coefficient which captures sampling variability so the variation in intercept can be up 1.85443 and variation in AGST will be 0.11128 and variation in HarvestRain is 0.00101 not more than that
T value: t value is Coefficient divided by standard error it is basically how big is estimated relative to error bigger the coefficient relative to Std. error the bigger the tscore and tscore comes with a pvalue because its a distribution pvalue is how statistically significant the variable is to the model for a confidence level of 95% we will compare this value with alpha which will be 0.05, so in our case pvalue of intercept, AGST and HarvestRain is less than alpha (alpha = 0.05) this implies that all are statistically significant to our model.
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple Rsquared: 0.7074, Adjusted Rsquared: 0.6808
## Fstatistic: 26.59 on 2 and 22 DF, pvalue: 1.347e06
Residual standard error or the standard error of the model is basically the average error for the model which is 0.3674 in our case and it means that our model can be off by an average of 0.3674 while predicting the Price of wines. Lesser the error the better the model while predicting.
Multiple Rsquared is the ratio of (1(sum of squared error/sum of squared total))
Adjusted Rsquared:
If we add variables no matter if its significant in prediction or not the value of Rsquared will increase which the reason Adjusted Rsquared is used because if the variable added isn’t significant for the prediction of the model the value of Adjusted Rsquared will reduce, it one of the most helpful tools to avoid overfitting of the model.
F – statistics is the ratio of the mean square of the model and mean square of the error, in other words, it is the ratio of how well the model is doing and what the error is doing, and the higher the F value is the better the model is doing on compared to the error.
Two is the degrees of freedom of the numerator of the F – statistic and 22 is the degree of freedom of the errors.
Predicting values for our test set
prediction < predict(model1, newdata = wine_test)
Predicted values with the test data set
wine_test
## Year Price WinterRain AGST HarvestRain Age FrancePop ## 1 1979 6.9541 717 16.1667 122 4 54835.83 ## 2 1980 6.4979 578 16.0000 74 3 55110.24
prediction
## 1 2 ## 6.982126 7.101033
Conclusion
As we can see that from the available dataset we can create a linear regression model and train that model, if enough data is available we can accurately predict new events or in other words future outcomes.