In the post Simple Linear Regression With Example we saw how to create a Simple linear regression model using the scikit-learn library in Python. In this post we'll see how to create a multiple linear regression model using the scikit-learn library. We'll also go through the steps for data pre-processing and cleaning, feature transformation, encoding categorical data.
Multiple Linear Regression
In simple linear regression we model relationship between one independent variable (predictor) and one dependent variable. Multiple Linear Regression is a fundamental statistical technique used to model the relationship between one dependent variable and multiple independent variables. So, we'll create a model to analyse how multiple features affect the outcome.
Multiple Linear Regression equation
In context of machine learning where we have sample data and we use it to create regression model, multiple linear regression equation is as given below.
$$ \hat {y}=b_0 + b_1x_1 + b_2x_2 + b_3x_3 + ….. + b_nx_n $$
Here \(\hat{y}\) is the predicted label - Output
b0 is the intercept, which tells you where the regression line intercepts the Y-axis. Or you can say it is the value when all the predictors x1, x2, x3, .. , xn are zero.
b1, b2, bn are the slopes. It tells how much dependent variable changes for one unit change in given independent variable when all the other independent variables are held constant. For example, b1 represents the estimated change in \(\hat {y}\), against per unit increase in x1 when x2, x3, .. , xn are held constant. To explain it in other words, if you want to interpret b1, you imagine increasing x1 by 1 unit while keeping x2, x3, .. , xn unchanged. Then the predicted change in \(\hat {y}\) is exactly b1. Same logic applies for b2, b3 and so on.
The residual (difference between actual value and predicted value) term is calculated as \(e_i = y_i - \hat{y}_i\).
In the model these slopes (b1, b2, …) are chosen to minimize the mean of this residual sum of squares (Mean Squared Error).
$$ L=\frac{1}{n}\sum _{i=1}^n(y_i-\hat {y}_i)^2 $$The goal of the model is to find the best fit line which has the minimum Mean Squared Error.
Multiple linear regression using scikit-learn Python library
Dataset used here can be downloaded from- https://www.kaggle.com/datasets/hellbuoy/car-price-prediction
Goal is to predict the car price based on the given features.
In the implementation code is broken into several smaller units with some explanation in between for data pre-processing steps.
1. Importing libraries and reading CSV file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('./CarPrice_Assignment.csv')
2. Getting info about the data. The parameter include='all' includes all the columns otherwise only columns with numerical values are included.
df.describe(include='all')
3. Removing columns
There are 26 columns and 205 rows in the dataset. On analyzing the data, you can make an observation that "car_ID" is inconsequential, "CarName" column has 147 unique values, encoding these many unique categorical values will be a problem so we'll drop these 2 columns. Also, "enginelocation" column has 'front' as value for 202 out of 205 rows, making it as good as a constant, so we can drop this column too.
# dropping columns car_id (just a unique id)
# CarName as there are 147 unique car names, encoding it will add lot of columns
df = df.drop(columns=['car_ID', 'CarName'])
# dropping columns enginelocation (as 202 entries are with value 'front')
df = df.drop("enginelocation", axis=1)
4. Removing outliers
We can also check for outliers in the dependent and independent variables because extreme values can disproportionately affect the regression line.
For that we can plot a distribution, if we do it for 'price'
g = sns.displot(df['price'], kde=True, bins=30) g.set(xlim=(0, None))
As you can see, there is a positive skew. Let's say we want to avoid top 1% of the prices (treat them as outliers), that can be done using quantile() function in Pandas to return 99th percentile value.
#taking only the 99% records qnt = df['price'].quantile(0.99) data_re = df[df['price'] < qnt]
After that if you run the data_re.describe(include='all') code line, you can see that 3 rows are removed and the max price is now 37028. That way we have decreased some of the skewness in the price data.
Same way you can check for some of the independent variables and remove some outliers if needed.
5. Resetting index
If you have removed few records based on quantile values, you can use reset_index() to rearrange the index of a DataFrame back to the default integer index (0, 1, 2, …). As removing random records disturbs the default index.
#resetting indices data_processed = data_re.reset_index(drop=True)
6. Checking for linear relationship between variables.
You can also plot the independent variables Vs price and verify the scatter plot, if scatterplot looks roughly like a straight line that means a likely linear relationship. In case relationship doesn't look linear we may have to use logarithmic transformation, square root transformation to transform the data.
f, (p1,p2,p3) = plt.subplots(1,3, sharey=True, figsize=(15,3))
p1.scatter(data_processed['enginesize'], data_processed['price'])
p1.set_title('EngineSize and Price')
p2.scatter(data_processed['horsepower'], data_processed['price'])
p2.set_title('HorsePower and Price')
p3.scatter(data_processed['highwaympg'], data_processed['price'])
p3.set_title('HighwayMpg and Price')
At least for these variables relationship looks linear.
7. Feature and label selection
y = data_processed['price'] # dependent variable X = data_processed.drop(columns=['price']) # independent variables
8. Checking for multicollinearity
Multicollinearity in linear regression occurs when two or more independent variables are highly correlated, meaning they provide redundant information, making it difficult to quantify the individual contribution of each independent variable to the dependent variable.
For detecting multicollinearity two of the most used options are-
Correlation Matrix
A correlation matrix is a matrix displaying correlation coefficients for all the possible pairs of predictors. That helps to find relationships between independent variables. Look for high correlation coefficients (e.g., >0.7 or 0.8) between predictors.
Variance Inflation Factor (VIF)
The Variance Inflation Factor (VIF) measures how much the variance of an estimated regression coefficient is increased due to collinearity (correlation) among predictor variables. A high VIF of greater than 10 indicates multicollinearity.
If you want to use correlation matrix to find the high correlation (here it is kept as 0.85) then following code will drop the features.
# select columns with numerical values
v = X.select_dtypes(include ='number')
corr_matrix = v.corr().abs() # absolute correlations
#corr_matrix
#print(corr_matrix)
upper = corr_matrix.where(
#upper triangular part of an array
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# get the columns having any corr value > .85
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]
print(to_drop)
X_reduced = X.drop(columns=to_drop)
If you want to use VIF (upper limit kept as 10) then following code will drop the features. Note that the statsmodels library provides the variance_inflation_factor function to compute VIF for each variable in a regression model.
from statsmodels.stats.outliers_influence import variance_inflation_factor # select columns with numerical values v = X.select_dtypes(include ='number') #select catgorical features categorical_features = X.select_dtypes(exclude='number').columns # create a new dataframe vif = pd.DataFrame() vif["VIF"] = [variance_inflation_factor(v.values, i) for i in range(v.shape[1])] vif["Features"] = v.columns #get the columns where VIF is less than or equal to 10 valid_numeric = vif.loc[vif["VIF"] <= 10, "Features"] final_features = list(valid_numeric) + list(categorical_features) X_reduced = X[final_features]
I have used correlation matrix code in this example which drops the following columns.
['carlength', 'curbweight', 'enginesize', 'highwaympg']
9. Splitting and encoding data
Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas 80% of the data is used to train the model. OneHotEncoder is used to encode categorical data. With OneHotEncoder, drop = 'first' parameter is used so that new columns are created for only n-1 unique values, that helps in avoiding dummy variable trap. The parameter handle_unknown = 'ignore' helps with training vs. transform mismatch. The encoder learns categories from the training set. If new categories appear in the test set, they're "unknown". Parameter handle_unknown='ignore' ensures unseen categories don't break the transform step. They'll be encoded as all zeros.
Note that both fitting and transformation (using fit_transform) is done for training data, where as only transform() method is used for test data. That's how it should be done. When you split data into test data and train data there is a chance some value may not appear in train data and only in test data. Your encoder has learned about the categories from training data, when a new value is encountered while transforming test data it is an unknown value for encoder, that's where handle_unknown='ignore' parameter helps.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score, mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=0)
ct = ColumnTransformer([
('encoder', OneHotEncoder(sparse_output = False, drop = 'first',handle_unknown = 'ignore'), X_reduced.select_dtypes(exclude='number').columns)
],remainder = 'passthrough')
ct.fit_transform(X_train)
X_train_enc = ct.fit_transform(X_train)
X_test_enc = ct.transform(X_test)
10. Training the model
From sklearn you import the LinearRegression class. Later you have to create an object of this class and call the fit method to train the model, parameters passed to the fit method are training data (X_train in our case) and target values (y_train in our case).
reg = LinearRegression() reg.fit(X_train_enc, y_train)
11. Once the model is trained, predictions can be made using test data which can then be compared with the actual test data (y_test)
y_pred = reg.predict(X_test_enc)
12. Comparing test and predicted data
df_results = pd.DataFrame({'Target':y_test, 'Predictions':y_pred})
df_results['Residual'] = df_results['Target'] - df_results['Predictions']
df_results['Difference%'] = np.abs((df_results['Residual'] * 100)/df_results['Target'])
print(df_results.head())
Target Predictions Residual Difference% 18 6295.0 5691.859375 603.140625 9.581265 171 10698.0 12380.265625 -1682.265625 15.725048 107 13860.0 18371.781250 -4511.781250 32.552534 98 13499.0 15935.093750 -2436.093750 18.046476 178 15750.0 22500.046875 -6750.046875 42.857440
print(df_results.describe())
Target Predictions Residual Difference% count 41.000000 41.000000 41.000000 41.000000 mean 13564.524390 13802.394436 -237.870046 15.610925 std 7463.439157 6884.600274 2839.126221 11.662503 min 6189.000000 5691.859375 -7951.265625 0.230753 25% 8495.000000 9159.765625 -1665.515625 7.594619 50% 11694.000000 12293.171875 -448.171875 13.121822 75% 15750.000000 16523.468750 1010.000000 21.949032 max 37028.000000 33051.968750 5494.531250 48.020689
As you can see min difference percentage is 0.23 while the max is 48.02.
13. Seeing the model metrics such as R-squared, mean squared error and root mean squared error.
print("R2 score", r2_score(y_test, y_pred))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error", mse)
print("Root Mean Squared Error", np.sqrt(mse))
That's all for this topic Multiple Linear Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Python Tutorial Page
Related Topics
You may also like-






