In this post we'll see how to use decision tree regression which uses decision trees for regression tasks to predict continuous
values. Decision trees are also used for creating classification models to predict a category or class label.
How does decision tree regression work
In decision tree regressor, decision tree splits the data using features and threshold values that enables them to capture complex,
non-linear relationships.
The decision tree regression model has a binary tree like structure where you have a-
- Root node- Starting point which represents the whole dataset.
- Decision nodes- A decision point where the algorithm chooses a feature and a threshold to split the data into
subsets.
- Branches- From each decision node there are branches to child nodes, representing the outcome of the rule
(tested decision). For example, if you have housing dataset and one of the features is square footage and the threshold value
is 1500 then the algorithm at a decision node decides- Is square footage ≤ 1500
- If yes go towards left (houses smaller than 1500 sft)
- If no go towards right (houses larger than or equal to 1500 sft).
- Leaf node- Contains the final predicted value. Also known as the terminal node.
Decision Tree Structure
How is feature selected
If there are multiple features, at each node only one of the features is selected for making the decision rule but that feature
is not just picked arbitrarily by the algorithm. All of the features are evaluated, where the steps are as given below-
- For each feature, the algorithm considers possible split points (threshold values).
- For each candidate split, the algorithm computes the decrease in impurity after splitting. For decision tree regressor,
impurity is measured using one of the following metrics-
- Mean Squared Error (MSE) which is the default
- Friedman MSE
- Mean Absolute Error (MAE)
- Poisson deviance
When splitting a parent node into left (L) and right (R) child nodes:
$$Cost of (\mathrm{split})=C(L)+C(R)$$
The algorithm evaluates all possible features and thresholds, and chooses the split that minimizes the total squared error
across child nodes
For a parent node with N samples split into-
- Left child with N_L samples
- Right child with N_R samples
The cost of the split is-
$$C(\mathrm{split})=\frac{N_L}{N}\cdot MSE(L)\; +\; \frac{N_R}{N}\cdot MSE(R)$$
where-
$$MSE(L)=\frac{1}{N_L}\sum _{i\in L}(y_i-\bar {y}_L)^2$$
$$MSE(R)=\frac{1}{N_R}\sum _{i\in R}(y_i-\bar {y}_R)^2$$
- yi = target value of sample i
- \(\bar {y}_L, \bar {y}_R\) = mean target values in left and right child nodes
- \(N=N_L+N_R\)
- The same procedure is repeated recursively for each child node until stopping criteria are met (e.g., max depth, min samples
per leaf, max_leaf_node or no further improvement). Which means at each node:
- Compute cost of split (C-split) for all candidate features and thresholds.
- Choose the split with the minimum cost of split
Scikit-Learn uses the Classification and Regression Tree (CART) algorithm to train decision trees
Here is the decision tree structure (with max depth as 3) for the laptop data used in the example in this post.
How is the value predicted
Before going to how prediction is done in decision tree regressor, note that by following the decision rules at each node, samples
will ultimately fall into one of the leaf nodes. Value you see in each of the leaf node in the above image is the average of the
target values of all the training samples that ended up in that specific leaf node.
To make a prediction for a new data point, you traverse the tree from the root to a leaf node by following the decision rules.
If you follow the above image at the root node it has chosen the "TypeName" feature (actually that TypeName feature is encoded
that's why you see the feature name as "encoder_type_notebook"), threshold value is 0.5. Same way for all the other decision nodes
there are some rules which will be evaluated for the new data point and ultimately the new data point also falls into one of the
leaf nodes. So, the predicted value for the new data point will be the value in that leaf node (the average of the target values
of all the training samples that ended up in that specific leaf node).
Decision tree regression using scikit-learn Python library
Dataset used here can be downloaded from-
https://www.kaggle.com/datasets/illiyask/laptop-dataset
Goal is to predict the price of the laptop based on the given features.
In the implementation code is broken into several smaller units with some explanation in between for the steps.
1. Importing libraries and reading CSV file
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df = pd.read_csv('./laptop_eda.csv')
laptop_eda.csv file is in the current directory.
2. Getting info about the data.
df.describe(include='all')
Analyze the data, there are 1300 rows, price has lots of variance, minimum value is 9270.72 whereas maximum value is 324954.72.
3. Check for duplicates and missing values
#for duplicates
df.duplicated().value_counts()
#for missing values
df.isnull().sum()
Output (for duplicates)
False 1270
True 30
Name: count, dtype: int64
There are duplicates which can be removed.
df.drop_duplicates(inplace=True)
4. Plotting pairwise relationship in the dataset
sns.pairplot(df[["Company", "Ram", "Weight", "SSD", "Price"]], kind="reg")
plt.show()
This helps in understanding the relationship between features as well as with dependent variables.
If you analyse the plots, relationship between Price and RAM looks kind of linear otherwise it is non-linear relationship among
the pairs.
5. Checking for outliers
To check for extreme values IQR method is used. The IQR (Interquartile Range) method detects outliers by finding data points
falling below Q1 - 1.5 X IQR or above Q3 + 1.5 X IQR
Here IQR = Q3 - Q1 (middle 50% of data)
- Q1 is the 25th percentile
- Q3 is the 75th percentile
for label, content in df.select_dtypes(include='number').items():
q1 = content.quantile(0.25)
q3 = content.quantile(0.75)
iqr = q3 - q1
outl = content[(content <= q1 - 1.5 * iqr) | (content >= q3 + 1.5 * iqr)]
perc = len(outl) * 100.0 / df.shape[0]
print("Column %s outliers = %.2f%%" % (label, perc))
Output
Column Ram outliers = 17.24%
Column Weight outliers = 3.54%
Column Touchscreen outliers = 100.00%
Column ClockSpeed outliers = 0.16%
Column HDD outliers = 0.00%
Column SSD outliers = 1.42%
Column PPI outliers = 9.37%
Column Price outliers = 2.20%
Going back to where data info was displayed and analysing it shows RAM values are from 2 GB to 64 GB, which to me looks ok
in the context of laptop dataset and doesn't require dropping of any rows.
Touchscreen has only 2 values 0 and 1 (binary column), so that also doesn't require deletion of any rows as outliers.
Price also has lots of variance. You can also plot a distplot to look for distribution.
dp = sns.displot(df['Price'], kde=True, bins=30)
dp.set(xlim=(0, None))
Plot shows positive skewness but in this example, not deleting any outliers for Price. You can test the final result by keeping all the rows
or after deleting outliers.
6. Feature and label selection
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
7. Checking for multicollinearity
Multicollinearity check is generally not required for decision tree regression. Decision trees split the data based on thresholds
of individual features. They don't estimate coefficients like linear regression does, so correlated predictors don't distort
parameter estimates.
8. Splitting and encoding data
Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas
80% of the data is used to train the model.
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
ct = ColumnTransformer([
('encoder', OneHotEncoder(sparse_output = False, drop = 'first', handle_unknown = 'ignore'), X.select_dtypes(exclude='number').columns)
],remainder = 'passthrough')
X_train_enc = ct.fit_transform(X_train)
X_test_enc = ct.transform(X_test)
9. Training the model and predicting values
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth=5, random_state=42)
regressor.fit(X_train_enc, y_train)
y_pred = regressor.predict(X_test_enc)
df_result = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df_result.head(10)
Output (side-by-side comparison of actual values and predicted values)
Actual Predicted
1176 114731.5536 90147.685886
1117 69929.4672 60902.414400
427 106506.7200 60902.414400
351 75071.5200 60902.414400
364 20725.9200 27029.100706
853 41931.3600 55405.469087
1018 118761.1200 55405.469087
762 60153.1200 83108.137838
461 39906.7200 55405.469087
883 19660.3200 20666.681974
10. Seeing the model metrics such as R squared to check whether the model is overfitting or not.
from sklearn.metrics import r2_score, mean_squared_error
# for training data
print(regressor.score(X_train_enc, y_train))
#for predicted values
print(r2_score(y_test, y_pred))
Output
0.8003266191317047
0.7163190370541831
R2 score for training is 0.8 where as for test data it is 0.71
If the training score were very high (close to 1.0) and the test score much lower (like 0.3–0.4), that would
indicate overfitting.
If both scores were low (say <0.5), that would indicate the model is too simple and not capturing the patterns.
The gap between 0.80 and 0.71 is modest. This means a slight overfitting, but not in extreme. The model generalizes reasonably
well.
11. Plotting the tree
If you want to check how decision nodes were created by the algorithm you can plot the decision tree.
from sklearn.tree import plot_tree
plt.figure(figsize=(20,15))
plot_tree(regressor, filled=True, fontsize=10)
plt.show()
Another way to do it is by using the graphviz library. But that would mean downloading graphviz from this location-
https://graphviz.org/download/
Also requires setting the path to the bin directory which can be done programmatically.
feature_names = ct.get_feature_names_out()
# get the name of the features used (after encoding)
X_train_final = pd.DataFrame(X_train_enc, columns=feature_names, index=X_train.index)
from sklearn.tree import export_graphviz
dot_data = export_graphviz(
regressor,
out_file=None,
feature_names=X_train_final.columns,
rounded=True,filled=True
)
import os
#setting path
os.environ["PATH"] += os.pathsep + r"D:\Softwares\Graphviz-14.1.1-win64\bin"
from graphviz import Source
Source(dot_data)
That's all for this topic Decision Tree Regression With Example. If you have any doubt or any suggestions to make please
drop a comment. Thanks!
Related Topics
-
Simple Linear Regression With Example
-
Multiple Linear Regression With Example
-
Polynomial Regression With Example
-
Support Vector Regression With Example
-
Mean, Median and Mode With Python Examples
You may also like-
-
Passing Object of The Class as Parameter in Python
-
Local, Nonlocal And Global Variables in Python
-
Python count() method - Counting Substrings
-
Python Functions : Returning Multiple Values
-
Marker Interface in Java
-
Functional Interfaces in Java
-
Difference Between Checked And Unchecked Exceptions in Java
-
Race Condition in Java Multi-Threading