How to create a decision tree for Regression in Python

A decision tree can be used for regression as well as classification, more information about it can be found here. The below code will help to create the decision tree model for regression.

import pandas as pd
import numpy as np
ColumnNames=&#91;'Hours','Calories', 'Weight']
DataValues=&#91;&#91;  1.0,   2500,   95],
             &#91;  2.0,   2000,   85],
             &#91;  2.5,   1900,   83],
             &#91;  3.0,   1850,   81],
             &#91;  3.5,   1600,   80],
             &#91;  4.0,   1500,   78],
             &#91;  5.0,   1500,   77],
             &#91;  5.5,   1600,   80],
             &#91;  6.0,   1700,   75],
             &#91;  6.5,   1500,   70]]
#Create the Data Frame
GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)
GymData.head()

#Separate Target Variable and Predictor Variables
TargetVariable='Weight'
Predictors=&#91;'Hours','Calories']
X=GymData&#91;Predictors].values
y=GymData&#91;TargetVariable].values

#Split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


########################################################

###### Single Decision Tree Regression in Python #######
from sklearn import tree
#choose from different tunable hyper parameters
RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion='mse')

#Printing all the parameters of Decision Tree
print(RegModel)

#Creating the model on Training Data
DTree=RegModel.fit(X_train,y_train)
prediction=DTree.predict(X_test)

#Measuring Goodness of fit in Training data
from sklearn import metrics
print('R2 Value:',metrics.r2_score(y_train, DTree.predict(X_train)))

#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(DTree.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults&#91;TargetVariable]=y_test
TestingDataResults&#91;('Predicted'+TargetVariable)]=prediction
TestingDataResults.head()

import pandas as pd

import numpy as np

ColumnNames=['Hours','Calories', 'Weight']

DataValues=[[ 1.0, 2500, 95],

[ 2.0, 2000, 85],

[ 2.5, 1900, 83],

[ 3.0, 1850, 81],

[ 3.5, 1600, 80],

[ 4.0, 1500, 78],

[ 5.0, 1500, 77],

[ 5.5, 1600, 80],

[ 6.0, 1700, 75],

[ 6.5, 1500, 70]]

#Create the Data Frame

GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)

GymData.head()

#Separate Target Variable and Predictor Variables

TargetVariable='Weight'

Predictors=['Hours','Calories']

X=GymData[Predictors].values

y=GymData[TargetVariable].values

#Split the data into training and testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

########################################################

###### Single Decision Tree Regression in Python #######

from sklearn import tree

#choose from different tunable hyper parameters

RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion='mse')

#Printing all the parameters of Decision Tree

print(RegModel)

#Creating the model on Training Data

DTree=RegModel.fit(X_train,y_train)

prediction=DTree.predict(X_test)

#Measuring Goodness of fit in Training data

from sklearn import metrics

print('R2 Value:',metrics.r2_score(y_train, DTree.predict(X_train)))

#Measuring accuracy on Testing Data

print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(DTree.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults[TargetVariable]=y_test

TestingDataResults[('Predicted'+TargetVariable)]=prediction

TestingDataResults.head()

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Leave a Reply! Cancel Reply