How to treat outliers in data in Python

Outliers are treated by either deleting them or replacing the outlier values with a logical value as per business and similar data.

Consider the below scenario, where you have an outlier in the Salary column.

# Creating a sample balanced data frame
import pandas as pd
ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']
DataValues=[[480, 28, 610000, 'No'],
             [480, 42, 140000, 'No'],
             [480, 29, 420000, 'No'],
             [490, 30, 420000, 'No'],
             [500, 27, 420000, 'No'],
             [510, 34, 190000, 'No'],
             [550, 24, 330000, 'No'],
             [560, 34, 160000, 'No'],
             [560, 25, 300000, 'No'],
             [570, 34, 450000, 'No'],
             [590, 30, 140000, 'Yes'],
             [600, 33, 600000, 'Yes'],
             [600, 22, 400000, 'Yes'],
             [600, 25, 490000, 'Yes'],
             [610, 32, 120000, 'Yes'],
             [630, 29, 360000, 'Yes'],
             [630, 30, 480000, 'Yes'],
             [660, 29, 460000, 'Yes'],
             [700, 32, 470000, 'Yes'],
             [740, 28, 4500000, 'Yes']]

#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
#########################################################
# Histogram for SALARY column
LoanData['SALARY'].hist(figsize=(8,3))

# Box-plot for SALARY column
LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False)

# Creating a sample balanced data frame

import pandas as pd

ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']

DataValues=[[480, 28, 610000, 'No'],

[480, 42, 140000, 'No'],

[480, 29, 420000, 'No'],

[490, 30, 420000, 'No'],

[500, 27, 420000, 'No'],

[510, 34, 190000, 'No'],

[550, 24, 330000, 'No'],

[560, 34, 160000, 'No'],

[560, 25, 300000, 'No'],

[570, 34, 450000, 'No'],

[590, 30, 140000, 'Yes'],

[600, 33, 600000, 'Yes'],

[600, 22, 400000, 'Yes'],

[600, 25, 490000, 'Yes'],

[610, 32, 120000, 'Yes'],

[630, 29, 360000, 'Yes'],

[630, 30, 480000, 'Yes'],

[660, 29, 460000, 'Yes'],

[700, 32, 470000, 'Yes'],

[740, 28, 4500000, 'Yes']]

#Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

print(LoanData.head())

#########################################################

# Histogram for SALARY column

LoanData['SALARY'].hist(figsize=(8,3))

# Box-plot for SALARY column

LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False)

Sample Output

Based on the above charts, you can easily spot the outlier point located beyond 4000000.

Treating the outlier values

You can sort and filter the data based on outlier value and see which is the closet logical value to the whole data.

Once you find the closest logical value, replace all the outlier points with that value.

# Finding the closest logical salary value in data
LoanData.sort_values(by='SALARY',ascending=False).head()

# Replacing the outlier value with the closed point
outlierFilter=LoanData['SALARY']>4000000
LoanData.loc[outlierFilter ,'SALARY']=610000

# Plotting the data again after outlier treatment
LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False)

# Finding the closest logical salary value in data

LoanData.sort_values(by='SALARY',ascending=False).head()

# Replacing the outlier value with the closed point

outlierFilter=LoanData['SALARY']>4000000

LoanData.loc[outlierFilter ,'SALARY']=610000

# Plotting the data again after outlier treatment

LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False)

Sample Output

Removing the outlier values

This is done only when the number of outlier rows is much less than the total rows in the data.

# Deleting the outlier values from the data
outlierFilter=LoanData['SALARY'] < 4000000
LoanData = LoanData[outlierFilter]

# Plotting the data again after outlier treatment
LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False)

# Deleting the outlier values from the data

outlierFilter=LoanData['SALARY'] < 4000000

LoanData = LoanData[outlierFilter]

# Plotting the data again after outlier treatment

LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False)

Sample Output

Removing outlier values from data in python

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Treating the outlier values

Removing the outlier values

Leave a Reply! Cancel Reply