Outliers are treated by either deleting them or replacing the outlier values with a logical value as per business and similar data.
Consider the below scenario, where you have an outlier in the Salary column.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# Creating a sample balanced data frame import pandas as pd ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN'] DataValues=[[480, 28, 610000, 'No'], [480, 42, 140000, 'No'], [480, 29, 420000, 'No'], [490, 30, 420000, 'No'], [500, 27, 420000, 'No'], [510, 34, 190000, 'No'], [550, 24, 330000, 'No'], [560, 34, 160000, 'No'], [560, 25, 300000, 'No'], [570, 34, 450000, 'No'], [590, 30, 140000, 'Yes'], [600, 33, 600000, 'Yes'], [600, 22, 400000, 'Yes'], [600, 25, 490000, 'Yes'], [610, 32, 120000, 'Yes'], [630, 29, 360000, 'Yes'], [630, 30, 480000, 'Yes'], [660, 29, 460000, 'Yes'], [700, 32, 470000, 'Yes'], [740, 28, 4500000, 'Yes']] #Create the Data Frame LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames) print(LoanData.head()) ######################################################### # Histogram for SALARY column LoanData['SALARY'].hist(figsize=(8,3)) # Box-plot for SALARY column LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False) |
Sample Output

Based on the above charts, you can easily spot the outlier point located beyond 4000000.
Treating the outlier values
You can sort and filter the data based on outlier value and see which is the closet logical value to the whole data.
Once you find the closest logical value, replace all the outlier points with that value.
|
1 2 3 4 5 6 7 8 9 |
# Finding the closest logical salary value in data LoanData.sort_values(by='SALARY',ascending=False).head() # Replacing the outlier value with the closed point outlierFilter=LoanData['SALARY']>4000000 LoanData.loc[outlierFilter ,'SALARY']=610000 # Plotting the data again after outlier treatment LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False) |
Sample Output

Removing the outlier values
This is done only when the number of outlier rows is much less than the total rows in the data.
|
1 2 3 4 5 6 |
# Deleting the outlier values from the data outlierFilter=LoanData['SALARY'] < 4000000 LoanData = LoanData[outlierFilter] # Plotting the data again after outlier treatment LoanData.boxplot(['SALARY'],figsize=(8,3),vert=False) |
Sample Output

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!
