How to remove duplicate rows from data in pandas DataFrame

Duplicate rows can be deleted from a pandas data frame using drop_duplicates() function.

You can choose to delete rows which have all the values same using the default option subset=None

Or you can choose a set of columns to compare, if values in two rows are the same for those set of columns then the whole row will be dropped.

The option keep=’first’ will keep the first occurrence and delete the second occurrence of the duplicate row.

# Defining Employee Data
import pandas as pd
EmpData=pd.DataFrame({'Name': ['ram','ravi','sham','sita','sita'],
                            'id': [101,102,103,104,104],
                        'Gender': ['M','M','M','F','F'],
                           'Age': [21,25,24,25,25],
                           'ExpMonths':[1.5,2,3,7,7]
                          })
# Priting data
print(EmpData)

# Removing duplicate rows based on all columns
EmpData.drop_duplicates(subset=None, keep='first', inplace=True)

# Priting data
print(EmpData)

# Defining Employee Data

import pandas as pd

EmpData=pd.DataFrame({'Name': ['ram','ravi','sham','sita','sita'],

'id': [101,102,103,104,104],

'Gender': ['M','M','M','F','F'],

'Age': [21,25,24,25,25],

'ExpMonths':[1.5,2,3,7,7]

})

# Priting data

print(EmpData)

# Removing duplicate rows based on all columns

EmpData.drop_duplicates(subset=None, keep='first', inplace=True)

# Priting data

print(EmpData)

Sample Data:

Dropping duplicate rows from the pandas data frame

Dropping duplicate rows based on a few columns

You can drop rows based on only a few selected columns as well by supplying the list of columns as an input to the drop_duplicates() function.

# Defining Employee Data
import pandas as pd
EmpData=pd.DataFrame({'Name': ['ram','ravi','sham','sita','gita'],
                            'id': [101,102,103,104,105],
                        'Gender': ['M','M','M','F','F'],
                           'Age': [21,25,24,25,25],
                           'ExpMonths':[1.5,2,3,12,7]
                          })
# Priting data
print(EmpData)

# Removing duplicate rows based on Gender and Age
EmpData.drop_duplicates(subset=['Gender', 'Age'], keep='first', inplace=True)

# Priting data
print(EmpData)

# Defining Employee Data

import pandas as pd

EmpData=pd.DataFrame({'Name': ['ram','ravi','sham','sita','gita'],

'id': [101,102,103,104,105],

'Gender': ['M','M','M','F','F'],

'Age': [21,25,24,25,25],

'ExpMonths':[1.5,2,3,12,7]

})

# Priting data

print(EmpData)

# Removing duplicate rows based on Gender and Age

EmpData.drop_duplicates(subset=['Gender', 'Age'], keep='first', inplace=True)

# Priting data

print(EmpData)

Sample Output:

Dropping duplicate rows based on few columns

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Dropping duplicate rows based on a few columns

Leave a Reply! Cancel Reply