What to do after reading data in Pandas?

Once you have read data in python, you should start rejecting the useless columns, below commands must be fired in order to understand the data and which category it belongs to.

Every column in the data can be bifurcated into below categories

Anatomy of Data

There are 3 major types of data listed below:

Quantitative (Numbers which cannot be grouped naturally) eg. sales, prices, age
Qualitative (Strings which cannot be grouped naturally) eg. address, names, reviews
Categorical (Strings or Numbers which CAN be grouped) eg. gender, color, sentiment, size
- Nominal Categorical (Categorical data which has no natural order) eg. gender, color
- Ordinal Categorical (Categorical data which HAS natural order) eg. size (S<M<L)

For supervised machine learning we use Quantitative and Categorical data. Raw Qualitative data has new string values for every row, hence it does not hold any pattern for the machine learning algorithm to learn.

That doesn’t means Qualitative data is completely useless! We extract some features from Qualitative data and create new variables, for example, zip code from address, sentiment from reviews etc. These new columns can be used in machine learning. The process of creating some new columns from existing columns is known as feature engineering.

Below commands are illustrated using a sample data frame “SimpleDataFrame”. Similarly you need to fire for any DataFrame variable which you have.

# Data Exploration commands
#############################

# Creating a sample dataframe
DataSample= [[101,'CS', 'M', 22], 
             [102,'IT', 'F', 24],
             [103,'IT','M', 29],
             [104,'INFRA', 'F',23],
             [105,'CS', 'M', 25],
             ]

import pandas as pd
SimpleDataFrame=pd.DataFrame(data=DataSample, columns=['ID','Dept','Gender','Age'])
print(SimpleDataFrame)

# Printing Top few records
SimpleDataFrame.head(3)

# printing Bottom few records
SimpleDataFrame.tail(3)

# Printing number of unique values in each column
SimpleDataFrame.nunique()

# Print summary information about DataFrame
SimpleDataFrame.info()

# Print dimension information (ROWS, COLS) about DataFrame
SimpleDataFrame.shape

# Finding the list of columns in the data
SimpleDataFrame.columns

# Data Exploration commands

#############################

# Creating a sample dataframe

DataSample= [[101,'CS', 'M', 22],

[102,'IT', 'F', 24],

[103,'IT','M', 29],

[104,'INFRA', 'F',23],

[105,'CS', 'M', 25],

]

import pandas as pd

SimpleDataFrame=pd.DataFrame(data=DataSample, columns=['ID','Dept','Gender','Age'])

print(SimpleDataFrame)

# Printing Top few records

SimpleDataFrame.head(3)

# printing Bottom few records

SimpleDataFrame.tail(3)

# Printing number of unique values in each column

SimpleDataFrame.nunique()

# Print summary information about DataFrame

SimpleDataFrame.info()

# Print dimension information (ROWS, COLS) about DataFrame

SimpleDataFrame.shape

# Finding the list of columns in the data

SimpleDataFrame.columns

At this point of time, you will get a fair amount of idea about the DataFrame, hence, you must start dropping all the qualitative variables as they will not be useful in supervised ml.

Below commands will help to find statistical summary for all the variables in data. This will help to understand the spread and quality of the data.

# Print statistical information about DataFrame
# 1. Measures of Central Tendency -- Mean, Median, Mode
# 2. Measures of Spread --Min, Max, StandardDeviation, Quartiles

# Statistical information for all numeric variables
SimpleDataFrame.describe()

# Statistical information for numeric and character variables both
SimpleDataFrame.describe(include='all')

# Print statistical information about DataFrame

# 1. Measures of Central Tendency -- Mean, Median, Mode

# 2. Measures of Spread --Min, Max, StandardDeviation, Quartiles

# Statistical information for all numeric variables

SimpleDataFrame.describe()

# Statistical information for numeric and character variables both

SimpleDataFrame.describe(include='all')

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Anatomy of Data

Leave a Reply! Cancel Reply