What to do after reading data in Pandas?

Once you have read data in python, you should start rejecting the useless columns, below commands must be fired in order to understand the data and which category it belongs to.

Every column in the data can be bifurcated into below categories

Anatomy of Data

There are 3 major types of data listed below:

  1. Quantitative (Numbers which cannot be grouped naturally) eg. sales, prices, age
  2. Qualitative (Strings which cannot be grouped naturally) eg. address, names, reviews
  3. Categorical (Strings or Numbers which CAN be grouped) eg. gender, color, sentiment, size
    • Nominal Categorical (Categorical data which has no natural order) eg. gender, color
    • Ordinal Categorical (Categorical data which HAS natural order) eg. size (S<M<L)

For supervised machine learning we use Quantitative and Categorical data. Raw Qualitative data has new string values for every row, hence it does not hold any pattern for the machine learning algorithm to learn.

That doesn’t means Qualitative data is completely useless! We extract some features from Qualitative data and create new variables, for example, zip code from address, sentiment from reviews etc. These new columns can be used in machine learning. The process of creating some new columns from existing columns is known as feature engineering.

Below commands are illustrated using a sample data frame “SimpleDataFrame”. Similarly you need to fire for any DataFrame variable which you have.

At this point of time, you will get a fair amount of idea about the DataFrame, hence, you must start dropping all the qualitative variables as they will not be useful in supervised ml.

Below commands will help to find statistical summary for all the variables in data. This will help to understand the spread and quality of the data.

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

Leave a Reply!

Your email address will not be published. Required fields are marked *