Once you have read data in python, you should start rejecting the useless columns, below commands must be fired in order to understand the data and which category it belongs to.
Every column in the data can be bifurcated into below categories
Anatomy of Data
There are 3 major types of data listed below:
- Quantitative (Numbers which cannot be grouped naturally) eg. sales, prices, age
- Qualitative (Strings which cannot be grouped naturally) eg. address, names, reviews
- Categorical (Strings or Numbers which CAN be grouped) eg. gender, color, sentiment, size
- Nominal Categorical (Categorical data which has no natural order) eg. gender, color
- Ordinal Categorical (Categorical data which HAS natural order) eg. size (S<M<L)
For supervised machine learning we use Quantitative and Categorical data. Raw Qualitative data has new string values for every row, hence it does not hold any pattern for the machine learning algorithm to learn.
That doesn’t means Qualitative data is completely useless! We extract some features from Qualitative data and create new variables, for example, zip code from address, sentiment from reviews etc. These new columns can be used in machine learning. The process of creating some new columns from existing columns is known as feature engineering.
Below commands are illustrated using a sample data frame “SimpleDataFrame”. Similarly you need to fire for any DataFrame variable which you have.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# Data Exploration commands ############################# # Creating a sample dataframe DataSample= [[101,'CS', 'M', 22], [102,'IT', 'F', 24], [103,'IT','M', 29], [104,'INFRA', 'F',23], [105,'CS', 'M', 25], ] import pandas as pd SimpleDataFrame=pd.DataFrame(data=DataSample, columns=['ID','Dept','Gender','Age']) print(SimpleDataFrame) # Printing Top few records SimpleDataFrame.head(3) # printing Bottom few records SimpleDataFrame.tail(3) # Printing number of unique values in each column SimpleDataFrame.nunique() # Print summary information about DataFrame SimpleDataFrame.info() # Print dimension information (ROWS, COLS) about DataFrame SimpleDataFrame.shape # Finding the list of columns in the data SimpleDataFrame.columns |
At this point of time, you will get a fair amount of idea about the DataFrame, hence, you must start dropping all the qualitative variables as they will not be useful in supervised ml.
Below commands will help to find statistical summary for all the variables in data. This will help to understand the spread and quality of the data.
|
1 2 3 4 5 6 7 8 9 |
# Print statistical information about DataFrame # 1. Measures of Central Tendency -- Mean, Median, Mode # 2. Measures of Spread --Min, Max, StandardDeviation, Quartiles # Statistical information for all numeric variables SimpleDataFrame.describe() # Statistical information for numeric and character variables both SimpleDataFrame.describe(include='all') |
