Identifying the correct columns(Features/Variables) from the data is the most important step in machine learning. If this gets wrong, everything gets wrong!
The most common question asked by many is How do I know which columns are important for machine learning?
How do I know which columns in my data are important for machine learning?
Feature selection is an iterative process, you keep rejecting bad columns based on various techniques available. I am listing below the steps used in supervised machine learning.
- Identify the data type, if the columns are Quantitative, Qualitative, or Categorical. We cannot use Qualitative data directly, hence we create new columns based on that (Feature Engineering) and use those new columns which can be quantitative or categorical
- Look at the data distribution of the Target variable, using either histogram(quantitative variable) or barplot(categorical variable). If the target variable has very high skewness, that means there are outliers, then its treatment must be done before machine learning.
- Understand the meaning of each one of the predictors available
- Look at the data distribution of each available predictor, using either histogram(quantitative variable) or barplot(categorical variable). If there are outliers, then it must be treated before using such a column in machine learning
- Measuring the correlation between each Predictor and the Target variable
- Looking at the variable importance charts from algorithms like Random Forest, Adaboost, Xgboost, etc.
- Selecting those features which rank high in the variable importance chart of multiple algorithms
Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!