How to convert text data into numeric data in Python

When there is a requirement of creating a classification model based on free text input like user comments, review, etc. then the text must be represented as numeric columns. This process is known as the vectorization of text. Simply put, representing text by a set of numeric columns.

There are two major approaches to do this.

  • Count Vectorization
  • TF-IDF Vectorization

Count Vectorization

Consider the below data, where the requirement is to create a predictive model that classifies a support ticket as P1/P2/P3 based on the ticket description. So before using a supervised ML algorithm, we need to convert the description text into numeric.

Sample Output:

Ticket classification data
Ticket classification data

Using count vectorization, the text can be converted into the numeric format. This result is known as Document Term Matrix. The columns represent the unique important words in all the text. The rows represent the frequency of that word in each of the sentences.

Sample Output

Document Term Matrix in Python using Term Counts
Document Term Matrix in Python using Term Counts


TF-IDF Vectorization

  • TF-IDF is a composite score representing the power of a given word to uniquely identify the document
  • It is computed by multiplying Term Frequency(TF) and Inverse Document Frequency(IDF)TF: (The number of times a word occurs in a document/ total words in that document)IDF: log (total number of documents/number of documents containing the given word).

  • IF a word is very common, then IDF is near to zero, otherwise, it is close to 1
  • The higher the tf-idf value of a word, the more unique/rare occurring that word is.
  • If the tf-idf is close to zero, it means the word is very commonly used

Sample Output

TF-IDF Document Term Matrix
TF-IDF Document Term Matrix


What to do with vectorized text?

  • This data can further be used in machine learning
  • If the text data also as target variable e.g. sentiment(positive/negative) or Support Ticket Priority (P1/P2/P3) then these word columns act as predictors and we can fit a classification ML algorithm

Sample Output

Vectorized data for Machine Learning
Vectorized data for Machine Learning

Here the Target variable is “Priority” and other columns are predictors.

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

6 thoughts on “How to convert text data into numeric data in Python”

  1. Wow, you have one of the best blog on AI/ML. How I wished I had come across your blog before now. I learnt of you through IvyProSchool “Python Live Study – Zamato dataset ” _ Restaurant rating Predictive Model.

Leave a Reply!

Your email address will not be published. Required fields are marked *