How to convert text data into numeric data in Python

When there is a requirement of creating a classification model based on free text input like user comments, review, etc. then the text must be represented as numeric columns. This process is known as the vectorization of text. Simply put, representing text by a set of numeric columns.

There are two major approaches to do this.

Count Vectorization
TF-IDF Vectorization

Count Vectorization

Consider the below data, where the requirement is to create a predictive model that classifies a support ticket as P1/P2/P3 based on the ticket description. So before using a supervised ML algorithm, we need to convert the description text into numeric.

# Creating a sample ticket data
import pandas as pd

TicketData=pd.DataFrame(data=[['Hi Please reset my password, i am not able to reset it','P3'],
                              ['Hi Please reset my password','P3'],
                              ['Hi The system is down please restart it', 'P1'],
                              ['Not able to login can you check?', 'P3'],
                              ['The data is not getting exported', 'P2'],
                               ], columns=['Text','Priority'])
# Printing the data
TicketData

# Creating a sample ticket data

import pandas as pd

TicketData=pd.DataFrame(data=[['Hi Please reset my password, i am not able to reset it','P3'],

['Hi Please reset my password','P3'],

['Hi The system is down please restart it', 'P1'],

['Not able to login can you check?', 'P3'],

['The data is not getting exported', 'P2'],

], columns=['Text','Priority'])

# Printing the data

TicketData

Sample Output:

Using count vectorization, the text can be converted into the numeric format. This result is known as Document Term Matrix. The columns represent the unique important words in all the text. The rows represent the frequency of that word in each of the sentences.

# count vectorization of text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

corpus = TicketData['Text'].values

#vectorizer = CountVectorizer(stop_words=None)
vectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)


X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

print(X.shape)

# Visualizing the Document Term Matrix
import pandas as pd
VectorizedText=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
VectorizedText['originalText']=pd.Series(corpus)
VectorizedText

# count vectorization of text

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

corpus = TicketData['Text'].values

#vectorizer = CountVectorizer(stop_words=None)

vectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

print(X.shape)

# Visualizing the Document Term Matrix

import pandas as pd

VectorizedText=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

VectorizedText['originalText']=pd.Series(corpus)

VectorizedText

Sample Output

Document Term Matrix in Python using Term Counts

TF-IDF Vectorization

TF-IDF is a composite score representing the power of a given word to uniquely identify the document
It is computed by multiplying Term Frequency(TF) and Inverse Document Frequency(IDF)TF: (The number of times a word occurs in a document/ total words in that document)IDF: log (total number of documents/number of documents containing the given word).

IF a word is very common, then IDF is near to zero, otherwise, it is close to 1
The higher the tf-idf value of a word, the more unique/rare occurring that word is.
If the tf-idf is close to zero, it means the word is very commonly used

# TF-IDF vectorization of text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

corpus = TicketData['Text'].values


#vectorizer = TfidfVectorizer(stop_words=None)
vectorizer = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS)

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

# Visualizing the Document Term Matrix using TF-IDF
import pandas as pd
VectorizedText=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
VectorizedText['originalText']=pd.Series(corpus)
VectorizedText

# TF-IDF vectorization of text

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

corpus = TicketData['Text'].values

#vectorizer = TfidfVectorizer(stop_words=None)

vectorizer = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS)

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

# Visualizing the Document Term Matrix using TF-IDF

import pandas as pd

VectorizedText=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

VectorizedText['originalText']=pd.Series(corpus)

VectorizedText

Sample Output

What to do with vectorized text?

This data can further be used in machine learning
If the text data also as target variable e.g. sentiment(positive/negative) or Support Ticket Priority (P1/P2/P3) then these word columns act as predictors and we can fit a classification ML algorithm

# Example Data frame For machine learning
# Sentiment column acts as a target variable and other columns as predictors
DataForML=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
DataForML['Priority']=TicketData['Priority']
DataForML.head()

# Example Data frame For machine learning

# Sentiment column acts as a target variable and other columns as predictors

DataForML=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

DataForML['Priority']=TicketData['Priority']

DataForML.head()

Sample Output

Here the Target variable is “Priority” and other columns are predictors.

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

6 thoughts on “How to convert text data into numeric data in Python”

Jonah
July 15, 2021 at 11:33 am

Wow, you have one of the best blog on AI/ML. How I wished I had come across your blog before now. I learnt of you through IvyProSchool “Python Live Study – Zamato dataset ” _ Restaurant rating Predictive Model.

1. Farukh Hashmi
  July 15, 2021 at 5:33 pm
  
  Hi Jonah!
  Thank you for your kind words. This made my day 🙂
  
arumoy saha
September 28, 2021 at 3:42 am

Nicely explained TFIDF & count vectorizer conceptwise, thanks a lot!

1. Farukh Hashmi
  September 28, 2021 at 9:11 am
  
  Thank you Arumoy!
  I am glad you liked it!
  
Malik Daler Ali Awan
March 9, 2022 at 6:51 am

One of the Best Blog_Post related to TEXT to NUMERICS.

1. Farukh Hashmi
  March 9, 2022 at 7:00 am
  
  Thank you For the kind words Malik!

Count Vectorization

TF-IDF Vectorization

What to do with vectorized text?

6 thoughts on “How to convert text data into numeric data in Python”

Leave a Reply! Cancel Reply