When there is a requirement of creating a classification model based on free text input like user comments, review, etc. then the text must be represented as numeric columns. This process is known as the vectorization of text. Simply put, representing text by a set of numeric columns.
There are two major approaches to do this.
- Count Vectorization
- TF-IDF Vectorization
Count Vectorization
Consider the below data, where the requirement is to create a predictive model that classifies a support ticket as P1/P2/P3 based on the ticket description. So before using a supervised ML algorithm, we need to convert the description text into numeric.
1 2 3 4 5 6 7 8 9 10 11 |
# Creating a sample ticket data import pandas as pd TicketData=pd.DataFrame(data=[['Hi Please reset my password, i am not able to reset it','P3'], ['Hi Please reset my password','P3'], ['Hi The system is down please restart it', 'P1'], ['Not able to login can you check?', 'P3'], ['The data is not getting exported', 'P2'], ], columns=['Text','Priority']) # Printing the data TicketData |
Sample Output:

Using count vectorization, the text can be converted into the numeric format. This result is known as Document Term Matrix. The columns represent the unique important words in all the text. The rows represent the frequency of that word in each of the sentences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# count vectorization of text from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS corpus = TicketData['Text'].values #vectorizer = CountVectorizer(stop_words=None) vectorizer = CountVectorizer(stop_words=ENGLISH_STOP_WORDS) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X.shape) # Visualizing the Document Term Matrix import pandas as pd VectorizedText=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()) VectorizedText['originalText']=pd.Series(corpus) VectorizedText |
Sample Output

TF-IDF Vectorization
- TF-IDF is a composite score representing the power of a given word to uniquely identify the document
- It is computed by multiplying Term Frequency(TF) and Inverse Document Frequency(IDF)TF: (The number of times a word occurs in a document/ total words in that document)IDF: log (total number of documents/number of documents containing the given word).
- IF a word is very common, then IDF is near to zero, otherwise, it is close to 1
- The higher the tf-idf value of a word, the more unique/rare occurring that word is.
- If the tf-idf is close to zero, it means the word is very commonly used
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# TF-IDF vectorization of text from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS corpus = TicketData['Text'].values #vectorizer = TfidfVectorizer(stop_words=None) vectorizer = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) # Visualizing the Document Term Matrix using TF-IDF import pandas as pd VectorizedText=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()) VectorizedText['originalText']=pd.Series(corpus) VectorizedText |
Sample Output

What to do with vectorized text?
- This data can further be used in machine learning
- If the text data also as target variable e.g. sentiment(positive/negative) or Support Ticket Priority (P1/P2/P3) then these word columns act as predictors and we can fit a classification ML algorithm
1 2 3 4 5 |
# Example Data frame For machine learning # Sentiment column acts as a target variable and other columns as predictors DataForML=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()) DataForML['Priority']=TicketData['Priority'] DataForML.head() |
Sample Output

Here the Target variable is “Priority” and other columns are predictors.
Wow, you have one of the best blog on AI/ML. How I wished I had come across your blog before now. I learnt of you through IvyProSchool “Python Live Study – Zamato dataset ” _ Restaurant rating Predictive Model.
Hi Jonah!
Thank you for your kind words. This made my day 🙂
Nicely explained TFIDF & count vectorizer conceptwise, thanks a lot!
Thank you Arumoy!
I am glad you liked it!
One of the Best Blog_Post related to TEXT to NUMERICS.
Thank you For the kind words Malik!