Support Ticket Classification using TF-IDF Vectorization

Support Ticket Classification using TF-IDF

This is probably one project which every organisation can benefit from!

A lot of human effort is spent unnecessarily every day just to re-prioritize the incoming support tickets to their deserving priority, because everyone just creates them either as Priority-2 or Priority-1.

This issue can be solved to some extent if we had a predictive model which can classify the incoming tickets into P1/P2/P3 ,etc. based on the text contained in them.

A sample data for such scenarios looks like this.

But, classification algorithms like logistic regression, Naive Bayes, Decision Trees etc. work on numeric data! So how will they learn text data?

Converting text to numeric form is a common requirement for the scenarios where the input data is a text and it needs to be classified into groups.

Reviews, emails, ticket description, etc. are common forms of input text which needs to be converted to numeric format in order to be further analyzed

Converting any such free form of text like ticket description, reviews, mails into a set of numbers is known as Vectorization.

There are many ways to do this, the famous ones are TF-IDF, Word2Vec, Doc2Vec, GloVe, BERT etc. In this post, I will show you how to use TF-IDF vectorization for ticket classification.

Vectorization converts one row of text into one row of numbers known as Document Term Matrix. The columns are the important words from the text.

These rows of text can be learned against the Target variable. In this scenario it is the priority of the ticket.

Vectorization converts one row of text into one row of numbers

Let us understand what TF-IDF does basically.

What is TF-IDF?

TF-IDF is a composite score representing the power of a given word to uniquely identify the document It is computed by multiplying Term Frequency(TF) and Inverse Document Frequency(IDF)

TF: (The number of times a word occurs in a document/ total words in that document)

IDF: log (total number of documents/number of documents containing the given word).

  • IF a word is very common like “is”, “and”, “the” etc. then IDF is near to zero, otherwise, it is close to 1
  • The higher the TF-IDF value of a word, the more unique/rare occurring that word is.
  • If the TF-IDF is close to zero, it means the word is very commonly used

TF-IDF is a composite score representing the power of a given word to uniquely identify the document

Creating TF-IDF scores using sklearn

sklearn.feature_extraction has a function called TfidfVectorizer which performs this calculation of TF-IDF scores for us very easily.

I am showing the output here for the above sample data so that you can easily visualize the output, after this, we will perform this activity on actual ticket data.

TF-IDF vectorization

You can now compare the words with the original text side by side.

If a word is not present in a sentence, its score is 0. if a word is present only in a few of the sentences, then its score it higher. If a word is present in almost every sentence then also its score is zero.

This document term matrix with the TF-IDF scores now represents the information present as text!

What to do with Vectorized text?

  • This data can further be used in machine learning.
  • If the text data also has a target variable e.g. sentiment(positive/negative) or Support Ticket Priority (P1/P2/P3) then these word columns act as predictors and we can fit a classification/regression ML algorithm on this data.

Below snippet shows how we can add the Target variable to the TF-IDF matrix and get the data ready for ML.

Now in this data, the word columns are predictors and the Priority column is the Target variable.

Case study: IT support ticket classification on Microsoft data

Now, let us use our understanding of TF-IDF to convert the text data from Microsoft IT- support desk. You can download the required data for this case study here.

Problem Statement: Use the support ticket text description to classify a new ticket into P1/P2/P3.

Reading the support ticket data

This data contains 19,796 rows and 2 columns. The column”body” represents the ticket description and the column “urgency” represents the Priority.

Visualising the distribution of the Target variable

Now we try to see if the Target variable has balanced distribution or not? Basically each priority type has enough number of rows to be learned.

If the data would have been imbalanced, for example very less number of rows for P1 category, then you need to balance the data using any of the popular techniques like over-sampling, under-sampling or SMOTE.

The above bar plot shows that there are enough rows for each ticket type. Hence, this is balanced data for classification.

TF-IDF Vectorization: converting text data to numeric

Now we will convert the text column “body” into TF-IDF matrix of numbers using TfidfVectorizer from sklearn library. This will get the text data into the numeric form, which will be learned by machine learning algorithms.

TF-IDF matrix data for classification

The above output is just a sample of the 19796 rows and 9100 columns!

Notice that the number of rows are same as the original data, but, the number of columns have exploded to 9100!!

This is because there were so many unique words in the support ticket texts even after removing the stop-words… 9099 unique words still remains, plus one Target variable “Priority”, hence, total 9100 columns.

The Curse of High Dimensionality

If we pass the above data to any machine learning algorithm, it will simply hang it! Especially the tree based algorithms. This is because of the sheer number of columns to process!

This problem is common in TF-IDF Vectorization because of the way it finds the representation for each sentence. Overall number of columns are bound to be very high because these are the unique words from all the text data!

This is exactly when we use Dimension Reduction! To represent the high number of columns with a lower number of columns.

Look at the below video to understand this concept in-depth!

Dimension Reduction

There are too many predictor columns(9099!), hence we use PCA to reduce the number of columns.

To select the best number of principal components, we need to run PCA once with the number of components equal to the number of columns, in this case 9099.

But, in the below snippet I am trying out 5000 maximum Principal Components. Why 5000? Because the total number of columns in original data is 9099 and that will take some time to process, hence I am just checking if the optimum number of components can be found below 5000 principal components or not? Luckily you see, the saturation was found near 2100 components.

Based on the cumulative variance explained chart, I will select the minimum number of principal components which can explain maximum amount of data variance. This is that point where the graph becomes horizontal.

Warning: Please run all the below codes on Google Colab or any other cloud platform. So that your laptop is not hanged due to the high amount of processing required!

Variance explained by PCA

Based on the above chart we can see that saturation is happening around 2100 principal components. They are explaining around 97% of the total data variance.

Hence choosing 2100 Principal components. With this, we were able to reduce the total number of columns significantly as compared to original 9099 predictor columns.

Using Principal Components as predictors

Now combining the Target variable with the principal components and preparing the data for machine learning.

Principal components as predictors

Standardization/Normalization of the data

This is an optional step. It can speed up the processing of the model training.

Train Test Split for TF-IDF data

Training ML classification models

Normally, while training classification models, we can use all of the famous classification algorithms like Random Forest, XGBoost, Adaboost, ANN etc.

But, this data is high dimensional even after dimension reduction! If you pass this data of 2100 columns to any of these algorithms then, it will take a while before it finishes training and if you are using your humble laptop CPU, it may even hang it!

So, while keeping the training speed in mind. I select below algorithms.

  • Naive bayes
  • Logistic Regression
  • Decision Trees

Naive Bayes and Logistic Regression will run faster on such high dimensional data. I have kept Decision trees just for helping you to visualize how slow the training is for tree based models on such datasets.

Naive Bayes

This algorithm trains very fast! The accuracy may not be very high always but the speed is guaranteed! Hence you can even skip the dimension reduction step for Naive Bayes and get better accuracy by using the original data with 9099 columns.

Here the output is shown based on 2100 principal components, hence the lower accuracy. If you use the original data then this accuracy will go to 70%.

I have commented the cross validation section just to save computing time. You can uncomment and execute those commands as well.

Logistic Regression

This algorithm also trains very fast, but not as fast as Naive Bayes! However this slow speed comes with the nice tradeoff of accuracy! It produces more accurate results. In the below snippet you can see the accuracy as 73%

Decision Trees

This algorithm will train painfully slow on such data! By looking at this you can imagine how slow a RandomForest or XGBoost will train! Hence, I have not included them in this case study.

Training the best model on Full Data

Based on the above outputs, we select Logistic Regression as the final model.

Making predictions for New Cases

This final model is deployed in production to classify the new incoming tickets. To do this, we write a function which can generate predictions either one at a time OR for multiple cases input as a data frame.

New ticket classification using TF-IDF

Saving the output as a file

You can write the PredictionResults dataframe as a csv or excel file. And then from there it can be loaded into the database.


I hope this post was helpful for you to get a practical flow of Text Vectorization using TF-IDF and you will be able to apply this in your projects. Consider sharing this post with your friends to spread the knowledge and help me grow as well!

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

Leave a Reply!

Your email address will not be published. Required fields are marked *