Wordclouds are used very commonly in the text analysis. They are a great tool to visualize and highlight the most important information in text data.
The concept is simple. Remove all the commonly used words from the text and highlight those words which are having a high frequency.
The below code snippet uses a string as sample text, you can fetch text from any source and use the code.
Unigram wordcloud
Unigram means one word at a time. This type of wordcloud focuses on the frequency of each unique word.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# installing the wordcloud library # !pip install wordcloud # A sample text snippet from an article on web regarding Donald Trump Article= '''Trump-critical media do continue to find elite audiences Their investigations still win Pulitzer Prizes; their reporters accept invitations to anxious conferences about corruption, digital-journalism standards, the end of nato, and the rise of populist authoritarianism. Yet somehow all of this earnest effort feels less and less relevant to American politics. President Trump communicates with the people directly via his Twitter account, ushering his supporters toward favorable information at Fox News or Breitbart. Despite the hand-wringing, the country has in many ways changed much less than some feared or hoped four years ago. Ambitious Republican plans notwithstanding, the American social-welfare system, as most people encounter it, has remained largely intact during Trump’s first term. The predicted wave of mass deportations of illegal immigrants never materialized. A large illegal workforce remains in the country, with the tacit understanding that so long as these immigrants avoid politics, keeping their heads down and their mouths shut, nobody will look very hard for them. ''' ######################################################################## # Cleaning the text data to remove all punctuations, numbers and special characters import re # removing all the special characters # Just a template to show how it can be done Selectively cleanedArticle = re.sub(r'[?|$|.|!]',r' ',Article) # removing everything which is not Alphabets cleanedArticle = re.sub(r'[^a-z A-Z]',r' ',cleanedArticle) # Converting the whole text to lowercase cleanedArticle = cleanedArticle.lower() # Deleting any word which is less than 3-characters mostly those are stopwords cleanedArticle= re.sub(r'\b\w{1,3}\b', ' ', cleanedArticle) # Stripping extra spaces in the text cleanedArticle= re.sub(r' +', ' ', cleanedArticle) ######################################################################## # Plotting the wordcloud %matplotlib inline import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS # Creating a custom list of stopwords customStopwords=list(STOPWORDS) + ['less','Trump','American','politics','country'] wordcloudimage = WordCloud( max_words=50, font_step=2 , max_font_size=500, stopwords=customStopwords, background_color='black', width=1000, height=720 ).generate(cleanedArticle) plt.figure(figsize=(15,7)) plt.imshow(wordcloudimage) plt.axis("off") plt.show() |
Sample Output

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!