How to create a wordcloud in python

Wordclouds are used very commonly in the text analysis. They are a great tool to visualize and highlight the most important information in text data.

The concept is simple. Remove all the commonly used words from the text and highlight those words which are having a high frequency.

The below code snippet uses a string as sample text, you can fetch text from any source and use the code.

Unigram wordcloud

Unigram means one word at a time. This type of wordcloud focuses on the frequency of each unique word.

# installing the wordcloud library
# !pip install wordcloud

# A sample text snippet from an article on web regarding Donald Trump
Article= '''Trump-critical media do continue to find elite audiences 
Their investigations still win Pulitzer Prizes; their reporters accept invitations 
to anxious conferences about corruption, digital-journalism standards, the end of nato, 
and the rise of populist authoritarianism. Yet somehow all of this earnest effort feels less
and less relevant to American politics. President Trump communicates with the people directly 
via his Twitter account, ushering his supporters toward favorable information at Fox News or Breitbart.
Despite the hand-wringing, the country has in many ways changed much less than some feared or 
hoped four years ago. Ambitious Republican plans notwithstanding, the American social-welfare system, 
as most people encounter it, has remained largely intact during Trump’s first term. The predicted wave 
of mass deportations of illegal immigrants never materialized. A large illegal workforce remains in the 
country, with the tacit understanding that so long as these immigrants avoid politics, 
keeping their heads down and their mouths shut, nobody will look very hard for them.
'''

########################################################################
# Cleaning the text data to remove all punctuations, numbers and special characters
import re

# removing all the special characters
# Just a template to show how it can be done Selectively
cleanedArticle = re.sub(r'[?|$|.|!]',r' ',Article)

# removing everything which is not Alphabets
cleanedArticle = re.sub(r'[^a-z A-Z]',r' ',cleanedArticle)  

# Converting the whole text to lowercase
cleanedArticle = cleanedArticle.lower()     

# Deleting any word which is less than 3-characters mostly those are stopwords
cleanedArticle= re.sub(r'\b\w{1,3}\b', ' ', cleanedArticle)

# Stripping extra spaces in the text
cleanedArticle= re.sub(r' +', ' ', cleanedArticle)

########################################################################
# Plotting the wordcloud
%matplotlib inline
import matplotlib.pyplot as plt


from wordcloud import WordCloud, STOPWORDS

# Creating a custom list of stopwords
customStopwords=list(STOPWORDS) + ['less','Trump','American','politics','country']

wordcloudimage = WordCloud( max_words=50,
                            font_step=2 ,
                            max_font_size=500,
                            stopwords=customStopwords,
                            background_color='black',
                            width=1000,
                            height=720
                          ).generate(cleanedArticle)

plt.figure(figsize=(15,7))
plt.imshow(wordcloudimage)
plt.axis("off")
plt.show()

# installing the wordcloud library

# !pip install wordcloud

# A sample text snippet from an article on web regarding Donald Trump

Article= '''Trump-critical media do continue to find elite audiences

Their investigations still win Pulitzer Prizes; their reporters accept invitations

to anxious conferences about corruption, digital-journalism standards, the end of nato,

and the rise of populist authoritarianism. Yet somehow all of this earnest effort feels less

and less relevant to American politics. President Trump communicates with the people directly

via his Twitter account, ushering his supporters toward favorable information at Fox News or Breitbart.

Despite the hand-wringing, the country has in many ways changed much less than some feared or

hoped four years ago. Ambitious Republican plans notwithstanding, the American social-welfare system,

as most people encounter it, has remained largely intact during Trump’s first term. The predicted wave

of mass deportations of illegal immigrants never materialized. A large illegal workforce remains in the

country, with the tacit understanding that so long as these immigrants avoid politics,

keeping their heads down and their mouths shut, nobody will look very hard for them.

'''

########################################################################

# Cleaning the text data to remove all punctuations, numbers and special characters

import re

# removing all the special characters

# Just a template to show how it can be done Selectively

cleanedArticle = re.sub(r'[?|$|.|!]',r' ',Article)

# removing everything which is not Alphabets

cleanedArticle = re.sub(r'[^a-z A-Z]',r' ',cleanedArticle)

# Converting the whole text to lowercase

cleanedArticle = cleanedArticle.lower()

# Deleting any word which is less than 3-characters mostly those are stopwords

cleanedArticle= re.sub(r'\b\w{1,3}\b', ' ', cleanedArticle)

# Stripping extra spaces in the text

cleanedArticle= re.sub(r' +', ' ', cleanedArticle)

########################################################################

# Plotting the wordcloud

%matplotlib inline

import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

# Creating a custom list of stopwords

customStopwords=list(STOPWORDS) + ['less','Trump','American','politics','country']

wordcloudimage = WordCloud( max_words=50,

font_step=2 ,

max_font_size=500,

stopwords=customStopwords,

background_color='black',

width=1000,

height=720

).generate(cleanedArticle)

plt.figure(figsize=(15,7))

plt.imshow(wordcloudimage)

plt.axis("off")

plt.show()

Sample Output

How to create wordclouds in Python — How to create a wordcloud in Python

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Unigram wordcloud

Leave a Reply! Cancel Reply