How to tokenize text in Python

A token is a piece of text. Each “entity” that is a part of whatever was split up based on rules. For example, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token if you tokenized the sentences out of a paragraph.

The nltk library in python is used for tokenization as well as all major NLP tasks!

# importing the library
from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = '''Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. 
The sky is pinkish-blue. You shouldn't eat cardboard.'''

# Sentence tokenization
print(sent_tokenize(EXAMPLE_TEXT))

# Word tokenization
print(word_tokenize(EXAMPLE_TEXT))

# importing the library

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = '''Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome.

The sky is pinkish-blue. You shouldn't eat cardboard.'''

# Sentence tokenization

print(sent_tokenize(EXAMPLE_TEXT))

# Word tokenization

print(word_tokenize(EXAMPLE_TEXT))

Sample Output:

Sentence and word tokenization in python

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Leave a Reply! Cancel Reply