The first thing you need to do in any NLP project is text preprocessing. Preprocessing input text simply means putting the data into a predictable and analyzable form. It’s a crucial step for building an amazing NLP application.
There are different ways to preprocess text:
- stop word removal,
Among these, the most important step is tokenization. It’s the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. A lot of open-source tools are available to perform the tokenization process.
In this article, we’ll dig further into the importance of tokenization and the different types of it, explore some tools that implement tokenization, and discuss the challenges.
Best Tools for NLP Projects
The Best NLP/NLU Papers from the ICLR 2020 Conference
Why do we need tokenization?
Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of your pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements. The token occurrences in a document can be used directly as a vector representing that document.
This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning. They can also be used directly by a computer to trigger useful actions and responses. Or they might be used in a machine learning pipeline as features that trigger more complex decisions or behavior.
Tokenization can separate sentences, words, characters, or subwords. When we split the text into sentences, we call it sentence tokenization. For words, we call it word tokenization.
Example of sentence tokenization
Example of word tokenization
Although tokenization in Python may be simple, we know that it’s the foundation to develop good models and help us understand the text corpus. This section will list a few tools available for tokenizing text content like NLTK, TextBlob, spacy, Gensim, and Keras.
White Space Tokenization
The simplest way to tokenize text is to use whitespace within a string as the “delimiter” of words. This can be accomplished with Python’s split function, which is available on all string object instances as well as on the string built-in class itself. You can change the separator any way you need.
As you can notice, this built-in Python method already does a good job tokenizing a simple sentence. It’s “mistake” was on the last word, where it included the sentence-ending punctuation with the token “1995.”. We need the tokens to be separated from neighboring punctuation and other significant tokens in a sentence.
In the example below, we’ll perform sentence tokenization using the comma as a separator.
NLTK Word Tokenize
NLTK (Natural Language Toolkit) is an open-source Python library for Natural Language Processing. It has easy-to-use interfaces for over 50 corpora and lexical resources such as WordNet, along with a set of text processing libraries for classification, tokenization, stemming, and tagging.
You can easily tokenize the sentences and words of the text with the tokenize module of NLTK.
First, we’re going to import the relevant functions from the NLTK library:
- Word and Sentence tokenizer
N.B: The sent_tokenize uses the pre-trained model from tokenizers/punkt/english.pickle.
- Punctuation-based tokenizer
This tokenizer splits the sentences into words based on whitespaces and punctuations.
We could notice the difference between considering “Amal.M” a word in word_tokenize and split it in the wordpunct_tokenize.
- Treebank Word tokenizer
This tokenizer incorporates a variety of common rules for english word tokenization. It separates phrase-terminating punctuation like (?!.;,) from adjacent tokens and retains decimal numbers as a single token. Besides, it contains rules for English contractions.
For example “don’t” is tokenized as [“do”, “n’t”]. You can find all the rules for the Treebank Tokenizer at this link.
- Tweet tokenizer
When we want to apply tokenization in text data like tweets, the tokenizers mentioned above can’t produce practical tokens. Through this issue, NLTK has a rule based tokenizer special for tweets. We can split emojis into different words if we need them for tasks like sentiment analysis.
- MWET tokenizer
NLTK’s multi-word expression tokenizer (MWETokenizer) provides a function add_mwe() that allows the user to enter multiple word expressions before using the tokenizer on the text. More simply, it can merge multi-word expressions into single tokens.
TextBlob Word Tokenize
TextBlob is a Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Let’s start by installing TextBlob and the NLTK corpora:
$pip install -U textblob$python3 -m textblob.download_corpora
In the code below, we perform word tokenization using TextBlob library:
We could notice that the TextBlob tokenizer removes the punctuations. In addition, it has rules for English contractions.
SpaCy is an open-source Python library that parses and understands large volumes of text. With available models catering to specific languages (English, French, German, etc.), it handles NLP tasks with the most efficient implementation of common algorithms.
spaCy tokenizer provides the flexibility to specify special tokens that don’t need to be segmented, or need to be segmented using special rules for each language, for example punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token.
Before you can use spaCy you need to install it, download data and models for the English language.
$ pip install spacy$ python3 -m spacy download en_core_web_sm
Gensim word tokenizer
Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The target audience is the natural language processing (NLP) and information retrieval (IR) community. It offers utility functions for tokenization.
Tokenization with Keras
Keras open-source library is one of the most reliable deep learning frameworks. To perform tokenization we use: text_to_word_sequence method from the Class Keras.preprocessing.text class. The great thing about Keras is converting the alphabet in a lower case before tokenizing it, which can be quite a time-saver.
N.B: You could find all the code examples here.
May be useful
Check how you can keep track of your TensorFlow / Keras model training metadata (metrics, parameters, hardware consumption, and more).
Challenges and limitations
Let’s discuss the challenges and limitations of the tokenization task.
In general, this task is used for text corpus written in English or French where these languages separate words by using white spaces, or punctuation marks to define the boundary of the sentences. Unfortunately, this method couldn’t be applicable for other languages like Chinese, Japanese, Korean Thai, Hindi, Urdu, Tamil, and others. This problem creates the need to develop a common tokenization tool that combines all languages.
Another limitation is in the tokenization of Arabic texts since Arabic has a complicated morphology as a language. For example, a single Arabic word may contain up to six different tokens like the word “عقد” (eaqad).
There’s A LOT of research going on in Natural Language Processing. You need to pick one challenge or a problem and start searching for a solution.
Through this article, we have learned about different tokenizers from various libraries and tools.
We saw the importance of this task in any NLP task or project, and we also implemented it using Python. You probably feel that it’s a simple topic, but once you get into the finer details of each tokenizer model, you will notice that it’s actually quite complex.
Start practicing with the examples above and try them on any text dataset. The more you practice, the better you’ll understand how tokenization works.
If you stayed with me until the end – thank you for reading!
Tokenization Challenges in NLP
A large challenge is being able to segment words when spaces or punctuation marks don't define the boundaries of the word. This is especially common for symbol-based languages like Chinese, Japanese, Korean, and Thai.
For example, consider the sentence: “Never give up”. The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.What are examples of tokenization? ›
Examples of tokenization
- mobile wallets, such as Google Pay and Apple Pay;
- e-commerce sites; and.
- businesses that keep customers' cards on file.
- Tokenization using Python's split() function. Let's start with the split() method as it is the most basic one. ...
- Tokenization using Regular Expressions (RegEx) First, let's understand what a regular expression is. ...
- Tokenization using NLTK.
Although NLP has been growing and has been working hand-in-hand with NLU (Natural Language Understanding) to help computers understand and respond to human language, the major challenge faced is how fluid and inconsistent language can be.What is type vs token in NLP? ›
A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence. A term is a (perhaps normalized) type that is included in the IR system's dictionary.What are tokenization methods in NLP? ›
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.Which algorithm is used for tokenization? ›
Byte-Pair Encoding (BPE)What are 3 examples of token? ›
Examples of well-known decentralized finance tokens include Solana, Chainlink, Uniswap, Polkadot, Aave, and many others. Some categories of DeFi applications include decentralized lending apps, decentralized exchanges, decentralized storage sharing, etc.What are examples of type and token? ›
For example, in the sentence "the bicycle is becoming more popular" the word bicycle represents the abstract concept of bicycles and is thus a type, whereas in the sentence "the bicycle is in the garage", it represents a particular object and is therefore a token.
Because tokenization relies on complex technology and networked systems, there is a risk that hackers could access and steal digital assets. While blockchain technology is generally considered secure, there have been instances of successful cyber attacks on cryptocurrency exchanges and other blockchain-based systems.What is tokenization tool? ›
Tokenization solutions provide a way to protect cardholder data, such as magnetic swipe data, primary account number, and cardholder information. Companies can comply with industry standards more easily, and better protect client information.What is tokenization in NLP using Python? ›
In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.What is NLTK used for? ›
NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language Processing) with Python. It is a really powerful tool to preprocess text data for further analysis like with ML models for instance. It helps convert text into numbers, which the model can then easily work with.What are the three 3 most common tasks addressed by NLP? ›
One of the most popular text classification tasks is sentiment analysis, which aims to categorize unstructured data by sentiment. Other classification tasks include intent detection, topic modeling, and language detection.What are the three most important NLP tools according to you? ›
- Imagery training. Imagery training, sometimes called mental rehearsal, is one of the classic neuro-linguistic programming techniques based on visualization. ...
- NLP swish. When you're ready for more advanced NLP techniques, use the NLP swish. ...
- Modeling. ...
- Mirroring. ...
- Pillar one: outcomes.
- Pillar two: sensory acuity.
- Pillar three: behavioural flexibility.
- Pillar four: rapport.
Syntax and semantic analysis are two main techniques used with natural language processing. Syntax is the arrangement of words in a sentence to make grammatical sense. NLP uses syntax to assess meaning from a language based on grammatical rules.What is an example of NLP failure? ›
Simple failures are common. For example, Google Translate is far from accurate. It can result in clunky sentences when translated from a foreign language to English. Those using Siri or Alexa are sure to have had some laughing moments.
- Special Characters.
The main distinction between these two is: API keys identify the calling project — the application or site — making the call to an API. Authentication tokens identify a user — the person — that is using the app or site.What is an example of a token? ›
In general, a token is an object that represents something else, such as another object (either physical or virtual), or an abstract concept as, for example, a gift is sometimes referred to as a token of the giver's esteem for the recipient.Is tokenization a preprocessing step? ›
Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences.What is the difference between tokenization and vectorization? ›
Tokenization: Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data). Vectorization: Define a good numerical measure to characterize these texts.How many types of tokenization are there? ›
NLP tokenization types
The three categories of NLP tokenization are: Subword tokenization. Character tokenization. Word tokenization.
BERT uses what is called a WordPiece tokenizer. It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one word can be broken into multiple tokens.What are the 6 types of tokens? ›
C Tokens are of 6 types, and they are classified as: Identifiers, Keywords, Constants, Operators, Special Characters and Strings.What are the different types of token? ›
- Access tokens.
- ID tokens.
- Self-signed JWTs.
- Refresh tokens.
- Federated tokens.
- Bearer tokens.
Tokens and points can come in many forms. An everyday example is a paycheck. To earn a paycheck, you need to go to work and complete your job responsibilities (behavior); in turn, you receive money (tokens) for working; and you can exchange this money for a nearly unlimited number of reinforcing items (choices).What is type vs token language? ›
Types are generally said to be abstract and unique; tokens are concrete particulars, composed of ink, pixels of light (or the suitably circumscribed lack thereof) on a computer screen, electronic strings of dots and dashes, smoke signals, hand signals, sound waves, etc.
Description: The type-token fallacy is committed when a word can refer to either a type (representing an abstract descriptive concept) or a token (representing an object that instantiates a concept) and is used in a way that makes it unclear which it refers to. This is a more specific form of the ambiguity fallacy.What problem does tokenization solve? ›
Tokenization works to not only increase security for sensitive data but also cut down on compliance scope and associated costs. The flexibility of tokenization allows companies to create customized solutions that help them balance their data utility needs with data security requirements.What is the biggest advantage of tokenization? ›
The advantages of Tokenisation
Due to the fact that no data would be saved on the merchant's website or mobile application, tokenisation will increase the security of digital transactions.
Security Tokens, Utility Tokens, and Cryptocurrencies
Generally speaking, a token is a representation of a particular asset or utility. Within the context of blockchain technology, tokenization is the process of converting something of value into a digital token that's usable on a blockchain application.
Tokenization. In natural language processing, tokenization is the text preprocessing task of breaking up text into smaller components of text (known as tokens). from nltk. tokenize import word_tokenize. text = "This is a text to tokenize"What is tokenization in NLTK? ›
NLTK Tokenization is used for parsing a large amount of textual data into parts to perform an analysis of the character of the text. NLTK for tokenization can be used for training machine learning models, Natural Language Processing text cleaning.What is tokenization in API testing? ›
An API token is similar to a password and allows you to authenticate to Dataverse Software APIs to perform actions as you. Many Dataverse Software APIs require the use of an API token.What are the challenges in sentence tokenization? ›
- If there is small letter after a dot, then the sentence should not split after the dot. The following is an example: Sentence: He has completed his Ph. ...
- If there is a small letter after the dot, then the sentence should be split after the dot. This is a common mistake.
All that said, the biggest disadvantage to tokenization is probably what it doesn't offer: comprehensive fraud protection. Merchants often believe that extra data security means decreased fraudulent activity. It doesn't work that way, though; tokenization reduces overall data security risk, but not fraud risk.What is the main challenges of NLP? ›
What is the main challenge/s of NLP? Explanation: There are enormous ambiguity exists when processing natural language. 4. Modern NLP algorithms are based on machine learning, especially statistical machine learning.
The Challenge Token is a security feature to unlock access to your Gift Card, and can be found in the email you received from Prizeout. Note: This Challenge Token is the answer to the security question that must be answered in order to claim your Gift Card.What are the advantages and disadvantages of tokenization? ›
Tokenization is a time- and money-saving method, too. While it doesn't eliminate Payment Card Industry Data Security Standards (PCI DSS) and other compliance requirements, converting the form of vulnerable data effectively reduces your team's need to prove compliance.What is tokenization attacks? ›
Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no intrinsic or exploitable meaning or value.What is hard token disadvantages? ›
Hard Token Cons
Hard tokens get lost and need to be replaced. Adding new users to a hard token system is also relatively complex. And the security systems that use hard tokens can also be expensive, since they require security hardware that works with the hard tokens. Breaches can be more severe.
Encryption essentially means scrambling sensitive data that must then be decrypted with a unique key in order to be read. Tokenization involves swapping sensitive data for a token that must then be presented in order to retrieve the data.What is the opposite of tokenization? ›
Detokenization is the reverse process, exchanging the token for the original data. Detokenization can be done only by the original tokenization system. There is no other way to obtain the original number from just the token.Which tokenization is best? ›
Punctuation-based tokenization is slightly more advanced than whitespace-based tokenization since it splits on whitespace and punctuations and also retains the punctuations. Punctuation-based tokenization overcomes the issue above and provides a meaningful “Jones” token.Can tokenization be hacked? ›
Therefore the token cannot be used to hack into the data. In this way tokenization is different to encryption, which merely scrambles data where it resides, and decrypts it when required. As such, the underlying data is vulnerable to hackers or corruption.Why NLP is a challenging problem? ›
Misspelled or misused words can create problems for text analysis. Autocorrect and grammar correction applications can handle common mistakes, but don't always understand the writer's intention. With spoken language, mispronunciations, different accents, stutters, etc., can be difficult for a machine to understand.