Tokenisation


In order for computer systems to interpret the subject of a text in a similar way humans do, they use natural language processing (NLP), an area within AI that deals with understanding written or spoken language, and responding in kind. 

Text analysis describes NLP processes that extract information from unstructured text.

Natural language processing might be used to create:

A social media feed analyser that detects sentiment for a product marketing campaign.

A document search application that summarises documents in a catalog.

An application that extracts brands and company names from text.

Azure AI Language is a cloud-based service that includes features for understanding and analysing text.

Azure AI Language includes various features that support sentiment analysis, key phrase identification, text summarisation, and conversational language understanding.


Frequency analysis

After tokenising the words, you can perform some analysis to count the number of occurrences of each token. The most commonly used words (other than stop words such as “a“, “the“, and so on) can often provide a clue as to the main subject of a text corpus. For example, the most common words in the entire text of the “go to the moon” speech we considered previously include “new“, “go“, “space“, and “moon“. If we were to tokenise the text as bi-grams (word pairs), the most common bi-gram in the speech is “the moon“. From this information, we can easily surmise that the text is primarily concerned with space travel and going to the moon.

 Tip:

Simple frequency analysis in which you simply count the number of occurrences of each token can be an effective way to analyze a single document, but when you need to differentiate across multiple documents within the same corpus, you need a way to determine which tokens are most relevant in each document. Term frequency – inverse document frequency (TF-IDF) is a common technique in which a score is calculated based on how often a word or term appears in one document compared to its more general frequency across the entire collection of documents. Using this technique, a high degree of relevance is assumed for words that appear frequently in a particular document, but relatively infrequently across a wide range of other documents.


Machine learning for text classification

Another useful text analysis technique is to use a classification algorithm, such as logistic regression, to train a machine learning model that classifies text based on a known set of categorisations.

A common application of this technique is to train a model that classifies text as positive or negative in order to perform sentiment analysis or opinion mining.

For example, consider the following restaurant reviews, which are already labeled as 0 (negative) or 1 (positive):

The food and service were both great: 1

A really terrible experience: 0

Mmm! tasty food and a fun vibe: 1

Slow service and substandard food: 0

With enough labeled reviews, you can train a classification model using the tokenized text as features and the sentiment (0 or 1) a label. The model will encapsulate a relationship between tokens and sentiment – for example, reviews with tokens for words like “great“, “tasty“, or “fun” are more likely to return a sentiment of 1 (positive), while reviews with words like “terrible“, “slow“, and “substandard” are more likely to return 0 (negative).


Semantic language models

As the state of the art for NLP has advanced, the ability to train models that encapsulate the semantic relationship between tokens has led to the emergence of powerful language models. At the heart of these models is the encoding of language tokens as vectors (multi-valued arrays of numbers) known as embeddings.

It can be useful to think of the elements in a token embedding vector as coordinates in multidimensional space, so that each token occupies a specific “location.” The closer tokens are to one another along a particular dimension, the more semantically related they are. In other words, related words are grouped closer together. As a simple example, suppose the embeddings for our tokens consist of vectors with three elements, for example:

4 (“dog”): [10.3.2]

5 (“bark”): [10,2,2]

8 (“cat”): [10,3,1]

9 (“meow”): [10,2,1]

10 (“skateboard”): [3,3,1]

We can plot the location of tokens based on these vectors in three-dimensional space, like this:

A diagram of tokens plotted on a three-dimensional space.

The locations of the tokens in the embeddings space include some information about how closely the tokens are related to one another.

For example, the token for “dog” is close to “cat” and also to “bark.” The tokens for “cat” and “bark” are close to “meow.”

The token for “skateboard” is further away from the other tokens.

The language models we use in industry are based on these principles but have greater complexity.

For example, the vectors used generally have many more dimensions.

There are also multiple ways you can calculate appropriate embeddings for a given set of tokens.

Different methods result in different predictions from natural language processing models.

A generalised view of most modern natural language processing solutions is shown in the following diagram.

A large corpus of raw text is tokenised and used to train language models, which can support many different types of natural language processing task.

A diagram of the process to tokenize text and train a language model that supports natural language processing tasks.

Common NLP tasks supported by language models include:

Text analysis, such as extracting key terms or identifying named entities in text.

Sentiment analysis and opinion mining to categorise text as positive or negative.

Machine translation, in which text is automatically translated from one language to another.

Summarisation, in which the main points of a large body of text are summarised.

Conversational AI solutions such as bots or digital assistants in which the language model can interpret natural language input and return an appropriate response.

These capabilities and more are supported by the models in the Azure AI Language service, which we’ll explore next.

Start a conversation
1
How can we help you?
Got any questions? Speak to one of our course specialists.