Anyword’s Language Optimization Platform is a product designed to help marketers articulate their marketing copy more easily and effectively to their target audiences. To do this successfully, our underlying Natural Language Processing (NLP) technology has to excel both at understanding and generating natural language. In this post, I review the incredible progress in the field of NLP that has made this product (and others) possible, and then share some tips based on our experience with these methods.

A quick layperson intro to supervised machine learning

Since supervised machine learning (ML) is the underlying technique behind much of the latest progress in NLP, I will start with a very quick intro for readers with no ML background. Plainly put, in supervised machine learning, we try to teach models to perform some target task by yielding some desired output for a given input. There are many types of tasks and input/output specifications. Here are a few examples:

Image classificationRaw imageWhich type of animal appears in the image
Part of speech taggingTextPart of speech for every word in the text
Question answeringText of a questionText of the answer to the question
Text engagement prediction Marketing text + target audienceEngagement potential between 0-100

To achieve this goal, the model is first trained using a supervision dataset that comprises pairs of inputs and their corresponding expected outputs. During training, the model learns by updating its internal parameters, which determine the algorithm it would use to make inferences. Once training is over, the model is evaluated on its ability to make output predictions for new inputs that it has not seen before. The success of such supervised models depends a lot on:

  • Their capacity to learn and generalize to unseen data
  • The size and quality of the supervision datasets used to train them. Unfortunately, such datasets (commonly referred to as ‘labeled data’) often require expensive manual annotation of the desired outputs.

Early days in ML approaches for NLP

Similar to other fields, many early applications of machine learning in NLP involved a great deal of attention to feature engineering, which deals with how to convert the inputs into a feature representation that would be most conductive to learning. NLP was particularly known for its ‘Curse of Dimensionality’ due to the high number of features (typically ~100K) that you had to deal with if you wanted to consider each unique word as a feature . Things got even worse if in order to capture more linguistic nuance one also included multi-word features (n-grams) like ‘not good’. In addition to classic feature selection techniques, early linguistically-motivated approaches to address this challenge included stemming or lemmatization techniques that merged different surface forms of the same lemma or stem into a single feature (e.g. mapping both ‘airplane’ and ‘airplanes’ to the same feature). Helpful as they were, there was still much room for improvement.

NLP models learn about the meaning of words

A major breakthrough in NLP came in the form of self-supervised pre-training. The basic idea is: “First teach the model about English in general and only then train it on your target task.” (Note: I occasionally refer to English in this post but the same principles pretty much apply to any other human language). If you care about machine learning terminology, this is called a ‘transfer learning’ technique as knowledge learned in one domain (general English) is transferred to another specific task like text classification or information extraction. 

The concept of self-supervised pre-training in NLP dates quite a number of years back with techniques such as Latent Semantic Analysis (LSA), but really started to take off more recently in the beginning of the 2010’s with the introduction of self-supervised word embeddings. Arguably, Collobert’s work “Natural Language Processing (almost) from Scratch” and Mikolov’s word2vec embeddings mark the beginning of this phase. These works and others presented very efficient and effective techniques for teaching models about the semantics of words using no manually labeled data as part of a ‘pre-training’ phase. They do so largely by ingesting large amounts of text and simply observing word co-occurrences. This type of training is called ‘self-supervised’ because you don’t need to supply labeled supervision, just lots of (usually publicly available) plain text, which comprises both the inputs and the outputs. During this training, these models learn a mapping of every English word to a vector in a low-dimensional embedding space (~1K dimensions) and their language ‘understanding’ properties stem from the fact that words with similar meanings or functions are typically mapped to similar vectors in this space. Figure 1 shows a 2D illustration of such a learned embedding space. Now that a model that ‘understands’ words was trained, task-specific models using these pre-trained word representations were able to achieve better results with less task-specific training data.

Figure 1: A 2-dimensional illustration of a word embedding space [source]

Language models put words together

Learning meaningful word presentations was a significant step forward for NLP models but language is much more than the simple sum of its words. Understanding how words are put together in sequence is key to comprehending language phenomena like word senses (as in ‘book a ticket’ vs. ‘read a book’) and multi-word expressions (as in being ‘under the weather’), not to mention larger language units like entire sentences and even longer texts. Luckily, with the constant increase in compute power, the abundance of unlabeled text and sophisticated neural network architectures, pre-training techniques progressed to teaching models exactly that. It started with learning word context and context-aware representations like context2vec and ELMo, which were based on sequence-aware neural models called Recurrent Neural Networks (RNN). Then, it continued with even more powerful models like BERT, GPT-2, T5 and many of their successors. Like their simpler predecessors, these self-supervised pre-trained models are also first trained on lots of plain text and then fine-tuned to particular target tasks with task-specific supervision. However, with larger sizes and a more powerful neural architecture that can attend better to interrelations between different words in a sequence (called ‘Transformers’), they came to dominate pretty much all of the NLP benchmark leaderboards today from sentiment analysis to question answering. Knowing how to put words together is the essence of being skilled in any language. Hence, it is only appropriate that these pre-trained models are called language models.

How language models work

While there are various variants of pre-trained language models, let’s look at the classic autoregressive language modeling task formulation to get the general feel for it. Simply put, the task is to predict the next word given any sequence of predecessor words. Let’s start by considering a simple concrete example:  

  1. Biden flew _____

In this case, the model is required to predict the most likely words that could fill in the blank right after the sequence “Biden flew”. It’s constructive to imagine what kind of knowledge the model needs to learn in order to do well on this task. In this case, for example, understanding English grammar and common usage patterns is very useful. It would help rule out simple present tense verb options (“Biden flew watch” doesn’t work) or singular nouns options (“Biden flew airplane” doesn’t work either), while preferring prepositions or verb particles that are often used with the verb “fly” like “to” or “off”. 

One of the interesting ways in which models that were trained to perform this language modeling task can showcase their linguistic capabilities is when they are used to generate text. Arguably, the very recent introduction of GPT-2 in 2019 marked the beginning of a new era in NLP by showing that a LM can master very impressive linguistic knowledge as exemplified by its ability to generate long pieces of text that are almost indistinguishable from human-generated ones. Figure 2 shows an excerpt of a popular text that was generated by GTP-2 by simply iteratively predicting word-by-word which word is likely to appear after the previous ones.

Figure 2: Text generated by GPT-2 [source]

Next, let’s look at this example:

  1. Biden flew to _____

To do well here, the model needs to know that names of countries or cities are some of the more likely options to follow this text. Furthermore, since some destinations are more likely than others, this is also an example of a case where wider world knowledge rather than mere linguistic knowledge comes into play. To dwell on this important point a little longer, consider the following:

  1. Biden flew to Israel to meet with Prime Minister _____

To be successful here, the model needs to know who the prime minister of Israel is. 

Interestingly, it turns out that language models are indeed successful in acquiring both linguistic and world knowledge. But how do they do it? LM implementations were initially based on fairly straight-forward n-gram counting techniques and were limited to predicting the next word only based on a sequence that was no more than a few words long. The more recent neural networks LMs that I mentioned in the previous section have outperformed them both in accuracy and in their ability to take very long contexts of up to thousands of words into account. I will not discuss here the technical details of how they achieved that. However, by and large, their powerful properties, which will be discussed next, are due to the combination of the following elements:

  1. As seen, doing language modeling well requires both linguistic and world knowledge.
  2. An abundance of training data freely available for this task (plain text) serves as an adequate source for capable models to acquire such knowledge.
  3. Powerful neural network model architectures, combined with massive compute power to train huge models on big data, make it feasible to train such capable models on lots of data.

Interacting with language models

Much like web services, machine learning models typically have a well-defined API, i.e. a strict format of the expected input/output which varies from model to model and task to task. Recent language-aware models now support a more relaxed interface where both inputs and outputs are plain text. This makes it easier to train a single model on multiple different tasks and, as we’ll later see, sometimes even interact with it in ways that were not envisioned during training. T5 is an example of such a model that was trained on several tasks, including translation, summarization, and question answering. Figure 3 shows possible inputs to T5 on the left and their respective outputs on the right.

Figure 3: T5’s text-to-text multi-task interface [source]

How much labeled data do you need to train a good model?

Arguably, one of the advantages human intelligence has over an artificial one is its ability to learn a wide variety of tasks from relatively few examples. For instance, we only need to see a couple of pictures of a dog breed we’ve never known, before we’re able to recognize it with high accuracy in an unseen dog image dataset. Machine learning models, on the other hand, usually needed a lot of labeled training instances (typically in the order of hundreds or more) to reach good accuracy. Obtaining lots of high-quality labeled training data and going through long, and sometimes expensive, training cycles for every specific task is a pain. Luckily, transfer learning from self-supervised pre-trained models is rapidly changing this state of affairs for the better.

Not only does pre-training help achieve better accuracies, but it has been shown to do so with less labeled data. ULMFiT is one of the early works that used language model pre-training to boost the performance of text classifiers. Figure 4 shows how pre-training achieves a considerable error reduction with as little as 100 training examples.

Figure 4: Text classification error rate as a function of training size. The ‘From scratch’ line describes the performance of a model without pre-training, while the ‘ULMFiT semi-supervised’ line describes a model that is based on extensive pre-training. [Source]

While doing well with 100 training examples is impressive, the holy grail in AI is to match or even surpass human performance with few-shot (a handful of labeled examples) or even zero-shot (no labeled examples at all) learning. While there have been various attempts at this in the past, arguably the most impressive of them all in the NLP domain, is the recent GPT-3 model from OpenAI.

Few-shot and zero-shot learning with GPT-3

GPT-3 is based on a similar model architecture as its little brother GPT-2, but is about 100 times larger in terms of parameter count. Figure 5 shows how over the span of as little as 2-3 years, parameter counts of language models sky-rocketed from hundreds of millions to over three orders of magnitude more than that! The 1.6-trillion parameter Switch model is the largest known LM at the time of this writing. However, since it has not been released to the public, I will focus for now on GPT-3. Training a 175-billion parameter model on huge amounts of texts is not a feat that can be accomplished by small organizations, as the required compute cost has been estimated at around a few million dollars. However, OpenAI has made GPT-3 available via a commercial API-based service. To the best of my knowledge, they are the first to turn a few-shot learning NLP model into a practical solution that seems to be well on its way to become a commercial success.

Figure 5: Language model sizes in billion parameters

Here’s an example from OpenAI of how to train a sentiment classifier with just a few examples. The following prompt is given as input to GPT-3 and then as output, it is simply asked to predict which words are most likely to come next. 

This is a tweet sentiment classifier
Tweet: “I loved the new Batman movie!”
Sentiment: Positive
###Tweet: “I hate it when my phone battery dies 💢”
Sentiment: Negative
###Tweet: “My day has been 👍
“Sentiment: Positive
###Tweet: “This is the link to the article”
Sentiment: Neutral
###Tweet: “Few-shot learning is awesome!”

GPT-3 correctly outputs ‘Positive’ in this case. The perplexing thing is that the exact same model is very good at predicting the correct completions in various other unrelated tasks, like text summarization, language translation, or even solving simple arithmetic questions – as long as it is given an appropriate short prompt. This is despite the fact that these tasks were not explicitly included in its training data (recall that it was just trained to predict likely next words on a huge corpus of plain text). So what’s happening here? Traditionally, a sentiment classification model would have to see many more examples at train time so it could update its internal parameters accordingly, and then at inference time, it would be available for use only for that particular task. In contrast, GPT-3 uses the same pre-trained parameters for various tasks. There’s no parameter updating based on the examples in the prompt. Rather, it’s as if the model already ‘magically’ learned all those different tasks during its pre-training, and the few examples in the prompt are used merely to define the API (the expected relation between input and output), rather than teach the model how to perform the task.

While few-shot learning is impressive enough, the ability of large LMs to learn specific tasks entirely just based on their simple pre-training objective becomes even more pronounced when exploring GPT-3’s zero-shot abilities. Evidently, on some tasks, GPT-3 does very well even without a single example in its input prompt. Here’s one example from OpenAI for a zero-shot prompt:

Student question: What’s the best way to divide negative numbers?Question type (Math, English, History, Science, Other):

GPT-3 correctly predicts that the next most likely word for this text sequence is Math

Which type of language model should you use?

With so many types of models and sizes, it’s sometimes hard to choose the best approach for the task at hand. At Anyword, we tackle various NLP tasks such as:

  • Generating effective marketing texts in various formats that promote products, services, publisher content, websites, etc.
  • Predicting the effectiveness and relevance of marketing texts to their target audience.
  • Improving existing customer texts to make them more effective.
  • Detecting grammatical errors in texts.

As experienced NLP and ML practitioners on one hand and an OpenAI beta partner on the other, we use both task-specific homegrown language models trained on huge amounts of our proprietary marketing data, as well as the few-shot GPT-3 model. Here are some tips on how to choose what might work best for your task based on our experience.

Use the few-shot GPT-3 model (or equivalent) when:

  • You don’t have a lot of good training data for your task.
    • If you can’t or don’t want to invest the effort in obtaining more than a few training examples, then a few-shot model is your no-brainer choice.
  • You want to develop new capabilities fast.
    • Prompt design requires only a handful of examples.
    • There’s no need to collect a lot of data and invest in lengthier model training cycles and associated infrastructure.
    • We have found that we can add new capabilities in as little as a few hours this way.
  • GPT-3 excels at the task you’re interested in.
    • Some tasks like summarization are apparently easy to learn solely based on self-supervised data (plain text). For those tasks, the size advantage of GPT-3 over smaller models like GPT-2 might translate into more accurate results.
  • You don’t have NLP and ML expertise.
    • Prompt design can be done by domain experts.
    • No need for NLP/ML expertise.

Use task-specific fine-tuned language models like BERT or GPT-2 when:

  • GPT-3 doesn’t do well on your task.
    • Its ‘magical’ pre-training doesn’t prepare GPT-3 for every type of task. It might not do well on your task either because it’s too hard or because the information needed to perform it well is not available in plain text.
    • For example, it’s hard to learn how effective different texts are for different types of audiences just by reading a lot of public domain texts because click-through and conversion rates information is arguably missing there.
  • You have good training data and are willing to invest an expert’s time in fine-tuning and optimizing a model for your task.
    • Since we have this expertise in-house at Anyword, we have found investing the time worthwhile on many of our tasks.
  • You are more particular about the expected model outputs.
    • There is a limit to how specific you can be when specifying the expected model output with just a few examples, especially when the task at hand is not well known. When fine-tuning with more examples you generally have more control.
    • For example, we have found that it’s easier to control the length of the desired output text when fine-tuning a homegrown model.
  • You handle high volumes of traffic in production and want to save on costs.
    • The compute power that is required to host and operate a huge 175-billion parameter model is costly and that is reflected in the GPT-3 API pricing. Fine-tuning a homegrown model will cost more during research and train time but may cost much less in production if it turns out that a smaller model can do the same job. 
  • You are sensitive to response times.
    • Smaller models are typically faster. If you can get a 110M or 340M parameter BERT model to do the job, it would most likely do it quicker than a 175B parameter one.