We recommend signing up with your work email to keep all your marketing content in one place
Anyword’s Language Optimization Platform is a product designed to help marketers articulate their marketing copy more easily and effectively to their target audiences. To do this successfully, our underlying Natural Language Processing (NLP) technology has to excel both at understanding and generating natural language. In this post, I review the incredible progress in the field of NLP that has made this product (and others) possible, and then share some tips based on our experience with these methods.
Since supervised machine learning (ML) is the underlying technique behind much of the latest progress in NLP, I will start with a very quick intro for readers with no ML background. Plainly put, in supervised machine learning, we try to teach models to perform some target task by yielding some desired output for a given input. There are many types of tasks and input/output specifications. Here are a few examples:
TaskInputOutputImage classificationRaw imageWhich type of animal appears in the imagePart of speech taggingTextPart of speech for every word in the textQuestion answeringText of a questionText of the answer to the questionText engagement prediction Marketing text + target audienceEngagement potential between 0-100
To achieve this goal, the model is first trained using a supervision dataset that comprises pairs of inputs and their corresponding expected outputs. During training, the model learns by updating its internal parameters, which determine the algorithm it would use to make inferences. Once training is over, the model is evaluated on its ability to make output predictions for new inputs that it has not seen before. The success of such supervised models depends a lot on:
Similar to other fields, many early applications of machine learning in NLP involved a great deal of attention to feature engineering, which deals with how to convert the inputs into a feature representation that would be most conductive to learning. NLP was particularly known for its ‘Curse of Dimensionality’ due to the high number of features (typically ~100K) that you had to deal with if you wanted to consider each unique word as a feature . Things got even worse if in order to capture more linguistic nuance one also included multi-word features (n-grams) like ‘not good’. In addition to classic feature selection techniques, early linguistically-motivated approaches to address this challenge included stemming or lemmatization techniques that merged different surface forms of the same lemma or stem into a single feature (e.g. mapping both ‘airplane’ and ‘airplanes’ to the same feature). Helpful as they were, there was still much room for improvement.
A major breakthrough in NLP came in the form of self-supervised pre-training. The basic idea is: “First teach the model about English in general and only then train it on your target task.” (Note: I occasionally refer to English in this post but the same principles pretty much apply to any other human language). If you care about machine learning terminology, this is called a ‘transfer learning’ technique as knowledge learned in one domain (general English) is transferred to another specific task like text classification or information extraction.
The concept of self-supervised pre-training in NLP dates quite a number of years back with techniques such as Latent Semantic Analysis (LSA), but really started to take off more recently in the beginning of the 2010’s with the introduction of self-supervised word embeddings. Arguably, Collobert’s work “Natural Language Processing (almost) from Scratch'' and Mikolov’s word2vec embeddings mark the beginning of this phase. These works and others presented very efficient and effective techniques for teaching models about the semantics of words using no manually labeled data as part of a ‘pre-training’ phase. They do so largely by ingesting large amounts of text and simply observing word co-occurrences. This type of training is called ‘self-supervised’ because you don’t need to supply labeled supervision, just lots of (usually publicly available) plain text, which comprises both the inputs and the outputs. During this training, these models learn a mapping of every English word to a vector in a low-dimensional embedding space (~1K dimensions) and their language ‘understanding’ properties stem from the fact that words with similar meanings or functions are typically mapped to similar vectors in this space. Figure 1 shows a 2D illustration of such a learned embedding space. Now that a model that ‘understands’ words was trained, task-specific models using these pre-trained word representations were able to achieve better results with less task-specific training data.
Learning meaningful word presentations was a significant step forward for NLP models but language is much more than the simple sum of its words. Understanding how words are put together in sequence is key to comprehending language phenomena like word senses (as in ‘book a ticket’ vs. ‘read a book’) and multi-word expressions (as in being ‘under the weather’), not to mention larger language units like entire sentences and even longer texts. Luckily, with the constant increase in compute power, the abundance of unlabeled text and sophisticated neural network architectures, pre-training techniques progressed to teaching models exactly that. It started with learning word context and context-aware representations like context2vec and ELMo, which were based on sequence-aware neural models called Recurrent Neural Networks (RNN). Then, it continued with even more powerful models like BERT, GPT-2, T5 and many of their successors. Like their simpler predecessors, these self-supervised pre-trained models are also first trained on lots of plain text and then fine-tuned to particular target tasks with task-specific supervision. However, with larger sizes and a more powerful neural architecture that can attend better to interrelations between different words in a sequence (called ‘Transformers’), they came to dominate pretty much all of the NLP benchmark leaderboards today from sentiment analysis to question answering. Knowing how to put words together is the essence of being skilled in any language. Hence, it is only appropriate that these pre-trained models are called language models.
While there are various variants of pre-trained language models, let’s look at the classic autoregressive language modeling task formulation to get the general feel for it. Simply put, the task is to predict the next word given any sequence of predecessor words. Let’s start by considering a simple concrete example:
In this case, the model is required to predict the most likely words that could fill in the blank right after the sequence “Biden flew”. It’s constructive to imagine what kind of knowledge the model needs to learn in order to do well on this task. In this case, for example, understanding English grammar and common usage patterns is very useful. It would help rule out simple present tense verb options (“Biden flew watch” doesn’t work) or singular nouns options (“Biden flew airplane” doesn’t work either), while preferring prepositions or verb particles that are often used with the verb “fly” like “to” or “off”.
One of the interesting ways in which models that were trained to perform this language modeling task can showcase their linguistic capabilities is when they are used to generate text. Arguably, the very recent introduction of GPT-2 in 2019 marked the beginning of a new era in NLP by showing that a LM can master very impressive linguistic knowledge as exemplified by its ability to generate long pieces of text that are almost indistinguishable from human-generated ones. Figure 2 shows an excerpt of a popular text that was generated by GTP-2 by simply iteratively predicting word-by-word which word is likely to appear after the previous ones.
Next, let’s look at this example:
To do well here, the model needs to know that names of countries or cities are some of the more likely options to follow this text. Furthermore, since some destinations are more likely than others, this is also an example of a case where wider world knowledge rather than mere linguistic knowledge comes into play. To dwell on this important point a little longer, consider the following:
To be successful here, the model needs to know who the prime minister of Israel is.
Interestingly, it turns out that language models are indeed successful in acquiring both linguistic and world knowledge. But how do they do it? LM implementations were initially based on fairly straight-forward n-gram counting techniques and were limited to predicting the next word only based on a sequence that was no more than a few words long. The more recent neural networks LMs that I mentioned in the previous section have outperformed them both in accuracy and in their ability to take very long contexts of up to thousands of words into account. I will not discuss here the technical details of how they achieved that. However, by and large, their powerful properties, which will be discussed next, are due to the combination of the following elements:
Much like web services, machine learning models typically have a well-defined API, i.e. a strict format of the expected input/output which varies from model to model and task to task. Recent language-aware models now support a more relaxed interface where both inputs and outputs are plain text. This makes it easier to train a single model on multiple different tasks and, as we’ll later see, sometimes even interact with it in ways that were not envisioned during training. T5 is an example of such a model that was trained on several tasks, including translation, summarization, and question answering. Figure 3 shows possible inputs to T5 on the left and their respective outputs on the right.
Arguably, one of the advantages human intelligence has over an artificial one is its ability to learn a wide variety of tasks from relatively few examples. For instance, we only need to see a couple of pictures of a dog breed we’ve never known, before we're able to recognize it with high accuracy in an unseen dog image dataset. Machine learning models, on the other hand, usually needed a lot of labeled training instances (typically in the order of hundreds or more) to reach good accuracy. Obtaining lots of high-quality labeled training data and going through long, and sometimes expensive, training cycles for every specific task is a pain. Luckily, transfer learning from self-supervised pre-trained models is rapidly changing this state of affairs for the better.
Not only does pre-training help achieve better accuracies, but it has been shown to do so with less labeled data. ULMFiT is one of the early works that used language model pre-training to boost the performance of text classifiers. Figure 4 shows how pre-training achieves a considerable error reduction with as little as 100 training examples.
While doing well with 100 training examples is impressive, the holy grail in AI is to match or even surpass human performance with few-shot (a handful of labeled examples) or even zero-shot (no labeled examples at all) learning. While there have been various attempts at this in the past, arguably the most impressive of them all in the NLP domain, is the recent GPT-3 model from OpenAI.
GPT-3 is based on a similar model architecture as its little brother GPT-2, but is about 100 times larger in terms of parameter count. Figure 5 shows how over the span of as little as 2-3 years, parameter counts of language models sky-rocketed from hundreds of millions to over three orders of magnitude more than that! The 1.6-trillion parameter Switch model is the largest known LM at the time of this writing. However, since it has not been released to the public, I will focus for now on GPT-3. Training a 175-billion parameter model on huge amounts of texts is not a feat that can be accomplished by small organizations, as the required compute cost has been estimated at around a few million dollars. However, OpenAI has made GPT-3 available via a commercial API-based service. To the best of my knowledge, they are the first to turn a few-shot learning NLP model into a practical solution that seems to be well on its way to become a commercial success.
Here’s an example from OpenAI of how to train a sentiment classifier with just a few examples. The following prompt is given as input to GPT-3 and then as output, it is simply asked to predict which words are most likely to come next.
This is a tweet sentiment classifier
Tweet: "I loved the new Batman movie!"
Sentiment: Positive
###Tweet: "I hate it when my phone battery dies 💢"
Sentiment: Negative
###Tweet: "My day has been 👍
"Sentiment: Positive
###Tweet: "This is the link to the article"
Sentiment: Neutral
###Tweet: "Few-shot learning is awesome!"
Sentiment:
GPT-3 correctly outputs 'Positive’ in this case. The perplexing thing is that the exact same model is very good at predicting the correct completions in various other unrelated tasks, like text summarization, language translation, or even solving simple arithmetic questions - as long as it is given an appropriate short prompt. This is despite the fact that these tasks were not explicitly included in its training data (recall that it was just trained to predict likely next words on a huge corpus of plain text). So what’s happening here? Traditionally, a sentiment classification model would have to see many more examples at train time so it could update its internal parameters accordingly, and then at inference time, it would be available for use only for that particular task. In contrast, GPT-3 uses the same pre-trained parameters for various tasks. There’s no parameter updating based on the examples in the prompt. Rather, it’s as if the model already ‘magically’ learned all those different tasks during its pre-training, and the few examples in the prompt are used merely to define the API (the expected relation between input and output), rather than teach the model how to perform the task.
While few-shot learning is impressive enough, the ability of large LMs to learn specific tasks entirely just based on their simple pre-training objective becomes even more pronounced when exploring GPT-3’s zero-shot abilities. Evidently, on some tasks, GPT-3 does very well even without a single example in its input prompt. Here’s one example from OpenAI for a zero-shot prompt:
Student question: What's the best way to divide negative numbers?Question type (Math, English, History, Science, Other):
GPT-3 correctly predicts that the next most likely word for this text sequence is Math.
With so many types of models and sizes, it’s sometimes hard to choose the best approach for the task at hand. At Anyword, we tackle various NLP tasks such as:
As experienced NLP and ML practitioners on one hand and an OpenAI beta partner on the other, we use both task-specific homegrown language models trained on huge amounts of our proprietary marketing data, as well as the few-shot GPT-3 model. Here are some tips on how to choose what might work best for your task based on our experience.
Anyword is an AI copywriting tool that empowers marketers to optimize and generate copy at scale. Trained on $250M worth of Facebook ads, Anyword harnesses the power of AI copywriting to generate variations that are made to convert, saving marketers time and money.
Testing AIDA vs. PAS with Anyword’s Performance Prediction tool. Discover how each framework performs.
In this guide, we’ll break down the 7 essential components of a great campaign brief and explore how each one helps AI perform at its best for your business.
[Watch Full Webinar]. Gain insights on integrating generative AI into your marketing efforts with tips from Blazeo’s recent webinar.