Welcome to Anyword

We recommend signing up with your work email to keep all your marketing content in one place

OpenAI & DeepSeek: How They Stack Up Against Models Trained with A/B Testing Data 

Oren Melamud
Reading time 4 min
AI for Enterprise

Models are getting stronger and faster - but the feedback loop remains broken, which is why AI copywriting underperforms.

Foundational reasoning models, such as OpenAI’s O1 and the recently introduced DeepSeek’s R1, have shown extremely impressive capabilities to reason and solve a variety of problems, often even surpassing the level of human experts. Yet, in many cases foundational models are incapable of powering mission-critical business functions because they lack the specialized knowledge needed to get the job done. Specifically, at Anyword, we know that writing content that performs really well is hard, and requires extensive A/B testing data. Why is that, and how big is the gap?

Last month, the world woke up to breaking news about the Chinese company DeepSeek R1, a reasoning model that is on par with the latest and greatest OpenAI O1 model in terms of competency, yet is available at a fraction of the price, and – maybe even more importantly –  is released as open source with open weights and a permissive license allowing anyone to use it for commercial purposes. The markets are still digesting this event with pundits trying to wrap their heads around the business and geopolitical ramifications and technology leaders reconsidering their AI strategy.

At Anyword, we were interested in seeing whether the advent of stronger reasoning models, like O1 and R1, could disrupt the ability to generate better-performing copy. Could they reason about the potential of a copy to engage a target audience similar to how they reason about a solution to a math problem? 

The assumption we use at Anyword is that a model that can predict the performance of generated copy can, by definition, generate better-performing copy and keep optimizing itself.

For years, we’ve been telling our customers, highly innovative marketers and product leaders, that foundational models will make generating lots of copy content (“just words”) increasingly easier, faster and cheaper – yet generating meaningful, business-impacting copy content (“words that matter”) remains a tough challenge that requires special means.

And so, we put it to the test:

Methodology

Anyword’s feedback platform allows us to actually “close the loop” and inform our models of what really works. In other words, we feed our models with A/B testing data on which marketing copies produced more ad engagement than others, all other factors being equal. This is the kind of data that isn’t available to foundational models, and that can truly make the difference between “AI guessing” and an AI that is actually able to predict what would work in a business context. The question was clear: Foundational models are smarter than ever - but do they know what copy actually works??

We used 917 real Facebook campaigns from real companies that Anyword powered in 2024. In each of these campaigns, we compared two A/B tested ads that were tested against each other. To focus on what we care about, the engagement of the copy, the only difference between the A and B versions of the ad was the primary text copy.

As part of the experiment, we asked the compared models to score each Facebook primary text copy for its ability to obtain a higher Click-Through-Rate (CTR) for the ad on a scale of 1-100. This was done using a simple prompt presented to the O1 and R1 models. Next, as another reference model, we examined the Anyword Performance Prediction Score for each of these texts. Lastly, we evaluated all of the above models based on the actual performance of the campaigns while accounting for some variability in the campaign data.

The way to determine success was very simple: Did the model score the copy that performed better higher than the one that performed worse?

Results

To get a sense of the range of the possible results, we note that a ‘random’ baseline (i.e., a model that just guesses which of the compared A/B copies is better for each campaign) would be correct half of the time and hence get an accuracy score of about 50%, while a perfect model that always makes the correct prediction would score 100% accuracy. The results of the compared models — Anyword’s Prediction model, OpenAI O1 model, and DeepSeek R1 model — are presented in the following diagram:

As can be clearly seen, both O1 and R1 struggle to surpass the 50% random baseline with accuracies of 62% and 56%, respectively. We hypothesize that the small advantage that they do manage to obtain over random selection might be due to their ability to capture some copywriting best practices as they appear in different sources such as the web, books, etc. This is an interesting topic in itself that we plan to shed more light on in a future post.

On the other hand, despite being much smaller and faster, the Anyword Performance Prediction model scores significantly higher with an impressive 82% accuracy. The secret? As mentioned above, it was trained with A/B tested data that allows it to outperform both expert copywriters and foundational models trying to imitate them.

What does it mean for your business?

The bottom line here is rather clear: while OpenAI’s O1 and the new DeepSeek R1 model have many benefits and can be applied to various tasks, they fall short in their ability to predict marketing copy engagement, which might be crucial in a business setting. The inability to significantly outperform random guessing is actually slightly alarming and suggests a potential negative business impact for companies that choose to exclusively rely on such models for content generation tasks.

And this isn’t likely to be solved soon either. We have been following the development of OpenAI models for years, and our research suggests that there has been very limited progress in their ability to predict actual copy performance despite significant improvements in other problem domains. It all boils down to the simple principle that machine learning models can only be as good as their data — and performance data is typically not publicly available. It also isn’t easily digested in text form and needs to go through special preprocessing pipelines to be useful.

The good news: There are specialized models that can do a much better job for your business and augment rather than replace foundational models if you’re using them elsewhere. Anyword is such a solution — providing the much-needed performance layer in the AI writing assistants space, based on its ability to “close the performance loop.” With model costs constantly decreasing and performance layers and applications improving, the ROI of AI-generated content for your business is expected to grow significantly, keeping you ahead of the competition.

There’s More

Vertical vs. Horizontal AI: The CMO’s Guide to AI Performance in 2025

AI efficiency gains are so 2024 – welcome to 2025: The Year of AI Performance

AIDA vs PAS: Performance Impact of Copywriting Frameworks

Testing AIDA vs. PAS with Anyword’s Performance Prediction tool. Discover how each framework performs.

Marketers Using Anyword

See an average 30% increase in conversion rates

// G2 tracking code //fOR BANNeR