A new research paper from Cohere, Stanford, MIT, and AI2 claims that Chatbot Arena, a widely-used AI benchmarking platform, has favored top tech companies like Meta, Google, OpenAI, and Amazon, giving them unfair advantages in leaderboard rankings. The paper accuses LM Arena, the organization behind Chatbot Arena, of allowing certain AI labs to privately test multiple models without publishing the scores of underperforming ones.
According to the researchers, this selective access helped big tech firms achieve higher placements on the Chatbot Arena leaderboard, while smaller companies and newer AI developers were left out of this opportunity. The authors argue that this unequal testing process undermines the platform’s fairness and transparency.
Preferential Testing for Top AI Companies
The study, which began in November 2024, analyzed over 2.8 million model matchups on Chatbot Arena during a five-month period. Chatbot Arena is known for allowing users to vote between two anonymized AI model responses in a side-by-side format. Over time, these votes contribute to each model’s public score and leaderboard position.
But according to the research, industry leaders like Meta were allowed to test multiple unreleased models on the platform. In one instance, Meta reportedly tested 27 different variants of its AI models privately between January and March 2025, just before launching Llama 4. However, only one high-performing variant was publicly revealed at launch — a move that allegedly helped the company land a top spot on the leaderboard.
“This is gamification,” said Sara Hooker, Cohere’s VP of AI research and a co-author of the paper, in an interview with TechCrunch. “Only a handful of companies were told that private testing was available, and the volume of private testing some labs had far exceeded others.”
LM Arena’s co-founder, UC Berkeley professor Ion Stoica, pushed back on the findings. Stoica labeled the study’s claims “inaccurate” and based on “questionable analysis.” LM Arena insisted it remains committed to community-driven and fair evaluation practices, adding that if one lab conducts more tests than another, it doesn’t mean there’s bias involved.
Still, the paper alleges that the data tells a different story. Researchers found that companies like Meta, OpenAI, and Google had significantly more opportunities to collect data by having their models appear in more user matchups, or “battles.” The authors say this increased sampling rate provided an edge in refining models and optimizing performance specifically for the Chatbot Arena leaderboard.
Using additional Arena battle data, researchers noted, can improve a model’s score on Arena Hard — another benchmark maintained by LM Arena — by as much as 112%. LM Arena responded on X (formerly Twitter), stating that Arena Hard performance doesn’t directly influence Chatbot Arena rankings.
The paper also points to a lack of transparency in how models are identified. To determine which models were undergoing private testing, the authors used a technique called “self-identification,” prompting the models to reveal their origin. While this method isn’t foolproof, Hooker stated that LM Arena didn’t dispute their initial findings when approached.
Researchers Demand Greater Transparency from LM Arena
In response to their findings, the researchers called for several changes to improve the integrity of Chatbot Arena. Their suggestions include:
- Limiting private testing for AI labs and requiring public disclosure of all scores, including underperformers.
- Equalizing sampling rates so all models participate in a similar number of user battles.
- Transparency in model testing procedures, especially for unreleased models.
LM Arena rejected the call to publish scores of unreleased models, arguing that such models aren’t accessible to the broader AI community for validation. Still, the organization acknowledged the suggestion to improve the fairness of model exposure and stated it is working on a new sampling algorithm to ensure balanced visibility among all entries.
The issue of fairness in AI benchmarking was already under scrutiny after Meta was accused of manipulating Chatbot Arena scores for its Llama 4 launch. According to the researchers, Meta optimized a version of Llama 4 to perform well in conversational settings and submitted only that variant for ranking — without releasing it publicly. The publicly released version later showed much weaker performance.
LM Arena admitted Meta should have been more transparent during that process.
This latest controversy arrives as LM Arena transitions into a for-profit company and begins raising investment capital. Critics argue that this evolution makes it even more important to hold benchmarking organizations accountable. As AI plays an increasingly central role in technology and decision-making, ensuring transparent and equitable model evaluations will remain a critical concern for researchers and developers worldwide.