Andrej Karpathy, the AI researcher and founder of Eureka Labs, recently shared an experiment called “LLM-Council,” which sends a user query to multiple language models, lets them anonymously judge each other’s answers, and then produces a final response based on their rankings.
The results of this experiment revealed that the AI model that consistently ranked highest was OpenAI’s GPT-5.1. This is significant given how recent benchmarks suggested Google’s Gemini 3.0 had overtaken OpenAI in overall capability and reasoning tests.
“Quite often, the models are surprisingly willing to select another LLM’s response as superior to their own, making this an interesting model evaluation strategy more generally,” said Karpathy.
“For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between.”
Karpathy’s experiment setup is a three-step loop.
First, the user’s query is sent to all models separately, and their answers are shown side-by-side without revealing who wrote what.
Next, each model sees the others’ responses, still anonymised, and ranks them based on accuracy and insight. Finally, a “chairman model” produces the answer by combining the councils’ outputs and their critiques, turning the response into a consensus built through competition.
However, Karpathy also noted that these rankings are subjective and don’t necessarily match his own judgment.
As he put it, “I’m not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively, I find GPT 5.1 a little too wordy and sprawled and Gemini 3 a bit more condensed and processed. Claude is too terse in this domain.”
He revealed that he built this project over the weekend using a ‘vibe coding’ tool and shared the repository on GitHub.
Reacting to Karpathy’s post on X, Vasuman M, founder and CEO at Varick AI Agents, claimed on the social media platform that he built something similar months ago, and observed similar performance from OpenAI’s models.
“Even after plugging in Gemini 3.0, the winner was GPT 5.1, every single time,” he said. “Even funnier, if you tell other models (Claude, Gemini, Grok) that the answer they are reading came from GPT (un-anonymise), they fold immediately and start correcting themselves based on GPT’s output.”
ALSO READ: OpenAI’s GPT-5.1-Codex-Max Can Work for More Than 24 Hours
