AI Models in Andrej Karpathy’s ‘LLM Council’ Rank GPT 5.1 as The Best

Andrej Karpathy, the AI researcher and founder of Eureka Labs, recently shared an experiment called “LLM-Council,” which sends a user query to multiple language models, lets them anonymously judge each other’s answers, and then produces a final response based on their rankings.

The results of this experiment revealed that the AI model that consistently ranked highest was OpenAI’s GPT-5.1. This is significant given how recent benchmarks suggested Google’s Gemini 3.0 had overtaken OpenAI in overall capability and reasoning tests.

“Quite often, the models are surprisingly willing to select another LLM’s response as superior to their own, making this an interesting model evaluation strategy more generally,” said Karpathy.

“For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between.”

Karpathy’s experiment setup is a three-step loop.

First, the user’s query is sent to all models separately, and their answers are shown side-by-side without revealing who wrote what.

Next, each model sees the others’ responses, still anonymised, and ranks them based on accuracy and insight. Finally, a “chairman model” produces the answer by combining the councils’ outputs and their critiques, turning the response into a consensus built through competition.

However, Karpathy also noted that these rankings are subjective and don’t necessarily match his own judgment.

As he put it, “I’m not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively, I find GPT 5.1 a little too wordy and sprawled and Gemini 3 a bit more condensed and processed. Claude is too terse in this domain.”

He revealed that he built this project over the weekend using a ‘vibe coding’ tool and shared the repository on GitHub.

Reacting to Karpathy’s post on X, Vasuman M, founder and CEO at Varick AI Agents, claimed on the social media platform that he built something similar months ago, and observed similar performance from OpenAI’s models.

“Even after plugging in Gemini 3.0, the winner was GPT 5.1, every single time,” he said. “Even funnier, if you tell other models (Claude, Gemini, Grok) that the answer they are reading came from GPT (un-anonymise), they fold immediately and start correcting themselves based on GPT’s output.”

ALSO READ: OpenAI’s GPT-5.1-Codex-Max Can Work for More Than 24 Hours

Join Our Core Community

Scaling Telehealth Without Scaling Fraud: The Case for an AI Trust Layer

From Siemens Energy to Bank of America: What “Quietly Advanced” Enterprises are Doing Differently

‘Do Not Fall in Love with the Prompt’: Ellis Crosby on Building Outcome‑First AI

Claude Mythos Has Changed the Vulnerability Curve. Can Defenders Keep Up?

Dr Lee Schlenker’s Playbook For Boards Before EU AI Act’s Enforcement

Unstructured Data, Deterministic Answers

Data Layer Precedes Compute, GPU Capacity in Sovereign AI

Why Data Reliability Now Governs Scaling GenAI

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

Inside IBM’s 11 Billion Dollar Bet: What the Confluent Deal Reveals About AI’s Investment Paradox

Snowflake, Anthropic Expand Cortex AI to Push Agentic AI into Enterprise Core

Salesforce’s Stake in Anthropic Reaches $5 Bn Ahead of Latter’s IPO Filing

Anthropic Files for IPO

Alphabet Plans $80 Bn Capital Raise as AI Demand Outpaces Supply

Microsoft Releases New Windows PCs Powered by NVIDIA’s RTX Spark

AI Models in Andrej Karpathy’s ‘LLM Council’ Rank GPT 5.1 as The Best

Karpathy built this project over the weekend using a ‘vibe coding’ tool and shared the repository on GitHub.

Snowflake, Anthropic Expand Cortex AI to Push Agentic AI into Enterprise Core

Salesforce’s Stake in Anthropic Reaches $5 Bn Ahead of Latter’s IPO Filing

Unpack More

Anthropic Raises $65Bn, Launches Claude Opus 4.8

GPT-5.5 Beats Claude and Gemini in New Coding Benchmark DeepSWE

Microsoft Replaces Claude Code With GitHub Copilot CLI

Andrej Karpathy Joins Anthropic

Why Data Reliability Now Governs Scaling GenAI

Middle East: The Sovereign AI Testbed US, EU and Asia Can Learn From

NVIDIA’s VP of Solutions Architecture on What It Actually Takes to Build a Sovereign AI Factory