Andrej Karpathy Builds AI Research System Using 630 Lines of Code

Andrej Karpathy, the former OpenAI researcher, has released a minimal open-source project that allows AI agents to autonomously run and iterate on large language model training experiments using a compact codebase of roughly 630 lines.

The project, called “autoresearch”, was published on GitHub and is designed to let AI agents modify model training code, run short experiments, evaluate the results, and repeat the process in an automated research loop.

Karpathy said the goal of the project is to create a small yet functional research setup in which an AI agent can test architectural changes, hyperparameters, and optimisation strategies without continuous human supervision. In the repository documentation, he described the concept as giving “an AI agent a small but real LLM training setup” and letting it “experiment autonomously overnight.”

The repository contains a simplified training implementation derived from Karpathy’s nanochat project and is designed to run on a single NVIDIA GPU. The system separates the roles of humans and AI agents: researchers modify a Markdown file that defines the agent’s research instructions, while the agent edits a single Python file that contains the model architecture, optimiser and training loop.

Each training run operates under a fixed five-minute wall-clock time budget, which ensures experiments remain comparable even when the agent changes model size, batch size, or architecture. According to the repository, this constraint allows roughly 12 experiments per hour and about 100 experiments overnight.

The agent evaluates each run using a validation metric called bits-per-byte and iteratively modifies the training script to improve results.

Developers on social media showered praise on the project, given how it demonstrates how AI agents could automate parts of the machine learning experimentation process by continuously generating and testing new model configurations.

Among those reacting publicly was Tobi Lütke, CEO of Shopify, who said he used the system overnight to run experiments on a query-expansion model. Lütke wrote that he woke up to “+19% score on a 0.8b model after 8 hours and 37 experiments,” describing improvements discovered through the automated experiment loop.

He added that watching the system iterate through training adjustments provided unexpected insight into how models improve, writing that he “learned more from that than months of following ML researchers.”

Responding to Lütke, Karpathy said the improvements discovered through the automated experimentation process were already transferring to larger models, noting that changes identified after roughly 650 experiments on a smaller 12-layer model “transfer well to depth 24,” referring to a larger model with 24 transformer layers.

Karpathy said the result suggests that training strategies identified through automated experimentation may scale beyond the smaller models used during the initial runs.

ALSO READ: Big Tech Players Pledge to Pay New Data Centre Costs

Join Our Core Community

20 Women Taking On AI’s Hardest Problems

The Procedural Friction Eating Relationship Banking — and How AI Can End It

De-Risking the Crypto Portfolio: How AI Offers CFOs Control in a 24/7 Market

6 Enterprise Tests to Expose Hidden AI Compliance Risks Across Borders

Forward-Looking Technical Debt: The Hidden Cost of AI Hesitation

Cloud 3.0 and Data Sovereignty: Why Workload Placement Is Now a Strategic Decision

Inside IBM’s 11 Billion Dollar Bet: What the Confluent Deal Reveals About AI’s Investment Paradox

“Synthetic Data Is Not the Ground Truth” — SandboxAQ’s VP of Engineering on Simulation’s Power and Limits

Data as the New Diagnostic: How Ahead Health is Turning Algorithms Into Preventive Care

Why Data Leaders Are Wary of a Synthetic Future

TCS Unveils 7th Gemini Experience Centre in Michigan

OpenAI Releases Codex App for Windows

OpenAI Launches GPT-5.4, Claims Higher Results on Multiple Fronts