SLDAgent + OpenEvolve: Can Language Models Discover Their Own Scaling Laws?

SLDAgent is an evolution-based coding agent built on OpenEvolve that automatically discovers scaling laws for large language models. On a new benchmark called SLDBench (5,000+ experiments across seven tasks), SLDAgent consistently discovers laws that are more accurate in extrapolation than human-designed laws—ranging from pre-training and fine-tuning to Mixture-of-Experts (MoE) scaling. Crucially, the agent doesn't just curve-fit better; it often discovers more principled functional forms that fix limitations in prior human-designed scaling laws. Full details in the paper and code.

Modern AI models live and die by their scaling laws: the simple formulas that predict how performance improves as we scale model size, data, and compute. These laws act like the "physics" of AI development — they determine whether we train a 7B or 70B model, how many tokens we budget, and which checkpoint we fine-tune.

Yet discovering these laws is still a manual, intuition-driven process. Researchers usually start by assuming a functional form — something like $L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$ — and then iterate: fit it to experimental data with some optimization method, inspect how well it matches reality, modify or replace the hypothesis, and repeat this trial-and-error cycle over and over, often for weeks. The paper asks: Can an AI agent automatically discover scaling laws that are better than the ones humans use today?

Why Scaling Laws are Hard (The Extrapolation Problem)

Scaling laws answer critical resource allocation questions:

If I double model size at fixed compute, what happens to loss?
How much extra data justifies the cost of a larger model?

Classically, laws like Chinchilla model loss as a power law in parameters ($N$) and dataset size ($D$). While successful, extending these laws to complex scenarios (like MoE routing or domain mixing) is difficult for two reasons:

Huge Symbolic Search Space: We aren't just fitting coefficients; we are searching for the functional form itself (logarithms, exponents, offsets, interaction terms).
The Extrapolation Trap: A law might fit small-scale experiments perfectly but fail extensively when predicting the performance of a massive run.

Human experts rely on trial-and-error to navigate this space. SLDAgent automates it.

SLDBench: A Sandbox for Scientific Discovery

To rigorously evaluate automated discovery, we built SLDBench, a dataset comprising 5,000+ experiments collected from recent literature.

Each task in SLDBench provides the agent with:

Features: Model configurations (size, layers, experts), training hyperparameters (LR, batch size), and data metrics.
Targets: Training loss or validation error.
The Objective: Write a Python program (law.py) that defines a function whose predictions on a held-out extrapolation set achieve high $R^2$ and low error.

The benchmark covers seven distinct scaling tasks:

Task	Description
Parallel	Scaling with parallelism ($P$) and model size ($N$)
Vocab_size	Scaling with model parameters, vocabulary size ($V$), and dataset size ($D$)
SFT	Supervised fine-tuning loss vs. dataset size
Domain_mix	Loss variations based on domain mixture proportions
MoE	Scaling with parameters ($N$) and number of experts ($E$)
D_constrain	Scaling with model size ($N$), dataset size ($D$), and unique tokens ($U$) in data-constrained settings
LR_and_BSZ	Joint laws over learning rate, batch size, model size, and data

The Engine: OpenEvolve and SLDAgent

OpenEvolve is an evolutionary coding framework that improves programs through iterative mutation and evaluation. SLDAgent specializes this framework for scientific discovery by treating scaling laws as a program search problem.

Instead of searching for numbers, SLDAgent searches for code. Each candidate solution is a pair of Python subroutines:

Expression(x, θ): Defines the symbolic law $f_\theta(x) \to \hat{y}$.
Optimization(X, y): Defines how to fit $\theta$ to the observed data.

Figure 1: The SLDAgent System. Left: Each candidate program consists of two subroutines: an Expression that defines the symbolic model $f_\theta(x)$, and an Optimization routine that fits its parameters $\theta$ to data. Right: The evolutionary loop. An LLM mutates a parent program sampled from a database. The resulting child is evaluated and inserted back into the database, continually improving the population. (Adapted from Figure-2 in the paper)

The agent never sees the test set during evolution; it must find laws that are robust enough to generalize naturally.

Results: Achieving Superhuman Extrapolation

We benchmarked SLDAgent against general coding agents (OpenHands, Aider), provider-specific CLIs (Claude Code, Gemini-CLI), and the human baseline (the specific laws proposed in the original research papers).

Key Findings:

Beating the Baseline: On 6 out of 7 tasks, SLDAgent outperforms the human-designed laws using standard base models.
Superhuman Performance: When paired with a stronger LLM, SLDAgent achieves superior extrapolation accuracy on all seven tasks.
Specialization Wins: Evolutionary specialization proved more effective than simply using a "smarter" raw LLM.

Figure 2a: Performance on SLDBench for agents using o4-mini. Scores are $R^2$ averaged over three runs. The best and second-best scores for each task are highlighted. "NA" indicates no valid output. (Adapted from Table-2 in the paper)

Figure 2b: SLDBench performance of provider-specific agents together with reference rows for SLDAgent paired with the corresponding provider models. Grok-4 was not evaluated due to cost. Scores report the coefficient of determination ($R^2$), averaged over three runs. Bold denotes the best value within each model family. (Adapted from Table-3 in the paper)

What Did SLDAgent Discover? (The Physics of AI)

The most exciting results are qualitative. SLDAgent discovered mathematical forms that are more conceptually sound than the formulas proposed by human researchers.

In the SFT task, the goal is to predict loss $L$ based on fine-tuning samples $D$.

Human Proposal: Treated pre-training as an external additive term:

$$L = \theta_2 + \frac{\theta_0}{D^{\theta_1} + \theta_3}$$

SLDAgent Discovery: Unified pre-training and SFT into a single data budget:

$$L = \theta_2 + \frac{\theta_0}{(D + \theta_3)^{\theta_1}}$$

Why it matters: SLDAgent realized that pre-training contributes an effective data offset ($\theta_3$). By placing $\theta_3$ inside the power law, it correctly models the diminishing returns of data across both training phases.

Practical Applications

These better laws have immediate utility for researchers and engineers.

1. Analytic Hyperparameter Tuning

Instead of running expensive sweeps for learning rate ($lr$) and batch size ($bsz$), SLDAgent finds a joint law $L = f(N, D, lr, bsz)$. Because this form is analytic, we can compute derivatives $\frac{\partial L}{\partial lr}$ and solve for the optimal hyperparameters mathematically.

On a 1B parameter model trained on 100B tokens (far outside the training set), the analytic solution yielded a validation loss relative error of just 0.067% compared to the ground-truth optimum.

Figure 3: Ground-truth validation-loss heatmap for a 1B-parameter LLM trained on 100B tokens. (Adapted from Figure-3 in the paper)

2. Efficient Model Selection

We used SLDAgent to solve a common problem: Which pre-trained model should I fine-tune?

By fine-tuning candidate models on just 6.25% of the data and extrapolating with the SLDAgent SFT law, we achieved 100% Relative Accuracy in ranking the top models. This allows practitioners to select the best base model without wasting compute on full fine-tuning runs.

Toward Agentic Science

SLDAgent + OpenEvolve offers an existence proof for a new mode of research. We are moving from AI that applies formulas to AI that discovers them.

This is a step toward agentic scientific discovery, where AI systems propose, test, and refine hypotheses, returning publishable insights to the community. We are excited to extend this to multimodal scaling, alignment physics, and variable discovery—asking the agent not just "what is the law," but "what should we measure?"

Guest Authors: This blog post was written in collaboration with Haowei Lin and Haotian Ye.