OpenEvolve for AI Driven Research for Systems (ADRS)

UC Berkeley's Sky Computing Lab used OpenEvolve as an open-source engine for AI-Driven Research for Systems (ADRS) to automatically discover and refine systems algorithms across multiple domains (for instance - MoE expert parallelism load balancer, multi-region spot scheduling, LLM-SQL preprocessing, transaction scheduling, and more). Reported results include up to 5× runtime speedups and double-digit percentage point cost reductions, often found within hours and sub-$20 evaluation budgets. Berkeley's team has documented their experience of using OpenEvolve and the results in their paper and blog.

Algorithms improved with OpenEvolve (ADRS case studies)

OpenEvolve started as an attempt to replicate the Google DeepMind AlphaEvolve system and has now become the leading open-source evolutionary agent for algorithmic discovery and optimization. It was ideal for use in ADRS as the problems in systems research often have robust & well-defined evaluators and metrics.

Figure 1: Generic OpenEvolve process with inner and orchestrator loops

The paper describes several case studies and how OpenEvolve discovered SOTA algorithms for them. Lets look at a case study and then we show a summary of all the algorithms:

Case Study: Expert-Parallelism Load Balancer (MoE inference)

Problem

Balance computational load across GPUs during Mixture-of-Experts (MoE) inference via a three-stage Expert Parallelism Load Balancing (EPLB) procedure: (i) distribute expert groups across nodes, (ii) replicate "hot" experts, and (iii) assign replicas to GPUs to minimize imbalance for a given workload/model/GPU set. Objectives are twofold: minimize load imbalance (ratio of average to maximum tokens per GPU) and minimize the runtime of the rebalancing algorithm so it can react quickly to shifting loads.

Evaluator & Traces

A compact PyTorch simulator models distributed MoE inference; evaluation traces reflect load changes from ShareGPT and GSM8K. The evaluator reports (a) the imbalance factor and (b) rebalancing runtime. Because OpenEvolve requires a single scalar score, the study uses an equal-weight average of these two metrics.

Baselines

Public greedy baseline (DeepSeek-style): sort experts by load, assign to the least-loaded GPU with capacity. Due to Python loops and linear search, this baseline takes ~540 ms per rebalancing with imbalance ≈ 0.66.
Frontier (internal) tensorized reference: replaces explicit bin-packing with tensor reshaping/transposes, yielding a "zigzag/snake" assignment that interleaves high-load and low-load experts across GPUs. This removes loops and cuts runtime to ~19.6 ms while matching the greedy baseline's imbalance.

Results

OpenEvolve independently rediscovers the tensorized zigzag placement and introduces refinements (ordering logic, adaptive reshaping). It matches the imbalance of the baselines while reducing rebalancing runtime to ~3.7 ms—approximately 5× faster than the 19.6 ms internal reference (and ~145× faster than the public greedy baseline). More importantly, because the internal reference implementation is non-public, the authors argue memorization is unlikely; OpenEvolve arrived at a semantically similar design independently.

Evolution Trajectory & Insights

Major gains came from replacing Python loops with batched tensor operations and then discovering/reusing the zigzag layout beyond the initial placement stage.
Replication heuristics were unstable across attempts; the winning strategy was to replicate only overloaded experts.
Overall improvements stem from global reorganization with tensor ops plus the zigzag layout, not from aggressive replication.

#	Task	Source	Objective & Evaluator	Result vs. Baseline/SOTA	Iterations · Time	Budget
1	Expert-Parallelism Load Balancer (MoE)	In-progress work	Balance expert-parallel load across GPUs.	Same balanceness, 5× faster runtime vs internal implementation.	300 iters · 5h	≤ $10
2	Telemetry Repair	Krentsel et al. (2024, HotNets '24)	Repair buggy network telemetry counters and calibration.	+9% better counter-repair score, +30% higher confidence calibration score than published solution.	300 iters · 8h	≤ $10
3	LLM-SQL Preprocessing	Liu et al. (2024b, MLSys '25)	Reorder tabular data to improve prefix hit rate.	Comparable hit rate, 3.9× faster runtime.	100 iters · 1h	≤ $7
4	Transaction Scheduling	Cheng et al. (2024, VLDB '24)	Minimize makespan in transaction scheduling.	20% better than greedy (offline).	100 iters · <2h	≤ $20
5	Global Model Placement	Yu et al. (2025, arXiv)	Optimize model→GPU placement cost.	18.5% cheaper than published solution.	70 iters · 40m	≤ $5
6	Can't Be Late (Single-Region)	Wu et al. (2024, NSDI '24)	Schedule deadline-driven jobs on single-region spot instances.	Up to 16% (avg 7%) higher cost savings vs SOTA.	400 iters · 5h	≤ $20
7	Can't Be Late (Multi-Region)	In-progress work	Schedule deadline-driven jobs on multi-region spot instances.	26% lower cost vs single-region baseline.	100 iters · 1h	≤ $5
8	Sparse Attention Design	Desai et al. (2025, NeurIPS '25)	Balance attention sparsity and accuracy.	7% average error and density improvement vs SOTA.	100 iters · 4h	≤ $15
9	Multi-Agent System Optimization	ICLR '24	Improve multi-agent collaboration using MAST taxonomy and rubric feedback.	7% improvement on ProgramDev.	100 iters · <2h	≤ $15
10	Cloudcast (Multi-Cloud Transfer)	Wooders et al. (2024, NSDI '24)	Optimize multi-cloud data transfer cost.	Matches SOTA cost.	100 iters · 1h	≤ $5
11	Adaptive Weight Compression	In-progress work	Assign bitrate per column to minimize bits/elem while preserving accuracy.	Similar bits/elem, 14.2% worse PPL.	200 iters · 12h	≤ $20

Table 1 (from paper): Summary of project task objectives and it's corresponding SOTA publication, performance improvements relative to SOTA and baseline solutions, and overall time/cost efficiency.

OpenEvolve's Role in ADRS: Solution + Evaluation

User provides the problem setup:

Objectives & metrics: e.g., p99 latency, cost under SLO, throughput.
Workloads/traces: representative inputs and hold-outs.
Evaluator: deterministic system or simulator that scores candidates.
Budget & stop rules: time/iteration cap, spend cap, plateau criteria.

OpenEvolve automates (solution + evaluation) loop; like a human-researcher would do:

Code-level search: edits policy/algorithm code (not just knobs) to produce executable candidates.
Execution & scoring: runs each candidate through the evaluator; logs metrics and artifacts.
Selection & refinement: retains "best" candidates, updates prompts/seeds, and iterates.
Export & reproduce: returns the best program(s) plus a complete reproducer (config, prompts, workloads, seeds).

Summary: User defines what to optimize and how to measure it; OpenEvolve searches over how to implement it and returns measured, reproducible improvements.

Caveats & mitigations: Results can overfit to specific traces or hardware configs. Re-score winners on new workloads/hardware and validate on held-outs; OpenEvolve supports this via config-driven re-runs, fully reproducible runs (prompts/seeds/configs/workloads), and early-stop with exportable reproducers for external verification.

What this means for teams adopting OpenEvolve?

Fit & prerequisites

Works best when you have objective, verifiable metrics (throughput, cost, deadlines), representative workloads with hold-outs, and a deterministic evaluator (system or simulator).

Budget & runtime expectations

Plan for hours-scale searches with evaluation spend typically in the single to low double-digit USD range; costs scale with evaluator complexity and model choice.

Human responsibilities

You still own problem framing, evaluator fidelity (correctness/constraints), anti–reward-hacking safeguards, and validation on held-out/shifted workloads for generalization.