How AMD Used OpenEvolve to Optimise GPU Kernels

AMD-AGI Group adapted OpenEvolve to automatically optimize GPU kernels written in Triton for AMD Instinct™ GPUs. Our system, GEAK-OpenEvolve, uses OpenEvolve's population-based evolutionary framework enhanced with a Quality-Diversity search, hardware-specific prompt engineering, and a cascade evaluation pipeline. It achieves an average 3.42x speedup over reference kernels on the TritonBench-modified benchmark and a 7.02x speedup on our ROCm-bench suite.
Code: GEAK-OpenEvolve on GitHub | Full blog: GEAK-Triton v2 Family

The Problem: GPU Kernel Optimization Is Hard

Writing high-performance GPU kernels is one of the most labor-intensive tasks in systems programming. A single kernel may need dozens of tuning passes across block sizes, warp counts, memory access patterns, and pipeline stages, all of which interact in non-obvious ways with the target hardware. For AMD Instinct GPUs, the optimization surface is especially rich: CU topology, LDS limits, wavefront sizing, and HBM3/HBM3E memory hierarchies all matter.

Traditional LLM-based optimization agents tend to converge on a single solution path and get stuck. We needed something that could maintain diversity across the search space while still exploiting the best solutions found so far.

That's exactly what OpenEvolve provides.

Why OpenEvolve?

OpenEvolve offered us three things that were critical for our use case:

Population-based evolution: Instead of a single agent refining one kernel, OpenEvolve maintains a population of diverse candidates. This is essential for GPU kernel optimization, where the performance landscape is rugged. A kernel that is 2x faster on one input shape may be 3x slower on another.
Island-based exploration: OpenEvolve's island model lets us balance exploration and exploitation. We tuned this to keep diverse optimization strategies alive in the population rather than prematurely converging.
A clean, extensible architecture: OpenEvolve's modular design made it straightforward to plug in our domain-specific components: a GPU-aware evaluator, hardware-targeted prompts, and a custom fitness function based on kernel execution time and correctness.

What We Built: GEAK-OpenEvolve

GEAK-OpenEvolve adapts OpenEvolve to the Triton-to-Triton kernel optimization task: given an existing Triton kernel, evolve it into a faster version that remains functionally correct.

Here's how our system works:

Quality-Diversity Search with MAP-Elites

We use a MAP-Elites-based Quality-Diversity (QD) approach on top of OpenEvolve's evolutionary engine. Rather than simply ranking candidates by speed, we maintain a structured "map" of kernel variants across 9 feature dimensions:

Fusion intelligence
Autotuning coverage
Memory access efficiency
Algorithmic complexity
Warp/wavefront utilization
Software pipelining
Numerical stability
Correctness and portability
Optimization scope

An LLM-based evaluator scores each candidate on these dimensions (0–1 scale), and MAP-Elites places them into a feature grid. This preserves diverse optimization strategies. A kernel excelling at memory coalescing coexists with one excelling at compute fusion, enabling the evolutionary process to combine strengths from different niches.

Hardware-Aware Prompt Engineering

We enriched OpenEvolve's prompt templates with AMD GPU-specific optimization cues:

Memory coalescing patterns and LDS usage strategies for the MI300X/MI325X GPUs
Wavefront occupancy and register pressure guidelines
Block size, warp size, and launch configuration constraints from the AMD workload tuning guide
Kernel-agnostic guidance on autotuning and algorithmic refinements

This grounds the LLM's mutations in real hardware constraints rather than generic optimization heuristics.

Cascade Evaluation Pipeline

Evaluating GPU kernels is expensive since each candidate requires compilation and execution on actual hardware. We built a multi-stage cascade filter:

Small inputs: Quick correctness check, filters out ~60% of broken candidates
Medium inputs: Functional validation at realistic sizes
Full-scale inputs: Performance benchmarking on production-sized tensors

Multiple offspring run concurrently on separate GPUs, dramatically increasing iteration throughput. Only candidates that survive all stages enter the MAP-Elites archive.

Results

We evaluated GEAK-OpenEvolve on two benchmarks from GEAK-eval: the TritonBench-modified suite (184 Triton kernels) and ROCm-bench (31 Triton kernels). All experiments used Claude 4 Sonnet on AMD Instinct MI300X.

Benchmark	Success Rate	Avg. Speedup
TritonBench-modified	56.01%	3.42x
ROCm-bench	56.67%	7.02x

"Success rate" is the percentage of kernels where the evolved variant achieved a speedup >1x over the original. The average speedup is computed only over successfully optimized kernels.

Case Study: RMS Norm, 6.58x Speedup

The RMS LayerNorm kernel is a memory-bandwidth-bound operation central to LLaMA-style architecture. The reference implementation had a fundamental limitation: it assumed all input data fit within a single thread block, causing crashes for large hidden dimensions (N > 4096).

Through evolutionary search, GEAK-OpenEvolve discovered three key optimizations:

Triton autotuning: Replaced the naive block-size calculation with 22 autotuning configurations spanning block sizes 32–8192, warp counts 1–16, and pipeline stages 1–4
Adaptive single-pass/two-pass algorithm: For small sequences, the evolved kernel fuses variance computation and normalization into a single pass, saving ~50% memory bandwidth. For larger sequences, it gracefully falls back to a blocked two-pass approach.
Numerical optimizations: Replaced 1 / tl.sqrt(var + eps) with the dedicated tl.math.rsqrt GPU instruction wrapped in tl.maximum() for denormal safety

The most impressive aspect: the adaptive branching strategy was discovered by the evolutionary process, not hand-engineered. This is the kind of non-obvious optimization that population-based search excels at finding.

Beyond Benchmarks: Production-Grade Kernels

The impact of GEAK-OpenEvolve extends beyond benchmark suites. The evolutionary search has helped us produce production-grade kernels that are being integrated into real workloads on AMD Instinct GPUs. By evolving kernels that are not only fast but also numerically stable, portable across input shapes, and robust under diverse operating conditions, GEAK-OpenEvolve has proven to be a practical tool for shipping optimized code, not just a research prototype for leaderboard numbers.

What OpenEvolve Got Right for This Domain

Having built and shipped GEAK-OpenEvolve, here's our perspective on what makes OpenEvolve well-suited for GPU kernel optimization:

Diversity preservation matters: GPU optimization has many local optima. OpenEvolve's population-based approach avoids the single-path convergence trap that plagues standard LLM agents. Our MAP-Elites extension amplifies this further.
The evolutionary loop is a natural fit: Kernel optimization is inherently iterative: profile, hypothesize, mutate, test. OpenEvolve's generate-evaluate-select cycle maps directly onto this workflow.
Extensibility enabled rapid prototyping: We went from "let's try OpenEvolve for kernels" to a working system in weeks, not months. The clean separation between evolution strategy, prompt construction, and evaluation made it easy to inject our domain-specific components.
LLM-as-mutator works: Using an LLM to propose kernel mutations (rather than random code perturbations) means the search operates in a semantically meaningful space. The LLM "understands" what a block size change or a loop tiling transformation does, making each evolutionary step far more productive than blind search.

Notes

GEAK-OpenEvolve is part of the broader GEAK family of AI agents for GPU kernel development at AMD, which includes GEAK-OptimAgentv2 (instruction-to-Triton generation with hardware-aware profiler feedback) and GEAK-HIP (native HIP code optimization).

GEAK-OpenEvolve is fully open-source:

GEAK-OpenEvolve (branch: geak-openevolve)
Evaluation suite
OpenEvolve

If you're working on GPU kernel optimization, OpenEvolve's evolutionary framework provides a strong foundation to build on.

System Configuration

All performance benchmarks were conducted using the following hardware and software configuration:

Component	Specification
GPU	AMD Instinct™ MI300X (192GB HBM3)
	AMD Instinct™ MI325X (256GB HBM3E)
ROCm	6.4.3
Host OS	Ubuntu 24.04.3 LTS
Python	3.12+
PyTorch	2.4+ (ROCm)
Triton	3.3.0

Guest Author: This blog post was written in collaboration with Umang Pandey.