The Opportunities and Challenges of OpenEvolve for Novel Bijection Construction

The Opportunities and Challenges of OpenEvolve for Novel Bijection Construction

Bijective maps are a key tool in an area of mathematics called combinatorics, which studies discrete structures like partitions, permutations, and graphs. They allow mathematicians to draw deep connections between families of objects that might seem different at first glance but ultimately encode a similar combinatorics. For example, when $X$ and $Y$ are finite sets, a bijection $f: X \rightarrow Y$ is a constructive proof that $|X| = |Y|$. This makes bijections a powerful tool for counting the size of a complicated set $Y$, using a set $X$ that we understand better. Given that code is a natural medium in which to write bijections, we conjecture that evolutionary program synthesis frameworks like OpenEvolve have the potential to lead to the discovery of new combinatorics. To explore this, we developed an OpenEvolve-based bijection search system and presented it with three challenging bijection construction problems involving Dyck paths, two with known solutions and one open problem. The system produced tantalizing results including a correct solution to one of the known (but very non-trivial) bijections. We describe some of the common failure modes we observed in our system and what we believe is required to make progress moving forward. We provide full details in the paper.

In the field of combinatorics, one of the most elegant and powerful tools is the bijection: a one-to-one and onto mapping $f:X \rightarrow Y$ between two sets of objects. You can think of $f$ as establishing a correspondence between $X$ and $Y$. The existence of $f$ implicitly proves that $|X| = |Y|$. For example, the Catalan numbers count hundreds of seemingly unrelated families of combinatorial objects such as triangulations of convex polygons or the number of certain lattice paths called Dyck paths (both illustrated in the figure below along with a sketch of the corresponding bijection). Once we can prove that one of these sets $X_0$ is counted by the Catalan numbers, $|X_0| = C_n$, we can prove that another set $X_i$ is counted by the Catalan numbers by constructing a bijective function $f_i: X_0 \rightarrow X_i$.1

Bijection between triangulations and Dyck paths

Figure 1: The bijective map from triangulations of a 5-gon to Dyck paths from (0,0) to (3,3): For each vertex i in {3, 4, 5}: (1) Draw an N step, (2) Draw an E step for each diagonal (i, j) with i < j.

Bijections can do more than prove two sets are the same size: if the set $X_0$ has some special structure like a symmetry (e.g., reflection symmetries among sets of lattice paths) or an important statistic (e.g., the length of a permutation), the bijection $f$ pushes this structure to $Y$. This often reveals new aspects of $Y$ that we may not have previously recognized. While some bijections may be obvious, others can be highly non-trivial, surfacing deep combinatorial structure and drawing connections between seemingly unrelated areas of mathematics.

Most combinatorial bijections are concrete (at least by the standards of modern mathematics). They also frequently relate sets of discrete objects that are well represented by mainstream programming languages such as permutations, integer partitions, trees, etc. The task of constructing bijections can be challenging, requiring creativity, a deep understanding and intuition for the objects of study, and in most cases, extensive trial and error. Together, these properties make the task of discovering novel bijections a great testbed for evolutionary program synthesis systems. Below we describe our experience evaluating a bijection-focused version of OpenEvolve on three different bijection discovery tasks.


What Makes Bijections Easy or Difficult?

The difficulty of constructing a bijection between two combinatorial sets varies widely depending on the problem. Here's how we categorize bijection construction problems:

Bijections that are straightforward

Sometimes bijections between combinatorial objects seem almost designed to be discovered. A classic example is the bijection between Dyck paths and sequences of matched pairs of parentheses. These types of bijections are valuable testbeds for evaluating LLM reasoning, but do not require a sophisticated evolutionary search.

Known but nontrivial bijections

These bijections require several key insights and a fair amount of creativity, but ultimately arise from standard combinatorial techniques like recursive decompositions or symmetry arguments. While an LLM alone might struggle to produce such bijections, we hypothesize that an evolutionary framework like OpenEvolve should be capable of discovering them with some design effort.

Examples from this category:

  • Bijection between odd-diagonal avoiding paths and Dyck paths
  • Bijection between symmetric Dyck paths and lattice paths staying below the diagonal

Non-constructive bijections

Finally, there exist combinatorial results where we know two sets have the same size through a counting argument, but no explicit bijection is known. These correspond to significant open problems in enumerative combinatorics, and pushing a system to discover them would be the true holy grail. As a proof of concept, we designed one case study around this type of challenge, selecting an original open problem posed by combinatorist William Dugan in 2019.


Setting up OpenEvolve for Bijection Discovery

To facilitate robust evaluations of bijection discovery as a testbed for evolutionary algorithms, we implemented custom systems tailored to the needs of evaluating bijection construction programs, with data structures and helper functions for generating paths, filtering paths by combinatorial constraints, and visualizing paths. On top of this, we built a bijection-focused OpenEvolve interface.

Integrating a flexible, customizable LLM judge was crucial for tracking nuances of the programs being generated that would have been challenging to capture in a simple scoring function, e.g. whether the program description matched its implementation, or how interpretable the code was from the perspective of a mathematician. This helped us understand failure modes in the programs, and we frequently fed these findings back to OpenEvolve as updates to the core prompt.

Here, we will discuss the successful case study, leaving descriptions of system performance on the other examples to the paper.


Case Study: Bijection between Dyck paths and odd-diagonal-avoiding lattice paths

Problem. Construct a bijective map between North-East lattice paths that avoid odd points on the diagonal $(1, 1)$, $(3, 3), \ldots$ (which we call "odd-diagonal avoiding paths") and the classic Dyck paths.

Can base models produce this bijection? Before applying OpenEvolve, we prompted a variety of LLMs to produce the bijection directly. None were successful, making it an appropriate problem for OpenEvolve.

Evaluation Procedure. We evaluated each program produced by our version of OpenEvolve by applying it to all odd-diagonal avoiding paths from $(0,0)$ to $(4, 4)$ and measuring statistics quantifying how close it was to a bijection. Specifically, we measured the surjectivity which we defined as the ratio $|f(X(n))/Y(n)|$, the injectivity, the ratio of unique elements in $f(X(n))$ to the total elements in $f(X(n))$, and the validity, the fraction of $f(X(n))$ that actually belong to $Y(n)$.

We also used an LLM to evaluate the program in order to score the program on qualities that are harder to quantify. The LLM judge evaluated the simplicity of the program: it was instructed to give high scores to programs that are simple, intuitive, and reveal a structural correspondence between the two families of paths. The LLM judge also evaluated whether the code implemented the algorithm described in the program's docstring, in order to differentiate between programs that scored low due to a flawed algorithm and those that scored low due to implementation errors, and additionally it checked for cheating.

Results. OpenEvolve discovered a program implementing a bijection after 60 iterations. The evolving population of programs is shown in the top left plot of the figure below. The program defining the valid bijection is the same as the known bijection.

Particularly notable was the quality of the documentation produced by the system. The final program included a docstring of over 600 words containing a detailed sketch of an invertibility proof. This illustrates how the natural language output of such a system could aid a human mathematician when producing a formal proof of the bijectivity of the map.

Analysis of program variability. Surprisingly, we found that the metrics we used to quantify how close a program was to being a bijection took a relatively limited number of values. For example, many programs achieved the same "surjectivity" score, even when they were functionally distinct mappings. To better understand this, we plot the text embeddings of all programs after dimensionality reduction in the figure below (bottom left) giving programs defining the same mapping the same color. We see that programs defining the same mapping can have very different text embeddings. For example, the two orange upside down triangles that are the exact same program are distant on the plot. This highlights an axis of variation that can be easy to overlook: a significant amount of variation introduced by LLMs can relate to program implementation rather than program functionality. While we want simpler bijections since these are easier for a human to gain insight from, we do not generally care about code-level implementation details. Would limiting LLM exploration of implementation modifications lead to more focus on mathematically salient aspects of the program?

Progression of program metrics over iterations
Text embedding of all programs using PCA

Figure 2: Top: Progression of the best program metrics over 70 iterations for different terms in the scoring function. Bottom: Text embedding of all programs, after applying dimensionality reduction using PCA. Each functionally distinct mapping is given a different color (the bijection is black). The surjectivity score is indicated by the relative size of the points.

The valid bijection produced after 60 iterations:

def avoiding_path_to_dyck_path(path, m):
  # Find the first return to the diagonal: smallest t>0 with height h(t) = 0.
  h = 0
  t = -1
  for i, step in enumerate(path, start=1):
    h += 1 if step == 1 else -1
    if h == 0:
      t = i
      break
  if t == -1:
    raise ValueError("No return to diagonal found; invalid input.")

  U = path[:t]
  V = path[t:]

  # Determine the Dyck interior X of the primitive excursion U.
  if U[0] == 1:
    X = U[1:-1]  # interior is a Dyck path
    R = avoiding_path_to_dyck_path(V, m-t //2)
    return [1] + X + [0] + R
  else:
    # U = 0 Y 1 with Y a co-Dyck interior
    # take X = complement(Y) which is Dyck.
    Y = U[1:-1]
    X = [1 - s for s in Y]
    R = avoiding_path_to_dyck_path(V, m-t //2)
    return [1] + R + [0] + X

Lessons learned

Several recent papers have successfully applied evolutionary program synthesis tools to problems in research-level mathematics. Many of these problems involve establishing bounds on a numerical quantity. So, what is different about bijection discovery? Here are the three biggest challenges we faced.

Challenge #1: Designing an objective function

A core challenge in any evolutionary search is designing a good scoring system (or objective function). The ability to construct granular scoring functions (ideally continuous) was identified as a crucial ingredient to finding solutions with AlphaEvolve in a recent paper by Georgiev et al.

The problem is, combinatorial bijections are all-or-nothing constructions: either a map is a bijection or it is not. In our case studies, we found that programs that defined distinct mappings yielded identical scores on our metrics. This created a flat "plateau" in the search space, giving the evolutionary algorithm no clear direction to "hill-climb" towards.

It may be that the problem of bijection discovery is simply less amenable to incremental improvement than problems like sphere packing, as the discovery of a correct mapping can require a single critical insight that resolves multiple structural flaws at once. Alternatively, program evaluation may simply need refinement with more granular objective functions.

Challenge #2: Aligning the objective with mathematical intent

Reward hacking is a familiar problem with LLMs: the model produces a solution that satisfies the literal constraints but violates the spirit of the problem. In our case we found that models discovered a variety of loopholes in the objective function, producing programs that achieve high scores but are not useful from the perspective of mathematical discovery. For example, several LLMs would exhaustively enumerate all objects from both sets in order to construct a trivial index-based mapping.

We prevented this kind of solution by adding explicit instructions to the prompt not to enumerate all of the objects and instructing the LLM evaluator to check for this type of cheating. However, we found that it was difficult to anticipate all of the different ways a model might reward-hack.

Challenge #3: LLMs favor solutions to well-known problems

In our case studies, we found that models had a tendency to get stuck on famous mathematical results that are associated with key words in the problem, even when they repeatedly prove to be unhelpful. This suggests that in problem selection and prompt writing, one should consider potential "gravitational pulls" and make efforts to steer models away from unproductive paths.


We believe that for creative mathematical tasks like bijection discovery, an expert-in-the-loop is essential. Their role is not just to define the initial problem and objective, but also to steer the discovery process by periodically inspecting the top-performing solutions for pathological behavior and evaluating whether the system is producing solutions that align with the intention of mathematicians.

1 A famous exercise in Richard Stanley's seminal combinatorics textbook Enumerative Combinatorics II does exactly this, having the reader construct 66 bijections between different sets counted by the Catalan numbers.

Guest Author: This blog post was written in collaboration with Henry Kvinge.