The Opportunities and Challenges of OpenEvolve for Novel Bijection Construction

Bijective maps are a key tool in an area of mathematics called combinatorics, which studies discrete structures like partitions, permutations, and graphs. They allow mathematicians to draw deep connections between families of objects that might seem different at first glance but ultimately encode a similar combinatorics. For example, when $X$ and $Y$ are finite sets, a bijection $f: X \rightarrow Y$ is a constructive proof that $|X| = |Y|$. This makes bijections a powerful tool for counting the size of a complicated set $Y$, using a set $X$ that we understand better. Given that code is a natural medium in which to write bijections, we conjecture that evolutionary program synthesis frameworks like OpenEvolve have the potential to lead to the discovery of new combinatorics. To explore this, we developed an OpenEvolve-based bijection search system and presented it with three challenging bijection construction problems involving Dyck paths, two with known solutions and one open problem. The system produced tantalizing results including a correct solution to one of the known (but very non-trivial) bijections. We describe some of the common failure modes we observed in our system and what we believe is required to make progress moving forward. We provide full details in the paper.

In the field of combinatorics, one of the most elegant and powerful tools is the bijection: a one-to-one and onto mapping $f:X \rightarrow Y$ between two sets of objects. You can think of $f$ as establishing a correspondence between $X$ and $Y$. The existence of $f$ implicitly proves that $|X| = |Y|$. For example, the Catalan numbers count hundreds of seemingly unrelated families of combinatorial objects such as triangulations of convex polygons or the number of certain lattice paths called Dyck paths (both illustrated in the figure below along with a sketch of the corresponding bijection). Once we can prove that one of these sets $X_0$ is counted by the Catalan numbers, $|X_0| = C_n$, we can prove that another set $X_i$ is counted by the Catalan numbers by constructing a bijective function $f_i: X_0 \rightarrow X_i$¹.

Bijection between triangulations and Dyck paths

Figure 1: The bijective map from triangulations of a 5-gon to Dyck paths from (0,0) to (3,3): For each vertex i in {3, 4, 5}: (1) Draw an N step, (2) Draw an E step for each diagonal (i, j) with i < j.

Bijections can do more than prove two sets are the same size: if the set $X_0$ has some special structure like a symmetry (e.g., reflection symmetries among sets of lattice paths) or an important statistic (e.g., the length of a permutation), the bijection $f$ pushes this structure to $Y$. This often reveals new aspects of $Y$ that we may not have previously recognized. While some bijections may be obvious, others can be highly non-trivial, surfacing deep combinatorial structure and drawing connections between seemingly unrelated areas of mathematics.

Most combinatorial bijections are concrete (at least by the standards of modern mathematics). They also frequently relate sets of discrete objects that are well represented by mainstream programming languages such as permutations, integer partitions, trees, etc. The task of constructing bijections can be challenging, requiring creativity, a deep understanding and intuition for the objects of study, and in most cases, extensive trial and error. Together, these properties make the task of discovering novel bijections a great testbed for evolutionary program synthesis systems. Below we describe our experience evaluating a bijection-focused version of OpenEvolve on three different bijection discovery tasks.

What Makes Bijections Easy or Difficult?

The difficulty of constructing a bijection between two combinatorial sets varies widely depending on the problem. We illustrate this with some bijective maps from Dyck paths to other combinatorial sets counted by the Catalan numbers. Dyck paths are lattice paths that travel from (0,0) to (n,n) using a sequence of single unit steps to the north and east and never crossing the diagonal y=x.

On one end of the spectrum, there are trivial bijections, such as the map from Dyck paths to balanced parentheses, which simply replaces each N step with a “(“ and each E step with a “)”. There are also many well-known, straightforward bijections from Dyck paths to combinatorial sets that require using the N and E labels in a way that does not require much creativity.

On the other end of the spectrum are bijections that require using other combinatorial sets or statistics in a creative way, such as the bijection between Dyck paths and 321-avoiding permutations. Finally, there exist combinatorial results where we know two sets have the same size through a counting argument, but no explicit bijection is known.

Setting up OpenEvolve for Bijection Discovery

We designed three case studies to evaluate our OpenEvolve-based system's ability to discover bijections. All three problems relate to Dyck paths. We selected two solved but challenging problems based on difficulty ratings from Richard Stanley's Catalan Numbers monograph, which contains 214 combinatorial interpretations of the Catalan numbers. We selected an open problem (discovering an "area-bounce exchanging bijection on Dyck paths") based on a recent research paper that contains a partial result.

To apply OpenEvolve for bijection discovery, we wrote custom evaluation functions and prompts tailored to evaluating bijection construction programs. We used an LLM judge to track nuances of the programs that would have been challenging to capture in a simple scoring function, e.g. whether the program description in the docstring matched its implementation. We fed these findings back to OpenEvolve as updates to the core prompt.

Here, we will discuss the successful case study, leaving descriptions of system performance on the other examples to the paper.

Case Study: Bijection between Dyck paths and odd-diagonal-avoiding lattice paths

Problem. Construct a bijective map between North-East lattice paths that avoid odd points on the diagonal $(1,1), (3,3), \ldots$ (which we call "odd-diagonal avoiding paths") and the classic Dyck paths.

Can base models produce this bijection? Before applying OpenEvolve, we prompted a variety of LLMs to produce the bijection directly. None were successful, making it an appropriate problem for OpenEvolve.

Evaluation Procedure. We evaluated each program produced by our version of OpenEvolve by applying it to all odd-diagonal avoiding paths from $(0,0)$ to $(4,4)$ and measuring statistics quantifying how close it was to a bijection. Specifically, we measured the surjectivity which we defined as the ratio $|f(X(n))/Y(n)|$, the injectivity, the ratio of unique elements in $f(X(n))$ to the total elements in $f(X(n))$, and the validity, the fraction of $f(X(n))$ that actually belong to $Y(n)$.

We also used an LLM to evaluate the program in order to score the program on qualities that are harder to quantify. The LLM judge evaluated the simplicity of the program: it was instructed to give high scores to programs that are simple, intuitive, and reveal a structural correspondence between the two families of paths. The LLM judge also evaluated whether the code implemented the algorithm described in the program's docstring, in order to differentiate between programs that scored low due to a flawed algorithm and those that scored low due to implementation errors, and additionally it checked for cheating.

Results. OpenEvolve discovered a program implementing a bijection after 60 iterations. The evolving population of programs is shown in the figure below. The program defining the valid bijection is the same as the known bijection.

Particularly notable was the quality of the documentation produced by the system. The final program included a docstring of over 600 words containing a detailed sketch of an invertibility proof. This illustrates how the natural language output of such a system could aid a human mathematician when producing a formal proof of the bijectivity of the map.

Analysis of program variability. Surprisingly, we found that the metrics we used to quantify how close a program was to being a bijection took a relatively limited number of values. For example, many programs achieved the same "surjectivity" score, even when they were functionally distinct mappings. To better understand this, we plot the text embeddings of all programs after dimensionality reduction in the figure below giving programs defining the same mapping the same color. We see that programs defining the same mapping can have very different text embeddings. For example, the two orange upside down triangles that are the exact same program are distant on the plot. This highlights an axis of variation that can be easy to overlook: a significant amount of variation introduced by LLMs can relate to program implementation rather than program functionality. While we want simpler bijections since these are easier for a human to gain insight from, we do not generally care about code-level implementation details. Would limiting LLM exploration of implementation modifications lead to more focus on mathematically salient aspects of the program?

Progression of program metrics over iterations

Text embedding of all programs using PCA

Figure 2: Top: Progression of the best program metrics over 70 iterations for different terms in the scoring function. Bottom: Text embedding of all programs, after applying dimensionality reduction using PCA. Each functionally distinct mapping is given a different color (the bijection is black). The surjectivity score is indicated by the relative size of the points.

def avoiding_path_to_dyck_path(path, m):
  # Find the first return to the diagonal: smallest t>0 with height h(t) = 0.
  h = 0
  t = -1
  for i, step in enumerate(path, start=1):
    h += 1 if step == 1 else -1
    if h == 0:
      t = i
      break
  if t == -1:
    raise ValueError("No return to diagonal found; invalid input.")

  U = path[:t]
  V = path[t:]

  # Determine the Dyck interior X of the primitive excursion U.
  if U[0] == 1:
    X = U[1:-1]  # interior is a Dyck path
    R = avoiding_path_to_dyck_path(V, m-t //2)
    return [1] + X + [0] + R
  else:
    # U = 0 Y 1 with Y a co-Dyck interior
    # take X = complement(Y) which is Dyck.
    Y = U[1:-1]
    X = [1 - s for s in Y]
    R = avoiding_path_to_dyck_path(V, m-t //2)
    return [1] + R + [0] + X

Lessons learned

Several recent papers have successfully applied evolutionary program synthesis tools to problems in research-level mathematics. Many of these problems involve establishing bounds on a numerical quantity. So, what is different about bijection discovery? Here are the three biggest challenges we faced.

Challenge #1: Designing an objective function

A core challenge in any evolutionary search is designing a good scoring system (or objective function). The ability to construct granular scoring functions (ideally continuous) was identified as a crucial ingredient to finding solutions with AlphaEvolve in a recent paper by Georgiev et al.

The problem is, combinatorial bijections are all-or-nothing constructions: either a map is a bijection or it is not. In our case studies, we found that programs that defined distinct mappings yielded identical scores on our metrics. This created a flat "plateau" in the search space, giving the evolutionary algorithm no clear direction to "hill-climb" towards.

It may be that the problem of bijection discovery is simply less amenable to incremental improvement than problems like sphere packing, as the discovery of a correct mapping can require a single critical insight that resolves multiple structural flaws at once. Alternatively, program evaluation may simply need refinement with more granular objective functions.

Challenge #2: Aligning the objective with mathematical intent

Reward hacking is a familiar problem with LLMs: the model produces a solution that satisfies the literal constraints but violates the spirit of the problem. In our case we found that models discovered a variety of loopholes in the objective function, producing programs that achieve high scores but are not useful from the perspective of mathematical discovery. For example, several LLMs would exhaustively enumerate all objects from both sets in order to construct a trivial index-based mapping.

We prevented this kind of solution by adding explicit instructions to the prompt not to enumerate all of the objects and instructing the LLM evaluator to check for this type of cheating. However, we found that it was difficult to anticipate all of the different ways a model might reward-hack.

Challenge #3: LLMs favor solutions to well-known problems

In our case studies, we found that models had a tendency to get stuck on famous mathematical results that are associated with key words in the problem, even when they repeatedly prove to be unhelpful. This suggests that in problem selection and prompt writing, one should consider potential "gravitational pulls" and make efforts to steer models away from unproductive paths.

We believe that for creative mathematical tasks like bijection discovery, an expert-in-the-loop is essential. Their role is not just to define the initial problem and objective, but also to steer the discovery process by periodically inspecting the top-performing solutions for pathological behavior and evaluating whether the system is producing solutions that align with the intention of mathematicians.

Guest Author: This blog post was written in collaboration with Helen Jenne, Henry Kvinge and Davis Brown.

A famous exercise in Richard Stanley's seminal combinatorics textbook Enumerative Combinatorics II does exactly this, having the reader construct 66 bijections between different sets counted by the Catalan numbers. ↩