Solving an advent of code problem using a compiler

January 27, 2023

I've never tried advent of code, but my friend showed me an old problem he thought I'd be interested in. Here's a fun and possibly unexpected take on how to solve it.

The problem provided a small program written in a made-up assembly language, and asked: what input to this program produces a particular output? The language is simple enough (there's not even an "add" primitive, for example) that it's not immediately obvious what the program does, and the particular snippet has some tricky control flow with jumps crossing each other.

It's reasonable you could study the code to understand it, or perhaps make a fast runner to brute force it, but here's my sideways take on how to solve it: let a compiler optimize the code (to find structure from the assembly), then a decompiler to recover human-readable higher-level structure from that output, ultimately converting the assembly into something easier to understand.

The input program was 30 lines, starting with this. The language has a few instructions that modify things and some relative jumps:

cpy a d
cpy 4 c
cpy 633 b
inc d
dec b
jnz b -2
[...]

We can translate this into a small equivalent C snippet by using goto for the jumps. Just for funsies here an awk program to do that (though it would also be just a slightly longer Python etc. program). (Edit: after writing this post I discovered someone on reddit had a strikingly similar awk program, ha!)

{ printf "L" NR ": "  }
/cpy/ { print $3 "=" $2 ";" }
/inc/ { print $2 "++;" }
/dec/ { print $2 "--;" }
/jnz/ { print "if (" $2 "!=0) goto L" (NR+$3) ";" }
/out/ { print "out(" $2 ");" }

From the above input, this produces output like:

L1: d=a;
L2: c=4;
L3: b=633;
L4: d++;
L5: b--;
L6: if (b!=0) goto L4;

Next, we take that snippet and paste it into a .c file that looks like:

void run(int a) {
    // The other registers as used by the input language:
    int b = 0, c = 0, d = 0;
    // ...and then the generated code from above:
L1: d=a;
L2: c=4;
// ...
}

And, for the central trick, we compile this using the C optimizer. Importantly, a C optimizer knows to look for structures like loops and replace them with simpler code that does the same thing.

For example, if you look at just the small snippet I pasted above, you can recognize that L4-L6 are effectively adding b to d, and a compiler can figure that out too.

$ cc -O2 -o cpy cpy.c

Finally, we load this output into Ghidra, which will not only show us the assembly but also attempt to further infer structure to decompile it into higher-level C code.

It produces a snippet that opens with:

param_1 = param_1 + 0x9e4;
iVar2 = param_1;
do {
...

On one hand, this is already pretty neat — the compiler managed to figure out that the initial block of code (involving the constant 633 in the input) is actually just adding 0x9e4! But on the other, it's still no obvious in part because there are no variable names.

So next we turn to ChatGPT. Prompt: "Can you rewrite this code to use better variable names:" plus the code. It produces (disclaimer: I reran this to get better output for this post):

void run(int inputNumber)
{
  int output;
  int currentNumber;
  
  currentNumber = inputNumber + 0x9e4;
  do {
    while (currentNumber == 0) {
      out(0);
      currentNumber = inputNumber;
    }
    int iteration = 0;
    do {
      if (currentNumber == 1) {
        output = 1;
        goto endLoop;
      }
      iteration++;
      currentNumber -= 2;
    } while (currentNumber != 0);
    output = 0;
    endLoop:
    out(output);
    currentNumber = inputNumber;
    if (iteration != 0) {
      currentNumber = iteration;
    }
  } while( true );
}

This code still requires thought to understand. The middle do-while loop is adding one to iteration for each time it subtracts 2 from currentNumber, which you have to realize is a divide by two. But from here it is not a huge leap to see that this program adds 0x9e4 to the input and then prints it as binary. And from there, solving the original problem is a relatively small step.

In practice, I don't think this is the approach I would have actually taken to solve this problem. But since I've been staring at disassembly lately these tools are top of mind, and I think it's kind of a fun angle to take on it.