Compiling Python syntax to x86-64 assembly for fun and (zero) profit

June 2017

Summary: I used Python’s built-in AST module to parse a subset of Python syntax and turn it into an x86-64 assembly program. It’s basically a toy, but it shows how easy it is to use the ast module to co-opt Python’s lovely syntax for your own ends.

One of the reasons people like Python is because of its readable, low-punctuation syntax. You can write an algorithm in pseudocode, and then a minute later realize you’ve just written valid Python. So why not borrow Python’s syntax – and its parser – for your own evil schemes?

I got onto this recently when I was doing some performance work on Python bytecode (a story for another day), and I wondered, “Hmmm, what if Python bytecode was just x86 code?” That is, why have a bytecode interpreter when you have a super-speedy native interpreter right there in your CPU? I’m sure that idea isn’t original with me, and it has probably been tried and found wanting. It’s obviously not portable, and Python’s data and memory model is much different and higher-level than the x86’s.

I still think that’s an interesting idea, but I decided to try my hand at a much smaller and more fun problem: using the Python ast (abstract syntax tree) module to parse Python syntax, then recursively visit the AST nodes to turn them into another language. I thought it’d be easiest to generate C, but then I decided that going straight to (very unoptimized) assembly would be more fun.

For example, we’ll be turning this:

        65 + i

Into something like this:

        movq    8(%rbp), %rdx
        movq    $65, %rax
        addq    %rdx, %rax

Enter pyast64

I’ve written a bunch of 8086 and 80386 assembly back in the day, but have never actually had the need to write 64-bit x86 code. (“What’s this rax thing? I’ve only heard of ax and eax.”) These days almost everything is 64-bit, and I spend most of my time (alas) on macOS, so I’m generating x86-64 code for macOS. It wouldn’t be hard to port it to Linux, though Windows might be a bit more work.

Outputting assembly source is easier than outputting machine code and an executable directly, so that’s what I did, letting as and ld take care of assembly and linking. I’m not a fan of AT&T assembly syntax, but that’s what’s built in, so I used it.

I chose to keep it really simple: my language looks like Python, but it’s definitely not. The only data type is integers, and the only output mechanism is a predefined putc function (adding getc for input is left as an exercise for the reader).

The compiler uses a simple %rbp-based stack frame to handle an arbitrary number of local variables. It supports while loops, as well as for i in range() loops – for is implemented in terms of while. It supports if/else, comparisons, logical and and or, the four basic math operators, and recursion.

If I felt the urge to make it more than a toy, I’d use Python 3’s type annotations to support different data types like floating point and strings, and I’d add some features to allocate and use memory arrays. Oh, and things like *args and default arguments.

To keep it simple, the assembly output is very dumb, basically ignoring the fact that the x86-64 has a whole slew of registers, and just using %rax and %rdx for calculations and the stack for the rest. However, there’s a small peephole optimizer which turns push-then-pop sequences into mov instructions.

A quick look at the implementation

The full source is on GitHub at benhoyt/pyast64, but here’s a quick look at the implementation.

When writing AST-handling code, you typically write a visitor class that implements visit_* methods to visit each AST node type, for example visit_FunctionDef or visit_Add. There’s a simple standard library ast.NodeVisitor class that you can subclass to do the lookups, but I implemented my own because I wanted it to fail hard on unknown node types instead of calling the generic_visit fallback.

Python’s dynamic attributes and getattr() function makes this trivial:

class Compiler:
    def visit(self, node):
        name = node.__class__.__name__
        visit_func = getattr(self, 'visit_' + name, None)
        assert visit_func is not None, '{} not supported'.format(name)
        visit_func(node)

    def visit_Module(self, node):
        for statement in node.body:
            self.visit(statement)

    def visit_FunctionDef(self, node):
        ...

And here is how the meat of the node types are implemented (for a selection of simple node types):

    def visit_Num(self, node):
        # A constant, just push it on the stack
        self.asm.instr('pushq', '${}'.format(node.n))

    def local_offset(self, name):
        # Calculate the offset of the given local variable
        index = self.locals[name]
        return (len(self.locals) - index) * 8 + 8

    def visit_Name(self, node):
        # Push the value of a local on the stack
        offset = self.local_offset(node.id)
        self.asm.instr('pushq', '{}(%rbp)'.format(offset))

    def visit_Assign(self, node):
        # Assign (set) the value of a local variable
        assert len(node.targets) == 1, \
            'can only assign one variable at a time'
        self.visit(node.value)
        offset = self.local_offset(node.targets[0].id)
        self.asm.instr('popq', '{}(%rbp)'.format(offset))

    def simple_binop(self, op):
        self.asm.instr('popq', '%rdx')
        self.asm.instr('popq', '%rax')
        self.asm.instr(op, '%rdx', '%rax')
        self.asm.instr('pushq', '%rax')

    def visit_Add(self, node):
        self.simple_binop('addq')

    def visit_Sub(self, node):
        self.simple_binop('subq')

    def visit_Call(self, node):
        assert not node.keywords, 'keyword args not supported'
        for arg in node.args:
            self.visit(arg)
        self.asm.instr('call', node.func.id)
        if node.args:
            # Caller cleans up the arguments from the stack
            self.asm.instr('addq', '${}'.format(
                    8 * len(node.args)), '%rsp')
        # Return value is in rax, so push it on the stack now
        self.asm.instr('pushq', '%rax')

When a FunctionDef node is encounted, we use another visitor class (this time an ast.NodeVisitor subclass) to whip through the AST of that function and find the names and number of local variables. We store a dict of variable name to index that’s used to calculate stack offsets when fetching and storing locals.

In my toy language, the only node types that can “create” locals are assignment and for loops, so here’s that visitor class in its entirety:

class LocalsVisitor(ast.NodeVisitor):
    def __init__(self):
        self.local_names = []

    def add(self, name):
        if name not in self.local_names:
            self.local_names.append(name)

    def visit_Assign(self, node):
        assert len(node.targets) == 1, \
            'can only assign one variable at a time'
        self.visit(node.value)
        self.add(node.targets[0].id)

    def visit_For(self, node):
        self.add(node.target.id)
        for statement in node.body:
            self.visit(statement)

In addition to the Compiler class, there’s an Assembler class that actually outputs the assembly instructions, labels, etc. This class also implements the peephole optimizer to combine sequences of pushes and pops into moves. Here’s the structure of that:

class Assembler:
    def __init__(self, output_file=sys.stdout, peephole=True):
        self.output_file = output_file
        self.peephole = peephole
        # Current batch of instructions, flushed on label and
        # end of function
        self.batch = []

    def instr(self, opcode, *args):
        # Output a single instruction with given args
        self.batch.append((opcode, args))

    def flush(self):
        if self.peephole:
            self.optimize_pushes_pops()
        for opcode, args in self.batch:
            args_str = ', '.join(str(a) for a in args)
            print('\t{}\t{}'.format(opcode, args_str),
                  file=self.output_file)
        self.batch = []

    def optimize_pushes_pops(self):
        """This finds runs of push(es) followed by pop(s) and combines
        them into simpler, faster mov instructions. For example:

        pushq   8(%rbp)
        pushq   $100
        popq    %rdx
        popq    %rax

        Will be turned into:

        movq    $100, %rdx
        movq    8(%rbp), %rax
        """
        ...

Again, go to the GitHub repo to browse the full source – it’s only 500 lines of code including blanks and comments.

Example of output

There are a couple of examples (the *.p64 files) in the source tree. Below is the simplest of them: forloop.p64, which simply prints the letters A through J using a for loop:

def loop():
    for i in range(10):
        putc(65 + i)            # 65 is 'A'

def main():
    loop()

Note that for i in range(10) is not compiled directly, but expanded to a while loop:

i = 0
while i < 10:
    putc(65 + i)
    i = i + 1

To give you a taste of the output, the loop function compiles to the following assembly (with some comments added afterward):

loop:
        pushq   $0              # allocate stack space for "i"
        pushq   %rbp            # save and setup frame pointer
        movq    %rsp, %rbp
        movq    $0, 8(%rbp)     # i = 0
loop_1_while:
        movq    $10, %rdx       # rax = 1 if i < 10 else 0
        movq    8(%rbp), %rax
        cmpq    %rdx, %rax
        movq    $0, %rax
        jnl     loop_3_less
        incq    %rax
loop_3_less:
        cmpq    $0, %rax        # if bool is zero, break
        jz      loop_2_break
        movq    8(%rbp), %rdx   # 65 + i
        movq    $65, %rax
        addq    %rdx, %rax
        pushq   %rax            # putc()
        call    putc
        addq    $8, %rsp
        movq    $1, %rdx        # i = i + 1
        movq    8(%rbp), %rax
        addq    %rdx, %rax
        movq    %rax, 8(%rbp)
        jmp     loop_1_while
loop_2_break:
        popq    %rbp            # restore frame pointer
        leaq    8(%rsp),%rsp    # deallocate stack space for "i"
        ret                     # return to caller

A quick benchmark

Again, this is very much a toy, and the assembly output is pretty terrible. But benchmarks are fun (and Python is slow at integer math), so I did a quick benchmark that sums the first 100,000,000 integers:

def main():
    sum = 0
    for i in range(100000000):
        sum += i

I also compared to a slightly more idiomatic version that only works in actual Python using sum(range(100000000)), as well as a C version of the same loop for reference. On my MacBook Pro 2.5GHz i7, here are the results:

version	time (s)	ratio
python 3.5	7.0	1.0
python 3.5 with sum()	2.2	3.2
pyast64 without peephole	0.55	12.7
pyast64 with peephole	0.24	29.7
C version (`gcc -O0`)	0.22	31.8

So there you go: my 500-line toy compiler executes an integer-summing loop 30x as fast as Python, and on a par with gcc with optimizations disabled (gcc -O2 optimizes the entire loop away to a constant).

Enjoy! Just don’t try to use it for real projects anytime soon. :-)

Please write your comments on Hacker News or programming reddit.

I’d love it if you sponsored me on GitHub – it will motivate me to work on my open source projects and write more good content. Thanks!