Compiler architecture#

This document goes over the Pymich compiler internals. We first go over its high-level architecture, before diving it deeper to provide the reader with the information necessary start hacking on the codebase.

High level architecture#

The Pymich compiler is articulated over three components:

  • a frontend, that compiles from Pymich code to its intermediate-representation (Pymich-IR).

  • a middlend, that transforms Pymich-IR into a more optimized Pymich-IR.

  • a backend, that compiles from the Pymich-IR to Michelson instructions ready

The compiler source is organized as follows:

Pymich Intermediate-Representation#

Let us start by going over the Pymich-IR, heavily inspired from Albert since it is the only language involved in all three components of the compiler. However, it is not quite as low level and leaves variable dealocations to the backend compiler.

The main goal of this IR is to provide the middle-end with a simpler representation to optimize over, and allow the backend to generate Michelson code in a simpler way. The IR provides the following grammar:

python_types     ::=
                   | python_number
                   | python_string

base_types       ::=
                   | Unit
                   | String
                   | Nat
                   | Int
                   | Boolean
                   | Address

type             ::=
                   | base_type
                   | Operation
                   | Contract
                   | Map[type, type]
                   | BigMap[type, type]
                   | List[type]
                   | Set[type]
                   | Record[attr1=type, ..., attrn=type]

base_object      ::=
                   | base_type(python_type)

atom             ::=
                   | var
                   | base_object

lhs              ::=
                   | var
                   | record.attribute
                   | map[atom]
                   | big_map[atom]

rhs              ::=
                   | atom
                   | f(atom)
                   | record.attribute
                   | Map[type, type]()
                   | BigMap[type, type]()
                   | List[type]()
                   | Set[type]()
                   | var.unpack[type]()

instruction      ::=
                   | rhs
                   | lhs = rhs
block            ::=
                   | instruction
                   | instruction
                     instruction

loops            ::=
                   | for var in rhs:
                         block

 condition       ::=
                   | if var:
                         block
                     else:
                         block

 exception       ::=
                   | raise Exception(python_string)

Entrypoints are factored in a class inheriting from pymich.michelson_types.Contract class and all of this class’s methods are compiled to entrypoints. Entrypoints take exactly one argument and return the new storage. Although the operations are currently returned implicitely by the compiler at the end of every entrypoints, this will most likely be updated in the near future and the Pymich-IR will require functions returning a Pair(List[Operation], Storage). Finally, in Pymich-IR, all contracts must implement a :python`get_storage_type()` method that returns the storage type.

All functions take exactly one argument.

Here is an example of the FA1.2 mint function that was implemented in the getting-started section:

class AllowanceKey:
    owner: Address
    spender: Address

class Storage:
    tokens: BigMap[Address, Nat]
    allowances: BigMap[AllowanceKey, Nat]
    total_supply: Nat
    owner: Address

class mintParam:
    _to: Address
    value: Nat

class FA12(Contract):

    def get_storage_type():
        return Storage

    def mint(mint__param: mintParam) -> Storage:
        _to = mint__param._to
        value = mint__param.value
        tmp_var_6 = __STORAGE__.owner
        tmp_var_7 = SENDER != tmp_var_6
        if tmp_var_7:
            raise Exception('Only owner can mint')
        tmp_var_8 = __STORAGE__.total_supply
        __STORAGE__.total_supply = tmp_var_8 + value
        tmp_var_0 = __STORAGE__.tokens
        tmp_var_9 = __STORAGE__.tokens
        tmp_var_10 = tmp_var_9.get
        tmp_var_11 = tmp_var_10(_to, Nat(0))
        tmp_var_0[_to] = tmp_var_11 + value
        __STORAGE__.tokens = tmp_var_0
        return __STORAGE__

Frontend#

The goal of the frontend compiler is to go from the subset of python that Pymich supports to the Pymich-IR such that the middle end can then optimize the IR before the backend emits the Michelson.

The passes go as follow (see the frontend compiler code:

  1. rewrite pre-hangzhou view functions;

  2. factor out the storage into its own dataclass and introduce the get_storage_type method;

  3. remove self from contract method arguments;

  4. tuplify function arguments;

  5. handle entrypoint and functions with no arguments (convert then to single argument functions of type Unit;

  6. three address code the python code according to the Pymich-IR grammar.

Supporting a new Python syntax such as the currently unsupported += operator is then as easy as adding a frontend compiler pass that transforms a += b into a = a + b and letting the middlend and backend do the rest.

Note that the IR is not yet in static-single assignment form due to the difficulty to compile the phi function in the backend compiler. Work is currently undergoing to introduce it. We are currently discussing with Nomadics Labs how to best introduce it (or the equivalent continuous passing style representation).

Finally, the Python typechecking occurs at the Pymich-IR after all passes have been executed. This allows to easily reuse the typechecking routine after each middlend passes.

Middlend#

The goal for the middlend is to convert, through successive passes, Pymich-IR to a more optimized Pymich-IR.

Although the Middlend is currently not implemented, the main goal in the next few months is to introduce one, since we have simple enough Pymich-IR that optimizations are possible. The optimizations that will be first implemented are the following:

  • constant folding

  • constant propagation

  • dead code elimination

  • inlining function that are only use a single time

  • common subexpression elimination

  • loop invariant code motion

In order to ensure that the program still typechecks after the optimizations, the Pymich-IR typechecker is ran after all optimization passes have successfully executed.

Backend#

Finally, the compiler backend emits Michelson code from Pymich-IR to Michelson. Besides the obvious conversions from Pymich-IR data structure to the Michelson ones, the compiler backend is responsible for allocating and deallocating variables when necessary.

Here are some links to the implementations:

Note that although the typechecker is currently used which compiling from the IR to Michelson (link from the first bullet above), it is planned to refactor it into a separate pass that can be called after the frontend compiler, between each optimization pass, and before the backend compiler.

Error Handling#

Exceptions raised by the compiler stages are available here. They each take a line number that is preserved from Pymich code to the IR via ast.fix_missing_locations(transformed_ast).

In the following example, we use an undefined variable g:

from pymich.michelson_types import *
from typing import Callable


class Contract:
    counter: Nat
    f: Callable[[Nat], Nat]

    def update_f(self, f: Callable[[Nat], Nat]):
        self.f = g  # Error, `g` is undefined

    def update_counter(self, x: Nat):
        self.counter = self.f(x)

Which returns as expected the undefined variable error with the correct line number:

Function 'g' does not exist at line 10