CS 6983 Special Topics: The Technology of Lambda -*- Outline -*- Olin Shivers Homework 4: Closure conversion part I: analysis Due 2025/3/22 midnight by email Closure conversion is the heart of our compiler, where we incarnate higher-order functions (a lovely, language-level fantasy) as closures made of code, memory blocks and registers (a gritty, machine reality). In this assignment, you will do the analysis to compute the information needed for the actual conversion. The results of your analysis will be recorded in annotations on the code; in the next assignment, you will write the actual conversion as a simple tree-walk of the AST that is driven by these annotations. This all has two consequences: - This is the difficult half of closure conversion -- figuring out what we know about the program and making all the important decisions. The actual transform will be fairly straightforward. So be prepared for some work, both conceptual and hacking. - Our program terms are going to get quite annotation-heavy. ------------------------------------------------------------------------------- * 0. Preliminaries ------------------ ** Sets everywhere ================== This assignment does a fair amount of set manipulation -- union, subtraction, membership, subset. You may want to track down or implement a basic set-algebra package to use. SRFI-1 has a set of functions that might be useful; you may find more sophisticated packages if you look around. Clearly, you can get big speedups by representing your sets as bit vectors, but this requires more engineering than is merited in your prototype compiler. I recommend you keep it simple. ** Lambdas, LETs, CONTs and FUNs ================================ When we say "lambda," we mean both FUN and CONT terms in a program. When we say a variable is "LET-bound" to some argument right-hand side (RHS), we mean this as a convenient short-hand way to speak of the corresponding beta redex. Bear in mind that there are two such redexes: one where the binding lambda is an open FUN term, and one where the binding lambda is an open CONT term: ((cont (f) ... f ... f ...) ; (let ((f )) ) ; ... f ... f ...) ((fun (f k) ; (let ((f ) ... f ... k ... f ... k) ; (k )) ; ... f ... k ) ; ... f ... k) (Remember that only FUN forms can bind a continuation. CONTs don't take continuation arguments.) ** Scope and free vars ====================== Before you start this assignment, reflect for a moment on the relationship between the set of variables in-scope at program point p, InScope_p, and the set of variables in the free-variable set at program point p, FreeVars_p: the latter is a subset of the former. That is, there's a producer/consumer relation between InScope_p and FreeVars_p: InScope_p is the set of variables the program *makes available* to point p, while FreeVars_p is the subset of these variables that p *actually uses*. The *language semantics* say the term at point p can, if it wishes, access anything in InScope_p. But the *compiler* is only obligated to make the bindings of FreeVars_p available to p -- which might be quite a savings. For example, suppose we have a pure combinator, say, the function-doubler (fun (f x k) (f x (cont (temp) (f temp k)))) This piece of code might appear in a *context* with a thousand variables in scope. But it doesn't *need* any of them. Keep this InScope / FreeVar distinction in mind when you get to the extended-variables analysis in step N of this assignment. ** Annotating forms =================== You'll be collecting, using, and sprinkling lots of information on various pieces of your program. That, in fact, is the entire point of this assignment. If we were programming in OCaml or SML, we'd simply make these annotations mutable fields in the records that we use to represent nodes in the CPS AST. But s-expressions are nice, because they can be easily pretty printed. (If you don't know how to "pretty print" an s-expression, look it up in the Racket documentation.) This compiler has been designed so that we only need to annotate binding forms -- things that bind variables: FUN, CONT and LETREC forms. To accomodate our wish to have a (reasonably) readable IR, we'll collect all the annotations for a term together in a sublist marked with an "@". As you proceed through the analyses in this assignment, you'll keep accumulating additional annotations to add to the grammar you use in nanopass for annotations. Here are some examples: (fun (x k) ;; Initial form after parameters is the annotations set: (@ (kind first-order) ; This FUN is a first-order lambda. (label fun38) ; Its global label is FUN38. (free-vars b y)) ; It contains free references to B & Y. (if b (k 0) ; Here's the body ($+ x y k))) ; of the FUN. (letrec ((f (fun (n k1) ...)) ; bindings (g (fun (a b k2) ... z ...))) (@ (label lr81) ; annotations (free-vars x z ktop)) (f x ktop)) ; body You'll need to engineer up ways to query, delete, add, and update the annotations on a term. To convert back to your pre-analysis form, you just drop the (@ ...) subforms everywhere. ------------------------------------------------------------------------------- * 1. Static, global names for binding forms ------------------------------------------- Write a pass that assigns a globally unique name (called its "label") to every lambda and LETREC term in your program. These names will be useful later when we want to globally refer to different lambdas. Labels are drawn from a flat, global namespace that is completely separate from the lexically scoped namespace used for source variables; this is why a code label must be unique across the entire program. The label of a lambda or a LETREC is given by the (label ) annotation, e.g. (fun (x k) (@ ... (label fun37) ...) ; This FUN is labelled FUN27. ($+ x x k)) In our AST, a label is a symbol. ------------------------------------------------------------------------------- * 2. Free-variable analysis --------------------------- Write an analysis that annotates each lambda with its set of free variables, e.g. (fun (x k) (@ (free-vars y z)) (if y ($+ x y k) (k z))) You could, in fact, interleave this with the next two passes (kind classification and first-order CFA)... but you might want to keep it a separate pass just to keep the details separate to make them easier to code and debug. And it's a good warmup analysis: a simple tree walk, pretty easy. ------------------------------------------------------------------------------- * 3. Classification of lambdas ------------------------------ Write a pass in your compiler that will analyse a CPS term and classify (by annotation) every lambda (FUN or CONT) as one of the following three kinds: - Open An "open" lambda is a term that will get executed exactly where it appears in the code, so that + Its evaluation environment and its application environment match exactly. + It is called from exactly one, statically known, call site. + That call site calls no other lambda. There are two kinds of open lambdas: 1. A lambda appearing as the function part of a beta redex, e.g. ((cont (f) ... f ... f ...) ; This CONT is an open lambda. (fun (x k) ...)) ; This FUN is not. Such lambdas are basically LETs. Invariants: - Any such LET binds its LHS variable to a lambda RHS. Why? If the variable was bound to a constant or variable RHS, your simplifier would have reduced it away. - The variable bound by any such LET has more than one reference. If it only had 0 or 1 reference, your simplifier would have reduced the redex away. - So these LETs bind a *lambda* to a variable that has *multiple references*. In other words, this kind of open lambda always binds/names a "join point" -- code to which control is transferred from two or more spots. These join-point LETs allow us to encode a control-flow graph that is a DAG; without them we can just code up a tree using IFs as code splits. Want cycles in the CFG, too? Then we have to use LETRECs. Here's another way to look at these "join" LETs. An open lambda don't really do any computation. It just introduces a local name for a local thing. Here we are exploiting the property of names to give us the ability to refer to something multiple times: just give it a name, and use the name twice. 2. A CONT form appearing as the continuation of a primop call, e.g. ($+ x y (cont (z) ...)) - First-order A "first-order" lambda is one that is LET- or LETREC-bound to a variable that is only referenced in the *function position* of call forms. E.g., ((cont (sqr) (sqr 3 (cont (a) (sqr 5 (cont (b) ($+ a b k)))))) (fun (x k') ($* x x k'))) ; This FUN is first-order & bound to SQR. Given this definition, a first-order lambda has these properties: + It is only called from statically known call sites. + These call sites only call this one lambda. + All calls to this lambda occur in a lexical context that sees all variables that are in scope for the lambda, with the same bound values. A first-order lambda is only bound to *one* variable, ever, and that one variable is only bound to that lambda. So we can think of every first-order function as having a unique name -- that one variable to which it is bound. They are paired together. A first-order lambda has the nice property that every call to it is statically known and *only* calls that lambda, so we can tailor these calls *specifically* for that one lambda. This is pretty liberating, as we'll see. - Closed A "closed" lambda is anything not open or first-order. Essentially, it is a lambda that will need to have a closure created each time it is evaluated. If it's a FUN, the closure will be allocated on the heap; if it's a CONT, the closure will be allocated in the stack frame that is the current, top frame at the time the closure is created. What's good about a closed lambda is that evaluating such a lambda makes a real *value* that we can pass around the program, put onto lists, throw into hash tables, etc. (For comparison, look at the rules for first-order lambdas -- they cannot be used / are never used as actual values.) What's bad is the cost of this extra utility: (1) we have to allocate storage for the closure at run time, every time the program evaluates the lambda, and (2) we will have to call these functions using a general-purpose, globally fixed, heavyweight calling protocol. (Could we do better? If we had, say, higher-order flow analysis and built lambda/call webs? Stay tuned...) Why are "first-order" lambdas called first-order? Because they correspond to the simple skeleton of code that we could write in a first-order programming language like Pascal or Algol. You can name functions, but only use those names as the function part of a function call -- which provides lots of useful constraints we can exploit when compiling the code. First-order and open lambdas, taken together, give us the control structure that C, Pascal, and Fortran compilers call the "control-flow graph" (CFG) of a given function. That is, in CPS, loops, conditional splits, joins and straight-line basic blocks are all provided by open and first-order lambdas: - Basic blocks are sequences of primops chained together by open continuations. - Loops are made with LETREC-bound first-order functions. - Join points are made with LET-bound first-order continuations. - OK, we cheated on conditional splits in our IR, with a core IF syntax form, to keep things simple, but it can be easily managed with multi-continuation conditional primops. Take a moment and reflect on all the different things we are saying when we say a lambda is first-order: - Control: It's only called in tightly controlled ways, from compiler-known places. - Environment: There's a very strong relationship between the environment where it is evaluated and the one where it is invoked. - Data: No closure needed. Its procedures are *not run-time data*. We've tightly constrained *all three* fundamental program structures involved with this lambda. Which just goes to show you how powerful the general mechanism of lambda is -- because the lambda that the programmer gets is a lot more than just first-order lambdas. You record your classification by adding a KIND annotation to the lambda. For example, (fun (x k) ; Evaluating this lambda will allocate a (@ (kind closed)) ; closure on the heap, for COEFF & the code. ($* x coeff k)) (cont (a) ; A return point for a function call. (@ (kind closed)) ; Free vars M-1 & K will be kept on the stack ($* m-1 a k)) ; frame of the FUN to which we belong. (cont (x) ; We jump here from two or more (@ (kind first-order)) ; known places: a CFG join point. ($+ 1 x k)) Additionally, when you find a first-order lambda L that is LET- or LETREC-bound to some variable V -- so that V is either the parameter of an open lambda or LETREC binder B -- your analysis should mark V as being bound to a first-order lambda by putting it in a FIRST-ORDER-VARS annotation on B. For example: (fun (j k) ; Slow even test for natural numbers. (@ (kind closed)) (letrec ((even? (fun (n k1) ; Only called => first-order ($zero? n (cont (b) (if b (k1 #t) ($- n 1 (cont (m) (odd? m k1)))))))) (odd? (fun (p k2) ; Only called => first-order ($zero? p (cont (b) (if b (k2 #f) ($- p 1 (cont (q) (even? q k2))))))))) (@ (first-order-vars even? odd?)) ; Both of this LETREC's (even? j k)) ; vars are 1st order. This section is described separately from the following control-flow analysis in the next section, but you should combine the two into a single unified pass. They are really a single algorithm that produces multiple distinct (but related) annotations. ------------------------------------------------------------------------------- * 4. First-order control-flow analysis -------------------------------------- For every first-order lambda, find all the places it could be called. If we had assigned labels to all the call sites in the program, we could simply annotate each first-order lambda with the labels of those calls. Instead, we'll work in terms of *binder form* labels, as follows. If we start with any call expression in the program and start moving up through the AST, we will ascend through a sequence of IF terms until we eventually arrive at a FUN, CONT or LETREC -- our three binding forms. These binding forms *are* labelled. So we'll use these. Let's call the innermost binding form that contains a call C its "parent binder." That's the view looking up the AST. Another way to put this is the downward view: the CPS grammar says that the body of a binding form is a binary tree of IF's whose leaves are either (1) LETREC forms or (2) calls. (That tree could, of course, be a simple leaf, no IF.) (Note that if we'd gone ahead and represented conditionals using calls to primops that take multiple continuations, we could say the above more simply: the *immediate* parent of every call expression is either a lambda or a LETREC. We wouldn't have to deal with the tree-of-ifs structure.) First, write a pass that will add a (@ ... (parent-binder