CS 6983 Special Topics: The Technology of Lambda		-*- Outline -*-
Olin Shivers

Homework 5: Closure conversion part II: transform
Due Monday 2025/4/15 midnight by email

We've marked up the CPS program with all the annotations we need
to transform it to closure-converted form. Again, after closure
conversion, all variables can be viewed as if they were registers
(on a machine with an arbitrary number of registers).

Our target language is no longer lambda calculus of any kind. The
target program executes on a machine that has
- a program counter,
- a stack, which we can think of as a sequence of frames,
- a heap where we can allocate closure tuples for first-class 
  function values, and
- an arbitrary number of registers.

The target language is called AM, for "abstract machine." It is a kind
of SSA form; we can regard variables as registers, and every variable
has only one lexical definition. The language has two name spaces:
code labels and var/register names.

The language has been carefully defined to hide word size and be
independent of what direction the stack grows -- from high memory
downwards, or from low memory upwards.

-------------------------------------------------------------------------------
* Grammar of AM
---------------
;;; An INSN is a straight-line computation.
;;; An INSN+ terminates in a control transfer somewhere.
;;; As each kind of code includes the ability to call a function
;;;   in the middle of the computation, non-termination is also
;;;   a possibility in both cases.
;;;
;;; This is essentially SSA: each variable is assigned/bound
;;; at only one spot (bound across a jump, or by an := statement).

prog ::= insn+

insn-seq  ::= (insn ...)        	; Possibly empty
insn-seq+ ::= (insn ... insn+)  	; Never empty

insn ::= (BLOCK . insn-seq)		; Compound instruction
       | (LABEL (label code) ...)	; Locally introduce code blocks
       | (:= var rhs)			; Bind var / assign reg
       | (FRAME-STORE n var)		; Store VAR's value into frame slot #n

insn+ ::= (BLOCK . insn-seq+)		; Compound instruction
        | (IF var insn+ insn+)
        | (TAIL   target arg ... retpc) ; Tail call:   pop frame & jump to proc
        | (CALL   target arg ... retpc) ; Simple call: jump to proc, no pop
	| (RETURN target arg ...)	; Return: pop frame & jump to cont
	| (GO     target arg ...)	; Local jump, no frame-pop

retpc, target ::= var | labref	; Can jump through reg or to labelled location
labref ::= (% label)		; Address of labelled code

code ::= (CODE (var ...)        . insn-seq+)	; Continuation
       | (PROC (var ...) cvar n . insn-seq+)	; Procedure; frame size N.

rhs  ::= (prim arg ...)		; Note: no continuation arg
       | (CALL target arg ...)	; Note: no continuation arg
       | (CLOSURE arg ...)	; Allocate closure tuple on heap
       | memref			; Load from stack frame or a closure tuple

arg   ::= labref | const	; A constant value
        | loc			; A value source: reg ref or memory load
        | (OFFSET loc n)	; For making proc values from closure tuples
       
loc ::= var | memref		; The location of a previously computed value

memref ::= (FRAME-REF n)	; Load slot #n off the stack frame
	 | (TUPLE-REF var n)	; Load slot #n off the closure tuple


-------------------------------------------------------------------------------
* AM discussion and rationale
-----------------------------
Label references are marked with the (% label) form, to distinguish
them from variable/register references.

** INSN's and INSN+'s
=====================
INSN forms include the declaration of labelled code blocks. We permit
these declarations anywhere to make it easy for tree-walking code to
add these as locally needed.

INSN forms do not include IF conditionals, to ensure that all join
points in the code are explicitly labelled (and take parameters).

The compiled body of a lambda is an INSN+, which is essentially a
binary tree of IF's with control transfers at the leaves. There are
three kinds of control transfers:

- (CALL target arg ... retpc)
  This jumps to TARGET (which can be a register or a fixed label),
  which should give the address of a PROC form. The call passes the
  given arguments plus the return-address RETPC. This form is used
  to make a non-tail call -- it returns to a shared join point (RETPC)
  that will run using the caller's stack frame, so it doesn't pop
  the current stack frame.

  Note that most non-tail calls are *not* made this way, but are
  rather made on the right-hand side of a (:= var (CALL ...)) INSN.
  This INSN+ jump is used only when the return point is a shared join
  point (e.g., locally bound by the caller so it can be used in both
  arms of an IF. See a following section for further discussion.

- (TAIL target arg ... retpc)
  This jumps to TARGET (which can be a register or a fixed label),
  which should give the address of a PROC form. The call passes the
  given arguments plus the return-address RETPC. We use this form when
  we want a function to terminate in a tail call, so the current
  frame of the stack is popped as we jump to TARGET.

- (RETURN target arg ...)
  This jumps to TARGET, which should be the return address of some call.
  We use this form when we want a function to terminate by returning
  to its caller, so the current frame of the stack is popped as we
  jump to TARGET.

- (GO target arg ...)
  This jumps to TARGET, which should be the address of code local to
  the current function -- that is, the target of the jump has been
  compiled to run with the same stack frame as the source of the jump.
  So the stack is *not* popped.

Here's the table of the four INSN+ jump instructions, which are
distinguished by two properties: where we are jumping
(frame-allocating PROC, or just a CODE form that runs on an
pre-existing stack frame), and whether or not we pop the caller's
frame as part of the jump:

              stack op
              POP    NO-POP
            +-------+-----+
jump   PROC |TAIL   | CALL|
target CONT |RETURN | GO  |
            +-------+-----+

We write a procedure body as
  (PROC (var ...) kvar n . insn-seq+)
This stands for code that
1. Binds/loads the VAR variables/registers with the values 
   being passed from the caller.
2. Binds/loads variable/register KVAR with the return address
   being passed from the caller.
3. Adds an N-word frame onto the tip of the stack.
4. Runs the code in INSN-SEQ+

A CODE form is similar, but
- no return address is passed from source of the jump,
  and there is no corresponding KVAR to be bound, and
- no new stack frame is pushed on entry.
Named CODE blocks are used for join points in intra-procedural
control-flow graphs; the CODE's parameter list provide the
equivalent of phi functions in an SSA representation. Just as 
PROC forms arise from (some) user lambdas in the CPS, CODE blocks 
arise from (some) continuation lambdas in the CPS source.

A PROC form can be called from three places in the grammar:

- A (TAIL target arg ... retpc) or (CALL target arg ... retpc) 
  call performed at the terminal leaf of some INSN+ IF-tree.

- A function call done on the right-hand side of an :=
  binding/assignment form, e.g.,
    (:= y-small? (call < y 5))
  Note that in this case, the RHS CALL form does *not* provide a
  continuation argument.

A register/variable V is assigned / bound to values in one of two ways:
- As the result of a (:= v rhs) insn. There are five kinds of right-hand
  side:
  1. A primop application, e.g.,
       (:= x ($* y z))
  2. A non-tail call to a procedure, e.g.
       (:= x (call leap-year? y))
  3. The allocation of a closure tuple, e.g.
       (:= clo (closure (% lam87)	; Slot 0 is code pointer
                        n		; Slot 1 is free-var value
			a))		; Slot 2 is free-var value
  4. A load from the stack frame, e.g.,
       (:= x (frame-ref 5))
     The frame is indexed starting from 0.
  5. A load from a closure tuple, e.g.
       (:= x (tuple-ref clo 3))
     The tuple is indexed starting at 0.

- As the result of a control jump: all control jumps pass values to
  the registers/variables named at the target CODE or PROC form. (For
  a comparison, this serves the purpose provided by the phi functions
  of SSA, at code join points.)

  The TAIL/CALL/RETURN/GO form doing the control transfer can pass six
  kinds of arguments to the variables/registers waiting to be
  bound/assigned at the jump target:
  1. A code label, e.g., (% lam87)
  2. A constant, e.g., 5.
  3. A variable/register, e.g., x.
  4. The address of a specific slot in a memory tuple, e.g.,
       (offset clo 2))
     If variable CLO's value is the address of some tuple, then
     the (offset clo 2) form produces that address plus a 2-word
     offset, that is, the address of slot #2 in the tuple. This is
     used to convert the address of a closure tuple into a procedure,
     when we have multiple procedures sharing a single closure record.
  5. A stack-frame load, e.g., (FRAME-REF 3) 
  6. A closure-tuple load, e.g., (TUPLE-REF clo 2)

  Arguments passed across control jumps are restricted to be values
  already computed; actual computation happens in := forms.
  We include memory loads as legal arguments to set us up to handle
  things if we ever extended the compiler to handle register spilling.
  E.g., if we perform a function call with twenty arguments on a
  processor with sixteen registers, we have no chance of getting all
  the arguments into physical registers before the call.

  We also do this for arguments to the closure-allocating form,
  for similar reasons. In general, any atomic computational form
  in AM that could take an arbitrary number of inputs is defined
  so that the inputs can come from either variables (registers)
  or memory.

When a program has a LETREC that binds multiple variables to closure
lambdas, we arrange for all of these procedures to share one closure
tuple. For example, if we have a pair of mutually recursive F and G
functions

    (letrec ((f (fun (n1 k1) ... g ... x ...))	; F & G must be closed.
             (g (fun (n2 k2) ... f ... y ...))) ; Closure includes X & Y.
      (h f g ktop)) ; Export F & G closures to H.

then the corresponding AM code makes only *one*, shared 4-slot closure tuple:

    (:= clo (closure (% f-lam)	; <-- Procedure F points here
                     (% g-lam)	; <-- Procedure G points here
		     x
		     y))
    ;; Procedure F is (offset clo 0) = CLO
    ;; Procedure G is (offset clo 1)

Now the F-LAM code accesses X from its closure parameter CF as
    (tuple-ref cf 2)
while the G-LAM code accesses X from its closure parameter CG as
    (tuple-ref cg 1)
because CG already points one slot off from the beginning of its
containing tuple.

Stack-frame slots are written with the FRAME-STORE insn, e.g.,
    (frame-store 7 X)
stores variable/register X into slot #7 of the current stack frame.

INSN terms permit labels to be declared for PROC and CODE code blocks,
using the LABEL form. These labels obey lexical scope.

In AM, primops no longer take continuation arguments, as they are now
invoked on the right-hand side of := definitions. So if our source
CPS term is
    ($+ x y
        (cont (z) ...following code in scope of x, y & z...))
in AM we'd say
    (block (:= z ($+ x y))
           ...following code in scope of x, y & z...)

Note that in CPS, the continuation parameter K in a user FUN form
    (fun (x y k) ...)
is bound to a continuation value -- a closure tuple packaging up some
code with some environment. But in AM, the "continuation parameter"
RET of PROC form (proc (x y) ret 3 ...) is bound to *just* a code
pointer, the return address of the caller. The rest of the
continuation (the saved variable bindings that the code at RET will
use) is the stack frame at the tip of the stack on entry to the PROC.
When the PROC's body returns by jumping to RET, it will ensure that
the stack is restored back to its state when the caller jumped to the
PROC form (*before* the PROC allocated *its* three-slot frame), so the
stack's tip frame when we jump (return) to RET will be the same as
when we originally jumped to the PROC. (The necessary stack-frame
popping is done by the TAIL and RETURN INSN+ forms.)

The := assignment/binding form only defines a single variable. We
could have a multi-var binder (:= (var ...) rhs) form (after all,
continuations can be called on multiple arguments), but we won't
bother for this basic compiler, as our source language doesn't have
a multiple-value return primop like Scheme's VALUES function. It
would be a simple extension to add, however. (Feel free to add it,
if you like.)

** Scope
========
AM is lexically scoped, with a scoping principle tuned specifically
for its use as a low-level, pre-register-allocated IR.

A code label is visible from its point of declaration inside a PROC
or CODE block down to the end of the code block. (Unless shadowed
by a following definition of a different label using the same name.)

Additionally, a label declaration has circular scope. For example,
a LABEL declaration

    (label (f (code (w x) ... (% f) ... (% g) ...))	; Labels F & G
           (g (code (y z) ... (% f) ... (% g) ...)))	; visible in here.

    ...Labels F & G visible from here forward 
       to end of containing PROC or CODE form...

makes labels F and G visible
- in the code of the innermost containing PROC or CODE form,
  from the point of the LABEL form to the end of the code.
- inside the two declared code blocks that are bound to F and G.

The circular-scope property is why the LABEL form permits you to bind
more than one label: so that you can declare a group of mutually
referential code blocks, together.

Variable scoping works as follows. Every PROC and CODE block starts a
completely empty variable scope (which means that it doesn't matter
where in the program it occurs, in terms of variables), which is then
immediately augmented by the parameters of the code block. A (:= var
rhs) form defines VAR, which is visible to the end of the code block
-- unless shadowed by a subsequent definition of a new variable with
the same name.

If we think of a PROC or CODE block as being a tree of IF's with
TAIL/CALL/RETURN/GO jumps at the leaves of the tree, then a := form
binds a variable at some internal node of the IF-tree, which is then
visible to the subtree rooted at the := form.

Additionally, a (:= var (CALL ...)) INSN resets the scope of
following code to *just* VAR. This is odd and worth emphasizing: a
subroutine call kills all prior register bindings. We are modelling
the behavior of *registers* with our variables. For example:

    (block (:= x ($+ y 7))
           (:= z (call (% reverse) numlist))
	   ;; Here, we cannot reference X or Y. Only Z.
	   ...)

In short, variables are always local to (a portion of) the PROC or
CODE block in which they appear; we abandon all these local variables
when execution arrives at some leaf of the IF-tree and jump to some
other labelled code block; all live variables must be passed to that
code block as parameters of the TAIL/CALL/RETURN/GO jump.

All this parameter-passing is not expensive because we haven't done
actual register allocation yet. When we do, we'll coalesce register
assignments across calls, so the variable/variable copies that happen
across TAIL/CALL/RETURN/GO jumps will mostly go away (that is, they
are handled statically, by the compiler). Note that one way to look at
what we've done is to say that we've made liveness analysis for
variables a simple question of local scope.

BLOCK forms are not scope boundaries. For example, this is a well
formed procedure body:

    (proc (a b) ret-pc 0
      (block (:= c ($+ a b)))	; The C defined in this block
      (return ret-pc c))	; is visible after/outside the block.

So you can use BLOCK forms promiscuously anywhere you want to package
up a sequence of INSN's into a single compound INSN, or turn some
INSN's and one INSN+ into a single compound INSN+.

If you give these scoping principles a little thought, you'll see:

- You can easily alphatise labels to ensure all labels in a program
  have distinct names. It's just lambda-calculus alphatisation, which
  you've already done, and which isn't very hard.

- You can merge two adjacent LABEL declarations, although you
  might have to alpha-rename one set of labels if they overlap.

- You can always pull LABEL bindings up the code tree.

This works because variable scope is always local to its containing
CODE or PROC block, so we can move these code blocks around without
messing up variable scoping.

This all means that, if you first alphatised your program's labels,
you could simply yank all labelled code blocks from their declaration
inside the code tree, up to top level. Your program would then look
like

    (BLOCK (LABEL (lab1 code1)
    	   	  (lab2 code2)
    	          ...)
           . insn-seq+)

where the INSN-SEQ+ would have no internal LABEL declarations at all.
That is, your program would be (1) a set of labelled code blocks and
(2) a "main" block to execute in the context of those labels.

Alternatively, you could implement a pass to split LABEL binding sets
into connected components, then push each binding set down the code
tree as far as possible, until we reach

- an IF split where both arms contain free references to a label
  in the binding set, or

- a primitive INSN or INSN+ that directly references a label in
  the binding set.

This will make an AM program that resembles the kind of control-flow
graph built by compilers for first-order languages such as C or
Fortran. Local loops, recursions and join points will show up as
local, LABELed code blocks. You might consider the relationship of
such an AST to the dominator tree for the program -- and what this
might say about how to find loops for loop optimisations.

This is all sensible and rigorous, but in practical terms you can
use labels and variables in a simpler regime: just ensure you always
use fresh identifiers and preserve alpha-uniqueness, and the issues
of name capture become irrelevant.

So, there are two ways you can proceed, in terms of name management:
- Operate in an alpha-unique way, worry free.
- Reuse names with abandon, exploiting the scoping rules,
  and be prepared to alpha-rename labels if you do transforms
  that move code blocks around.

Are AM variables assigned (by side effect) or bound (as in a
functional language)? The answer is: yes. That is to say: it doesn't
matter. Think of it whichever way you prefer.

** Other
========
Why no (IF arg insn1 insn2) form in the INSN grammar? That would
introduce implicit control joins at the end of INSN1 and INSN2 --
and all joins in AM are explicit jumps that pass values.

Some compiler people refer to a tree of IF's where control only enters
at the root of the tree as an "extended basic block" (EBB). AM's INSN+
non-terminal is similar, except that we permit arbitrary function
calls inside the tree, so an INSN+ is not atomic and not guaranteed to
terminate. (True basic blocks and extended basic blocks are an
all-or-nothing kind of "macro-instruction.")

-------------------------------------------------------------------------------
* Examples
----------

Here are some examples of code written in our target language.
I have written these in a fully alphatised way to make the code
easier to read. Note also that I use variable names like N-1
and A*I when I think that will make the code easier to read.

;;; Recursive factorial
(block (label (fact (proc (n) k 2	; 2 means 2-slot frame 
                      (:= b ($zero? n))
                      (if b
                          (return k 1)
                          (block (frame-store 0 n)           ; Save N on frame
                                 (frame-store 1 k)           ; Save ret pc
                                 (:= n-1 ($- n 1))
                                 (:= a1 (call (% fact) n-1)) ; a1 = (fact n-1)
                                 (:= n2 (frame-ref 0))       ; Load N from frame
                                 (:= a2 ($* a1 n2))          ; a2 = n * (fact n-1)
                                 (:= k2 (frame-ref 1))       ; Load K from frame
                                 (return k2 a2))))))         ; n! => K
       (:= m ($read))
       (:= ans (call (% fact) m)) ; Leave answer in ANS.
       (halt))

;;; Iterative factorial, iterates by tail-calling LP.
(block (label (fact (proc (n) k 0	; 0 means empty frame
                      ;; (LP I A K1) => (K1 (* I! A))
                      (label (lp (proc (i a) k1 0
                                   (:= b ($zero? i))
                                   (if b
                                       (return k1 a)
                                       (block (:= a*i ($* a i))
                                              (:= i-1 ($- i 1))
                                              (tail (% lp) i-1 a*i k1))))))
                      (tail (% lp) n 1 k))))
       (:= m ($read))
       (:= ans (call (% fact) m)) ; Leave answer in ANS.
       (halt))

;;; Iterative factorial after contifying the loop. This is
;;; better, but your compiler doesn't know how to do this
;;; transform.
(block (label (fact (proc (n) k 0	; 0 means empty frame
                      ;; (LP I A) => (K (* I! A))
                      (label (lp (code (i a k1) 	; Loop is now CODE.
                                   (:= b ($zero? i))
                                   (if b
                                       (return k1 a)
                                       (block (:= a*i ($* a i))
                                              (:= i-1 ($- i 1))
                                              (go (% lp) i-1 a*i k1))))))
                      (go (% lp) n 1 k))))
       (:= m ($read))
       (:= ans (call (% fact) m)) ; Leave answer in ANS.
       (halt))

;;; Function composition. This Scheme code
;;;
;;;   (let ((compose (lambda (f g)
;;;                    (lambda (x) (f (g x))))))
;;;     ... (compose fun1 fun2) ... (compose fun3 fun4) ...)
;;;
;;; compiles to the following, assuming COMPOSE is used in first-order
;;; ways. This is an example that is heavy on closure creation & unpacking.

(block (label (lam21 (proc (clo1 x) k1 2              ; LAM21 is (lambda (x) ...)
                       (:= f1 (tuple-ref clo1 1))     ; Fetch F from closure
                       (:= g1 (tuple-ref clo1 2))     ; Fetch G from closure
                       (frame-store f1 0)             ; Save F on stack
                       (frame-store k1 1)             ; Save K1 on stack
                       (:= g1-code (tuple-ref g1 0))  ; Call (G X) with
                       (:= tmp (call g1-code g1 x))   ;   closure protocol
                       (:= f2 (frame-ref 0))          ; Load F off stack
                       (:= k2 (frame-ref 1))          ; Load return address
                       (:= f2-code (tuple-ref f2 0))  ; Tail call (F TMP) with
                       (tail f2-code tmp k2)))        ;   closure protocol

              (compose (proc (f g) k2 0
                          (:= clo2 (closure (% lam21) f g))
                          (return k2 clo2))))

       ... body of the LET, using COMPOSE, goes here ...)

-------------------------------------------------------------------------------
* From CPS to AM
----------------

** Call arguments
=================
In CPS, a call argument can be a constant, a variable, or a lambda.
Constants and variables are reasonably straightforward.

If an argument is a FUN expression being used as a value (that is, not
a first-order or open lambda), however, we must first create a closure
tuple packaging up the lambda's code with the values of its free
variables in a little prelude preceding the call; then we can do the
call using a register whose value is this tuple as the actual
argument. So if we had the CPS call

    (f (fun (x k1) (@ (label lam38)) ...)
       k2)

where f is a first-order function, we'd produce this AM INSN+:

    (block (labels (lam38 (proc (clo38 x) k1 ...)))
           (:= proc38 (closure (% lam38) q y z))
	   (tail f proc38 k2))

where LAM38 is the label of the (FUN (X K1) ...) code, CLO38 is the
extra argument for the closure tuple that will be passed to this code
when it is called, and PROC38 is the variable/register holding the
closure when we allocate it. The closure's first slot is the code
pointer (% LAM38), and its next three slots hold the values of free
variables Q, Y and Z. We produce the PROC form for the lambda by
recursively compiling the CPS lambda.

If an argument is a closed CONT form, e.g.,
    (f x (cont (y) ...body...))
then we produce an AM := definition for Y, and carry on compiling
the body of the CONT form in following code:
    (block
      (:= y (call f x))			; Do the call.
      ...compiled body goes here...)	; Do this after the call.
But wait! This is not enough. While F is running, before it returns
to our call site, it might recursively run *this* code, thus
clobbering our local registers, including any that are live across
the
    (:= y (call f x))
call. This is why they are not in scope in the code following the call.
So we have to
- first save onto the stack any variables needed by the ...BODY...
  that will run when the F call returns, and
- then make sure we compile the ...BODY... code with a symbol table
  indicating that these CPS variables correspond to AM stack-slot
  references, *not* AM variables.
We know which variables need to be saved on the stack before the
F call -- that is given by the FRAME+ annotation on the CONT form.

So our CPS example above really looks like this

    (f x (cont (y)
           (@ (frame+ (a 3) (q 7)))
	   ...body...))

and it is compiled to this AM code

    (block (frame-store a' 3)		; Save A at slot 3 of stack frame
           (frame-store q' 7)		; Save Q at slot 7 of stack frame
	   (:= y (call f x))		; Call F.
           ...compiled body		; Return here. A Q ref from here
	   goes here...)		; forward is (FRAME-REF 7), not Q'.

In AM, a "call argument" is one of the expressions that appear in
TAIL/CALL/RETURN/GO control jumps, which are passed to the target PROC
or CODE block and bound to its initial parameters. These call
arguments are intended to be things already computed; the point of
passing them is to give these things new names.

Here are the kinds of things we can use as arguments in an AM call:
- Constants: either a label reference such as (% fact-code), or
  a simple constant such as 17.

- A reference to some data source: a register/variable or a memory
  location, e.g., X, (FRAME-REF 3) or (TUPLE-REF CLO 7). 

- A constant offset from some address fetched from a register or
  memory location, e.g., (OFFSET CLO 2), which means the address
  of the tuple that is the value of variable CLO, plus 2 words.

The last case is how we convert a closure tuple allocated by a CPS
LETREC into the different procedures that share that tuple. The offset
produces the address of the slot in the tuple where the desired
procedure's code pointer lives.

So, if we had the CPS call to first-order function F

    (letrec ((g (fun ... a ...))	; G closes over A
             (h (fun ... b ...)))	; H closes over B
      (f 3 g
         (fun (x k1) ($+ x a k1))	; Closes over A, too.
       	 h k2))

we'd produce this AM code:

    (block (label (g-code (proc ...))	; Compiled code for (fun ... a ...)
                  (h-code (proc ...)))	; Compiled code for (fun ... b ...)
	   ;; Make the shared closure for G & H.
	   (:= gh-clo (closure (% g-code)	; Slot 0: G code pointer
	       	      	       (% h-code)	; Slot 1: H code pointer
			       a		; Slot 2: A
			       b))		; Slot 3: B
	   ;; Now do the F call. First we have to close arg #3.
	   (label (lam102 (proc (clo1 x) k1 0
	   	  	    (:= tmp ($+ x (tuple-ref clo1 1)))
			    (return k1 tmp))))
	   (:= clo (closure (% lam102) a)) ; Close over A
	   (tail f 3 
	           (offset gh-clo 0)	; G (But, really, just use GH-CLO)
		   clo			; The (fun (x k1) ...) closure
		   (offset gh-clo 1))	; H
		   k2)			; Tail call -- F will return to K2.

** To compile a call
====================
There are multiple kinds of call.

First, bear in mind that some CPS arguments will require producing
pre-call code to heap-allocate closures or save a continuation's
free vars on the stack. In the case of a FUN user lambda, you'll
also have to produce a LABEL declaration for the FUN's compiled
code, as described above.

OK, onward to the actual calls. There are seven cases: we can call six
different kinds of lambda
    {open, first-order, closed} x {user lambda, continuation}
or we can call a primop. Each case is handled differently.

- Open continuation call / beta-redex: ((cont (x) ...body...) arg)

  Here, we know that ARG must be a (FUN ...) form, not a constant
  or a variable, and that there is more than one reference to X in
  the body of the continuation (thanks to your simplifier).

  + If ARG is a first-order FUN
    In this case, compile ARG into a PROC form and name it
    with its label in a LABEL form. Compile the body of the
    continuation with an extended symbol table marking X as
    a first-order variable:

      (block (label (lab102 (proc ...arg's code goes here...)))
             ...compiled cont body here...) ; Calls to X will use (% lab102).

    Don't forget to add extra parameters to the PROC for ARG's
    extended free variables, which will be passed to the PROC
    from all calls to X.

  + If ARG is a closed FUN
    In this case, make a closure for ARG and assign it to a fresh
    variable/register.
        (block (label (lab102 (proc ...arg's code goes here...)))
  	       (:= clo (closure (% lab102) fv1 ...)))
    This code is followed by the code for the body of the continuation,
    translated using a symbol table that maps CPS variable X to AM
    variable CLO.

- Open user-lambda call / beta-redex: ((fun (x k) ...body...) arg carg) 
  This is handled the same way as above. [Fixme: Except for the
  impact on TAIL vs. CALL jumps. This is to be written up later.]

- First-order user-procedure call: (f arg ... carg)
  + Tail call: (f x 37 k)
    A tail call is one where the continuation arg is a variable. Since
    tail calls don't return to the caller, this is the end of the
    local code sequence -- we're going to F, who will return to K.
    We're out of the picture. Thus, we're producing a leaf in the
    INSN+ IF-tree.

    This produces a CALL INSN+ to F's label. Since this is a
    first-order call, we must pass F values for its extended
    variables, too. Note that when we look up CPS variable K
    in the symbol table, we'll get an AM argument that represents
    the return address. The rest of K's closure is on the stack
    frame under the current one; the CALL form is responsible
    for popping the current frame so that K's frame will be
    at the tip of the stack when we jump to the continuation's code.
    
    Put all this together, and we might get this:
      (tail (% f-lab)   x' 37   a z q   k')
    where F-LAB is F's code label; X' is the AM variable we're
    using for X; A, Z & Q are extended free variables for F,
    and K' is the AM variable for K.

    How do we know that F is a first-order variable, that its
    AM code label is F-LAB, and which CPS source variables are
    in its EXTENDED set, so that we can pass their corresponding
    AM variable along in the call? All this information is stored
    in the symbol table as the entry for CPS variable F.

  + Non-tail call: (f x 37				; Call F
                      (cont (y)				; Return here.
		        (@ (frame+ (b 3) (w 7)))
		        ... b ... w ...))
    This turns into a block:
      (block (frame-store b' 3)	; Close the continuation as
             (frame-store w' 7)	; directed by FRAME+ annotation.
	     (:= y' (call (% f-lab)  x' 37  a z q)) ; Do the call
             ...				    ; Cont's code goes here
	     )
    
    Here, we first save the continuation's variable needs on the
    stack, then do the call: B' and W' are the AM variables for the
    B and W CPS variables. Since this is a first-order call, we must
    pass F's extended free variable needs as extra parameters A, Z and
    Q. No return address is specified in the CALL form; it is implicitly
    the following code.

    When we compile the body of the (cont (y) ...) continuation to
    make the code following the F call, we have to use a symbol table
    updated to map B & W to stack slots (frame-ref 3) and (frame-ref
    7), instead of AM variables B' and W'

- First-order continuation call: (k arg ...)
  This turns into a GO to the label of K's associated CONT form. Don't
  forget to pass along all of K's extended variables as specified by
  its EXTENDED annotation:
      (go (% k-label) arg1' ... xv1 ...) ; XVi are the extended free vars

- Primop call: ($+ x 3 (cont (z) ...)) or ($+ x 3 k)
  In the first case, where the primop's continuation is an explicit
  CONT form, this turns into a (:= z' ($+ x' 3)) INSN followed by the
  code for the body of the continuation produced with an updated
  symbol table, where z' and x' are the AM variables being used for
  the CPS variables x and z.

  The second case is like the first, but we assign a fresh temporary
  variable, which is then used in an immediately following RETURN or
  GO to K. (K might be a closure or a first-order continuation --
  to call one of these, see other entries in this bullet list.)
  If we assume that K is not a first-order variable, the code would 
  look like this:
      (block (:= tmp ($+ x' 3))
      	     (return k' tmp))
  If K is a first-order variable, the code would be
      (block (:= tmp ($+ x' 3))
      	     (go (% k-label) tmp))
  where K-LABEL is the code label for K.

- Closed user-procedure call: (f arg ... carg)
  In this case, F is a pointer into a closure tuple. The AM code
  should extract the code address at offset 0 from the pointer and
  call it, passing the tuple as an extra, first argument. Here's a
  tail call (that is, one where CARG is a CPS variable) to the example
  procedure above:
  
      (block (:= f-code (tuple-ref f' 0))
             (tail f-code f' arg' ... carg'))

  where F', ARG' and CARG' are the AM translations of F and the
  the call arguments. 
  
  Alternately, the F call is non-tail, which means that CARG
  is some CONT form, (cont (x) ...). In this case, the call
  compiles into some stack saves, as indicated by the CONT's
  FRAME+ annotation, followed by an
      (:= x (call f-code f' arg' ...))
  INSN, followed by the AM code for the body of the continuation,
  using a symbol table mapping those saved variables to frame loads.

- Closed continuation call: (k arg ...)
  This is a no-kidding procedure return. In the corresponding AM code,
  the continuation variable will hold the return address. The AM code
  will be a RETURN INSN+:
      (return k' arg' ...)
  where K' and ARG' are the AM expressions corresponding to K and ARG.
  
  Of course, as usual, if any args are closed FUN forms, you'll
  first have to produce a chunk of AM code to bind labels to their
  compiled code and then allocate closures.

** Symbol table
===============
As the CPS->AM compiler recursively walks the CPS tree translating
it to AM, the translation is critically mediated by
1. the annotations your previous analysis phase sprinkled
   over the code, and
2. the symbol table passed down through the tree walk.

The symbol table maps in-scope CPS source variables to information
needed when we encounter references to them.

- A first-order variable's entry gives its call protocol -- that is,
  the ordered list of extended-free CPS variables that must be passed
  to it when it is called, and the label of the AM PROC form to call.

- A regular variable has these bits:

  + How to refer to it when including it in a closure record

    This is an AM ARG (see the AM grammar): an AM var, a FRAME-REF
    or a TUPLE-REF memory-load expression. It will never be a
    constant, a (% label) label reference, or an OFFSET form

    Why never an OFFSET form? Because you'll add the OFFSET
    when you use the closure tuple as a procedure value (that
    is, as an arg).

  + How to refer to it as an argument in a call (OR FUNCTION?)

    Again, this is an AM ARG, but if it is describing a LETREC-bound
    procedure that is part of a shared closure tuple, the AM
    term will include the specific offset needed for this particular
    procedure.

  As an example, suppose we have three closed functions bound by a CPS
  LETREC form to variables F, G and H

      (letrec ((f (fun ...))
               (g (fun ...))
	       (h (fun ...)))
        ...)

  The AM code allocates a single closure tuple
      (:= fgh-clo (closure (% f-code)
                           (% g-code)
			   (% h-code)
			   free-var1
			   free-var2
			   ...))
  The symbol table that we use to compile the body of the LETREC says
  that CPS variables F, G, H all map to FGH-CLO *when we want to close
  code over F, G, or H* or *when we want to pass F, G, or H as an
  extended free var to a first-order procedure.* But! If a function
  has to close over both F *and* G, we'll just include the FGH-CLO in
  its closure tuple *once*.

  This is because in the first case, we're passing around *environment
  structure* -- the *bindings* of F, G and H. In the second case, we
  want the *values* of F, G and H that can be passed around the
  program (consed into lists, stored in hash tables, passed as
  arguments, returned as values and, of course, eventually called).
  In this case, we have to take the binding information and convert
  it into a value: add the right offset to the shared closure tuple.

  Because of this, your symbol table for general, non-first-order
  variables needs to provide two entries.

When are [Var |--> FRAME-REF] entries added to the symbol table?
Answer: when we compile code for a closed CONT form, as directed
by the CONT's FRAME+ annotation.

When are [Var |--> TUPLE-REF] entries added to the symbol table?
Answer: when we compile code for a closed FUN form, as directed
by the FUN's EXTENDED annotation.

** To compile a LETREC
======================
We have three pieces to handle:
- a set of first-order bindings
- a set of closure bindings
- the body of the LETREC

1. Compile the RHS lambdas to a set of labelled PROC and CODE blocks.
   These go in a (LABEL ...) binding.

   - To compile a 1st-order FUN:
     Create fresh AM variables for the lambda's parameters *and* all
     its extended free variables -- these will be the parameters of
     the AM CODE or PROC form you produce. (But remember... just *one*
     AM variable for all extended free variables that are bound by
     the same LETREC.) (Including this LETREC itself, if it binds any
     closed lambdas.) Compile the body of the lambda with this symbol
     table.
   
     The symbol table also needs entries describing the call protocol
     (code label and extended free var set) for each first-order
     lambda bound by this LETREC.
     
   - To compile a closed FUN:
     Create fresh AM variables for the lambda's parameters, plus one
     more for the extra closure-tuple parameter that the caller will
     pass. Add entries to the symbol table for

     + The CPS lambda's parameters, which map to the corresponding
       AM variables.

     + The lambda's extended free variables, which map to TUPLE-REF 
       expressions fetching the values from the closure-tuple
       parameter. Remember that when you fetch a value from the
       closure tuple CLO that is, itself, a closure tuple, CLO',
       that was created by some other, outer LETREC, you'll need
       to add an OFFSET to the CLO' value to make the actual
       procedure pointer, so it'll look something like

           (offset
               (tuple-ref clo 3) ; This fetches a shared closure tuple
	       2)                ; This adds the offset needed for a
	                         ;   specific procedure using this tuple.

       This offset adjustment is used for variables bound to closures
       by *this* LETREC, too -- just make an offset from the FUN's
       own closure tuple!

       Don't forget to adjust your offsets to account for the fact
       that the closure-tuple parameter doesn't point to the first
       word of the closure tuple; it points to the slot in the tuple
       that contains this lambda's code pointer.
   
     First-order variables bound by this LETREC are mapped to their
     calling protocol, as above.

   - To compile a 1st-order or closed CONT:
     Similar to the FUN cases, but a closed CONT fetches closed
     variables from the stack; it doesn't need an extra parameter
     for the closure tuple.

     Your compiler will never have to handle a CONT form appearing
     on the RHS of a LETREC binding unless you introduce some extra
     transforms ("contification"). So this is optional.

2. Following the (LABEL ...) that provides the code for all the RHS
   lambdas, you want a piece of code that allocates the single,
   shared closure tuple for all the closed RHS FUN's, if you
   have any:
       (:= clo (closure (% rhs-code1)
       	       		(% rhs-code2)
			...
			<captured-binding1>
			<captured-binding2>
			...))			
   The CLOSURE form references code labels from the LABEL form you
   made in #1 above.

3. Now compile the body of the LETREC in an appropriately extended
   symbol table.

-------------------------------------------------------------------------------
* BLOCK flattening
------------------
Write a simple transform that flattens nested BLOCK forms, e.g.,

    (block insn1 (block insn2 insn3) (block insn4 insn+))
    =>
    (block insn1 insn2 insn3 insn4 insn+)

You want to flatten the bodies of BLOCK, PROC and CODE forms.

This is an easy pass to implement. If you use the BLOCK feature
heavily for packaging up little snippets of code that then get
dropped into other snippets, dumping all this structure after
you've produced your code (or while you're doing so), might make
your AM code much less ugly to read and debug.

Alternately, you can ensure your INSN-SEQ and INSN-SEQ+ sequences are
always flattened by writing "smart" constructors to flatten them as
you create them -- for example:

- A function (insns->block i1 ...) which makes a BLOCK form out
  of INSN I1, I2, ..., splicing in the contents of any argument
  which is itself a BLOCK form.

- A function (insns+->block i1 ... i') which does the same with INSN
  I1, I2, ... and INSN+ I'. 

- A function (insn->seq i) which converts an INSN I into a sequence
  of INSNs: if I is a (BLOCK i1 ...), the answer is (list i1 ...);
  if I is not a BLOCK, the answer is (list i).

- A similar function (insn+->seq i), which does the same for an INSN+.
  
-------------------------------------------------------------------------------
* Looking back, looking forward
-------------------------------
Stop and reflect on where the program is, at this point. You're
below the level of C code. Your program is code, registers and blocks
of memory. No lambda.

What's left to do, to get something that's machine level? Here's the list:

1. Register allocation
   We're set up to do a coloring allocator, coalescing the var/var
   copies that happen at jumps to first-order lambda join points.

1a. Spilling
    We need some way of spilling, when our register allocator needs
    more physical registers than the machine has. There are two ways
    to do this. One, we can simply "create more registers," by setting
    aside a little bank of RAM as extra registers. (This block of
    statically managed memory will certainly reside in the on-cpu L0
    cache, so we'll get some wins here, even if they won't be as fast
    as the true registers.) Two, we can do the standard
    chooose/edit/retry technique, where we choose spill candidates,
    force them out to a spill area on the stack frame, rewrite all
    uses of the variable so that every use is preceded by a load and
    every write of the variable is followed by a store, giving the
    variable tiny live ranges that don't intersect other variable
    live ranges.    

1b. Eliminating duplicate frame and closure loads
    Once a variable is saved onto the stack (because it is live across
    a call), it is reloaded from the stack each time it is needed,
    over and over again. We could eliminate these redundant, repeated
    loads if we had some form of simple "common subexpression
    elimination" optimisation that worked for frame and closure loads,
    so that
        (block ... (:= x1 (frame-ref 7)) ... (:= x2 (frame-ref 7)) ...)
    was optimised by killing the X2 definition and replacing all
    references to X2 with X1. Note that this increases register pressure
    by extending the live range of X1, that is, by tying up whatever
    physical register we assign X1 across a larger span of code. So X1
    may end getting spilled as a result...

    This can also be done with loads from closure tuples.

    The frame-slot case is complicated slightly because we *reuse*
    frame slots -- when a slot goes dead, it will be reused. In our
    simple, non-circular tree-walk "flow analysis," an assignment to
    a frame slot "kills" any register copy we might be tracking for
    that slot. This is not a problem for tracking loads from closure
    tuples, as their contents are never altered after being allocated.

1c. Callee-saves
    If you look at your produced code, you'll notice that a function
    call is assumed to kill all registers. Any variable live across
    a function call is saved to the stack before the call. This is
    basically a caller-saves management strategy. We can do better
    if we manage some registers (four seems to be the traditional
    number) using a callee-saves strategy. Appel discusses how to
    do this in *Compiling with Continuations* and also in separate
    papers that he published.

    It amounts to a simple strategy: continuation closures are
    "spread" across 6 registers/arguments: the return-PC parameter,
    four others used for the first four elements of the closure, and
    SP, the implicitly passed stack register, which points to the rest
    of the closure.
    
    You would alter the PROC form to be
        (PROC (var ...) (ret-pc kv1 kv2 kv3 kv4) frame-size
	  . insn-seq+)
    and the continuation call protocol would be altered to pass
    KV1 through KV4 as the first four arguments to the continuation
    code:
        (CALL ret-pc kv1 kv2 kv3 kv4 arg1 ...)
    Now a leaf procedure, like the add5 function looks like this:
        (proc (n) (ret-pc kv1 kv2 kv3 kv4) 0
	  (:= n+5 ($+ n 5))
	  (return ret-pc kv1 kv2 kv3 kv4 n+5))
    The KVi vars are passed from ADD5's caller through ADD5 and
    forward to the continuation code without ever going onto the
    stack.

2. Instruction selection for a specific processor
   This is pretty straightforward: the current IR is machine level,
   just not machine specific. You'll have to get specific about
   the details of your function-call protocol and stack management,
   which has been slightly abstracted in our current IR.

3. Contification
   If you write a tight loop in your source Scheme program, e.g.,
   to compute factorial iteratively, your loop will be expressed by
   a LETREC-bound function that tail-calls itself. Your current
   compiler will render such an iteration as a tail call to a user
   procedure. So each time around the loop, the current frame will be
   popped and then re-pushed. It would be much better instead to
   demote the loop lambda from user procedure to continuation by
   arranging for the final return continuation not to be passed
   around and around the loop as a parameter.

   This is how you get continuation lambdas bound by LETRECs. MLton
   relies on this transform to get exactly the wins described above,
   for example.

   For example, here is iterative factorial before and after
   contification, in CPS:

     (fun (n ktop)
       (letrec ((lp (fun (a i k)
                      ($zero? i (cont (z?)
                                  (if z?
                                      (k a)
                                      ($* a i
				          (cont (a*i)
                                            ($- i 1 (cont (i-1)
                                                      (lp a*i i-1 k)))))))))))
         (lp 1 n ktop)))

     (fun (n ktop)
       (letrec ((lp (cont (a i) 		; No longer a FUN but a CONT
                      ($zero? i (cont (z?)	; so no frame pop & push
                                  (if z?	; on each iteration; we just
                                      (ktop a)	; run on the PROC's (empty)
                                      ($* a i	; frame.
				          (cont (a*i)
                                            ($- i 1 (cont (i-1)
                                                      (lp a*i i-1)))))))))))
         (lp 1 n)))
   
   You can use simple first-order syntactic criteria or do a full
   flow analysis to spot these opportunities. There's a paper by
   the MLton people explaining how they do it that you should read,
   first.

4. Code linearisation 
   Real machine code is a graph embedded into a linear sequence. In
   our current IR, that graph is embedded into a tree. In particular,
   machine-code conditionals either branch or fall through to the
   following instruction, while our current IR uses two-way IF forms.
   You need some way of linearising the code to use machine-code
   conditionals. Appel's textbook (chapter 8) shows a simple way to do
   this in the context of his Tiger compiler. LLVM tries harder to
   linearise code with better layouts, but it's the same problem.

5. Better IF
   Most processors have conditional branches that do more than test
   the true/false value of a register -- for example, on a RISC-V
   processor, you can compare two registers ra and rb and then branch
   if ra < rb, ra >= rb, ra <> rb, or a few other possibilities.
   We could produce better code if our CPS IF form directly reflected
   these possibilities. That is, we could extend IF to be something
   like
       (IF (rel arg1 arg2) prog prog)
   where REL is one of: < > <= >= <> =.

6 and up:
- You're compiling Scheme, not SML, so you need run-time arity
  checks on function calls. And you'd like to optimise away cases
  where you know the arity check is satisfied at compile time.
  You could extend closure tuples to store arity in the slot
  just before the code pointer. Or you could put arity information
  into the low 10 bits of the procedure pointer and shift them
  away. Or you could require CODE and PROC forms to have an initial
  header with arity information.

- Oh, and what about n-ary Scheme functions, where call induces
  list consing? In real Scheme, this is written as
    (lambda (x1 x2 . more-xs) ...)
  This function takes at least two arguments, which are bound to X1
  and X2. Arguments #3 and following are consed into a list which
  is bound to MORE-XS.

  This is the standard. It's clearly painfully inefficient. Perhaps
  you could extend the language to support more efficient n-ary
  call protocols? (Dybvig has published alternative mechanisms. Racket
  provides them.) Maybe you could provide programmers the ability to
  define extensions of associative operators to n-ary functions, with
  special beta-reduction rules, using a magic ASSOCIATIVE-BINARY
  keyword, e.g.
  
    (define + (associative-nary binary+ 0))
    (define * (associative-nary binary* 1))
  where the compiler will convert
    (+ x y z) => (binary+ (binary+ x y) z)
    (+ x y)   => (binary+ x y)
    (+ x)     => x
    (+)       => 0

  Oops, now you've drifted from writing a compiler into language
  design (influenced by your knowledge of how compilers work).
  This is how it starts. When you start publishing papers in ICFP,
  it will be time to admit you have a language problem and consider
  seeking a support group and possibly entering rehab.

- Your standard function-call protocol is to pass all arguments in the
  registers. This is clearly impossible, if you call a function with
  more parameters than the machine has registers (unless you do what
  Orbit did, and have an "extended register set" out in RAM for just
  this purpose). Change the protocol to pass the first four parameters
  in registers and the remaining ones on the stack.

- Flow analysis? Of course!

- Factor LETREC bindings into strongly connected components
  You need to do this to prevent space leaks from shared closures.

- Web-based function-call protocol specialisation

- You're missing out on the power of Scheme if you don't have a
  general, hygienic macro front end. Go figure out how to do that.

- Don't forget: real Scheme has call/cc...

  General continuations involve copying the entire stack -- very
  heavyweight. Perhaps you could do some analysis to spot the cheap,
  easy cases? (For example, should throwing out of the middle of a
  tight loop really require saving and restoring the *entire stack*
  to/from the heap? That's crazy talk.)

  Perhaps you could redesign your compilation strategies and run-time
  stack representations to make continuation capture incremental? See
  work by Dybvig for more.

- Your numeric-intensive codes would run a *lot* faster if you could
  figure out how to transform your loops to use MMX/SSE/AVX SIMD
  instructions on the processor. How hard could that be?

- And what about the garbage collector and tagged data representations?

- Or you could just punt lots of the above off to LLVM.

- Why are you compiling Scheme, anyway? Wouldn't you be happier with
  a polymorphic static type system?

- I could go on...