CS 6983 Special Topics: The Technology of Lambda
Olin Shivers

Homework 3: Writing a lambda simplifier
Due 2025/2/29 midnight by email

* Part 1: Symbol table
----------------------

Implement a functional/persistent symbol -> number dictionary. You can
implement a simple one using alists, or you can build a more efficient
one using a balanced-tree (e.g., red-black tree) dictionary. Feel
free to use existing dictionary implementations already built for
Racket, if you like. This is not a sophomore-year class on data
structures.

A symbol->number dictionary has this interface:

;;; The empty symbol table
(empty-symtab)

;;; A new symbol table just like SYMTAB, but symbol SYM maps to number I.
(update-symtab symtab sym i)

;;; Returns the number to which SYM maps, or #f if no entry.
(lookup-sym symtab sym)

;;; Note that this function doesn't provide any functionality you
;;; can't get from LOOKUP-SYMTAB, but we provide it so that you
;;; can write clearer code.
(symtab-contains? symtab sym) ; Returns #f or some true value.

;;; Here are two useful utilities, written for you:

;;; Apply function F to the entry kept for SYM in the table.
(define (modify-entry symtab sym f)
  (update-symtab symtab
  		 (f (lookup-sym symtab sym))))

;;; Add DELTA to SYM's entry in the table. "No entry" is treated as 0.
;;; Returns the updated table.
(define (incr-entry symtab sym delta)
  (modify-entry symtab sym 
                (lambda (x)
		  (+ (or x 0) delta))))


-------------------------------------------------------------------------------
* Part 2: Alphatisation
-----------------------

Write a nanopass transform that assigns every variable in your program
a unique name. Making unique names on demand is pretty easy. Here is
code that will do it:

  (define *gensym-counter* 0)
  (define (reset-gensym-counter!) (set! *gensym-counter* 0))
  (define (new-name base)
    (let ((name (string->symbol
                  (string-append (symbol->string base)
		                 "."
				 (number->string *gensym-counter*)))))
      (set! *gensym-counter* (add1 *gensym-counter*))
      name))

But feel free to work a little more to make the names not too ugly -- e.g., 
it would be nice if your transform alphatised this

  (fun (x k)
    (letrec ((x (fun (y k) ($+ y 3 k))))
      (x 8)))

into this

  (fun (x.1 k.1)
    (letrec ((x.2 (fun (y.1 k.2) ($+ y.1 3 k.2))))
      (x.2 8 k.1)))

Note that the X's are numbered independently of the Y's and the K's.

- You can use your symbol->number dictionary for this cosmetic 
  dot-number trick.

Or you can expend a bit more cleverness and arrange for your
transform to be idempotent. That is, applying your transform to
a term where the variables are *already* alpha-unique doesn't
alter the term at all.

This is all cosmetic stuff intended to help keep your transformed code
as human-readable as possible, which is handy for a limited-effort,
one-semester compiler.

-------------------------------------------------------------------------------
* Part 3: Variable census
-------------------------

Write a "census" analysis, which returns a dictionary mapping every
variable in a program term with the number of times it is referenced.
This should only be applied to a term that has been "alphatised,"
so that all its variables have unique names.

Write an analysis that computes a list of the free variables in a
term. (It's not particularly efficient, but you can make use of the
LSET-... functions from the SRFI-1 list library for doing set
operations on lists.)


-------------------------------------------------------------------------------
* Part 4: Substitution
----------------------
Write a substitution function for lambda terms:
    (substitute term var new)
returns the tree TERM except that every free occurence of VAR in TERM is
replaced with NEW.

You might think that you can exploit the alpha-uniqueness property of
your terms, post-alphatisation, to simplify and speed up this function. 
But... it's not that simple.

- Suppose TERM and NEW are both subtrees of a larger term, PROG --
  that is, we're recurring down inside PROG, and have decided to do
  a substitution.

  If there is a free reference to VAR down in TERM, then when we are
  done, our result will have NEW where that reference occured. But NEW
  may still occur elsewhere in PROG. So now we have two copies of NEW
  in the new version of PROG we are recursively constructing. This
  would violate alpha-uniqueness! Every variable locally bound in NEW
  would now have multiple binders. In such a case, every time we drop
  a copy of NEW into place inside TERM, we have to sweep over NEW and
  make sure the copied version binds fresh local identifiers.

- If we do a substitution of NEW into TERM and TERM contains
  *multiple* references to VAR, then, again, we'll have to "freshen"
  up each copy of NEW we drop into place -- each copy has to have
  distinct bound vars.
  
To that end, you should write a FRESHEN function that walks a term
like NEW and makes a copy whose bound vars are *globally* unique
across the whole program. (So your FRESHEN function will need access
to the global state you maintain for your make-a-fresh-identifier
machinery in the alphatisation pass.)

You might also consider writing a special-case variant of SUBSTITUTE,
    (substitute/once term var new)
which can *only* be called if
- TERM contains exactly *one* free reference to VAR, and
- we know that NEW will only appear inside TERM when done.
In other words, SUBSTITUTE/ONCE can be used to *move* NEW from one
place in the program some other place. No freshening necessary.
This can make one-ref beta reductions much simpler and faster.

We have seen in class how to do all this efficiently (with the
uplinked BUBS data structure), but you can do it simply and not worry
about speed.


-------------------------------------------------------------------------------
* Part 5: Simplify
------------------
Enough preliminaries. Now we are ready to do simplification.

Write a set of nanopass transforms that will simplify a term in CPS form.
Your simplifier should do the following:

- Any beta redexes should be reduced if
  + the bound variable is referred to only 0 or 1 times, or
  + the redex's substituend (the "argument" part of the redex)
    is a variable or a constant.

- Eta redexes should be reduced

- Constant fold arithmetic primops

- Constant fold IF expressions with constant tests.

- Propagate information determined by IF tests. That is,
    (if a
        (if a 4 5)
	(f a 6))
  should simplify to
    (if a
        (if #t 4 5)
	(f #f 6))
  which a later pass will constant-fold to
    (if a 4 (f #f 6))
	
- Want to do more? Fee free...

Your simplifier should set a mutable CHANGED? boolean any time it
finds a hit; your top level simplifier should iterate all the
simplification passes until no change.

  
-------------------------------------------------------------------------------
* Part 6: Anchor pointing
-------------------------
Consider this piece extra credit, as it is a little challenging.

Arrange for your simplifier to do "anchor pointing" -- that is,
the IF-of-an-IF transform. In non-cps, direct style, we
want to transform
  (if (if e1 e2 e3) e4 e5)
to
  (let ((thunk4 (lambda () e4))
        (thunk5 (lambda () e5)))
    (if e1
        (if e2 (thunk4) (thunk5))
	(if e3 (thunk4) (thunk5))))
   
Doing this in CPS will take a little thought.

Note that the anti-code-replication thunks THUNK4 & THUNK5 may
be eligible for reduction if expressions E4 or E5 are small.


-------------------------------------------------------------------------------
* Part 7: Demo
--------------

Show off a little. Submit some tests showing the combined power of
these simple optimisations on some code of your choosing.


-------------------------------------------------------------------------------
* Suggestion
------------

You can test your simplifier on large terms using the CPS interpreter
to run both the "before" and "after" programs.