From Robert_Maclachlan@RAM.IUS.CS.CMU.EDU Tue Jan 23 23:31:52 2001 Received: from knight.cons.org (knight.cons.org [194.233.237.86]) by mailhub.mclink.it (8.9.1/8.9.0) with ESMTP id XAA13404 for ; Tue, 23 Jan 2001 23:46:18 +0100 (CET) Received: from RAM.IUS.CS.CMU.EDU (RAM.IUS.CS.CMU.EDU [128.2.206.16]) by knight.cons.org (8.11.1/8.11.1) with SMTP id f0NMWCX09687 for ; Tue, 23 Jan 2001 23:32:13 +0100 (CET) Message-Id: <200101232232.f0NMWCX09687@knight.cons.org> Received: from RAM.IUS.CS.CMU.EDU by RAM.IUS.CS.CMU.EDU id aa08152; 23 Jan 2001 17:31 EST To: Raymond Toy cc: cmucl-imp@cons.org Subject: Re: Non Vector SIMPLE Arrays. In-reply-to: Your message of 23 Jan 2001 16:32:28 -0500. <4nofwyosyb.fsf@rtp.ericsson.se> Reply-To: ram+@CS.cmu.edu Date: Tue, 23 Jan 2001 17:31:52 -0500 From: Robert Maclachlan Are there other notes available that aren't in the TeX internals document? It would be nice if they could be made available. Ray Well, I don't know. I've appended the notes file if you care to compare. Rob ________________________________________________________________ 1 -*- Dictionary: design; Package: C -*- 2 Done stuff: 3 Glossary: 4 Phases: 5 IR1 CONVERSION 6 Canonical forms: 7 Inline functions: 8 Tail sets: 9 Hairy function representation: 10 IR1 representation of non-local exits 11 Block compilation: 12 Notes: 13 LOCAL CALL ANALYSIS 14 Entry points: 15 FLOW GRAPH CANONICALIZATION 16 IR1 OPTIMIZE 17 Goals for IR1 optimizations: 18 Flow graph simplification: 19 Bottom-up IR1 optimizations: 20 Top-down IR1 optimizations: 21 TYPE CHECK 22 TYPE CONSTRAINT PROPAGATION 23 ENVIRONMENT ANALYSIS 24 CALL/LOOP ANALYSIS 25 GLOBAL TN ASSIGNMENT 26 LOCAL TN ASSIGNMENT 27 IR2 CONVERSION 28 Stack analysis: 29 Non-local exit: 30 REACHING DEFINITIONS 31 LOOP INVARIANT OPTIMIZATION 32 COMMON SUBEXPRESSION ELIMINATION 33 REPRESENTATION SELECTION 34 LIFETIME ANALYSIS 35 Flow analysis: 36 Conflict detection: 37 PACKING 38 Scarce SB packing: 39 Load TN packing: 40 Unbounded SB packing: 41 CONTROL OPTIMIZATION 42 BRANCH DELAY 43 CODE GENERATION 44 ASSEMBLY 45 IR1 FINALIZE 46 RETARGETING 47 Storage bases and classes: 48 Type system parameterization: 49 VOP Definition: 50 Lifetime model: 51 VOP Cost model: 52 Implementation parameterizations: 53 Multiple hardware configurations: 54 Efficiency notes: 55 Standard VOPS: 56 Special-case IR2 conversion: 57 INTERPRETER INTERFACE 58 MISC LOW-LEVEL STUFF 59 Inline shallow binding: 60 Foreign function call: 61 Assembly routines: 62 IR2 CONSISTENCY CHECKING 63 OBSERVED BUGS AND POSSIBLE FIXES Done stuff: Glossary: assert (a type) [see also restrict] In Python, all type checking is done via a general type assertion mechanism. Explicit declarations and implicit assertions (e.g. the arg to + is a number) are recorded in the front-end (implicit continuation) representation. Type assertions (and thus type-checking) are "unbundled" from the operations that are affected by the assertion. This has two major advantages: -- Code that implements operations need not concern itself with checking operand types. -- Run-time type checks can be eliminated when the compiler can prove that the assertion will always be satisfied. back end The back end is the part of the compiler that operates on the VIRTUAL MACHINE intermediate representation. Also included are the compiler phases involved in the conversion from the FRONT END (VIRTUAL MACHINE) representation. bind node This is a node type in the that marks the start of a LAMBDA body in the IMPLICIT CONTINUATION representation. This serves as a placeholder for environment manipulation code. basic block A basic block (or simply "block") has the pretty much the usual meaning of representing a straight-line sequence of code. However, the code sequence ultimately generated for a block might contain internal branches that were hidden inside the implementation of a particular operation. The type of a block is actually CBLOCK. The BLOCK-INFO slot holds an IR2-BLOCK containing backend information. block compilation Block compilation is a term commonly used to describe the compile-time resolution of function names. This enables many optimizations. call graph Each node in the call graph is a function (represented by a FLOW GRAPH.) The arcs in the call graph represent a possible call from one function to another. See also TAIL SET cleanup A cleanup is the part of the implicit continuation representation that retains information scoping relationships. For indefinite extent bindings (variables and functions), we can abandon scoping information after ICR conversion, recovering the lifetime information using flow analysis. But dynamic bindings (special values, catch, unwind protect, etc.) must be removed at a precise time (whenever the scope is exited.) Cleanup structures form a hierarchy that represents the static nesting of dynamic binding structures. When the compiler does a control transfer, it can use the cleanup information to determine what cleanup code needs to be emitted. closure variable A closure variable is any lexical variable that has references outside of its HOME ENVIRONMENT. See also INDIRECT VALUE CELL closed continuation A closed continuation represents a TAGBODY tag or BLOCK name that is closed over. These two cases are mostly indistinguishable after IR1 conversion. home Home is a term used to describe various back-pointers. A lambda variable's "home" is the lambda that the variable belongs to. A lambda's "home environment" is the environment in which that lambda's variables are allocated. indirect value cell Any closure variable that has assignments (SETQs) will be allocated in an indirect value cell. This is necessary to ensure that all references to the variable will see assigned values, since the compiler normally freely copies values when creating a closure. set variable Any variable that is assigned to is called a "set variable". Several optimizations must special-case set variables, and set closure variables must have an INDIRECT VALUE CELL. code generator (in define-vop) The code generator for a VOP is a potentially arbitrary list code fragment which is responsible for emitting assembly code to implement that VOP. codegen The compiler phase which calls each VOP's code generator. constant pool The part of a compiled code object that holds pointers to non-immediate constants. constant TN A Constant TN is the VM representation of a compile-time constant value. constant leaf A Constant LEAF is the IC representation of a compile-time constant value. combination A combination NODE is the IC representation of any fixed-argument function call (not APPLY or MULTIPLE-VALUE-CALL.) top-level component A top-level component is any component whose only entry points are top-level lambdas. top-level lambda A top-level lambda represents the execution of the outermost form on which the compiler was invoked. In the case of COMPILE-FILE, this is often a truly top-level form in the source file, but the compiler can recursively descend into some forms (EVAL-WHEN, etc.) breaking them into separate compilations. component A component is basically a sequence of blocks. Each component is compiled into a separate code object. With BLOCK COMPILATION or LOCAL FUNCTIONS, a component will contain the code for more than one function. This is called a component because it represents a connected portion of the call graph. Normally the blocks are in DFO. component, initial During ICR conversion, blocks are temporarily assigned to initial components. The "flow graph canonicalization" phase determines the true component structure. component, head and tail The head and tail of a component are dummy blocks that mark the start and end of the DFO sequence. local function (call) A local function call is a call to a function known at compile time to be in the same COMPONENT. Local call allows compile time resolution of the target address and calling conventions. See BLOCK COMPILATION. conflict (of TNs, set) Register allocation terminology. Two TNs conflict if they could ever be live simultaneously. The conflict set of a TN is all TNs that it conflicts with. continuation The ICR data structure which represents both: -- The receiving of a value (or multiple values), and -- A control location in the flow graph. In the Implicit Continuation Representation, the environment is implicit in the continuation's BLOCK (hence the name.) The ICR continuation is very similar to a CPS continuation in its use, but its representation doesn't much resemble (is not interchangeable with) a lambda. CONT A slot in the NODE holding the CONTINUATION which receives the node's value(s). Unless the node ends a BLOCK, this also implicitly indicates which node should be evaluated next. COST CS CSTACK dead (TN) DEST DFN DFO DOMINATOR DROPTHRUS EFFECTFUL EFFECTLESS entry point (external) ENV environment analysis null external entry point FIXUP flow graph FOLDABLE FP front end FSC full call function attribute function "real" (allocates environment) meaning function-entry more vague (any lambda?) funny function GEN (kill and...) global TN, conflicts, preference GTN (number) IR IR1 IR2 IR1 conversion, IR2 conversion (translation) inline expansion, call kill (to make dead) also in common subexpression known function LAMBDA leaf let call lifetime analysis, live (tn, variable) load tn LOCS (passing, return locations) local call local TN, conflicts, (or just used in one block) location (selection) loop nesting depth factor segment (of strange loop) head, tail(s), exits LTN (number) main entry mess-up (for cleanup) MISCOP more arg (entry) MV natural loop non-local exit non-packed SC, TN non-set variable operand (to vop) optimizer (in ir1 optimize) optional-dispatch pack, packing, packed pass (in a transform) passing locations (value) conventions (known, unknown) policy (safe, fast, small, ...) predecessor block primitive-type reaching definition REF representation selection for value result continuation (for function) result type assertion (for template) (or is it restriction) restrict a TN to finite SBs a template operand to a primitive type (boxed...) a tn-ref to particular SCs return (node, vops) safe, safety saving (of registers, costs) SB SC (restriction) semi-inline set variable side-effect in IR1 in IR2 SP sparse set splitting (of IR2 blocks) SSET strange loop SUBPRIMITIVE successor block tail recursion tail recursive tail recursive loop user tail recursion template TN TNBIND TN-REF transform (source, IR1) type assertion inference top-down, bottom-up assertion propagation derived, asserted descriptor, specifier, intersection, union, member type check type-check (in continuation) UNBOXED (boxed) descriptor unknown values continuation unset variable unwind-block, unwinding used value (dest) value passing VAR VM VOP XEP Phases: The structure of the compiler may be broadly characterized by describing the compilation phases and the data structures that they manipulate. The steps in the compilation are called phases rather than passes since they don't necessarily involve a full pass over the code. The data structure used to represent the code at some point is called an IR (Intermediate Representation). Two major IRs are used in the compiler: IR1 is used to represent the lisp-level semantics of the source code during the initial phases. Meta-evaluation and semantic analysis are done on this representation. IR1 is roughly equivalent to a subset of Common Lisp, but is represented as a flow-graph rather than a syntax tree. Phases which only manipulate IR1 comprise the "front end". It would be possible to use a different back end such as one that directly generated code for a stack machine. IR2 is used to represent the implementation of the source code on a virtual machine. The virtual machine may vary depending on the the target hardware, but the IR2 representation is sufficiently stylized that most of the phases which manipulate it are portable. Each phase is briefly described here. The ordering below is approximate. Some phases could moved around. The phases from Local call analysis to Type constraint propagation all interact; for maximum optimization, they should be repeated until nothing new is discovered. The name of a phase is in brackets if it may be omitted to save compilation time. IR1 conversion Convert the source into the IR1 representation, doing macroexpansion and simple source-to-source transformation. All names are resolved at this time, so we don't have to worry about name conflicts. Local call analysis Find calls to local functions and splice them into the flow graph so that we can do flow analysis. Create External Entry Points for entry-point functions. Flow graph canonicalization Find flow graph components and compute depth-first ordering. Locate IF-TYPEP constructs. [IR1 optimize] A grab-bag of all the non-flow IR1 optimizations. Fold constant functions, propagate types and eliminate code that computes unused values. Special-case calls to some known global functions by replacing them with a computed function. Merge blocks and eliminate IF-IFs. Substitute let variables. [Type constraint propagation] Use global flow analysis to propagate information about lexical variable types. Eliminate unnecessary type checks and tests. Type check generation Emit explicit IR1 code for necessary type checks too complex to be easily generated on the fly by the back end. Check that type assertions are satisfied, marking places where type checks need to be done. Locate Let calls. Delete functions and variables with no references Environment analysis Determine which distinct environments need to be allocated, and what context needed to be closed over by each environment. We detect non-local exits and set closure variables. We also emit cleanup code as funny function calls. This is the last pure IR1 pass. [Call/loop analysis] Make IR2 annotations about the call and loop structure of the component. The call information is used to determine the feasibility of inter-routine register allocation, and the loop information is used to detect inner loops for register allocation and loop optimization. Control analysis Linearize the flow graph in a way that minimizes the number of branches. Global TN allocation (GTN) Walk over the environment nesting, determining function calling conventions and assigning TNs to local variables. Local TN allocation (LTN) Use type and policy information to determine which IR2 translation to use for known functions, and then create TNs for expression evaluation temporaries. We also accumulate some random information needed by IR2 conversion. Stack analysis Maintain stack discipline for unknown-values continuation in the presence of local exits. IR2 conversion Convert IR1 into IR2 by translating nodes into VOPs. Emit type checks. [Reaching definitions] Compute the reaching definitions for TNs. Use this information to eliminate unnecessary copying of TN values. [Loop invariant optimization] Move simple expressions out of loops where possible. [common subexpression elimination] Combine expressions that are duplicated within a block. Lifetime analysis Do flow analysis to find the set of TNs that have lifetimes that overlap with the lifetimes of each TN being packed. Annotate call VOPs with the TNs that need to be saved across calls. Packing Use cost information to assign each TN to the "best" storage location, choosing the "best" code generator for each VOP as a side-effect. Code generation Call the VOP generators to emit assembly code. [Pipeline reorganization] On some machines, move memory references backward in the code so that they can overlap with computation. On machines with delayed branch instructions, locate instructions that can be moved into delay slots. Assembly Resolve branches and dump into the output file with appropriate loader directives. IR1 finalize This phase is run after all components have been compiled. It scans the global variable references, looking for references to undefined variables and incompatible function redefinitions. IR1 CONVERSION The set of special forms recognized is exactly that specified in the Common Lisp manual. Everything that is described as a macro in CLTL is a macro. Large amounts of syntactic information are thrown away by the conversion to an anonymous flow graph representation. The elimination of names eliminates the need to represent most environment manipulation special forms. The explicit representation of control eliminates the need to represent BLOCK and GO, and makes flow analysis easy. The full Common Lisp LAMBDA is implemented with a simple fixed-arg lambda, which greatly simplifies later code. The elimination of syntactic information eliminates the need for most of the "beta transformation" optimizations in Rabbit. There are no progns, no tagbodys and no returns. There are no "close parens" which get in the way of determining which node receives a given value. In IR1, computation is represented by Nodes. These are the node types: If: Represents all conditionals. Set: Represents a SETQ. Ref: Represents a constant or variable reference. Combination: Represents a normal function call. MV-Combination: Represents a MULTIPLE-VALUE-CALL. This is used to implement all multiple value receiving forms except for MULTIPLE-VALUE-PROG1, which is implicit. Bind: This represents the allocation and initialization of the variables in a lambda. Return: This collects the return value from a Lambda and represents the control transfer on return. Entry: Marks the start of a dynamic extent that can have non-local exits to it. Dynamic state can be saved at this point for restoration on re-entry. Exit: Marks a potentially non-local exit. This node is interposed between the non-local uses of a continuation and the DEST so that code to do a non-local exit can be inserted if necessary. Some slots are shared between all node types. This information held in common between all nodes often makes it possible to avoid special-casing nodes on the basis of type. This shared information is primarily concerned with the order of evaluation and destinations and properties of results. This control and value flow is indicated in the node primarily by pointing to continuations. The Continuation structure represents stuff that is sufficiently related to the normal notion of a continuation that naming it so seems sensible. Basically, a continuation represents a place in the code, or alternatively the destination of an expression result and a transfer of control. These two notions are bound together for the same reasons that they are related in the standard functional continuation interpretation. A Continuation may be deprived of either or both of its value or control significance. If the value of a continuation is unused due to evaluation for effect, then the continuation will have a null DEST. If the NEXT node for a continuation is deleted by some optimization, then NEXT will be :NONE. [### Continuation kinds...] The Block structure represents a basic block, in the the normal sense. Control transfers other than simple sequencing are represented by information in the Block structure. The continuation for the last node in a block represents only the destination for the result. It is very difficult to reconstruct anything resembling the original source from IR1, so we record the original source form in each node. The location of the source form within the input is also recorded, allowing for interfaces such as "Edit Compiler Warnings". Forms such as special-bind and catch need to have cleanup code executed at all exit points from the form. We represent this constraint in IR1 by annotating the code syntactically within the form with a Cleanup structure describing what needs to be cleaned up. Environment analysis determines the cleanup locations by watching for a change in the cleanup between two continuations. We can't emit cleanup code during IR1 conversion, since we don't know which exits will be local until after IR1 optimizations are done. Special binding is represented by a call to the funny function %Special-Bind. The first argument is the Global-Var structure for the variable bound and the second argument is the value to bind it to. Some subprimitives are implemented using a macro-like mechanism for translating %PRIMITIVE forms into arbitrary lisp code. Subprimitives special-cased by IR2 conversion are represented by a call to the funny function %%Primitive. The corresponding Template structure is passed as the first argument. We check global function calls for syntactic legality with respect to any defined function type function. If the call is illegal or we are unable to tell if it is legal due to non-constant keywords, then we give a warning and mark the function reference as :notinline to force a full call and cause subsequent phases to ignore the call. If the call is legal and is to a known function, then we annotate the Combination node with the Function-Info structure that contains the compiler information for the function. Canonical forms: #| Would be useful to have a Freeze-Type proclamation. Its primary use would to be say that the indicated type won't acquire any new subtypes in the future. This allows better open-coding of structure type predicates, since the possible types that would satisfy the predicate will be constant at compile time, and thus can be compiled as a skip-chain of EQ tests. Of course, this is only a big win when the subtypes are few: the most important case is when there are none. If the closure of the subtypes is much larger than the average number of supertypes of an inferior, then it is better to grab the list of superiors out of the object's type, and test for membership in that list. Should type-specific numeric equality be done by EQL rather than =? i.e. should = on two fixnums become EQL and then convert to EQL/FIXNUM? Currently we transform EQL into =, which is complicated, since we have to prove the operands are the class of numeric type before we do it. Also, when EQL sees one operand is a FIXNUM, it transforms to EQ, but the generator for EQ isn't expecting numbers, so it doesn't use an immediate compare. Array hackery: Array type tests are transformed to %array-typep, separation of the implementation-dependent array-type handling. This way we can transform STRINGP to: (or (simple-string-p x) (and (complex-array-p x) (= (array-rank x) 1) (simple-string-p (%array-data x)))) In addition to the similar bit-vector-p, we also handle vectorp and any type tests on which the a dimension isn't wild. [Note that we will want to expand into frobs compatible with those that array references expand into so that the same optimizations will work on both.] These changes combine to convert hairy type checks into hairy typep's, and then convert hairyp typeps into simple typeps. Do we really need non-VOP templates? It seems that we could get the desired effect through implementation-dependent IR1 transforms. The main risk would be of obscuring the type semantics of the code. We could fairly easily retain all the type information present at the time the tranform is run, but if we discover new type information, then it won't be propagated unless the VM also supplies type inference methods for its internal frobs (precluding the use of %PRIMITIVE, since primitives don't have derive-type methods.) I guess one possibility would be to have the call still considered "known" even though it has been transformed. But this doesn't work, since we start doing LET optimizations that trash the arglist once the call has been transformed (and indeed we want to.) Actually, I guess the overhead for providing type inference methods for the internal frobs isn't that great, since we can usually borrow the inference method for a Common Lisp function. For example, in our AREF case: (aref x y) ==> (let ((#:len (array-dimension x 0))) (%unchecked-aref x (%check-in-bounds y #:len))) Now in this case, if we made %UNCHECKED-AREF have the same derive-type method as AREF, then if we discovered something new about X's element type, we could derive a new type for the entire expression. Actually, it seems that baring this detail at the IR1 level is beneficial, since it admits the possibly of optimizing away the bounds check using type information. If we discover X's dimensions, then #:LEN becomes a constant that can be substituted. Then %CHECK-IN-BOUNDS can notice that the bound is constant and check it against the type for Y. If Y is known to be in range, then we can optimize away the bounds check. Actually in this particular case, the best thing to do would be if we discovered the bound is constant, then replace the bounds check with an implicit type check. This way all the type check optimization mechanisms would be brought into the act. So we actually want to do the bounds-check expansion as soon as possible, rather than later than possible: it should be a source-transform, enabled by the fast-safe policy. With multi-dimensional arrays we probably want to explicitly do the index computation: this way portions of the index computation can become loop invariants. In a scan in row-major order, the inner loop wouldn't have to do any multiplication: it would only do an addition. We would use normal fixnum arithmetic, counting on * to cleverly handle multiplication by a constant, and appropriate inline expansion. Note that in a source transform, we can't make any assumptions the type of the array. If it turns out to be a complex array without declared dimensions, then the calls to ARRAY-DIMENSION will have to turn into a VOP that can be affected. But if it is simple, then the VOP is unaffected, and if we know the bounds, it is constant. Similarly, we would have %ARRAY-DATA and %ARRAY-DISPLACEMENT operations. %ARRAY-DISPLACEMENT would optimize to 0 if we discover the array is simple. [This is somewhat inefficient when the array isn't eventually discovered to be simple, since finding the data and finding the displacement duplicate each other. We could make %ARRAY-DATA return both as MVs, and then optimize to (VALUES (%SIMPLE-ARRAY-DATA x) 0), but this would require optimization of trivial VALUES uses.] Also need (THE (ARRAY * * * ...) x) to assert correct rank. |# A bunch of functions have source transforms that convert them into the canonical form that later parts of the compiler want to see. It is not legal to rely on the canonical form since source transforms can be inhibited by a Notinline declaration. This shouldn't be a problem, since everyone should keep their hands off of Notinline calls. Some transformations: Endp ==> (NULL (THE LIST ...)) (NOT xxx) or (NULL xxx) => (IF xxx NIL T) (typep x ') => ( x) (typep x ') => ...composition of simpler operations... TYPEP of AND, OR and NOT types turned into conditionals over multiple TYPEP calls. This makes hairy TYPEP calls more digestible to type constraint propagation, and also means that the TYPEP code generators don't have to deal with these cases. [### In the case of union types we may want to do something to preserve information for type constraint propagation.] (apply #'foo a b c) ==> (multiple-value-call #'foo (values a) (values b) (values-list c)) This way only MV-CALL needs to know how to do calls with unknown numbers of arguments. It should be nearly as efficient as a special-case IR2-Convert method could be. Make-String => Make-Array N-arg predicates associated into two-arg versions. Associate N-arg arithmetic ops. Expand CxxxR and FIRST...nTH Zerop, Plusp, Minusp, 1+, 1-, Min, Max, Rem, Mod (Values x), (Identity x) => (Prog1 x) All specialized aref functions => (aref (the xxx) ...) Convert (ldb (byte ...) ...) into internal frob that takes size and position as separate args. Other byte functions also... Change for-value primitive predicates into (if t nil). This isn't particularly useful during IR1 phases, but makes life easy for IR2 conversion. This last can't be a source transformation, since a source transform can't tell where the form appears. Instead, IR1 conversion special-cases calls to known functions with the Predicate attribute by doing the conversion when the destination of the result isn't an IF. It isn't critical that this never be done for predicates that we ultimately discover to deliver their value to an IF, since IF optimizations will flush unnecessary IFs in a predicate. Inline functions: [### Inline expansion is especially powerful in the presence of good lisp-level optimization ("partial evaluation"). Many "optimizations" usually done in Lisp compilers by special-case source-to-source transforms can be had simply by making the source of the general case function available for inline expansion. This is especially helpful in Common Lisp, which has many commonly used functions with simple special cases but bad general cases (list & sequence functions, for example.) Inline expansion of recursive functions is allowed, and is not as silly as it sounds. When expanded in a specific context, much of the overhead of the recursive calls may be eliminated (especially if there are many keyword arguments, etc.) [Also have MAYBE-INLINE] ] We only record a function's inline expansion in the global environment when the function is in the null lexical environment, since it the expansion must be represented as source. We do inline expansion of functions locally defined by FLET or LABELS even when the environment is not null. Since the appearances of the local function must be nested within the desired environment, it is possible to expand local functions inline even when they use the environment. We just stash the source form and environments in the Functional for the local function. When we convert a call to it, we just reconvert the source in the saved environment. An interesting alternative to the inline/full-call dichotomy is "semi-inline" coding. Whenever we have an inline expansion for a function, we can expand it only once per block compilation, and then use local call to call this copied version. This should get most of the speed advantage of real inline coding with much less code bloat. This is especially attractive for simple system functions such as Read-Char. The main place where true inline expansion would still be worth doing is where large amounts of the function could be optimized away by constant folding or other optimizations that depend on the exact arguments to the call. Tail sets: #| Probably want to have a GTN-like function result equivalence class mechanism for IR1 type inference. This would be like the return value propagation being done by Propagate-From-Calls, but more powerful, less hackish, and known to terminate. The IR1 equivalence classes could probably be used by GTN, as well. What we do is have local call analysis eagerly maintain the equivalence classes of functions that return the same way by annotating functions with a Tail-Info structure shared between all functions whose value could be the value of this function. We don't require that the calls actually be tail-recursive, only that the call deliver its value to the result continuation. [### Actually now done by IR1-OPTIMIZE-RETURN, which is currently making IR1 optimize mandatory.] We can then use the Tail-Set during IR1 type inference. It would have a type that is the union across all equivalent functions of the types of all the uses other than in local calls. This type would be recomputed during optimization of return nodes. When the type changes, we would propagate it to all calls to any of the equivalent functions. How do we know when and how to recompute the type for a tail-set? Recomputation is driven by type propagation on the result continuation. This is really special-casing of RETURN nodes. The return node has the type which is the union of all the non-call uses of the result. The tail-set is found though the lambda. We can then recompute the overall union by taking the union of the type per return node, rather than per-use. How do result type assertions work? We can't intersect the assertions across all functions in the equivalence class, since some of the call combinations may not happen (or even be possible). We can intersect the assertion of the result with the derived types for non-call uses. When we do a tail call, we obviously can't check that the returned value matches our assertion. Although in principle, we would like to be able to check all assertions, to preserve system integrity, we only need to check assertions that we depend on. We can afford to lose some assertion information as long as we entirely lose it, ignoring it for type inference as well as for type checking. Things will work out, since the caller will see the tail-info type as the derived type for the call, and will emit a type check if it needs a stronger result. A remaining question is whether we should intersect the assertion with per-RETURN derived types from the very beginning (i.e. before the type check pass). I think the answer is yes. We delay the type check pass so that we can get our best guess for the derived type before we decide whether a check is necessary. But with the function return type, we aren't committing to doing any type check when we intersect with the type assertion; the need to type check is still determined in the type check pass by examination of the result continuation. What is the relationship between the per-RETURN types and the types in the result continuation? The assertion is exactly the Continuation-Asserted-Type (note that the asserted type of result continuations will never change after IR1 conversion). The per-RETURN derived type is different than the Continuation-Derived-Type, since it is intersected with the asserted type even before Type Check runs. Ignoring the Continuation-Derived-Type probably makes life simpler anyway, since this breaks the potential circularity of the Tail-Info-Type will affecting the Continuation-Derived-Type, which affects... When a given return has no non-call uses, we represent this by using *empty-type*. This consistent with the interpretation that a return type of NIL means the function can't return. Hairy function representation: Non-fixed-arg functions are represented using Optional-Dispatch. An Optional-Dispatch has an entry-point function for each legal number of optionals, and one for when extra args are present. Each entry point function is a simple lambda. The entry point function for an optional is passed the arguments which were actually supplied; the entry point function is expected to default any remaining parameters and evaluate the actual function body. If no supplied-p arg is present, then we can do this fairly easily by having each entry point supply its default and call the next entry point, with the last entry point containing the body. If there are supplied-p args, then entry point function is replaced with a function that calls the original entry function with T's inserted at the position of all the supplied args with supplied-p parameters. We want to be a bit clever about how we handle arguments declared special when doing optional defaulting, or we will emit really gross code for special optionals. If we bound the arg specially over the entire entry-point function, then the entry point function would be caused to be non-tail-recursive. What we can do is only bind the variable specially around the evaluation of the default, and then read the special and store the final value of the special into a lexical variable which we then pass as the argument. In the common case where the default is a constant, we don't have to special-bind at all, since the computation of the default is not affected by and cannot affect any special bindings. Keyword and rest args are both implemented using a LEXPR-like "more args" convention. The More-Entry takes two arguments in addition to the fixed and optional arguments: the argument context and count. (ARG ) accesses the N'th additional argument. Keyword args are implemented directly using this mechanism. Rest args are created by calling %Listify-Rest-Args with the context and count. The More-Entry parses the keyword arguments and passes the values to the main function as positional arguments. If a keyword default is not constant, then we pass a supplied-p parameter into the main entry and let it worry about defaulting the argument. Since the main entry accepts keywords in parsed form, we can parse keywords at compile time for calls to known functions. We keep around the original parsed lambda-list and related information so that people can figure out how to call the main entry. IR1 representation of non-local exits All exits are initially represented by EXIT nodes: How about an Exit node: (defstruct (exit (:include node)) value) The Exit node uses the continuation that is to receive the thrown Value. During optimization, if we discover that the Cont's home-lambda is the same is the exit node's, then we can delete the Exit node, substituting the Cont for all of the Value's uses. The successor block of an EXIT is the entry block in the entered environment. So we use the Exit node to mark the place where exit code is inserted. During environment analysis, we need only insert a single block containing the entry point stub. We ensure that all Exits that aren't for a NLX don't have any Value, so that local exits never require any value massaging. The Entry node marks the beginning of a block or tagbody: (defstruct (entry (:include node)) (continuations nil :type list)) It contains a list of all the continuations that the body could exit to. The Entry node is used as a marker for the the place to snapshot state, including the control stack pointer. Each lambda has a list of its Entries so that environment analysis can figure out which continuations are really being closed over. There is no reason for optimization to delete Entry nodes, since they are harmless in the degenerate case: we just emit no code (like a no-var let). We represent CATCH using the lexical exit mechanism. We do a transformation like this: (catch 'foo xxx) ==> (block #:foo (%catch #'(lambda () (return-from #:foo (%unknown-values))) 'foo) (%within-cleanup :catch xxx)) %CATCH just sets up the catch frame which points to the exit function. %Catch is an ordinary function as far as IR1 is concerned. The fact that the catcher needs to be cleaned up is expressed by the Cleanup slots in the continuations in the body. %UNKNOWN-VALUES is a dummy function call which represents the fact that we don't know what values will be thrown. %WITHIN-CLEANUP is a special special form that instantiates its first argument as the current cleanup when converting the body. In reality, the lambda is also created by the special special form %ESCAPE-FUNCTION, which gives the lambda a special :ESCAPE kind so that the back end knows not to generate any code for it. We use a similar hack in Unwind-Protect to represent the fact that the cleanup forms can be invoked at arbitrarily random times. (unwind-protect p c) ==> (flet ((#:cleanup () c)) (block #:return (multiple-value-bind (#:next #:start #:count) (block #:unwind (%unwind-protect #'(lambda (x) (return-from #:unwind x))) (%within-cleanup :unwind-protect (return-from #:return p))) (#:cleanup) (%continue-unwind #:next #:start #:count)))) We use the block #:unwind to represent the entry to cleanup code in the case where we are non-locally unwound. Calling of the cleanup function in the drop-through case (or any local exit) is handled by cleanup generation. We make the cleanup a function so that cleanup generation can add calls at local exits from the protected form. #:next, #:start and #:count are state used in the case where we are unwound. They indicate where to go after doing the cleanup and what values are being thrown. The cleanup encloses only the protected form. As in CATCH, the escape function is specially tagged as :ESCAPE. The cleanup function is tagged as :CLEANUP to inhibit let conversion (since references are added in environment analysis.) Notice that implementing these forms using closures over continuations eliminates any need to special-case IR1 flow analysis. Obviously we don't really want to make heap-closures here. In reality these functions are special-cased by the back-end according to their KIND. Block compilation: One of the properties of IR1 is that supports "block compilation" by allowing arbitrarily large amounts of code to be converted at once, with actual compilation of the code being done at will. The interface to block compilation will be primarily through extensions to Compile-File rather than syntax in the source file. 1] The file argument may be a list of files to be compiled together into a single binary. 2] The keyword :Block-Compile specifies whether function references may be resolved at compile time. (default NIL) When enabled, full call may still be forced in a case-by-case basis by NOTINLINE declarations. 3] The keyword :Entry-Points determines which DEFUNs are given globally callable definitions. Eliminating entry points always saves space, and saves time when the function is used in only one place. It also allows us to derive argument types, since we are aware of all of the uses. These values are legal: :ALL All functions get entry points (the default). :SPECIFIED Functions in an ENTRY-POINT proclamation get entry points. :EXPORTED Functions are given entry points if they are either specified as an ENTRY-POINT or are exported from their name's home package. [The last is NYI. Also, we might really want some syntax to block compilation of a portion of a file, since block compilation of large files has prohibitive memory use.] In order to preserve the normal semantics we must recognize that proclamations (possibly implicit) are scoped. A proclamation is in effect only from the time of appearance of the proclamation to the time it is contradicted. The current global environment at the end of a block is not necessarily the correct global environment for compilation of all the code within the block. We solve this problem by closing over the relevant information in the IR1 at the time it is converted. For example, each functional variable reference is marked as inline, notinline or don't care. Similarly, each node contains a structure known as a Cookie which contains the appropriate settings of the compiler policy switches. We actually convert each form in the file separately, creating a separate "initial component" for each one. Later on, these components are merged as needed. The main reason for doing this is to cause EVAL-WHEN processing to be interleaved with reading. Compilation policy: We want more sophisticated control of compilation safety than is offered in CL, so that we can emit only those type checks that are likely to discover something (i.e. external interfaces.) #| Make some sort of selective safeness such as "XEP only" which would allow Hemlock and much of Lisp to be compiled unsafe, let CLX interfaces be fully safe with unsafe or semi-safe guts, etc. Have some sort of general DEFPOLICY mechanism that allows the policy for XEPs and for the body to be determined from outside in a config file or something. Sort of a rule-based system or something. We can match on an exact name, or a list of names, or a predicate on the name or on the source file name. Each rule gets to specify an optimize or XEP optimize declaration, and we process all applicable ones, from the least specific to the most specific (however that is determined) giving precedence to the last value for a quality. I suppose instead of having predicates on the file, we could just define the rules before each file in the compilation order. Perhaps hook into IR1 conversion as a way to introduce fairly arbitrary declarations. i.e., it should be possible to do inline, etc, as well as optimize. With-compilation-unit would be a reasonable way to scope this stuff at the file level and greater. WCU could specify a scoped "global" optimize declaration, and also the function-matching stuff for within that scope. The function-matching stuff (external name, XEP, etc.) would be syntactically wrapped around the body of the function, i.e. having a tighter scope than global declarations. This might actually be handled in IR1-convert-lambda, though I guess that would make problems with the current interface, since the name, kind are set afterwards. An alternative would be for callers of IR1-C-L to pass in some additional declarations that would be magically spliced in. I guess XEP-OPTIMIZE (or something) could be a declaration in its own right, or we could count on function-matching of the :EXTERNAL kind. It seems that having it be a separate declaration might be slightly preferable, as it would allow easy local changes in the defaults. Perhaps name OPTIMIZE-INTERFACE. If SAFETY is the only thing we want to specify (???), then INTERFACE-SAFETY. Note that this should affect some functions other than the XEP itself. In particular, the more-arg entry (keyword parsing.) Probably also optional entries, though that is somewhat more questionable. So perhaps we have two new mechanisms: -- The OPTIMIZE-INTERFACE declaration, which is managed almost exactly like the current OPTIMIZE declaration. The only difference is that the initial global value (*DEFAULT-INTERFACE-COOKIE*) is all NILs, and any NIL values are defaulted from the current normal policy at the point that we are converting interface code. -- A context-sensitive declaration mechanism that allows declarations to be inserted at the head of functions automatically depending on the function name, etc. A reasonable model might be a scope list of functions which we call with the function name to be, file info, etc., and if it returns the first value true, we use the second value as additional declarations. We could also have models that allowed multiple rules to apply to the same function, but it isn't clear we need that power. (definitely unnecessary as long as we only use it for the OPTIMIZE declaration.) It is an interesting question whether the context-declaration stuff should be in the LEXENV, or just in some random special bound by W-C-U. Probably in interface to the context declaration stuff, the main thing we want is something that fires on functions that might be external (i.e. the name is external in its home package.) If not a package, we presumably assume external. Macros are always external? Note that we might like the bodies of macros to be compiled safe even when internal functions (and the bodies of external functions) aren't. Perhaps this is best done on a per-file basis. Probably W-C-U should have :OPTIMIZE and :OPTIMIZE-INTERFACE arguments to allow policy to be specified there. |# Notes: Generalized back-end notion provides dynamic retargeting? (for byte code) The current node type annotations seem to be somewhat unsatisfactory, since we lose information when we do a THE on a continuation that already has uses, or when we convert a let where the actual result continuation has other uses. But the case with THE isn't really all that bad, since the test of whether there are any uses happens before conversion of the argument, thus THE loses information only when there are uses outside of the declared form. The LET case may not be a big deal either. Note also that losing user assertions isn't really all that bad, since it won't damage system integrity. At worst, it will cause a bug to go undetected. More likely, it will just cause the error to be signaled in a different place (and possibly in a less informative way). Of course, there is an efficiency hit for losing type information, but if it only happens in strange cases, then this isn't a big deal. LOCAL CALL ANALYSIS All calls to local functions (known named functions and LETs) are resolved to the exact LAMBDA node which is to be called. If the call is syntactically illegal, then we emit a warning and mark the reference as :notinline, forcing the call to be a full call. We don't even think about converting APPLY calls; APPLY is not special-cased at all in IR1. We also take care not to convert calls in the top-level component, which would join it to normal code. Calls to functions with rest args and calls with non-constant keywords are also not converted. We also convert MV-Calls that look like MULTIPLE-VALUE-BIND to local calls, since we know that they can be open-coded. We replace the optional dispatch with a call to the last optional entry point, letting MV-Call magically default the unsupplied values to NIL. When IR1 optimizations discover a possible new local call, they explicitly invoke local call analysis on the code that needs to be reanalyzed. [### Let conversion. What is means to be a let. Argument type checking done by caller. Significance of local call is that all callers are known, so special call conventions may be used.] A lambda called in only one place is called a "let" call, since a Let would turn into one. In addition to enabling various IR1 optimizations, the let/non-let distinction has important environment significance. We treat the code in function and all of the lets called by that function as being in the same environment. This allows exits from lets to be treated as local exits, and makes life easy for environment analysis. Since we will let-convert any function with only one call, we must be careful about cleanups. It is possible that a lexical exit from the let function may have to clean up dynamic bindings not lexically apparent at the exit point. We handle this by annotating lets with any cleanup in effect at the call site. The cleanup for continuations with no immediately enclosing cleanup is the lambda that the continuation is in. In this case, we look at the lambda to see if any cleanups need to be done. Let conversion is disabled for entry-point functions, since otherwise we might convert the call from the XEP to the entry point into a let. Then later on, we might want to convert a non-local reference into a local call, and not be able to, since once a function has been converted to a let, we can't convert it back. A function's return node may also be deleted if it is unreachable, which can happen if the function never returns normally. Such functions are not lets. Entry points: #| Since we need to evaluate potentially arbitrary code in the XEP argument forms (for type checking), we can't leave the arguments in the wired passing locations. Instead, it seems better to give the XEP max-args fixed arguments, with the passing locations being the true passing locations. Instead of using %XEP-ARG, we reference the appropriate variable. Also, it might be a good idea to do argument count checking and dispatching with explicit conditional code in the XEP. This would simplify both the code that creates the XEP and the IR2 conversion of XEPs. Also, argument count dispatching would automatically benefit from any cleverness in compilation of case-like forms (jump tables, etc). On the downside, this would push some assumptions about how arg dispatching is done into IR1. But then we are currently violating abstraction at least as badly in IR2 conversion, which is also supposed to be implementation independent. |# As a side-effect of finding which references to known functions can be converted to local calls, we find any references that cannot be converted. References that cannot be converted to a local call must evaluate to a "function object" (or function-entry) that can be called using the full call convention. A function that can be called from outside the component is called an "entry-point". Lots of stuff that happens at compile-time with local function calls must be done at run-time when an entry-point is called. It is desirable for optimization and other purposes if all the calls to every function were directly present in IR1 as local calls. We cannot directly do this with entry-point functions, since we don't know where and how the entry-point will be called until run-time. What we do is represent all the calls possible from outside the component by local calls within the component. For each entry-point function, we create a corresponding lambda called the external entry point or XEP. This is a function which takes the number of arguments passed as the first argument, followed by arguments corresponding to each required or optional argument. If an optional argument is unsupplied, the value passed into the XEP is undefined. The XEP is responsible for doing argument count checking and dispatching. In the case of a fixed-arg lambda, we emit a call to the %VERIFY-ARGUMENT-COUNT funny function (conditional on policy), then call the real function on the passed arguments. Even in this simple case, we benefit several ways from having a separate XEP: -- The argument count checking is factored out, and only needs to be done in full calls. -- Argument type checking happens automatically as a consequence of passing the XEP arguments in a local call to the real function. This type checking is also only done in full calls. -- The real function may use a non-standard calling convention for the benefit of recursive or block-compiled calls. The XEP converts arguments/return values to/from the standard convention. This also requires little special-casing of XEPs. If the function has variable argument count (represented by an OPTIONAL-DISPATCH), then the XEP contains a COND which dispatches off of the argument count, calling the appropriate entry-point function (which then does defaulting). If there is a more entry (for keyword or rest args), then the XEP obtains the more arg context and count by calling the %MORE-ARG-CONTEXT funny function. All non-local-call references to functions are replaced with references to the corresponding XEP. IR1 optimization may discover a local call that was previously a non-local reference. When we delete the reference to the XEP, we may find that it has no references. In this case, we can delete the XEP, causing the function to no longer be an entry-point. FLOW GRAPH CANONICALIZATION This is a post-pass to IR1 conversion that massages the flow graph into the shape subsequent phases expect. Things done: Compute the depth-first ordering for the flow graph. Find the components (disconnected parts) of the flow graph. Note (IF (typep ') ...) constructs. This pass need only be redone when newly converted code has been added to the flow graph. The reanalyze flag in the component structure should be set by people who mess things up. Another thing we do [### But not yet... Wait for type constraint propagation] is check whether the predicate of each IF is a %TYPEP test with a useful constant type and a lexical variable argument. If it is, we annotate the IF node with the variable and the type tested. This information is used by later type propagation steps. A useful type is one that we can hack at compile time: not a Hairy-Type. If the type is a NOT type, then we can handle it by swapping the alternatives. (if ...) is consider a type test against (not null). We create the initial DFO using a variant of the basic algorithm. The initial DFO computation breaks the IR1 up into components, which are parts that can be compiled independently. This is done to increase the efficiency of large block compilations. In addition to improving locality of reference and reducing the size of flow analysis problems, this allows back-end data structures to be reclaimed after the compilation of each component. IR1 optimization can change the connectivity of the flow graph by discovering new calls or eliminating dead code. Initial DFO determination splits up the flow graph into separate components, but does so conservatively, ensuring that parts that might become joined (due to local call conversion) are joined from the start. Initial DFO computation also guarantees that all code which shares a lexical environment is in the same component so that environment analysis needs to operate only on a single component at a time. [This can get a bit hairy, since code seemingly reachable from the environment entry may be reachable from a NLX into that environment. Also, function references must be considered as links joining components even though the flow graph doesn't represent these.] After initial DFO determination, components are neither split nor joined. The standard DFO computation doesn't attempt to split components that have been disconnected. IR1 OPTIMIZE We are conservative about doing variable-for-variable substitution in IR1 optimization, since if we substitute a variable with a less restrictive type, then we may prevent use of a "good" representation within the scope of the inner binding. Note that variable-variable substitutions aren't really crucial in IR1, since they don't create opportunities for new optimizations (unlike substitution of constants and functions). A spurious variable-variable binding will show up as a Move operation in IR2. This can be optimized away by reaching-definitions and also by targeting. [### But actually, some optimizers do see if operands are the same variable.] #| The IF-IF optimization can be modeled as a value driven optimization, since adding a use definitely is cause for marking the continuation for reoptimization. [When do we add uses? Let conversion is the only obvious time.] I guess IF-IF conversion could also be triggered by a non-immediate use of the test continuation becoming immediate, but to allow this to happen would require Delete-Block (or somebody) to mark block-starts as needing to be reoptimized when a predecessor changes. It's not clear how important it is that IF-IF conversion happen under all possible circumstances, as long as it happens to the obvious cases. [### It isn't totally true that code flushing never enables other worthwhile optimizations. Deleting a functional reference can cause a function to cease being an XEP, or even trigger let conversion. It seems we still want to flush code during IR1 optimize, but maybe we want to interleave it more intimately with the optimization pass. Ref-flushing works just as well forward as backward, so it could be done in the forward pass. Call flushing doesn't work so well, but we could scan the block backward looking for any new flushable stuff if we flushed a call on the forward pass. When we delete a variable due to lack of references, we leave the variable in the lambda-list so that positional references still work. The initial value continuation is flushed, though (replaced with NIL) allowing the initial value for to be deleted (modulo side-effects.) Note that we can delete vars with no refs even when they have sets. I guess when there are no refs, we should also flush all sets, allowing the value expressions to be flushed as well. Squeeze out single-reference unset let variables by changing the dest of the initial value continuation to be the node that receives the ref. This can be done regardless of what the initial value form is, since we aren't actually moving the evaluation. Instead, we are in effect using the continuation's locations in place of the temporary variable. Doing this is of course, a wild violation of stack discipline, since the ref might be inside a loop, etc. But with the IR2 back-end, we only need to preserve stack discipline for unknown-value continuations; this IR1 transformation must be already be inhibited when the DEST of the REF is a multiple-values receiver (EXIT, RETURN or MV-COMBINATION), since we must preserve the single-value semantics of the let-binding in this case. The REF and variable must be deleted as part of this operation, since the IR1 would otherwise be left in an inconsistent state; we can't wait for the REF to be deleted due to bing unused, since we have grabbed the arg continuation and substituted it into the old DEST. The big reason for doing this transformation is that in macros such as INCF and PSETQ, temporaries are squeezed out, and the new value expression is evaluated directly to the setter, allowing any result type assertion to be applied to the expression evaluation. Unlike in the case of substitution, there is no point in inhibiting this transformation when the initial value type is weaker than the variable type. Instead, we intersect the asserted type for the old REF's CONT with the type assertion on the initial value continuation. Note that the variable's type has already been asserted on the initial-value continuation. Of course, this transformation also simplifies the IR1 even when it doesn't discover interesting type assertions, so it makes sense to do it whenever possible. This reduces the demands placed on register allocation, etc. |# There are three dead-code flushing rules: 1] Refs with no DEST may be flushed. 2] Known calls with no dest that are flushable may be flushed. We null the DEST in all the args. 3] If a lambda-var has no refs, then it may be deleted. The flushed argument continuations have their DEST nulled. These optimizations all enable one another. We scan blocks backward, looking for nodes whose CONT has no DEST, then type-dispatching off of the node. If we delete a ref, then we check to see if it is a lambda-var with no refs. When we flush an argument, we mark the blocks for all uses of the CONT as needing to be reoptimized. Goals for IR1 optimizations: #| When an optimization is disabled, code should still be correct and not ridiculously inefficient. Phases shouldn't be made mandatory when they have lots of non-required stuff jammed into them. |# This pass is optional, but is desirable if anything is more important than compilation speed. This phase is a grab-bag of optimizations that concern themselves with the flow of values through the code representation. The main things done are type inference, constant folding and dead expression elimination. This phase can be understood as a walk of the expression tree that propagates assertions down the tree and propagates derived information up the tree. The main complication is that there isn't any expression tree, since IR1 is flow-graph based. We repeat this pass until we don't discover anything new. This is a bit of feat, since we dispatch to arbitrary functions which may do arbitrary things, making it hard to tell if anything really happened. Even if we solve this problem by requiring people to flag when they changed or by checking to see if they changed something, there are serious efficiency problems due to massive redundant computation, since in many cases the only way to tell if anything changed is to recompute the value and see if it is different from the old one. We solve this problem by requiring that optimizations for a node only depend on the properties of the CONT and the continuations that have the node as their DEST. If the continuations haven't changed since the last pass, then we don't attempt to re-optimize the node, since we know nothing interesting will happen. We keep track of which continuations have changed by a REOPTIMIZE flag that is set whenever something about the continuation's value changes. When doing the bottom up pass, we dispatch to type specific code that knows how to tell when a node needs to be reoptimized and does the optimization. These node types are special-cased: COMBINATION, IF, RETURN, EXIT, SET. The REOPTIMIZE flag in the COMBINATION-FUN is used to detect when the function information might have changed, so that we know when where are new assertions that could be propagated from the function type to the arguments. When we discover something about a leaf, or substitute for leaf, we reoptimize the CONT for all the REF and SET nodes. We have flags in each block that indicate when any nodes or continuations in the block need to be re-optimized, so we don't have to scan blocks where there is no chance of anything happening. It is important for efficiency purposes that optimizers never say that they did something when they didn't, but this by itself doesn't guarantee timely termination. I believe that with the type system implemented, type inference will converge in finite time, but as a practical matter, it can take far too long to discover not much. For this reason, IR1 optimization is terminated after three consecutive passes that don't add or delete code. This premature termination only happens 2% of the time. Flow graph simplification: Things done: Delete blocks with no predecessors. Merge blocks that can be merged. Convert local calls to Let calls. Eliminate degenerate IFs. We take care not to merge blocks that are in different functions or have different cleanups. This guarantees that non-local exits are always at block ends and that cleanup code never needs to be inserted within a block. We eliminate IFs with identical consequent and alternative. This would most likely happen if both the consequent and alternative were optimized away. [Could also be done if the consequent and alternative were different blocks, but computed the same value. This could be done by a sort of cross-jumping optimization that looked at the predecessors for a block and merged code shared between predecessors. IFs with identical branches would eventually be left with nothing in their branches.] We eliminate IF-IF constructs: (IF (IF A B C) D E) ==> (IF A (IF B D E) (IF C D E)) In reality, what we do is replicate blocks containing only an IF node where the predicate continuation is the block start. We make one copy of the IF node for each use, leaving the consequent and alternative the same. If you look at the flow graph representation, you will see that this is really the same thing as the above source to source transformation. Bottom-up IR1 optimizations: In the bottom-up pass, we scan the code in forward depth-first order. We examine each call to a known function, and: Replace calls of foldable functions with constant arguments with the result. We don't have to actually delete the call node, since Top-Down optimize will delete it now that its value is unused. Run any Optimizer for the current function. The optimizer does arbitrary transformations by hacking directly on the IR. This is useful primarily for arithmetic simplification and similar things that may need to examine and modify calls other than the current call. The optimizer is responsible for recording any changes that it makes. An optimizer can inhibit further optimization of the node during the current pass by returning true. This is useful when deleting the node. Do IR1 transformations, replacing a global function call with equivalent inline lisp code. Do bottom-up type propagation/inferencing. For some functions such as Coerce we will dispatch to a function to find the result type. The Derive-Type function just returns a type structure, and we check if it is different from the old type in order to see if there was a change. Eliminate IFs with predicates known to be true or false. Substitute the value for unset let variables that are bound to constants, unset lambda variables or functionals. Propagate types from local call args to var refs. We use type info from the function continuation to find result types for functions that don't have a derive-type method. IR1 transformation: IR1 transformation does "source to source" transformations on known global functions, taking advantage of semantic information such as argument types and constant arguments. Transformation is optional, but should be done if speed or space is more important than compilation speed. Transformations which increase space should pass when space is more important than speed. A transform is actually an inline function call where the function is computed at compile time. The transform gets to peek at the continuations for the arguments, and computes a function using the information gained. Transforms should be cautious about directly using the values of constant continuations, since the compiler must preserve eqlness of named constants, and it will have a hard time if transforms go around randomly copying constants. The lambda that the transform computes replaces the original function variable reference as the function for the call. This lets the compiler worry about evaluating each argument once in the right order. We want to be careful to preserve type information when we do a transform, since it may be less than obvious what the transformed code does. There can be any number of transforms for a function. Each transform is associated with a function type that the call must be compatible with. A transform is only invoked if the call has the right type. This provides a way to deal with the common case of a transform that only applies when the arguments are of certain types and some arguments are not specified. We always use the derived type when determining whether a transform is applicable. Type check is responsible for setting the derived type to the intersection of the asserted and derived types. If the code in the expansion has insufficient explicit or implicit argument type checking, then it should cause checks to be generated by making declarations. A transformation may decide to pass if it doesn't like what it sees when it looks at the args. The Give-Up function unwinds out of the transform and deals with complaining about inefficiency if speed is more important than brevity. The format args for the message are arguments to Give-Up. If a transform can't be done, we just record the message where IR1 finalize can find it. note. We can't complain immediately, since it might get transformed later on. Top-down IR1 optimizations: In the top-down pass, we walk the code in reverse depth-first order and: Eliminate any effectless nodes with unused values. In IR1 this is the only way that code is deleted other than the elimination of unreachable blocks. Eliminate any bindings for unused variables. Do top-down type assertion propagation. In local calls, we propagate asserted and derived types between the call and the called lambda. TYPE CHECK [### Somehow split this section up into three parts: -- Conceptual: how we know a check is necessary, and who is responsible for doing checks. -- Incremental: intersection of derived & asserted types, checking for non-subtype relationship. -- Check generation phase. ] We need to do a pretty good job of guessing when a type check will ultimately need to be done. Generic arithmetic, for example: In the absence of declarations, we will use use the safe variant, but if we don't know this, we will generate a check for NUMBER anyway. We need to look at the fast-safe templates and guess if any of them could apply. We compute a function type from the VOP arguments and assertions on those arguments. This can be used with Valid-Function-Use to see which templates do or might apply to a particular call. If we guess that a safe implementation will be used, then we mark the continuation so as to force a safe implementation to be chosen. [This will happen if IR1 optimize doesn't run to completion, so the ir1 optimization after type check generation can discover new type information. Since we won't redo type check at that point, there could be a call that has applicable unsafe templates, but isn't type checkable.] [### A better and more general optimization of structure type checks: in type check conversion, we look at the *original derived* type of the continuation: if the difference between the proven type and the asserted type is a simple type check, then check for the negation of the difference. e.g. if we want a FOO and we know we've got (OR FOO NULL), then test for (NOT NULL). This is a very important optimization for linked lists of structures, but can also apply in other situations.] If after IR1 phases, we have a continuation with check-type set in a context where it seems likely a check will be emitted, and the type is too hairy to be easily checked (i.e. no CHECK-xxx VOP), then we do a transformation on the IR1 equivalent to: (... (the hair ) ...) ==> (... (funcall #'(lambda (#:val) (if (typep #:val 'hair) #:val (%type-check-error #:val 'hair))) ) ...) This way, we guarantee that IR2 conversion never has to emit type checks for hairy types. [Actually, we need to do a MV-bind and several type checks when there is a MV continuation. And some values types are just too hairy to check. We really can't check any assertion for a non-fixed number of values, since there isn't any efficient way to bind arbitrary numbers of values. (could be done with MV-call of a more-arg function, I guess...) ] [Perhaps only use CHECK-xxx VOPs for types equivalent to a ptype? Exceptions for CONS and SYMBOL? Anyway, no point in going to trouble to implement and emit rarely used CHECK-xxx vops.] One potential lose in converting a type check to explicit conditionals rather than to a CHECK-xxx VOP is that IR2 code motion optimizations won't be able to do anything. This shouldn't be much of an issue, though, since type constraint propagation has already done global optimization of type checks. This phase is optional, but should be done if anything is more important than compile speed. Type check is responsible for reconciling the continuation asserted and derived types, emitting type checks if appropriate. If the derived type is a subtype of the asserted type, then we don't need to do anything. If there is no intersection between the asserted and derived types, then there is a manifest type error. We print a warning message, indicating that something is almost surely wrong. This will inhibit any transforms or generators that care about their argument types, yet also inhibits further error messages, since NIL is a subtype of every type. If the intersection is not null, then we set the derived type to the intersection of the asserted and derived types and set the Type-Check flag in the continuation. We always set the flag when we can't prove that the type assertion is satisfied, regardless of whether we will ultimately actually emit a type check or not. This is so other phases such as type constraint propagation can use the Type-Check flag to detect an interesting type assertion, instead of having to duplicate much of the work in this phase. [### 7 extremely random values for CONTINUATION-TYPE-CHECK.] Type checks are generated on the fly during IR2 conversion. When IR2 conversion generates the check, it prints an efficiency note if speed is important. We don't flame now since type constraint progpagation may decide that the check is unnecessary. [### Not done now, maybe never.] In local function call, it is the caller that is in effect responsible for checking argument types. This happens in the same way as any other type check, since IR1 optimize propagates the declared argument types to the type assertions for the argument continuations in all the calls. Since the types of arguments to entry points are unknown at compile time, we want to do runtime checks to ensure that the incoming arguments are of the correct type. This happens without any special effort on the part of type check, since the XEP is represented as a local call with unknown type arguments. These arguments will be marked as needing to be checked. TYPE CONSTRAINT PROPAGATION #| New lambda-var-slot: constraints: a list of all the constraints on this var for either X or Y. How to maintain consistency? Does it really matter if there are constraints with deleted vars lying around? Note that whatever mechanism we use for getting the constraints in the first place should tend to keep them up to date. Probably we would define optimizers for the interesting relations that look at their CONT's dest and annotate it if it is an IF. But maybe it is more trouble then it is worth trying to build up the set of constraints during IR1 optimize (maintaining consistency in the process). Since IR1 optimize iterates a bunch of times before it converges, we would be wasting time recomputing the constraints, when nobody uses them till constraint propagation runs. It seems that the only possible win is if we re-ran constraint propagation (which we might want to do.) In that case, we wouldn't have to recompute all the constraints from scratch. But it seems that we could do this just as well by having IR1 optimize invalidate the affected parts of the constraint annotation, rather than trying to keep them up to date. This also fits better with the optional nature of constraint propagation, since we don't want IR1 optimize to commit to doing a lot of the work of constraint propagation. For example, we might have a per-block flag indicating that something happened in that block since the last time constraint propagation ran. We might have different flags to represent the distinction between discovering a new type assertion inside the block and discovering something new about an if predicate, since the latter would be cheaper to update and probably is more common. It's fairly easy to see how we can build these sets of restrictions and propagate them using flow analysis, but actually using this information seems a bit more ad-hoc. Probably the biggest thing we do is look at all the refs. If have proven that the value is EQ (EQL for a number) to some other leaf (constant or lambda-var), then we can substitute for that reference. In some cases, we will want to do special stuff depending on the DEST. If the dest is an IF and we proved (not null), then we can substitute T. And if the dest is some relation on the same two lambda-vars, then we want to see if we can show that relation is definitely true or false. Otherwise, we can do our best to invert the set of restrictions into a type. Since types hold only constant info, we have to ignore any constraints between two vars. We can make some use of negated type restrictions by using TYPE-DIFFERENCE to remove the type from the ref types. If our inferred type is as good as the type assertion, then the continuation's type-check flag will be cleared. It really isn't much of a problem that we don't infer union types on joins, since union types are relatively easy to derive without using flow information. The normal bottom-up type inference done by IR1 optimize does this for us: it annotates everything with the union of all of the things it might possibly be. Then constraint propagation subtracts out those types that can't be in effect because of predicates or checks. This phase is optional, but is desirable if anything is more important than compilation speed. We use an algorithm similar to available expressions to propagate variable type information that has been discovered by implicit or explicit type tests, or by type inference. We must do a pre-pass which locates set closure variables, since we cannot do flow analysis on such variables. We set a flag in each set closure variable so that we can quickly tell that it is losing when we see it again. Although this may seem to be wastefully redundant with environment analysis, the overlap isn't really that great, and the cost should be small compared to that of the flow analysis that we are preparing to do. [Or we could punt on set variables...] A type constraint is a structure that includes sset-element and has the type and variable. [### Also a not-p flag indicating whether the sense is negated.] Each variable has a list of its type constraints. We create a type constraint when we see a type test or check. If there is already a constraint for the same variable and type, then we just re-use it. If there is already a weaker constraint, then we generate both the weak constraints and the strong constraint so that the weak constraints won't be lost even if the strong one is unavailable. We find all the distinct type constraints for each variable during the pre-pass over the lambda nesting. Each constraint has a list of the weaker constraints so that we can easily generate them. Every block generates all the type constraints in it, but a constraint is available in a successor only if it is available in all predecessors. We determine the actual type constraint for a variable at a block by intersecting all the available type constraints for that variable. This isn't maximally tense when there are constraints that are not hierarchically related, e.g. (or a b) (or b c). If these constraints were available from two predecessors, then we could infer that we have an (or a b c) constraint, but the above algorithm would come up with none. This probably isn't a big problem. [### Do we want to deal with (if (eq ') ...) indicating singleton member type?] We detect explicit type tests by looking at type test annotation in the IF node. If there is a type check, the OUT sets are stored in the node, with different sets for the consequent and alternative. Implicit type checks are located by finding Ref nodes whose Cont has the Type-Check flag set. We don't actually represent the GEN sets, we just initialize OUT to it, and then form the union in place. When we do the post-pass, we clear the Type-Check flags in the continuations for Refs when we discover that the available constraints satisfy the asserted type. Any explicit uses of typep should be cleaned up by the IR1 optimizer for typep. We can also set the derived type for Refs to the intersection of the available type assertions. If we discover anything, we should consider redoing IR1 optimization, since better type information might enable more optimizations. ENVIRONMENT ANALYSIS #| A related change would be to annotate IR1 with information about tail-recursion relations. What we would do is add a slot to the node structure that points to the corresponding Tail-Info when a node is in a TR position. This annotation would be made in a final IR1 pass that runs after cleanup code is generated (part of environment analysis). When true, the node is in a true TR position (modulo return-convention incompatibility). When we determine return conventions, we null out the tail-p slots in XEP calls or known calls where we decided not to preserve tail-recursion. In this phase, we also check for changes in the dynamic binding environment that require cleanup code to be generated. We just check for changes in the Continuation-Cleanup on local control transfers. If it changes from an inner dynamic context to an outer one that is in the same environment, then we emit code to clean up the dynamic bindings between the old and new continuation. We represent the result of cleanup detection to the back end by interposing a new block containing a call to a funny function. Local exits from CATCH or UNWIND-PROTECT are detected in the same way. |# The primary activity in environment analysis is the annotation of IR1 with environment structures describing where variables are allocated and what values the environment closes over. Each lambda points to the environment where its variables are allocated, and the environments point back. We always allocate the environment at the Bind node for the sole non-let lambda in the environment, so there is a close relationship between environments and functions. Each "real function" (i.e. not a LET) has a corresponding environment. We attempt to share the same environment among as many lambdas as possible so that unnecessary environment manipulation is not done. During environment analysis the only optimization of this sort is realizing that a Let (a lambda with no Return node) cannot need its own environment, since there is no way that it can return and discover that its old values have been clobbered. When the function is called, values from other environments may need to be made available in the function's environment. These values are said to be "closed over". Even if a value is not referenced in a given environment, it may need to be closed over in that environment so that it can be passed to a called function that does reference the value. When we discover that a value must be closed over by a function, we must close over the value in all the environments where that function is referenced. This applies to all references, not just local calls, since at other references we must have the values on hand so that we can build a closure. This propagation must be applied recursively, since the value must also be available in *those* functions' callers. If a closure reference is known to be "safe" (not an upward funarg), then the closure structure may be allocated on the stack. Closure analysis deals only with closures over values, while Common Lisp requires closures over variables. The difference only becomes significant when variables are set. If a variable is not set, then we can freely make copies of it without keeping track of where they are. When a variable is set, we must maintain a single value cell, or at least the illusion thereof. We achieve this by creating a heap-allocated "value cell" structure for each set variable that is closed over. The pointer to this value cell is passed around as the "value" corresponding to that variable. References to the variable must explicitly indirect through the value cell. When we are scanning over the lambdas in the component, we also check for bound but not referenced variables. Environment analysis emits cleanup code for local exits and markers for non-local exits. A non-local exit is a control transfer from one environment to another. In a non-local exit, we must close over the continuation that we transfer to so that the exiting function can find its way back. We indicate the need to close a continuation by placing the continuation structure in the closure and also pushing it on a list in the environment structure for the target of the exit. [### To be safe, we would treat the continuation as a set closure variable so that we could invalidate it when we leave the dynamic extent of the exit point. Transferring control to a meaningless stack pointer would be apt to cause horrible death.] Each local control transfer may require dynamic state such as special bindings to be undone. We represent cleanup actions by funny function calls in a new block linked in as an implicit MV-PROG1. CALL/LOOP ANALYSIS Note that Stack Analysis introduces cleanup code after this phase runs. We have to stick the new IR2-Block into the appropriate loop. [### This should be rethought at some point. The current code for strange loops seems not to be totally debugged, and it isn't really clear that anyone cares. If we aren't using loop info for environment hackery, then we may want to redo stuff to be more useful for cost assignment and IR2 optimizations. Probably what we really want to do is to compute the call graph before we do loop analysis. It seems that we should be able to use this information to avoid getting confused during loop analysis. We still represent the calls the same way in the flow graph, since this makes the actual control flow explicit for flow analysis. If we do a depth-first walk of the call graph, then we can find recursive calls using the depth-first numbering. Every recursion will involve an arc from a higher numbered node to a lower numbered one. If a function is the destination of such an arc, then we mark it as recursive. Nodes in the call graph ("functions") will be similar to loops. We do loop analysis on each function, [walking the call graph bottom-up?], using the function as the root for the loop nesting. We want to integrate the "loop depth" used for cost computation with the information from the call graph. For example, a non-recursive function would have the "depth" of the maximum of its call sites, and a recursive function would have 1+ this, since the recursion is a "loop". Probably we want the "function" node to be the IR2 environment structure. This will presumably also include some "loop" structure. We definitely need to be aware of tail recursions at this point. Tail recursive calls are represented at the flow graph level, and don't appear in the call graph at all. Tail recursions will be indistinguishable from explicit iteration. ] This phase is optional, but should be done if speed or space is more important than compile speed. We annotate the IR2 with information about loops. First we find the set of the blocks which dominate each block. Note that if we keep the set sorted by DFN, then they will be in order of dominance as well. We probably want to special-case local function call, since standard flow analysis won't realize that the return point of the call is dominated by the call point. The Loop structure contains a list of all the blocks in it, and also the loop depth. We arrange things so that Loops are reasonably nested, and have up and down pointers. We omit blocks contained within inner loops from our list. We also list the exit points from the loop. We must deal with "strange" loops (not just a concession to TAGBODY, since mutual recursion can create non-reducible flow graphs.) First we find the back edges and natural loops, and then we find the strange loops. A strange loop is a cycle which contains no back edges. It turns out that every strange loop contains a retreating edge which is not a back edge, and all such edges appear in strange loops. Strange loops are broken into segments, where each segment is the code in the loop which is dominated by a given entry point. The representation for a segment of a strange loop is about the same as for a natural loop. We can find the heads for the segments in a strange loop by doing a graph walk forward from the ascending non-back edge, recursing edges which are not back edges. If we reach the start node or a node already in the loop, then we add all the nodes on our path to the strange loop as we unwind. Each block in the strange loop with predecessors outside of the loop is the head of a segment. When there are multiple back branches to the same start block, we consider all the code to be in the same loop. A loop is effectively defined by its head, rather than by any particular back branch. Loop nesting is determined by the dominance relation on loop heads. A loop is nested within another if the head of the inner loop is dominated by the head of the outer loop. Local call tends to cause the flow graph to be non-reducible. This shouldn't be a correctness issue, since other code should be assuming arbitrarily bizarre flow graphs. This is an efficiency issue, since it could inhibit the combination of the allocations for some functions, resulting in more environments being allocated than is really necessary. This can probably be fixed by special-casing local call during loop analysis. All we need to do to get the correct dominators is to union the dominators at the call site with the dominators at each return site. It is less clear how this should affect loop analysis. In some ways, it would make sense to treat local functions as being nested in all the loops that call them. On the other hand, if we are interested in the dominance properties of nesting, then it makes more sense for the code for the function to be in none of the loops. I guess that this conflict is due to different people wanting different things from loops. Some people are only interested in the importance of the code, which they determine from the loop depth. Other people such as environment analysis are interested in proving that there are no cyclic paths, and want the dominance relations preserved. Other people are interested in doing "loop optimizations", and are primarily interested in entry and exit points. We may want to do funny things with the loop depth anyway, since we will want to bound it so that costs remain fixnums. GLOBAL TN ASSIGNMENT [### Rename this phase so as not to be confused with the local/global TN representation.] The basic mechanism for closing over values is to pass the values as additional implicit arguments in the function call. This technique is only applicable when: -- the calling function knows which values the called function wants to close over, and -- the values to be closed over are available in the calling environment. The first condition is always true of local function calls. Environment analysis can guarantee that the second condition holds by closing over any needed values in the calling environment. If the function that closes over values may be called in an environment where the closed over values are not available, then we must store the values in a "closure" so that they are always accessible. Closures are called using the "full call" convention. When a closure is called, control is transferred to the "external entry point", which fetches the values out of the closure and then does a local call to the real function, passing the closure values as implicit arguments. In this scheme there is no such thing as a "heap closure variable" in code, since the closure values are moved into TNs by the external entry point. There is some potential for pessimization here, since we may end up moving the values from the closure into a stack memory location, but the advantages are also substantial. Simplicity is gained by always representing closure values the same way, and functions with closure references may still be called locally without allocating a closure. All the TN based IR2 optimizations will apply to closure variables, since closure variables are represented in the same way as all other variables in IR2. Closure values will be allocated in registers where appropriate. Closures are created at the point where the function is referenced, eliminating the need to be able to close over closures. This lazy creation of closures has the additional advantage that when a closure reference is conditionally not done, then the closure consing will never be done at all. The corresponding disadvantage is that a closure over the same values may be created multiple times if there are multiple references. Note however, that IR2 loop and common subexpression optimizations can eliminate redundant closure consing. In any case, multiple closures over the same variables doesn't seem to be that common. #| Having the Tail-Info would also make return convention determination trivial. We could just look at the type, checking to see if it represents a fixed number of values. To determine if the standard return convention is necessary to preserve tail-recursion, we just iterate over the equivalent functions, looking for XEPs and uses in full calls. |# The Global TN Assignment pass (GTN) can be considered a post-pass to environment analysis. This phase assigns the TNs used to hold local lexical variables and pass arguments and return values and determines the value-passing strategy used in local calls. To assign return locations, we look at the function's tail-set. If the result continuation for an entry point is used as the continuation for a full call, then we may need to constrain the continuation's values passing convention to the standard one. This is not necessary when the call is known not to be part of a tail-recursive loop (due to being a known function). Once we have figured out where we must use the standard value passing strategy, we can use a more flexible strategy to determine the return locations for local functions. We determine the possible numbers of return values from each function by examining the uses of all the result continuations in the equivalence class of the result continuation. If the tail-set type is for a fixed number of values, then we return that fixed number of values from all the functions whose result continuations are equated. If the number of values is not fixed, then we must use the unknown-values convention, although we are not forced to use the standard locations. We assign the result TNs at this time. We also use the tail-sets to see what convention we want to use. What we do is use the full convention for any function that has a XEP its tail-set, even if we aren't required to do so by a tail-recursive full call, as long as there are no non-tail-recursive local calls in the set. This prevents us from gratuitously using a non-standard convention when there is no reason to. LOCAL TN ASSIGNMENT [Want a different name for this so as not to be confused with the different local/global TN representations. The really interesting stuff in this phase is operation selection, values representation selection, return strategy, etc. Maybe this phase should be conceptually lumped with GTN as "implementation selection", since GTN determines call strategies and locations.] #| [### I guess I believe that it is OK for IR2 conversion to dick the IR1 flow graph. An alternative would be to give IR2 its very own flow graph, but that seems like overkill. In particular, it would be very nice if a TR local call looked exactly like a jump in IR2. This would allow loop optimizations to be done on loops written as recursions. In addition to making the call block transfer to the head of the function rather than to the return, we would also have to do something about skipping the part of the function prolog that moves arguments from the passing locations, since in a TR call they are already in the right frame. In addition to directly indicating whether a call should be coded with a TR variant, the Tail-P annotation flags non-call nodes that can directly return the value (an "advanced return"), rather than moving the value to the result continuation and jumping to the return code. Then (according to policy), we can decide to advance all possible returns. If all uses of the result are Tail-P, then LTN can annotate the result continuation as :Unused, inhibiting emission of the default return code. [### But not really. Now there is a single list of templates, and a given template has only one policy.] In LTN, we use the :Safe template as a last resort even when the policy is unsafe. Note that we don't try :Fast-Safe; if this is also a good unsafe template, then it should have the unsafe policies explicitly specified. With a :Fast-Safe template, the result type must be proven to satisfy the output type assertion. This means that a fast-safe template with a fixnum output type doesn't need to do fixnum overflow checking. [### Not right to just check against the Node-Derived-Type, since type-check intersects with this.] It seems that it would be useful to have a kind of template where the args must be checked to be fixnum, but the template checks for overflow and signals an error. In the case where an output assertion is present, this would generate better code than conditionally branching off to make a bignum, and then doing a type check on the result. How do we deal with deciding whether to do a fixnum overflow check? This is perhaps a more general problem with the interpretation of result type restrictions in templates. It would be useful to be able to discriminate between the case where the result has been proven to be a fixnum and where it has simply been asserted to be so. The semantics of result type restriction is that the result must be proven to be of that type *except* for safe generators, which are assumed to verify the assertion. That way "is-fixnum" case can be a fast-safe generator and the "should-be-fixnum" case is a safe generator. We could choose not to have a safe "should-be-fixnum" generator, and let the unrestricted safe generator handle it. We would then have to do an explicit type check on the result. In other words, for all template except Safe, a type restriction on either an argument or result means "this must be true; if it is not the system may break." In contrast, in a Safe template, the restriction means "If this is not true, I will signal an error." Since the node-derived-type only takes into consideration stuff that can be proved from the arguments, we can use the node-derived-type to select fast-safe templates. With unsafe policies, we don't care, since the code is supposed to be unsafe. |# Local TN assignment (LTN) assigns all the TNs needed to represent the values of continuations. This pass scans over the code for the component, examining each continuation and its destination. A number of somewhat unrelated things are also done at the same time so that multiple passes aren't necessary. -- Determine the Primitive-Type for each continuation value and assigns TNs to hold the values. -- Use policy information to determine the implementation strategy for each call to a known function. -- Clear the type-check flags in continuations whose destinations have safe implementations. -- Determine the value-passing strategy for each continuation: known or unknown. -- Note usage of unknown-values continuations so that stack analysis can tell when stack values must be discarded. If safety is more important that speed and space, then we consider generating type checks on the values of nodes whose CONT has the Type-Check flag set. If the destinatation for the continuation value is safe, then we don't need to do a check. We assume that all full calls are safe, and use the template information to determine whether inline operations are safe. This phase is where compiler policy switches have most of their effect. The speed/space/safety tradeoff can determine which of a number of coding strategies are used. It is important to make the policy choice in IR2 conversion rather than in code generation because the cost and storage requirement information which drives TNBIND will depend strongly on what actual VOP is chosen. In the case of +/FIXNUM, there might be three or more implementations, some optimized for speed, some for space, etc. Some of these VOPS might be open-coded and some not. We represent the implementation strategy for a call by either marking it as a full call or annotating it with a "template" representing the open-coding strategy. Templates are selected using a two-way dispatch off of operand primitive-types and policy. The general case of LTN is handled by the LTN-Annotate function in the function-info, but most functions are handled by a table-driven mechanism. There are four different translation policies that a template may have: Safe The safest implementation; must do argument type checking. Small The (unsafe) smallest implementation. Fast The (unsafe) fastest implementation. Fast-Safe An implementation optimized for speed, but which does any necessary checks exclusive of argument type checking. Examples are array bounds checks and fixnum overflow checks. Usually a function will have only one or two distinct templates. Either or both of the safe and fast-safe templates may be omitted; if both are specified, then they should be distinct. If there is no safe template and our policy is safe, then we do a full call. We use four different coding strategies, depending on the policy: Safe - safety > space > speed, or we want to use the fast-safe template, but there isn't one. Small - space > (max speed safety) Fast - speed > (max space safety) Fast-Safe + type check - safety > speed > space, or we want to use the safe template, but there isn't one. "Space" above is actually the maximum of space and cspeed, under the theory that less code will take less time to generate and assemble. [### This could lose if the smallest case is out-of-line, and must allocate many linkage registers.] IR2 CONVERSION #| Single-use let var continuation substitution not really correct, since it can cause a spurious type error. Maybe we do want stuff to prove that an NLX can't happen after all. Or go back to the idea of moving a combination arg to the ref location, and having that use the ref cont (with its output assertion.) This lossage doesn't seem very likely to actually happen, though. [### must-reach stuff wouldn't work quite as well as combination substitute in psetq, etc., since it would fail when one of the new values is random code (might unwind.)] Is this really a general problem with eager type checking? It seems you could argue that there was no type error in this code: (+ :foo (throw 'up nil)) But we would signal an error. Emit explicit you-lose operation when we do a move between two non-T ptypes, even when type checking isn't on. Can this really happen? Seems we should treat continuations like this as though type-check was true. Maybe LTN should leave type-check true in this case, even when the policy is unsafe. (Do a type check against NIL?) At continuation use time, we may in general have to do both a coerce-to-t and a type check, allocating two temporary TNs to hold the intermediate results. IR2 Control representation: We represent all control transfer explicitly. In particular, :Conditional VOPs take a single Target continuation and a Not-P flag indicating whether the sense of the test is negated. Then an unconditional Branch VOP will be emitted afterward if the other path isn't a drop-through. So we linearize the code before IR2-conversion. This isn't a problem, since there isn't much change in control flow after IR2 conversion (none until loop optimization requires introduction of header blocks.) It does make cost-based branch prediction a bit ucky, though, since we don't have any cost information in IR1. Actually, I guess we do have pretty good cost information after LTN even before IR2 conversion, since the most important thing to know is which functions are open-coded. |# IR2 preserves the block structure of IR1, but replaces the nodes with a target dependent virtual machine (VM) representation. Different implementations may use different VMs without making major changes in the back end. The two main components of IR2 are Temporary Names (TNs) and Virtual OPerations (VOPs). TNs represent the locations that hold values, and VOPs represent the operations performed on the values. A "primitive type" is a type meaningful at the VM level. Examples are Fixnum, String-Char, Short-Float. During IR2 conversion we use the primitive type of an expression to determine both where we can store the result of the expression and which type-specific implementations of an operation can be applied to the value. [Ptype is a set of SCs == representation choices & representation specific operations] The VM specific definitions provide functions that do stuff like find the primitive type corresponding to a type and test for primitive type subtypep. Usually primitive types will be disjoint except for T, which represents all types. The primitive type T is special-cased. Not only does it overlap with all the other types, but it implies a descriptor ("boxed" or "pointer") representation. For efficiency reasons, we sometimes want to use alternate representations for some objects such as numbers. The majority of operations cannot exploit alternate representations, and would only be complicated if they had to be able to convert alternate representations into descriptors. A template can require an operand to be a descriptor by constraining the operand to be of type T. A TN can only represent a single value, so we bare the implementation of MVs at this point. When we know the number of multiple values being handled, we use multiple TNs to hold them. When the number of values is actually unknown, we use a convention that is compatible with full function call. Everything that is done is done by a VOP in IR2. Calls to simple primitive functions such as + and CAR are translated to VOP equivalents by a table-driven mechanism. This translation is specified by the particular VM definition; IR2 conversion makes no assumptions about which operations are primitive or what operand types are worth special-casing. The default calling mechanisms and other miscellaneous builtin features are implemented using standard VOPs that must implemented by each VM. Type information can be forgotten after IR2 conversion, since all type-specific operation selections have been made. Simple type checking is explicitly done using CHECK-xxx VOPs. They act like innocuous effectless/unaffected VOPs which return the checked thing as a result. This allows loop-invariant optimization and common subexpression elimination to remove redundant checks. All type checking is done at the time the continuation is used. Note that we need only check asserted types, since if type inference works, the derived types will also be satisfied. We can check whichever is more convenient, since both should be true. Constants are turned into special Constant TNs, which are wired down in a SC that is determined by their type. The VM definition provides a function that returns constant a TN to represent a Constant Leaf. Each component has a constant pool. There is a register dedicated to holding the constant pool for the current component. The back end allocates non-immediate constants in the constant pool when it discovers them during translation from IR1. [### Check that we are describing what is actually implemented. But this really isn't very good in the presence of interesting unboxed representations...] Since LTN only deals with values from the viewpoint of the receiver, we must be prepared during the translation pass to do stuff to the continuation at the time it is used. -- If a VOP yields more values than are desired, then we must create TNs to hold the discarded results. An important special-case is continuations whose value is discarded. These continuations won't be annotated at all. In the case of a Ref, we can simply skip evaluation of the reference when the continuation hasn't been annotated. Although this will eliminate bogus references that for some reason weren't optimized away, the real purpose is to handle deferred references. -- If a VOP yields fewer values than desired, then we must default the extra values to NIL. -- If a continuation has its type-check flag set, then we must check the type of the value before moving it into the result location. In general, this requires computing the result in a temporary, and having the type-check operation deliver it in the actual result location. -- If the template's result type is T, then we must generate a boxed temporary to compute the result in when the continuation's type isn't T. We may also need to do stuff to the arguments when we generate code for a template. If an argument continuation isn't annotated, then it must be a deferred reference. We use the leaf's TN instead. We may have to do any of the above use-time actions also. Alternatively, we could avoid hair by not deferring references that must be type-checked or may need to be boxed. Stack analysis: Think of this as a lifetime problem: a values generator is a write and a values receiver is a read. We want to annotate each IR2-Block with the unknown-values continuations that are live at that point. If we do a control transfer to a place where fewer continuations are live, then we must deallocate the newly dead continuations. We want to convince ourselves that values deallocation based on lifetime analysis actually works. In particular, we need to be sure that it doesn't violate the required stack discipline. It is clear that it is impossible to deallocate the values before they become dead, since later code may decide to use them. So the only thing we need to ensure is that the "right" time isn't later than the time that the continuation becomes dead. The only reason why we couldn't deallocate continuation A as soon as it becomes dead would be that there is another continuation B on top of it that isn't dead (since we can only deallocate the topmost continuation). The key to understanding why this can't happen is that each continuation has only one read (receiver). If B is on top of A, then it must be the case that A is live at the receiver for B. This means that it is impossible for B to be live without A being live. The reason that we don't solve this problem using a normal iterative flow analysis is that we also need to know the ordering of the continuations on the stack so that we can do deallocation. When it comes time to discard values, we want to know which discarded continuation is on the bottom so that we can reset SP to its start. [I suppose we could also decrement SP by the aggregate size of the discarded continuations.] Another advantage of knowing the order in which we expect continuations to be on the stack is that it allows us to do some consistency checking. Also doing a localized graph walk around the values-receiver is likely to be much more efficient than doing an iterative flow analysis problem over all the code in the component (not that big a consideration.) #| Actually, what we do is do a backward graph walk from each unknown-values receiver. As we go, we mark each walked block with ther ordered list of continuations we believe are on the stack. Starting with an empty stack, we: -- When we encounter another unknown-values receiver, we push that continuation on our simulated stack. -- When we encounter a receiver (which had better be for the topmost continuation), we pop that continuation. -- When we pop all continuations, we terminate our walk. [### not quite right... It seems we may run into "dead values" during the graph walk too. It seems that we have to check if the pushed continuation is on stack top, and if not, add it to the ending stack so that the post-pass will discard it.] [### Also, we can't terminate our walk just because we hit a block previously walked. We have to compare the the End-Stack with the values received along the current path: if we have more values on our current walk than on the walk that last touched the block, then we need to re-walk the subgraph reachable from from that block, using our larger set of continuations. It seems that our actual termination condition is reaching a block whose End-Stack is already EQ to our current stack.] If at the start, the block containing the values receiver has already been walked, the we skip the walk for that continuation, since it has already been handled by an enclosing values receiver. Once a walk has started, we ignore any signs of a previous walk, clobbering the old result with our own, since we enclose that continuation, and the previous walk doesn't take into consideration the fact that our values block underlies its own. When we are done, we have annotated each block with the stack current both at the beginning and at the end of that block. Blocks that aren't walked don't have anything on the stack either place (although they may hack MVs internally). We then scan all the blocks in the component, looking for blocks that have predecessors with a different ending stack than that block's starting stack. (The starting stack had better be a tail of the predecessor's ending stack.) We insert a block intervening between all of these predecessors that sets SP to the end of the values for the continuation that should be on stack top. Of course, this pass needn't be done if there aren't any global unknown MVs. Also, if we find any block that wasn't reached during the walk, but that USEs an outside unknown-values continuation, then we know that the DEST can't be reached from this point, so the values are unused. We either insert code to pop the values, or somehow mark the code to prevent the values from ever being pushed. (We could cause the popping to be done by the normal pass if we iterated over the pushes beforehand, assigning a correct END-STACK.) [### But I think that we have to be a bit clever within blocks, given the possibility of blocks being joined. We could collect some unknown MVs in a block, then do a control transfer out of the receiver, and this control transfer could be squeezed out by merging blocks. How about: (tagbody (return (multiple-value-prog1 (foo) (when bar (go UNWIND)))) UNWIND (return (multiple-value-prog1 (baz) bletch))) But the problem doesn't happen here (can't happen in general?) since a node buried within a block can't use a continuation outside of the block. In fact, no block can have more then one PUSH continuation, and this must always be be last continuation. So it is trivially (structurally) true that all pops come before any push. [### But not really: the DEST of an embedded continuation may be outside the block. There can be multiple pushes, and we must find them by iterating over the uses of MV receivers in LTN. But it would be hard to get the order right this way. We could easily get the order right if we added the generators as we saw the uses, except that we can't guarantee that the continuations will be annotated at that point. (Actually, I think we only need the order for consistency checks, but that is probably worthwhile). I guess the thing to do is when we process the receiver, add the generator blocks to the Values-Generators, then do a post-pass that re-scans the blocks adding the pushes.] I believe that above concern with a dead use getting mashed inside a block can't happen, since the use inside the block must be the only use, and if the use isn't reachable from the push, then the use is totally unreachable, and should have been deleted, which would prevent the prevent it from ever being annotated. ] ] |# We find the partial ordering of the values globs for unknown values continuations in each environment. We don't have to scan the code looking for unknown values continuations since LTN annotates each block with the continuations that were popped and not pushed or pushed and not popped. This is all we need to do the inter-block analysis. After we have found out what stuff is on the stack at each block boundary, we look for blocks with predecessors that have junk on the stack. For each such block, we introduce a new block containing code to restore the stack pointer. Since unknown-values continuations are represented as , we can easily pop a continuation using the Start TN. Note that there is only doubt about how much stuff is on the control stack, since only it is used for unknown values. Any special stacks such as number stacks will always have a fixed allocation. Non-local exit: If the starting and ending continuations are not in the same environment, then the control transfer is a non-local exit. In this case just call Unwind with the appropriate stack pointer, and let the code at the re-entry point worry about fixing things up. It seems like maybe a good way to organize IR2 conversion of NLX would be to have environment analysis insert funny functions in new interposed cleanup blocks. The thing is that we need some way for IR2 conversion to: 1] Get its hands on the returned values. 2] Do weird control shit. 3] Deliver the values to the original continuation destination. I.e. we need some way to interpose arbitrary code in the path of value delivery. What we do is replace the NLX uses of the continuation with another continuation that is received by a MV-Call to %NLX-VALUES in a cleanup block that is interposed between the NLX uses and the old continuation's block. The MV-Call uses the original continuation to deliver it's values to. [Actually, it's not really important that this be an MV-Call, since it has to be special-cased by LTN anyway. Or maybe we would want it to be an MV call. If did normal LTN analysis of an MV call, it would force the returned values into the unknown values convention, which is probably pretty convenient for use in NLX. Then the entry code would have to use some special VOPs to receive the unknown values. But we probably need special VOPs for NLX entry anyway, and the code can share with the call VOPs. Also we probably need the technology anyway, since THROW will use truly unknown values.] On entry to a dynamic extent that has non-local-exists into it (always at an ENTRY node), we take a complete snapshot of the dynamic state: the top pointers for all stacks current Catch and Unwind-Protect current special binding (binding stack pointer in shallow binding) We insert code at the re-entry point which restores the saved dynamic state. All TNs live at a NLX EP are forced onto the stack, so we don't have to restore them, and we don't have to worry about getting them saved. REACHING DEFINITIONS This phase is optional, but should be done whenever speed or space is more important than compile speed. We use global flow analysis to find the reaching definitions for each TN. This information is used here to eliminate unnecessary TNs, and is also used later on by loop invariant optimization. In some cases, IR2 conversion will unnecessarily copy the value of a TN into another TN, since it may not be able to tell that the initial TN has the same value at the time the second TN is referenced. This can happen when IR1 optimize is unable to eliminate a trivial variable binding, or when the user does a setq, or may also result from creation of expression evaluation temporaries during IR2 conversion. Whatever the cause, we would like to avoid the unnecessary creation and assignment of these TNs. What we do is replace TN references whose only reaching definition is a Move VOP with a reference to the TN moved from. This deletion of references can cause the TN to be dead at the location of the Move VOP, causing conflict analysis to force it into the bit-bucket SC. The generator for Move will then realize that it doesn't have to do anything. [Probably should just delete the MOVE VOP. This way representation selection doesn't get confused, etc.] Some degree of cleverness is probably useful to prevent the flow analysis from being too expensive. As for lifetime analysis, we only need to do flow analysis on global packed TNs. We can't do the real local TN assignment pass before this, since we allocate TNs afterward. Probably we do some kind of pre-pass that marks the TNs that are local for our purposes. We don't care if block splitting eventually causes some of them to be considered global. Note also that we really only are interested in known if there is a unique reaching definition, which we could possibly mash into our flow analysis rules by doing an intersection somewhere. Then a definition would only appear in the set when it is unique. And we could propagate only definitions of TNs with only one write, which would allow the TN to stand for the definition. LOOP INVARIANT OPTIMIZATION This phase is optional, but should be done whenever speed is more important than compile speed and space. We scan the loops from the inside out, moving VOPs to before the head of the loop when we can show that they compute the same value every time around the loop. Probably most loop invariant expressions will be due to code implicitly emitted by the compiler, the largest contribution being error checking code. We need to be more careful than lots of compilers, since we must guarantee that we don't evaluate the invariant expressions when they would not have been previously. For example, it would be unacceptable to move an error check out of the loop when the loop might run zero times. The simplest solution is to only consider VOPs in blocks that dominate all the loop exits. This is almost worthless, since most loops have the exit test at the head. What we have to do is guard the invariant code with a replication of the exit test. A simple but useful version of this general transformation can be implemented when the head is an exit. We do something like this: LOOP (if test (go EXIT)) . . (go LOOP) EXIT ==> (if test (go EXIT)) (go SKIP) LOOP (if test (go EXIT)) SKIP . . (go LOOP) EXIT What we do is remove invariants in the head block, then copy the entire head block and use it as a guard for the other invariants, which must dominate every other exit, but don't dominate the head. It makes absolutely no difference what the code in the head block is or what the exit test is. Of course, copying blocks can be a major space waste. We might want to inhibit copying of large blocks unless optimize space is 0. [### Note that this scheme can only only move code out one loop level, since the guard on the invariant block prevents it from dominating the loop exits. This transformation can be regarded as a special case of the transform: -- Given a loop with an invariant conditional, replicate the loop for each value of the condition and run the correct version. The invariant test is "runs more than once". There is a potential exponential blowup here. To avoid excessive code bloat, we might have one copy of all N loops that is invariant optimized and for which we test that *all* levels of the loop will run more than once before jumping in. But that only works if the loop control of all loops is independent (as is true in the common nested-array-loop case.) Or something...] We determine what expressions are invariants by doing global flow analysis to find the reaching definitions for each TN. A VOP is a potential invariant when every arg is either constant, has all its definitions outside of the loop, or has as its only definition an invariant VOP inside the loop. Such a VOP can be moved out of the loop when we know that it isn't affected by any side-effects in the loop. This will be trivially true if the VOP isn't affected by any side-effects. A somewhat more general solution would be to take the union of the side-effects of all the VOPs in the loop, and only do the move when the VOP is not affected by any of them. [### Actually, we don't need to use reaching-definitions information: we can just use the reference information for each TN. This won't detect some invariants that would be detected by using reaching-definitions, but it isn't true for any of the internal subexpressions that we care about, since they all have only one definition. The main case where this would happen is in the "only definition outside" case: it could be that only an invariant definition reaches the use even when there are definitions inside the loop. But this would only seem to happen if people are gratuitously recycling variables. If we don't need to use reaching-definitions, then reaching-definitions analysis could be relegated to a higher optimization level, and could be tuned for the move optimization case, where we only want to know what the definition is when there is a single one, and can get away with a simple indication of "many" if there is more than one. The move-deletion action also seems conceptually pretty similar to the stuff that register allocation does. Maybe it could be squeezed in there somehow? ] We can relax the "dominates all exits" requirement when we can prove that the operation can be successfully executed whenever its operands are available. Of course, any error checking code doesn't fall into this category. But we also need to show that "nothing bad will happen": system integrity will be maintained when the operation is done with arguments that it that is might not otherwise be done with. For one, we need to be sure that localized type information didn't influence our original decision to emit that particular VOP. Uses of THE, variable type declarations not attached to bindings, and function argument type restrictions must be disregarded, since in the original program, the code containing the declarations or call might nor have been executed. In contrast, we can use type information derived from declarations for variable bindings. Since we eagerly check the types of variables, we know that any reference to the variable will satisfy the type. Read this to say we can use the variable's Leaf-Type: we have to be more careful about using the Refs Node-Derived-Type, since IR1 optimizations may discover local information about the variable's type. For example, type constraint propagation puts localized type information in the derived type. Note that the Leaf-Type is represented in the TN-Primitive type, accessible in IR2. How about this: For a VOP to be subject to aggressive code motion, it must be explicitly declared as such (i.e. :Movable T), and it must have primitive-type restrictions on any arguments that must be of a particular type for successful execution. Motion won't be done unless the TN-Primitive-Type of operands after the move satisfy the restrictions. "Successful execution" means that *no* invocation of that vop with *any* argument TNs with the specified primitive types will result in any sort of error or in any trap such as illegal memory access, floating point overflow, divide by 0. Also, the result generated must be in the representation demanded by the result TN. If there are any result type restrictions, then it must be harmless for them to be violated (as long as nobody uses the result.) For example, it a lowtag scheme, it is harmless for a fixnum output restriction to be violated. [But not in a hightag scheme, since a garbaged type code results from overflow.] Type coercion operations aren't subject to aggressive motion (i.e. coerce-from-t), since coercion requires the argument to be the right type, yet coercion operations aren't ever emitted unless the argument type is T. Data structure accesses are o.k. as long as we establish that the operand is some sort of pointer (boxed in the case of boxed accessors). [But we can't bless the result of moved code as being of the correct type. If we do a bogus reference, we may pick up an some immediate object, rather than the pointer we were expecting. So we can't do aggressive motion of a whole reference expression, only of the innermost reference.] Note that we could still do aggressive motion of structure accesses, even though the primitive type discards the exact structure type. All could translate to STRUCTURE, assuming that mixing up accessors doesn't result in death. [Systems that use protected pages to mark the end of the heap could get confused, though.] This means that the kinds of operations subject to aggressive motion is relatively limited. Considering the complexity and subtle issues involved, it seems questionable whether it is worth attempting. The only interesting case seems to be fixnum arithmetic, and this only works given assumptions about the legality of fixnum overflow. Aggressive motion of special references (including boundp checks) is probably culturally acceptable, but the correctness is dubious. We could do it in unsafe code, though. I guess there also wouldn't be any problem if we factored out the boundp check into a special check operation. COMMON SUBEXPRESSION ELIMINATION This phase is optional, but should be done if speed or space is more important than compilation speed. This phase scans each block and combines VOPs that compute the same expression. We scan forward, adding VOPs to some kind of hashtable if they have an attribute indicating that they are a candidate for a subexpression, and clearing out VOPs in the table that are killed by side-effects that we encounter along the way. Probably we want a separate table for unaffected VOPs so that we have fewer to scan when we notice a side-effect. We can deal with invalidating VOPs whose arguments are clobbered by looking at the Tn-refs for each TN that we see being used as a result. Probably many expressions should be killed by function call, since they won't be worth doing two memory references to save. We do this after all the other optimizations so that we can combine duplicated crud that may have been stuck at loop heads by previous optimizations. We could do global common subexpression elimination, but it seems like a lot more trouble that it's worth. Global common subexpression is computationally difficult, both because it uses flow analysis and because it involves creating and repeatedly scanning large sets of expressions. Local common subexpression probably gets us most of the win; common subexpression elimination is more of a space optimization that a time optimization anyway, since loop invariant elimination moves the common subexpressions out of loops. [### On the other hand, GCS could optimize stuff that safe loop invariant optimization is too wimpy to handle. The problem is that we can only optimize invariants that dominate the exit (unless we can somehow prove to ourselves that the operation cannot possibly result in an error.) Also, it isn't clear that symbolic programs really do spend most of their time burning in inner loops. With GCS, if the expression has already been computed outside of the loop, then we know we can flush the evaluation; there is no need to worry about safety. Of course, we are already planning to do GCS-like optimization of of type checks in the IR1 type constraint propagation step, so we won't get any type checks this way. Special references and array index calculations (as well as garbage macro expansions) are a possibility, though.] Probably we would represent an "expression" by arbitrarily designating the first use we see of a particular VOP/argument combination as the "expression". A pre-pass would build sets of the operations generated, those killed by assignments, and a representation of the aggregate side-effects of the block. We may want to divide the set of available expressions according to whether the expression's VOP is affected by any side-effects. This is because during flow analysis of any block that has side-effects, we need to iterate over all of the possibly affected VOPs to see which which expressions need to be killed (since we don't the pre-pass to have to compute the set of all expressions killed by the side-effects.) It is likely that the set of available affectable expressions will be smaller than the unaffected set, since affected expressions tend to be killed by side effects. REPRESENTATION SELECTION #| It seems that now that representation selection and coercions are done by a pre-pass to pack, IR2 conversion will have to leave stubs for representation specific move operations. These will be VOPs that representation selection will recognize by name, replacing them with the appropriate representation specific VOPs. For MOVE itself, we just use the MOVE VOP. If it turns out that a funny move is needed, then we delete the MOVE and insert the new VOP. I guess we can similarly special-case MOVE-ARGUMENT and MOVE-RESULT. Probably this is a good time to stick in the new (but old) argument passing mechanism in which we move arguments into the callee's frame. If we do this, it eliminates the need for a separate MOVE-RESULT. [### But this opens the question of how we handle passing of arguments on the number stack. It seems that it doesn't work very well to index the number stack off of the top pointer, as it makes it difficult to discuss moving between frames. Maybe we should allocate a NFP whenever it is possible that something will be allocated on the number stack. But how do we know this before pack runs? I guess representation selection can note if it chose a SC (or an alternate SC) that is on the number stack. Should the NFP be passed as an arg if it exists? It seems that we might do well to defer the whole operand moving aspect of call/known-return until representation selection. Maybe we should leave nothing in the more operand until representation selection, and have a different marker VOP that accepts the argument/result values. This marker VOP would be replaced with the appropriate MOVE VOPs. Or perhaps we can flush the call/known-return more operands? The more operand would be moved to the MOVE-ARGUMENTS VOP, which would be emitted right before the call/known-return. The appropriate move/coerce VOPs would be inserted before this VOP, which would remain in the IR2 to terminate the lifetimes. Before the moves would come the VOPs to create the necessary frame(s). Perhaps these would be emitted during IR2tran and would be passed to the call VOP. All these VOPs would inhibit automatic loading of the frame pointer operands, and would only load when necessary. Only allocate number stack FP when we need it: when there is a WITH-STACK-ALIEN. We determine this by seeing if there are any refs of the funny function %WITH-STACK-ALIEN in the current function. If so, we allocate a number stack FP restricted to a register. If NFP is allocated, it is passed as a codegen info arg to ALLOCATE-FRAME. New interface: instead of CURRENT-FRAME-SIZE, have CURRENT-SB-SIZE which returns the current element size of the named SB. How can we have primitive types that overlap, i.e. (UNSIGNED-BYTE 32), (SIGNED-BYTE 32), FIXNUM? Primitive types are used for two things: Representation selection: which SCs can be used to represent this value? For this purpose, it isn't necessary that primitive types be disjoint, since any primitive type can choose an arbitrary set of representations. For moves between the overlapping representations, the move/load operations can just be noops when the locations are the same (vanilla MOVE), since any bad moves should be caught out by type checking. VOP selection: Is this operand legal for this VOP? When ptypes overlap in interesting ways, there is a problem with allowing just a simple ptype restriction, since we might want to allow multiple ptypes. This could be handled by allowing "union primitive types", or by allowing multiple primitive types to be specified (only in the operand restriction.) The latter would be long the lines of other more flexible VOP operand restriction mechanisms, (constant, etc.) Ensure that load/save-operand never need to do representation conversion. The PRIMITIVE-TYPE more/coerce info would be moved into the SC. This could perhaps go along with flushing the TN-COSTS. We would annotate the TN with best SC, which implies the representation (boxed or unboxed). We would still need represent the legal SCs for restricted TNs somehow, and also would have to come up with some other way for pack to keep track of which SCs we have already tried. A SC would have a list of "alternate" SCs and a boolean SAVE-P value that indicates it needs to be saved across calls in some non-SAVE-P SC. A TN is initially given its "best" SC. The SC is annotated with VOPs that are used for moving between the SC and its alternate SCs (load/save operand, save/restore register). It is also annotated with the "move" VOPs used for moving between this SC and all other SCs it is possible to move between. We flush the idea that there is only c-to-t and c-from-t. But how does this mesh with the idea of putting operand load/save back into the generator? Maybe we should instead specify a load/save function? The load/save functions would also differ from the move VOPs in that they would only be called when the TN is in fact in that particular alternate SC, whereas the move VOPs will be associated with the primary SC, and will be emitted before it is known whether the TN will be packed in the primary SC or an alternate. I guess a packed SC could also have immediate SCs as alternate SCs, and constant loading functions could be associated with SCs using this mechanism. So given a TN packed in SC X and a SC restriction for Y and Z, how do we know which load function to call? There would be ambiguity if X was an alternate for both Y and Z and they specified different load functions. This seems unlikely to arise in practice, though, so we could just detect the ambiguity and give an error at define-vop time. If they are doing something totally weird, they can always inhibit loading and roll their own. Note that loading costs can be specified at the same time (same syntax) as association of loading functions with SCs. It seems that maybe we will be rolling DEFINE-SAVE-SCS and DEFINE-MOVE-COSTS into DEFINE-STORAGE-CLASS. Fortunately, these changes will affect most VOP definitions very little. LIFETIME ANALYSIS This phase is a preliminary to Pack. It involves three passes: -- A pre-pass that computes the DEF and USE sets for live TN analysis, while also assigning local TN numbers, splitting blocks if necessary. ### But not really... -- A flow analysis pass that does backward flow analysis on the component to find the live TNs at each block boundary. -- A post-pass that finds the conflict set for each TN. #| Exploit the fact that a single VOP can only exhaust LTN numbers when there are large more operands. Since more operand reference cannot be interleaved with temporary reference, the references all effectively occur at the same time. This means that we can assign all the more args and all the more results the same LTN number and the same lifetime info. |# Flow analysis: It seems we could use the global-conflicts structures during compute the inter-block lifetime information. The pre-pass creates all the global-conflicts for blocks that global TNs are referenced in. The flow analysis pass just adds always-live global-conflicts for the other blocks the TNs are live in. In addition to possibly being more efficient than SSets, this would directly result in the desired global-conflicts information, rather that having to create it from another representation. The DFO sorted per-TN global-conflicts thread suggests some kind of algorithm based on the manipulation of the sets of blocks each TN is live in (which is what we really want), rather than the set of TNs live in each block. If we sorted the per-TN global-conflicts in reverse DFO (which is just as good for determining conflicts between TNs), then it seems we could scan though the conflicts simultaneously with our flow-analysis scan through the blocks. The flow analysis step is the following: If a TN is always-live or read-before-written in a successor block, then we make it always-live in the current block unless there are already global-conflicts recorded for that TN in this block. The iteration terminates when we don't add any new global-conflicts during a pass. We may also want to promote TNs only read within a block to always-live when the TN is live in a successor. This should be easy enough as long as the global-conflicts structure contains this kind of info. The critical operation here is determining whether a given global TN has global conflicts in a given block. Note that since we scan the blocks in DFO, and the global-conflicts are sorted in DFO, if we give each global TN a pointer to the global-conflicts for the last block we checked the TN was in, then we can guarantee that the global-conflicts we are looking for are always at or after that pointer. If we need to insert a new structure, then the pointer will help us rapidly find the place to do the insertion.] Conflict detection: [### Environment, :more TNs.] This phase makes use of the results of lifetime analysis to find the set of TNs that have lifetimes overlapping with those of each TN. We also annotate call VOPs with information about the live TNs so that code generation knows which registers need to be saved. The basic action is a backward scan of each block, looking at each TN-Ref and maintaining a set of the currently live TNs. When we see a read, we check if the TN is in the live set. If not, we: -- Add the TN to the conflict set for every currently live TN, -- Union the set of currently live TNs with the conflict set for the TN, and -- Add the TN to the set of live TNs. When we see a write for a live TN, we just remove it from the live set. If we see a write to a dead TN, then we update the conflicts sets as for a read, but don't add the TN to the live set. We have to do this so that the bogus write doesn't clobber anything. [We don't consider always-live TNs at all in this process, since the conflict of always-live TNs with other TNs in the block is implicit in the global-conflicts structures. Before we do the scan on a block, we go through the global-conflicts structures of TNs that change liveness in the block, assigning the recorded LTN number to the TN's LTN number for the duration of processing of that block.] Efficiently computing and representing this information calls for some cleverness. It would be prohibitively expensive to represent the full conflict set for every TN with sparse sets, as is done at the block-level. Although it wouldn't cause non-linear behavior, it would require a complex linked structure containing tens of elements to be created for every TN. Fortunately we can improve on this if we take into account the fact that most TNs are "local" TNs: TNs which have all their uses in one block. First, many global TNs will be either live or dead for the entire duration of a given block. We can represent the conflict between global TNs live throughout the block and TNs local to the block by storing the set of always-live global TNs in the block. This reduces the number of global TNs that must be represented in the conflicts for local TNs. Second, we can represent conflicts within a block using bit-vectors. Each TN that changes liveness within a block is assigned a local TN number. Local conflicts are represented using a fixed-size bit-vector of 64 elements or so which has a 1 for the local TN number of every TN live at that time. The block has a simple-vector which maps from local TN numbers to TNs. Fixed-size vectors reduce the hassle of doing allocations and allow operations to be open-coded in a maximally tense fashion. We can represent the conflicts for a local TN by a single bit-vector indexed by the local TN numbers for that block, but in the global TN case, we need to be able to represent conflicts with arbitrary TNs. We could use a list-like sparse set representation, but then we would have to either special-case global TNs by using the sparse representation within the block, or convert the local conflicts bit-vector to the sparse representation at the block end. Instead, we give each global TN a list of the local conflicts bit-vectors for each block that the TN is live in. If the TN is always-live in a block, then we record that fact instead. This gives us a major reduction in the amount of work we have to do in lifetime analysis at the cost of some increase in the time to iterate over the set during Pack. Since we build the lists of local conflict vectors a block at a time, the blocks in the lists for each TN will be sorted by the block number. The structure also contains the local TN number for the TN in that block. These features allow pack to efficiently determine whether two arbitrary TNs conflict. You just scan the lists in order, skipping blocks that are in only one list by using the block numbers. When we find a block that both TNs are live in, we just check the local TN number of one TN in the local conflicts vector of the other. In order to do these optimizations, we must do a pre-pass that finds the always-live TNs and breaks blocks up into small enough pieces so that we don't run out of local TN numbers. If we can make a block arbitrarily small, then we can guarantee that an arbitrarily small number of TNs change liveness within the block. We must be prepared to make the arguments to unbounded arg count VOPs (such as function call) always-live even when they really aren't. This is enabled by a panic mode in the block splitter: if we discover that the block only contains one VOP and there are still too many TNs that aren't always-live, then we promote the arguments (which we'd better be able to do...). This is done during the pre-scan in lifetime analysis. We can do this because all TNs that change liveness within a block can be found by examining that block: the flow analysis only adds always-live TNs. When we are doing the conflict detection pass, we set the LTN number of global TNs. We can easily detect global TNs that have not been locally mapped because this slot is initially null for global TNs and we null it out after processing each block. We assign all Always-Live TNs to the same local number so that we don't need to treat references to them specially when making the scan. We also annotate call VOPs that do register saving with the TNs that are live during the call, and thus would need to be saved if they are packed in registers. We adjust the costs for TNs that need to be saved so that TNs costing more to save and restore than to reference get packed on the stack. We would also like more often saved TNs to get higher costs so that they are packed in more savable locations. PACKING #| Add lifetime/pack support for pre-packed save TNs. Fix GTN/IR2 conversion to use pre-packed save TNs for old-cont and return-PC. (Will prevent preference from passing location to save location from ever being honored?) We will need to make packing of passing locations smarter before we will be able to target the passing location on the stack in a tail call (when that is where the callee wants it.) Currently, we will almost always pack the passing location in a register without considering whether that is really a good idea. Maybe we should consider schemes that explicitly understand the parallel assignment semantics, and try to do the assignment with a minimum number of temporaries. We only need assignment temps for TNs that appear both as an actual argument value and as a formal parameter of the called function. This only happens in self-recursive functions. Could be a problem with lifetime analysis, though. The write by a move-arg VOP would look like a write in the current env, when it really isn't. If this is a problem, then we might want to make the result TN be an info arg rather than a real operand. But this would only be a problem in recursive calls, anyway. [This would prevent targeting, but targeting across passing locations rarely seems to work anyway.] [### But the :ENVIRONMENT TN mechanism would get confused. Maybe put env explicitly in TN, and have it only always-live in that env, and normal in other envs (or blocks it is written in.) This would allow targeting into environment TNs. I guess we would also want the env/PC save TNs normal in the return block so that we can target them. We could do this by considering env TNs normal in read blocks with no successors. ENV TNs would be treated totally normally in non-env blocks, so we don't have to worry about lifetime analysis getting confused by variable initializations. Do some kind of TN costing to determine when it is more trouble than it is worth to allocate TNs in registers. Change pack ordering to be less pessimal. Pack TNs as they are seen in the LTN map in DFO, which at least in non-block compilations has an effect something like packing main trace TNs first, since control analysis tries to put the good code first. This could also reduce spilling, since it makes it less likely we will clog all registers with global TNs. If we pack a TN with a specified save location on the stack, pack in the specified location. Allow old-cont and return-pc to be kept in registers by adding a new "keep around" kind of TN. These are kind of like environment live, but are only always-live in blocks that they weren't referenced in. Lifetime analysis does a post-pass adding always-live conflicts for each "keep around" TN to those blocks with no conflict for that TN. The distinction between always-live and keep-around allows us to successfully target old-cont and return-pc to passing locations. MAKE-KEEP-AROUND-TN (ptype), PRE-PACK-SAVE-TN (tn scn offset). Environment needs a KEEP-AROUND-TNS slot so that conflict analysis can find them (no special casing is needed after then, they can be made with :NORMAL kind). IR2-component needs PRE-PACKED-SAVE-TNS so that conflict analysis or somebody can copy conflict info from the saved TN. Note that having block granularity in the conflict information doesn't mean that a localized packing scheme would have to do all moves at block boundaries (which would clash with the desire the have saving done as part of this mechanism.) All that it means is that if we want to do a move within the block, we would need to allocate both locations throughout that block (or something). Load TN pack: A location is out for load TN packing if: The location has TN live in it after the VOP for a result, or before the VOP for an argument, or The location is used earlier in the TN-ref list (after) the saved results ref or later in the TN-Ref list (before) the loaded argument's ref. To pack load TNs, we advance the live-tns to the interesting VOP, then repeatedly scan the vop-refs to find vop-local conflicts for each needed load TN. We insert move VOPs and change over the TN-Ref-TNs as we go so the TN-Refs will reflect conflicts with already packed load-TNs. If we fail to pack a load-TN in the desired SC, then we scan the Live-TNs for the SB, looking for a TN that can be packed in an unbounded SB. This TN must then be repacked in the unbounded SB. It is important the load-TNs are never packed in unbounded SBs, since that would invalidate the conflicts info, preventing us from repacking TNs in unbounded SBs. We can't repack in a finite SB, since there might have been load TNs packed in that SB which aren't represented in the original conflict structures. Is it permissible to "restrict" an operand to an unbounded SC? Not impossible to satisfy as long as a finite SC is also allowed. But in practice, no restriction would probably be as good. Register saving needs some thought. What is the representation of the save info? How about a list of the TNs live after the args are read and before the results are written... [Same as all TNs live after call that aren't results.] Have a pack post-pass that assigns a stack save-tn to each TN that needs to be saved? Just select a location on the stack for the original TN, then pack the save TN there instead. We need to somehow associate the save TN with the saved TN. I guess we use the info from define-save-scs to tell what SC to allocate the save TN in. [Note that load-tns never need to be saved, so this could be jammed into load TN pack with little difficulty.] Actually, we could emit explicit moves (?). That would have the advantage of removing complexity from the call VOPs. But the move VOPs had better not need any registers. Or... Instead of annotating the call VOP with the save TNs, we could emit the save/restore code after lifetime analysis (in the post-pass) and before pack. An advantage of pushing the saving into the call VOPs is that it would be more feasible to use load/store multiple operations. Note also that the idea of using LM/STM indicates against allocating explicit save TNs, since it would be difficult for pack to allocate those TNs where the instruction wants them. But at some point, we need to determine what the size of the stack frame is, including the save area. On the other hand, using TNs to represent the save locations probably simplifies life for the code generator, and makes dumping debug information easier. We have to know where each register TN is saved. A TN should always be saved in the same place so that it is feasible to dump this information. This imposes some constraints on the saving discipline. How useful is FSC information? Might want to know if FSC is the same before deciding to attempt targeting. The Costs, SC and Offset slots in the TN would probably all be flushed once loop-local packing is done, since this info would be loop-local. We assume all locations can be used when an sc is based on an unbounded sb. Associate a bit-vector with each loop that represents the set of blocks in the loop, including all nested loops and all code for functions called in the same environment. This allows a quick test for whether a particular block is within the loop we are currently packing, so we know whether conflicts in that block are relevant. To check for conflicts on a particular location in a given extent, we iterate over the global conflicts for the TN, using the bit-vector to tell whether the block is in the extent. We compute this set by doing a walk on the call/loop graph, adding the blocks directly in our loop into the set, then recursing on inferiors and OR'ing their sets into ours. Don't have to pack loops bottom-up breadth-first, since parallel loops are disjoint. We only have to pack all inferior loops before packing each loop: a post-order walk of the loop nesting. It also makes no difference which order we pack functions in as long as they have disjoint environments (modulo boundary conditions such as passing locations). If we have functions that share their environment, then the code in the function is nested within several loops. This causes problems, since packing loops surrounding one call would add conflicts in the called function which would interfere with packing more inner loops surrounding other calls. But this probably wouldn't cause much lossage, since loops within the called function would still get packed first. The only significant effect would be that one caller would be favored in allowing stuff to stay in registers over the call. And this is only an issue if we do implement such calls... Even dealing with this issue sub-optimally, it would surely be better than doing all that saving and restoring. Constant caching: the current scheme of doing lifetime analysis for all constant TNs is losing pretty badly. Probably we want some kind of constant TN that we don't attempt to cache, and then we selectively emit cached constant TNs depending on policy and perhaps on a guess for the likely desirability of caching based on the number of references, loop depth, kind of constant (cost of loading), etc. For example, caching an immediate constant is probably rarely (only worthwhile when we can squeeze out any move from the cache register, since the immediate load would be as fast as the register-to-register move.) Note that we need to discover which TNs are immediate constants anyway so that we can give them special costs which allow packing in the constant SCs. Perhaps have separately specified constant SCs for each primitive-type. These SCs would be in addition to the packed SCs when doing representation selection/pack. If a constant SC comes up best, then we don't attempt to cache. Note that specifying a generator restriction that restricts an argument to a constant SC causes costs to be computed that allow the operand TN only into the constant SCs. This would prevent caching. If you just want to emit an immediate variant of an instruction in the constant case, then the thing to do is to use only one generator that explicitly allows the immediate SC, and then do a sc-case off of the operand within the generator body. Of course, it is silly inhibiting caching of a constant just because someone isn't prepared to use the cache. Such users can still use the constant directly even when it is cached. I guess the way to reflect this would be to have the costs for the ref be all zeros, saying we don't care what the SC is, but when selecting the generator we still don't consider these other SCs to be possible. Specifying 0 move costs into the constant SCs would have the first effect, but would also allow the immediate generator to be used for any SC. [### But it probably isn't worth trying to fix restrictions to constant SCs, since it doesn't seem to be useful. We only need the power when an operand being in a particular constant SC modifies the costs for other operands. This doesn't seem very likely. Representation selection is a choice of the best representations for each TN, with the best representations for a given VOP taken as constant. This means that any interesting choice about the "best code sequence" is made at IR2 conversion time. There just don't seem to be any interesting choices that can't be made then. If we allowed conditional packing based on SCs, then we could do things like squeeze out the case of operations with a massively better sequence when an argument is constant, but this isn't supported by the current multi-generator code, and could probably be supported better by having support for automatic special-casing of constants in IR2 conversion. ] It might be useful to make the write-p slot encode more information. How about making it a "Kind" field, having values such as :Argument, :Result, :Temp-Read and :Temp-Write? Would it be a good idea to explicitly emit Move vops to represent operand loading and saving? It would make generator functions simpler, and would allow all knowledge of moving to be moved into the Move VOP (maybe... targeting?). It would also allow the Load-TN slot in the TN-Ref structure to be eliminated, since the TN-Refs for the actual VOP would directly reference the load TN. But, this would also disallow the possibility of a VOP getting control over the loading process. This seemed to be necessary for loading operations that require additional temporaries. But on the other hand, considering the loading operations to be VOPs might make this fall out more naturally. If we pack from inner loops to outer, one loop at a time, then it is unimportant (undesirable?) to consider loop depth in costs. This simplifies cost determination stuff. This isn't a pessimization: in our new interpretation, the purpose of costs is determining the best representation for a value within a given pack extent (loop). Weighting by loop factors represented the desirability of favoring the inner loop if its FSC differed from that in the outer loop. In our new scheme, each loop can have its own FSC. Since at any given time, we only pack TNs at the same loop level, all TNs are equally "important". Even supposing we wanted to use some cost-based "importance" to determine packing order, the loop depth would play no role. I guess the old scheme did provide a way to trade off the enhanced importance of an inner loop with important things outside the loop, which the new algorithm is fundamentally unable to do. But it is unlikely that this loss offsets all the gains of more localized packing. TN-Refs are be convenient structures to build the target graph out of. If we allocated space in every TN-Ref, then there would certainly be enough to represent arbitrary target graphs. Would it be enough to allocate a single Target slot? If there is a target path though a given VOP, then the Target of the write ref would be the read, and vice-versa. To find all the TNs that target us, we look at the TN for the target of all our write refs. We separately chain together the read refs and the write refs for a TN, allowing easy determination of things such as whether a TN has only a single definition or has no reads. It would also allow easier traversal of the target graph. When computing preference sets, take usage of TN into consideration. TNs that are only used once (i.e. have a single write and read) are much easier to fold together. If we only joined single-use, non-conflicting TNs into preference sets, this would amount to computing a target path. A lot of the preferences that it is important to honor are of this relatively simple nature (including most local preferences). Do we want to consider more explicitly doing targeting? This would be a phase that runs after lifetime analysis, but before back. It would mash together TNs that we would like to be on a particular target path. An icky issue here is targeting values to wired (or even restricted) locations. Basically, we can't do this in a pre-pass to pack, since extending such TN lifetimes could cause packing to fail. Pack must somehow do something sensible with preferences to wired TNs if full (and miscop) call is going to win. Basing preference analysis more closely on examination of the code also makes more sense in packing algorithms where allocation is done more locally during a sort of code walk. But having some sort of global preference mechanism is probably useful for encouraging compatible packing between the separately packed units. Perhaps we want to incrementally build a data structure that represents affinities between TNs and *locations* during the packing process? However we represent this, we want it to be able to represent the implicit moves that may have to be done when we enter or exit a pack region. But if we always pack from inner loops out, then we will always be able to express our desire as a desire to pack the TN in a certain location. We also need to express targeting to wired TNs, but once again this is a preference to a location, rather than to a TN. But we also need to communicate across, since we would like parallel inner loops to choose the same location so that the enclosing loop can agree with both. When packing in a loop, consider all conflicts for the TN in blocks in enclosed loops, as well as conflicts for blocks directly in that loop. This allows packing of inner loops to totally ignore TNs that aren't referenced within the current loop or its inferiors. In other words, when we pack a TN, we pack it in that loop and in all inferiors of that loop, instead of just in that loop. Represent per-scarce-location conflicts as vectors indexed by block number of per-block conflict info. To test whether a TN conflicts on a location, we would then have to iterate over the TNs global-conflicts, using the block number and LTN number to check for a conflict in that block. But since most TNs are local, this test actually isn't much more expensive than indexing into a bit-vector by GTN numbers. The big win of this scheme is that it is much cheaper to add conflicts into the conflict set for a location, since we never need to actually compute the conflict set in a list-like representation (which requires iterating over the LTN conflicts vectors and unioning in the always-live TNs). Instead, we just iterate over the global-conflicts for the TN, using BIT-IOR to combine the conflict set with the bit-vector for that block in that location, or marking that block/location combination as being always-live if the conflict is always-live. Generating the conflict set is inherently more costly, since although we believe the conflict set size to be roughly constant, it can easily contain tens of elements. We would have to generate these moderately large lists for all TNs, including local TNs. In contrast, the proposed scheme does work proportional to the number of blocks the TN is live in, which is small on average (1 for local TNs). This win exists independently from the win of not having to iterate over LTN conflict vectors. [### Note that since we never do bitwise iteration over the LTN conflict vectors, part of the motivation for keeping these a small fixed size has been removed. But it would still be useful to keep the size fixed so that we can easily recycle the bit-vectors, and so that we could potentially have maximally tense special primitives for doing clear and bit-ior on these vectors.] This scheme is somewhat more space-intensive than having a per-location bit-vector. Each vector entry would be something like 150 bits rather than one bit, but this is mitigated by the number of blocks being 5-10x smaller than the number of TNs. This seems like an acceptable overhead, a small fraction of the total IR2 representation. The space overhead could also be reduced by using something equivalent to a two-dimensional bit array, indexed first by LTN numbers, and then block numbers (instead of using a simple-vector of separate bit-vectors.) This would eliminate space wastage due to bit-vector overheads, which might be 50% or more, and would also make efficient zeroing of the vectors more straightforward. We would then want efficient operations for OR'ing LTN conflict vectors with rows in the array. This representation also opens a whole new range of allocation algorithms: ones that store allocate TNs in different locations within different portions of the program. This is because we can now represent a location being used to hold a certain TN within an arbitrary subset of the blocks the TN is referenced in. Pack goals: Pack should: Subject to resource constraints: -- Minimize use costs -- "Register allocation" Allocate as many values as possible in scarce "good" locations, attempting to minimize the aggregate use cost for the entire program. -- "Representation selection" When multiple representations for a value are possible, choose the one that has the lowest cost according to the context-sensitive use information. -- "Save optimization" Don't allocate values in registers when the save/restore costs exceed the expected gain for keeping the value in a register. (Similar to "opening costs" in RAOC.) [Really just a case of representation selection.] -- Minimize preference costs Eliminate as many moves as possible. "Register allocation" is basically an attempt to eliminate moves between registers and memory. "Save optimization" counterbalances "register allocation" to prevent it from becoming a pessimization, since saves can introduce register/memory moves. Preference optimization reduces the number of moves within an SC. Doing a good job of honoring preferences is important to the success of the compiler, since we have assumed in many places that moves will usually be optimized away. The "representation selection" problem is similar in formulation to "register allocation", but it seems that we can expect to handle it somewhat better because of its more local character. It isn't so much a zero-sum game; we are choosing between two good possibilities, rather than choosing which losses are more acceptable. A lot of the generality in our cost model is more aimed toward representation selection than toward register optimization; we don't necessarily need to make all the subtleties of context-sensitive cost determination apparent to the scarcity-oriented aspects of pack (such as priority determination). We want to have the "correct" FSC before we go to pack, but it isn't crucial that we keep the rank up to date during the packing process. The scarcity-oriented aspect of "register allocation" is handled by a greedy algorithm in pack. We try to pack the "most important" TNs first, under the theory that earlier packing is more likely to succeed due to fewer constraints. The drawback of greedy algorithms is their inability to look ahead. Packing a TN may mess up later "register allocation" by precluding packing of TNs that are individually "less important", but more important in aggregate. Packing a TN may also prevent preferences from being honored. The "voting" algorithms described in RAOC attempt to avoid the "bad" and favor the "good" by a local examination of the preference/conflict graph for the TN being packed. The basic problem is that at the time we go to pack a TN, we only know stuff about TNs that have already been packed; we don't know how the remaining TNs will be packed, so we don't know how our packing choice will affect the other TNs. In the case of "register allocation", we don't know which location (if any) we can pack that will result in less ultimate cost than in local savings. In the case of "preference optimization", we don't know which preferences we will be precluding by adding additional constraints, since we don't know where unpacked TNs are going to want to be preferenced to. Things seem to be especially ucky in the case of preferences; pack doesn't deal very well with n'th order preference chains. It might be interesting to consider if we could determine good preference equivalence classes in the absence of scarcity constraints. This information can be used directly by unbounded pack, and also might be put to good use by finite pack in determining what preference relationships we would like to exist. The algorithm suggested for unbounded pack is of the right character, but it doesn't consider preference/conflict issues at all. Another thing that we might do in a pre-pass to pack would be to try to get some sort of accurate picture of the resource demands of packing each TN; the number of TNs that it conflicts with and their aggregate cost. This could be used to get a pre-pack estimate of the TNs that should probably not be packed in registers so as to avoid precluding other packings. If we did this kind of stuff, the actual pack phase might look somewhat different. It might start to look more like the "graph coloring" algorithms. What we would do is find the sets of TNs that would like to be packed in the same location for preference reasons, disregarding scarcity constraints. We would then attempt to pack the entire set of TNs in a single (scarce) location. If this isn't possible due to conflicts, then we either pack some subset of the TNs, allowing the remaining TNs to be packed in a single unbounded location, or we unpack some TNs already packed in a scarce location (moving them to their preference-set's corresponding unbounded location). We could also do some combination of unpacking and subset packing. What we might do is initially pack the subset of the TNs that we think have a modest resource demand, and then attempt to pack TNs with more conflicts later on, after we have packed everything else. A central idea to this is that we assign a stack location to each preference set as though all of the TNs in the set are actually allocated on the stack. Then we can choose to pack any subset of the TNs in the preference set into a register, without worrying about interactions with stack packing. In particular, load-TN packing doesn't really have to do anything at all to eject a TN from a register. All it has to do is change the TN to use the stack location for its preference set. I guess this amounts to regarding registers as a cache for the stack. Perhaps this is a useful way to think about it? ### Is there some good order to consider TNs in that will cleverly exploit the locality in the code? For example, sorting TNs by the "time" they first become live. We might use a particular register to cache a particular preference set over a certain time interval. We would never combine TNs into a preference set if they have different FSCs (or perhaps different FSC SBs), since we really don't want to honor those preferences. And of course, there are no conflicting TNs in a preference set. One way that we might consider preference/conflict issues in an unbounded scheme would be to find sets of TNs that "want" to have their preferences honored, with only partial consideration of conflict issues. We would consider "attractive" preferences: preferences between TNs that don't conflict. We could then label TNs with the component they appear in in the transitive closure of this graph. When determining "attractive" preferences, we might want to try a little harder to consider preferences that seem "good", for example if a TN is preferenced to TNs that mutually conflict, then we would choose which preference we will attempt to honor, and (at least initially) disregard the others. The crucial point is that we want to throw out enough preferences so that there are a fairly large number of components in the closure of the graph, with each of these components representing TNs that are are "fairly strongly" connected. How many preferences we need to throw out depends on how strongly connected the preference graph is once we throw out preferences with direct conflicts. If we consider local preferences at this point (which we would have to), then there are going to be lots of preferences. In this scheme, the local/global preference distinction becomes pretty meaningless. If conflicting TNs appear in a component, then we split the component up so that there are no conflicts. This would basically amount to doing packing separately on each collection of TNs that we believe have interesting preferencing relationships. So the algorithm might looks like this: -- Find the "ideal preference set". This is a collection of TNs that we think would all like to be packed in the same location, but might contain some second-order conflicts. -- Split the ideal preference set as necessary in order to eliminate internal conflicts. One algorithm might be to rank TNs by the total strength of their preferences within the ideal preference set, and then iterate over the TNs in the set in that order, adding all possible ones to an actual preference set, and rejecting TNs that conflict. Rejected TNs would go back to the beginning of the whole packing process. -- Squish preference sets into scarce locations, as mumbled about above (subset packing/unpacking). How about: Representation selection is handled by the initial FSC determination. We select generators before pack, and don't change them. [Which makes conditional packing feasible again. Oh well...] Maybe we want to assume that before pack we determine two SCs for each TN: a finite SC and an unbounded SC. The finite SC may be omitted if the unbounded SC is better. The unbounded SC may be omitted if the TN is restricted. The thing that we are giving up here is the possibility of attempting packing in several finite SCs before packing on the stack. This doesn't seem to be an important limitation. Probably what we would actually do is have an FSC and an NFSC, which might be the same. We compute a "goodness" for each TN that is the difference in the costs of the FSC and the NFSC. We also compute a "badness", which is the sum of the LTN counts of all blocks that the TN is live in (giving some measure of the amount of code that the TN is live in). Then we compute the preference sets. We rank preferences by the benefit (move count * move cost in the FSC) divided by the sum of the "badness" of the preferenced TNs. We then iterate over the ranked preferences, putting the preferenced TNs in the same preference set when the TNs don't conflict and have the same finite and unbounded SCs (or had those SCs missing). But this can result in second-order conflicts within the preference set. We can either compute the aggregate conflicts for the set, and only combine when we don't conflict with any TN in the set, or we can remove conflicts after the fact. Maybe it would be useful to have a sparse representation of the aggregate conflict set of the TNs in the preference set? Perhaps represented with global-conflicts structures (or similar). A list sorted by block number, with either :Live or a LTN conflict bit-vector. An advantage of this is that adding a TN to a preference set would basically involve merging the global-conflict lists, rather than actually computing the conflict set. Using this information, we could consider simultaneously packing all the TNs in a preference set. The advantage would be that we could efficiently compute the aggregate conflict set from the merged global-conflicts lists, instead of having to iterate over the conflicts of each TN that we pack. We would iterate over the TNs in the preference set, packing each TN that doesn't have a conflict for the register in that register, and letting the rest go on the stack. Of course, this would give spurious conflicts on registers, since TNs that went on the stack would still have their conflicts in the corresponding register. But this might work well enough, especially if we pack the preference sets in an order determined by the aggregate ranks of the TNs in the set. |# Assign TNs to SCs using cost and lifetime info. This phase is mandatory. PQCC papers is that we support load TNs. Scarce SB packing: #| Pack all TNs restricted to a finite SC first, before packing any other TNs. These TNs are all VOP temporaries, and it probably wouldn't hurt to assume that all VOP temporaries are restricted to finite SCs. If one isn't, we don't lose anything by packing immediately placing it in an unbounded SB [except that we might pack in a register when we didn't have to, possibly preventing a legal packing.] This eliminates to business of "infinite ranks" and "urgent TNs", which I don't really understand or see how it works. This really amounts to making all VOP temporaries "urgent" from the time they are allocated, eliminating the need to decide when the TN becomes urgent due to SCs filling up. But maybe we kind of have to really do the same thing anyway to keep track of SCs that non-restricted TNs have been unsuccessfully packed in [But no, since we can record that when the unsuccessful packing attempt is made. The problem with urgent TNs is that we need to anticipate failure of packing attempts so that we know a restricted TN's second choice SC is in fact unavailable, and thus must be packed in it's first choice.] Note that there probably won't be many such TNs, since most VOP temporaries will be wired to specific locations, rather than packed. Some kind of penalty for packing like the P@-(-) in RAOC is probably worthwhile. The idea is to disfavor packing TNs that have lots of competing expensive TNs in their conflict set. Using the average positive cost of packing conflicting TNs seems to be a reasonable basis, but we need a factor that indicates the probable number of TNs excluded. Another possible angle to approach this from would be to use the per-location costs obtained from location selection, abandoning packing the TN in that SC if we can't find a location whose cost is more positive than the negated rank of the TN. [Or if we added the rank to each possible location, a location with a non-negative cost.] This approach seems like it is a lot more accurate than the one in RAOC, since it actually looks at individual locations. Instead of permanently rejecting the TN from the SC, we could just knock it's rank down somehow, in the hope that we would find we could pack it after all (as in RAOC). This seems like it might be a bit ugly, since we would have to (at least sort of) keep this penalty up to date. We don't have to do a very good job though. Although ideally we would like to reduce the penalty for conflict-set TNs whenever we pack a TN, we don't really have to. This may result in ranks being artificially low, but at least the TN will get a second chance eventually. We could just have a penalty for each TN that is initially 0. If we don't pack because there is no location good enough, then we store the cost for the best location as the TN penalty, add it to the rank, and re-rank. Since this would result in negative ranks, the effect would be similar to setting the TNs aside until all other TNs have had a chance. |# Here are the location selection techniques, roughly in increasing order of difficulty: 1] We favor locations containing TNs that we are preferenced to with the strength of the preference. We use both the global and local preferences in location selection. This is trivial: we just iterate over our preferences, voting for the locations of packed TNs. 2] Favor packing in locations that an unpacked TN that we *don't* conflict with and are preferenced to is preferenced to, in the hope that we can pack the other TN in that location and honor both preference links. The potential benefit is the sum of the strength of the preferences, but it isn't certain that we will be able to honor the preference, so we should reduce the benefit to reflect this uncertainty. [Not true that benefit is the sum, since we can hardly take credit for causing the second-order preference to be honored. The benefit is at most the first-order preference.] For each unpacked TN that we are preferenced to and don't conflict with, we favor all the locations it is preferenced to and can be packed in (as determined by the per-location conflicts.) 3] Disfavor packing in locations that an unpacked TN you conflict with is preferenced to. If we pack in a location that a TN we conflict with is preferenced to, then the preference cannot be honored. The penalty is proportional to the preference strength, but we might want to make less than the whole, since we might not honor the preference anyway. Finding the conflict set for global TNs is somewhat ugly. We have to iterate over the sets for all the blocks the TN is live in, rejecting duplicates. This is acceptable, since we have to know the full conflict set for the TN we are going to pack anyway. [This is kind of a special case of 4. It says that we disfavor locations extra-strongly if a conflict is preferenced to the location. 4] Favor packing in locations that hold TNs that conflict with unpacked TNs that you conflict with. This is so that we don't gratuitously narrow down the packing opportunities of TNs we conflict with. If we pack in a location that TNs in our conflict set are excluded from, then we don't reduce the number of places they could be packed. The benefit is uncertain, but would be proportional to the rank of the excluded TN, since if we did exclude it the cost would be increased by that amount. Using the conflict set of the conflict set is probably not reasonable, since we don't have this information conveniently lying around -- fortunately, we don't need the second order conflict set. Instead, we iterate over the unpacked TNs in the conflict set, favoring locations that they cannot be packed in. We can easily determine this by indexing the per-location conflicts bit-vector with the TN number of the TN in the conflict set. [To support packing rejection, we would want to change this rule to *disfavor* packing in locations that your conflicts *don't* conflict with. (i.e. disfavor locations that your conflicts can be packed in.)] [The penalty can be the conflicting TNs rank divided by the number of locations in the SC that it might be packed into. Assuming we randomly assign to a location, this is the probability it will be assigned to that location. When a preference? It seems our information about all possible packings of the TN in the SC could be used to scale the preference penalty.] In the scarce resource (register) packing algorithm, we represent the set of conflicts for a location using a vector indexed by the global TN number. The value for a TN's index is the number of TNs packed in that location which conflict with with the TN, thus a TN may be packed in a location whenever the conflict count for its number is zero. We use a conflict count rather than a single bit so that we can unpack TNs by decrementing the conflict count for all the TNs in its conflict set. Although there may be thousands of TNs, using an array representation of the conflicts should be tractable since there are only tens of scarce locations. One might suppose that Pack would have to treat TNs in different environments differently, but this is not the case. Pack simply assigns TNs to locations so that no two conflicting TNs are in the same location. In the process of implementing call semantics in conflict analysis, we cause TNs in different environments not to conflict. In the case of passing TNs, cross environment conflicts do exist, but this reflects reality, since the passing TNs are live in both the caller and the callee. Environment semantics has already been implemented at this point. This means that Pack can pack all TNs simultaneously, using one data structure to represent the conflicts for each location. So we have only one vector of conflict count vectors per SB, rather than one per SB per environment. If we guarantee that load TNs are always packed immediately after they are allocated, then we can avoid having to create TN conflicts information for load TNs, which would be painful if there isn't room for any more local TNs in the block that the reference is in. Instead, we just pack the TN, adding the appropriate conflicts to the location's conflicts vector. Allocating load TNs during Pack is a bit of a pain, since already packed TNs didn't specify the new TN as a conflict when they were packed. When we create a load TN, we must iterate over the conflicts of the TN, incrementing the conflict counts for the new TN number in the locations that conflicting TNs have been packed in. Load TN packing: [### Idea: pack load TNs after all TNs that can be are packed in the scarce SBs. Instead of remembering lifetime information from conflict analysis, we rederive it. We scan each block backward while keeping track of which locations have live TNs in them. When we find a reference that needs a load TN packed, we try to pack it in an unused location. If we can't, we unpack the currently live TN with the lowest cost and force it into an unbounded SC. This isn't as optimal as a scheme that would allow repacking of unpacked TNs, but it is a lot simpler, since we don't have to update much of anything when we pack a load TN. The per-location and per-TN conflict information used by scarce pack doesn't need to be updated, since we are done with scarce pack. Since load TNs can only be packed in scarce SBs, they can't conflict with any TNs that haven't been packed yet. We also don't need to create any TN-Refs for load TNs. [??? How do we keep track of load-tn lifetimes? It isn't really that hard, I guess. We just remember which load TNs we created at each VOP, killing them when we pass the loading (or saving) step. This suggests we could flush the Refs thread if we were willing to sacrifice some flexibility in explicit temporary lifetimes. Flushing the Refs would make creating the IR2 representation easier.] [### It probably isn't worth trying to get the lifetimes for load TNs exactly right. Instead, we just consider them to conflict with all TNs referenced by the VOP. But this may not be easier than doing it right...] The main possibility for pessimization would arise when there are TNs that can be packed in distinct, overlapping, sets of locations. A TN might be packed in one of the overlap locations when it could be packed elsewhere. A load TN required to be in the other SC might force us to unpack the TN and put it on the stack, when it could still go in a register if we realized this earlier. It may also be a bit harder to satisfy local preferences on the load TNs. [Clever location selection in scarce pack may be able to help these...] Since we don't unpack TNs as far as scarce pack is concerned, we can use a bit-vector to represent the TNs conflicting in a given location. The lifetime analysis done in this second pass can also probably be enhanced into a consistency check. If we see a read of a TN packed in a location which has a different TN currently live, then there is a packing bug. If any of the TNs recorded as being live at the block beginning are packed in a scarce SB, but aren't current in that location, then we also have a problem. Location selection is much simpler than in scarce Pack. For one, there are no global preferences on load TNs. The conflict structure is also much less interesting, since the load TNs for arguments and results all conflict with each other, and don't conflict with much else. Probably we just try packing in locations that we have local preferences to before trying at random. There isn't any need for voting. ] The "assumption of infinite registers" used by PQCC seems like it may be a lose on the RT. The assumption that they make is that they can get away without allocating registers for operand loading, under the optimistic assumption that they won't necessary. It looks to me like they get away with it primarily because many architectures don't require register operands in most cases, so even if the arg TN doesn't make it into a register, they won't have to allocate a load register. This is definitely not the case on a load-store architecture. A possible solution would be to get Pack to allocate operand loading TNs. What we do is have an optional SC requirement associated with TN-refs. If we pack the TN in an SC which is different from the required SC for the reference, then we create a TN for each such reference, and pack it into the required SC. In many cases we will be able to pack the load TN with no hassle, but in general we may need to unpack a TN that has already been packed. The only real trick in unpacking is choosing a TN to unpack that can be unpacked with minimum disruption and pessimization. We only need to consider unpacking TNs that conflict with the one we need to pack, and obviously we would prefer to unpack the least important such TN. We must avoid unpacking TNs that we may not be able to repack: TNs that are restricted to finite SCs. Probably this can be done with a simple modification of the location selection algorithm. When we throw the switch, packing in a location which conflicts with an unpackable TN would be given an appropriate finite cost, rather than an infinite one. Once we have decided on a location, we eject any TN that we conflict with. It is possible that the load TNs for the victim will force unpacking of yet another TN, but this process should be rare and limited, since each unpacking increases the odds of successful load TN packing due to replacing TNs that have long lifetimes with short-lived load TNs. Unbounded SB packing: We use a different packing algorithm for packing in SBs that represent an abundant resource such as memory. If we know that all TNs that want to be packed in the SB can be packed in it, then we can optimize preference costs at the expense of storage usage. When we attempt to pack in an abundant SB, we just add the TN to a per-SB list and return success. Our algorithms might also exploit the statistical difference in the kind of TNs that get packed on the stack: overall, most TNs are local and will get packed in registers. The TNs stack allocated are a small fraction of the total, and have a much larger percentage of global TNs. As in scarce Pack, we represent the conflicts for a location with a bit-vector that has a 1 for the number of every TN that cannot be packed in that location, but we index using new TN numbers assigned only to the TNs we are trying to pack on the stack. This should reduce the size of the vectors quite a bit, since the stack TNs will be a small fraction of the total. First, we attempt to honor all global preference links (ignoring local preferences for now). We find all the preference links between TNs in the SB, and rank them by strength. This should reject a lot of preferences, since many preferences will be to TNs that got packed in registers. We then process the preferences in order: -- If one of the preferenced TNs has a location assigned, then we check the location's conflicts to see if we can pack in that location: if not, we skip this preference. If we pack, then we add the new TN's conflicts to the locations conflicts vector, computing them from the TN's conflicts information. -- If both of the preferenced TNs have locations, then we see if the locations can be combined. This involves iterating over the TNs packed in one location and seeing if any of them are in the conflicts for the other location. We maintain a list of the TNs packed in each location so that we can do this iteration efficiently. The lists should be small during the initial preferencing pass, since the size is bounded by the number of TNs that can be combined for purely preference reasons, which should be O(1). We combine the conflicts vectors using bit-or. -- If neither of the TNs have locations, and they don't conflict, then we assign a location to the pair, building a conflicts vector from the combined conflicts. After we have done all we can with the preferences, we pack stuff together in the minimum amount of space: -- Combine the already allocated locations where possible. Although the number of TNs packed in a location may become large at this point, testing for conflicts will still be reasonable, since we iterate over the small list of the location we are combining. We don't even need to update the TN list for the location we are combining into, since we won't ever try to combine it with anything. -- Pack all the remaining TNs allocated in the SB. These are the TNs that didn't have any successful preference (probably a large fraction.) If a TN can't be packed in any existing location, then we make a new one. CONTROL OPTIMIZATION In this phase we annotate blocks with drop-throughs. This controls how code generation linearizes code so that drop-throughs are used most effectively. [### We may want to totally linearize the code here, allowing code generation to scan the blocks using the drop-thru rather than having the figure out what to do when there isn't one.] One question is how much of the original sequencing in the code we should attempt to preserve in IR1 conversion. It seems that this depends on both how smart the compiler is and how dumb the programmer is. Probably we want to preserve the drop-thrus in a TAGBODY at least until we can do loop rotation and similar things. There are basically two aspects to this optimization: 1] Dynamically reducing the number of branches taken v.s. branches not taken under the assumption that branches not taken are cheaper. 2] Statically minimizing the number of unconditional branches, saving space and presumably time. These two goals can conflict, but if they do it seems pretty clear that the dynamic optimization should get preference. The main dynamic optimization is changing the sense of a conditional test so that the more commonly taken branch is the fall-through case. The problem is determining which branch is more commonly taken. The most clear-cut case is where one branch leads out of a loop and the other is within. In this case, clearly the branch within the loop should be preferred. The only added complication is that at some point in the loop there has to be a backward branch, and it is preferable for this branch to be conditional, since an unconditional branch is just a waste of time. In the absence of such good information, we can attempt to guess which branch is more popular on the basis of difference in the cost between the two cases. Min-max strategy suggests that we should choose the cheaper alternative, since the percentagewise improvement is greater when the branch overhead is significant with respect to the cost of the code branched to. A tractable approximation of this is to compare only the costs of the two blocks immediately branched to, since this would avoid having to do any hairy graph walking to find all the code for the consequent and the alternative. It might be worthwhile discriminating against ultra-expensive functions such as ERROR. For this to work, we have to detect when one of the options is empty. In this case, the next for one branch is a successor of the other branch, making the comparison meaningless. We use dominator information to detect this situation. When a branch is empty, one of the predecessors of the first block in the empty branch will be dominated by the first block in the other branch. In such a case we favor the empty branch, since that's about as cheap as you can get. Statically minimizing branches is really a much more tractable problem, but what literature there is makes it look hard. Clearly the thing to do is to use a non-optimal heuristic algorithm. A good possibility is to use an algorithm based on the depth first ordering. We can modify the basic DFO algorithm so that it chooses an ordering which favors any drop-thrus that we may choose for dynamic reasons. When we are walking the graph, we walk the desired drop-thru arc last, which will place it immediately after us in the DFO unless the arc is a retreating arc. We scan through the DFO and whenever we find a block that hasn't been done yet, we build a straight-line segment by setting the drop-thru to the unreached successor block which has the lowest DFN greater than that for the block. We move to the drop-thru block and repeat the process until there is no such block. We then go back to our original scan through the DFO, looking for the head of another straight-line segment. This process will automagically implement all of the dynamic optimizations described above as long as we favor the appropriate IF branch when creating the DFO. Using the DFO will prevent us from making the back branch in a loop the drop-thru, but we need to be clever about favoring IF branches within loops while computing the DFO. The IF join will be favored without any special effort, since we follow through the most favored path until we reach the end. This needs some knowledge about the target machine, since on most machines non-tail-recursive calls will use some sort of call instruction. In this case, the call actually wants to drop through to the return point, rather than dropping through to the beginning of the called function. BRANCH DELAY On machines with delayed branch instructions, we would like to locate code that can be moved into the delay slots after the branch. Although this optimization is usually done as a peephole optimization on assembly code, there are advantages to doing it on VOPs in IR2. The big advantage is that we already have massive semantic information about the VOPs, their operands and side effects. This makes the use of delayed conditional branches much more tractable, and should allow us to fill the delay slots a greater percentage of the time. Another advantage of doing this optimization on IR2 is that we eliminate the need for an assembly code phase in the compiler, since there aren't any other significant peephole optimizations. Basically what we do is locate an appropriate number of VOPs which can be moved into the delay slot, and then mark them so that the code generator for the branch can special-case their generation. Perhaps we would move them to the very end of the block (after any IF VOP). The easiest VOPs to move to the delay slot would be Move VOPs. We would just look for a Move VOP with different source and destination where the source is not written and the destination is not read before the branch. In the case of conditional branches where one arm leads out of the loop, we could attempt to move code from the probable destination backward into the delay slot. In the case where the normal arm is the drop-through (an exit test in a loop body), this is fairly straightforward. In the case of the back branch at the end of a loop, we would have to replicate the delayed VOPs at the ends of all the predecessors of the loop head. We could get clever and try to move things other than Move VOPs. In the case of an unconditional branch this is fairly straightforward. For each VOP, we would need to know if it is a good candidate for a delay slot, i.e. is guaranteed not to expand into illegal stuff. In the case of conditional branches it is a bit more difficult. We can only move VOPs into the delay slot of a conditional branch when it is effectless and we know that nothing "bad" will happen if the VOP is given garbage for its arguments. Doing this pass after Pack is important, since we need to know which Move VOPs actually do anything. Unsurprisingly, there are reasons for wanting to do some stuff before pack. The main problem is that Pack knows nothing about the desirability of not reusing the operands to potential delay VOPs between the VOP and the end of the block. If pack does reuse the operands, then we cannot move the VOP. This could be helped by having preload generation allocate temporaries to hold the arguments of probable delay VOPs. CODE GENERATION This should be fairly straightforward. We lay out the blocks somehow or other, taking into account the dropthrus, and then translate VOPs into instruction sequences on a per-block basis. The linearization process should attempt to minimize branch lengths and place related code together. After code generation, the IR2 representation is gone. Everything is represented by the assembler data structures. ASSEMBLY [### So far as input to the dumper/loader, how about having a list of Entry-Info structures in the IR2-Component? These structures contain all information needed to dump the associated function objects, and are only implicitly associated with the functional/XEP data structures. Load-time constants that reference these function objects should specify the Entry-Info, rather than the functional (or something). We would then need to maintain some sort of association so IR2 conversion can find the appropriate Entry-Info. Alternatively, we could initially reference the functional, and then later clobber the reference to the Entry-Info. We have some kind of post-pass that runs after assembly, going through the functions and constants, annotating the IR2-Component for the benefit of the dumper: Resolve :Label load-time constants. Make the debug info. Make the entry-info structures. Fasl dumper and in-core loader are implementation (but not instruction set) dependent, so we want to give them a clear interface. open-fasl-file name => fasl-file Returns a "fasl-file" object representing all state needed by the dumper. We objectify the state, since the fasdumper should be reentrant. (but could fail to be at first.) close-fasl-file fasl-file abort-p Close the specified fasl-file. fasl-dump-component component code-vector length fixups fasl-file Dump the code, constants, etc. for component. Code-Vector is a vector holding the assembled code. Length is the number of elements of Vector that are actually in use. Fixups is a list of conses (offset . fixup) describing the locations and things that need to be fixed up at load time. If the component is a top-level component, then the top-level lambda will be called after the component is loaded. load-component component code-vector length fixups Like Fasl-Dump-Component, but directly installs the code in core, running any top-level code immediately. (???) but we need some way to glue together the componenents, since we don't have a fasl table. Dumping: Dump code for each component after compiling that component, but defer dumping of other stuff. We do the fixups on the code vectors, and accumulate them in the table. We have to grovel the constants for each component after compiling that component so that we can fix up load-time constants. Load-time constants are values needed my the code that are computed after code generation/assembly time. Since the code is fixed at this point, load-time constants are always represented as non-immediate constants in the constant pool. A load-time constant is distinguished by being a cons (Kind . What), instead of a Constant leaf. Kind is a keyword indicating how the constant is computed, and What is some context. Some interesting load-time constants: (:label .