Sunday, 6 May 2012

Learning garbage collection theory

I want to learn the theory behind garbage collection. How do i go about it?
I am also a dabbler interested in garbage collection (to the point that I wrote my own garbage collected VM called HLVM). I learned by reading as many research papers on garbage collection as I could get my hands on and by playing with the ideas myself, both raw in my virtual machine and also by writing memory-safe high-level simulations.
The obvious answer is - a compiler textbook... The question is, is it necessary to learn lexical analysis, parsing and other stuff that usually precedes garbage collection in a text?
The lexical analysis, parsing and other stuff is not relevant to garbage collection. You might get an outdated cursory overview of garbage collection from a compiler book but you need to read research papers to get an up-to-date view, e.g. with regard to multicore.
In short, what are the prerequisites to learning about Garbage collection theory?
You need to know about basic graph theory, pointers, stacks, threads and (if you're interested in multi-threading) low-level concurrency primitives such as locks.
Garbage collection is all about determining reachability. When a program can no longer obtain a reference to a value because that value has become unreachable then the GC can recycle the memory that value is occupying. Reachability is determined by traversing the heap starting from a set of "global roots" (global variables and pointers on the thread's stacks and in the core's registers)
GC design has many facets but you might begin with the four main garbage collection algorithms:
  • Mark-and-sweep (McCarthy, 1960)
  • Mark-and-compact (Haddon and Waite, 1967)
  • Stop-and-copy (Cheney, 1970)
  • Mark-region (McKinley et al., 2007)
Perhaps the most notable evolution of these basic ideas is generational garbage collection, which was the defacto standard design for many years.
My personal feeling is that some of the obscure work on garbage collection conveys just as much useful information so I'd also highly recommend:
You might also like to study the three kinds of write barrier (Dijkstra's, Steele's and Yuasa's) and look at the card marking and remembered set techniques.
Then you might also like to examine the actual design decisions some implementors chose for language implementations like Java and .NET as well as the SML/NJ compiler for Standard ML, the OCaml compiler, the Glasgow Haskell compiler and others. The differences between the techniques adopted are as big as the similarities between them!
There are also some great tangentially-related papers like Henderson's Accurate Garbage Collection in an Uncooperative Environment. I used that technique to avoid having to write a stack walker for HLVM.
The website is an invaluable resource, particularly the glossary of GC-related terms. Finally, the Garbage Collection Handbook by Jones et al. is a superb monograph on this subject.

Why do garbage collectors pause?

GCs work by tracing reachable heap blocks starting from a set of global roots (global variables, thread stacks and CPU registers). GCs lie on a sliding scale from snapshot to on-the-fly. Snapshot GCs work from a snapshot of the global roots and heap topology. On-the-fly GCs incrementally update their interpretation of the heap as the mutators run.
Wholly snapshot GCs attain high throughput because the collector runs almost entirely independently of the mutators but have high latency because taking a snapshot incurs a pause. Wholly on-the-fly GCs attain low latency because everything is done incrementally but low throughput because of the fine grained communication between the mutators and GC.
In practice, all GCs lie somewhere between these two extremes. VCGC is primarily a snapshot GC but it uses a write barrier to keep the collector apprised of changes to the heap topology. Staccato was the world's first parallel and concurrent and real-time GC but it still batches up some operations in order to retain the efficiency of stack allocation.

Functors vs generics

From this Stack Overflow question:
If I understand it correctly, functor gives you a new set of functions from a type you provide
More generally, functors map modules to modules. Your Set example maps a module adhering to theORDERED_TYPE signature to a module implementing a set. The ORDERED_TYPE signature requires a type and a comparison function. Therefore, your C# is not equivalent because it parameterizes the set only over the type and not over the comparison function. So your C# can only implement one set type for each element type whereas the functor can implement many set modules for each element module, e.g. in ascending and descending order.
Even if functors are just generics-for-namespace, what's the significant advantage of that approach?
One major advantage of a higher-order module system is the ability to gradually refine interfaces. In OOP, everything is public or private (or sometimes protected or internal etc.). With modules, you can gradually refine module signatures at will giving more public access closer to the internals of a module and abstracting more and more of it away as you get further from that part of the code. I find that to be a considerable benefit.
Two examples where higher-order module systems shine compared to OOP are parameterizing data structure implementations over each other and building extensible graph libraries. See the section on "Structural abstraction" in Chris Okasaki's PhD thesis for examples of data structures parameterized over other data structures, e.g. a functor that converts a queue into a catenable list. See OCaml's excellent ocamlgraph library and the paper Designing a Generic Graph Library using ML Functors for an example of extensible and reusable graph algorithms using functors.
Classes can also be used as ad-hoc namespaces using nested classes.
In C#, you cannot parameterize classes over other classes. In C++, you can do some things like inheriting from a class passed in via a template.
Also, you can curry functors...

Is metaprogramming possible in C#?

Metaprogramming is possible in .NET (see compiler compilers, regular expressions, code DOM, reflection etc.) but C# is not capable of template metaprogramming because it does not have that language feature.

Why don't purely functional languages use reference counting?

You can create cyclic data structures using purely functional programming simply by defining mutually-recursive values at the same time. For example, two mutually-recursive lists in OCaml:
let rec xs = 0::ys and ys = 1::xs
However, it is possible to define languages that make it impossible to create cyclic structures by design. The result is known as a unidirectional heap and its primary advantage is that garbage collection can be as simple as reference counting.
Some real languages do prohibit cycles and use reference counting. Erlang and Mathematica are examples. For example, in Mathematica when you reference a value you make a deep copy of it so mutating the original does not mutate the copy:
In[1] := xs = {1, 2, 3}
Out[1] = {1, 2, 3}

In[2] := ys = xs
Out[2] = {1, 2, 3}

In[3] := xs[[1]] = 5
Out[3] = 5

In[4] := xs
Out[4] = {5, 2, 3}

In[5] := ys
Out[5] = {1, 2, 3}

What's wrong with C#?

  • null everywhere.
  • const nowhere.
  • APIs are inconsistent, e.g. mutating an array returns void but appending to a StringBufferreturns the same mutable StringBuffer.
  • Collection interfaces are incompatible with immutable data structures, e.g. Add in System.Collections.Generic.IList<_> cannot return a result.
  • No structural typing so you write System.Windows.Media.Effects.SamplingMode.Bilinear instead of just Bilinear.
  • Mutable IEnumerator interface implemented by classes when it should be an immutable struct.
  • Equality and comparison are a mess: you've got System.IComparable and Equals but then you've also got:
    System.IComparable<_> System.IEquatableSystem.Collections.IComparer System.Collections.IStructuralComparable System.Collections.IStructuralEquatable System.Collections.Generic.IComparer System.Collections.Generic.IEqualityComparer
  • Tuples should be structs but structs unnecessarily inhibit tail call elimination so one of the most common and fundamental data types will allocate unnecessarily and destroy scalable parallelism.

Algebraic data types vs objects

I am a long time OO programmer and a functional programming newbie. From my little exposure algebraic data types only look like a special case of inheritance to me where you only have one level hierarchy and the super class cannot be extended outside the module.
You are describing closed sum types, the most common form of algebraic data types, as seen in F# and Haskell. Basically, everyone agrees that they are a useful feature to have in the type system, primarily because pattern matching makes it easy to dissect them by shape as well as by content and also because they permit exhaustiveness and redundancy checking.
However, there are other forms of algebraic datatypes. An important limitation of the conventional form is that they are closed, meaning that a previously-defined closed sum type cannot be extended with new type constructors (part of a more general problem known as "the expression problem"). OCaml's polymorphic variants allow both open and closed sum types and, in particular, the inference of sum types. In contrast, Haskell and F# cannot infer sum types. Polymorphic variants solve the expression problem and they are extremely useful. In fact, some languages are built entirely on extensible algebraic data types rather than closed sum types.
In the extreme, you also have languages like Mathematica where "everything is an expression". Thus the only type in the type system forms a trivial "singleton" algebra. This is "extensible" in the sense that it is infinite and, again, it culminates in a completely different style of programming.
So my (potentially dumb) question is: If ADTs are just that, a special case of inheritance (again this assumption may be wrong; please correct me in that case), then why does inheritance gets all the criticism and ADTs get all the praise?
I believe you are referring specifically to implementation inheritance (i.e. overriding functionality from a parent class) as opposed to interface inheritance (i.e. implementing a consistent interface). This is an important distinction. Implementation inheritance is often hated whereas interface inheritance is often loved (e.g. in F# which has a limited form of ADTs).
You really want both ADTs and interface inheritance. Languages like OCaml and F# offer both.

Friday, 4 May 2012

Quotations in F#, OCaml and Lisp

A quotation mechanism lets you embed code in your code and have the compiler transform that code from the source you provide into a data structure that represents it. For example, the following gives you a data structure representing the F# expression 1+2:
> <@ 1+2 @>;;
val it : Quotations.Expr<int> =
  Call (None, Int32 op_Addition[Int32,Int32,Int32](Int32, Int32),
      [Value (1), Value (2)])
    {CustomAttributes = [NewTuple (Value ("DebugRange"),
          NewTuple (Value ("stdin"), Value (3), Value (3), Value (3), Value (6)))];
     Raw = ...;
     Type = System.Int32;}
You can then hack on this data structure in order to apply transformations to your code, such as translating it from F# to Javascript in order to run it client side on almost any browser.
The F# quotation mechanism is extremely limited in functionality compared to the quotation mechanisms of languages like OCaml and Lisp. Moreover, although the .NET Framework and F# compiler provide everything required to compile and execute quoted code at full speed, the evaluation mechanism for quoted code is orders of magnitude slower than real F# code which, again, renders it virtually useless. Consequently, I am not familiar with any real applications of it beyond Websharper.
For example, you can only quote certain kinds of expressions in F# and not other code such as type definitions:
> <@ type t = Int of int @>;;

  <@ type t = Int of int @>;;

C:\Users\Jon\AppData\Local\Temp\stdin(4,4): error FS0010: Unexpected keyword 'type' in quotation literal
Most quotation mechanisms let you quote any valid code at all. For example, OCaml's quotation mechanism can quote the type definition that F# just barfed on:
$ ledit ocaml dynlink.cma camlp4oof.cma
        Objective Caml version 3.12.0

        Camlp4 Parsing version 3.12.0
# open Camlp4.PreCast;;
# let _loc = Loc.ghost;;
val _loc : Camlp4.PreCast.Loc.t = <abstr>
# <:expr< 1+2 >>;;
- : Camlp4.PreCast.Ast.expr =
Camlp4.PreCast.Ast.ExApp (<abstr>,
  Camlp4.PreCast.Ast.ExApp (<abstr>,
  Camlp4.PreCast.Ast.ExId (<abstr>, Camlp4.PreCast.Ast.IdLid (<abstr>, "+")),
  Camlp4.PreCast.Ast.ExInt (<abstr>, "1")),
  Camlp4.PreCast.Ast.ExInt (<abstr>, "2"))
# <:str_item< type t = Int of int >>;;
- : Camlp4.PreCast.Ast.str_item =
Camlp4.PreCast.Ast.StSem (<abstr>,
  Camlp4.PreCast.Ast.StTyp (<abstr>,
  Camlp4.PreCast.Ast.TyDcl (<abstr>, "t", [],
    Camlp4.PreCast.Ast.TySum (<abstr>,
    Camlp4.PreCast.Ast.TyOf (<abstr>,
      Camlp4.PreCast.Ast.TyId (<abstr>,
      Camlp4.PreCast.Ast.IdUid (<abstr>, "Int")),
      Camlp4.PreCast.Ast.TyId (<abstr>,
      Camlp4.PreCast.Ast.IdLid (<abstr>, "int")))),
  Camlp4.PreCast.Ast.StNil <abstr>)
Here is an example in Common Lisp:
$ sbcl
This is SBCL, an implementation of ANSI Common Lisp.
More information about SBCL is available at <>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses.  See the CREDITS and COPYING files in the
distribution for more information.
* '(+ 1 2)
(+ 1 2)
Metaprogramming is one application where pattern matching can be extremely useful but pattern matching is a general-purpose language feature. You may appreciate the article from the Benefits of OCaml about a minimal interpreter. In particular, note how easy pattern matching makes it to act upon each of the different kinds of expression:
> let rec eval vars = function
    | EApply(func, arg) ->
        match eval vars func, eval vars arg with
        | VClosure(var, vars, body), arg -> eval ((var, arg) :: vars) body
        | _ -> invalid_arg "Attempt to apply a non-function value"
    | EAdd(e1, e2) -> VInt (int(eval vars e1) + int(eval vars e2))
    | EMul(e1, e2) -> VInt (int(eval vars e1) * int(eval vars e2))
    | EEqual(e1, e2) -> VBool (eval vars e1 = eval vars e2)
    | EIf(p, t, f) -> eval vars (if bool (eval vars p) then t else f)
    | EInt i -> VInt i
    | ELetRec(var, arg, body, rest) ->
        let rec vars = (var, VClosure(arg, vars, body)) :: vars in
        eval vars rest
    | EVar s -> List.assoc s vars;;
val eval : (string * value) list -> expr -> value = <fun>
That OCaml article was used as the basis of the F#.NET Journal article "Language-oriented programming: The Term-level Interpreter" (31st December 2007).
You can write compilers in F#. In fact, F# is derived from a family of languages that were specifically designed for metaprogramming, the so-called MetaLanguages (ML) family.
The article "Run-time code generation using System.Reflection.Emit" (31st August 2008) from the F#.NET Journal described the design and implementation of a simple compiler for a minimal language called Brainf*ck. You can extend this to implement more sophisticated languages like Scheme. Indeed, the F# compiler is mostly written in F# itself.