Saturday, 14 September 2013

A look at the IronJS source code

The author of IronJS has made some claims (e.g. F# can feel “like a slightly crappier version of C#”) that I find surprising so I'm reading his code. I find it to be quite odd for a variety of reasons:
  1. Surprisingly OOP (dozens of interfaces and classes and hundreds of augmentations, static members and overloads) given how well suited ML is to writing compilers.
  2. Lots of low-level bit-twiddling.
  3. Uses List.empty instead of [].
  4. Uses large tuples.
  5. Uses arrays of closures, e.g. Stmt : (Lexer.Token -> State -> Ast.Tree) array.
  6. Doesn't use a symbol table.
  7. Choice of data structures appears to incur lots of unnecessary boxing.
  8. Odd choice of data structures for memory representation, e.g. nullleft and stmt fields for a token are scattered.
  9. Doesn't seem to use concurrency or parallelism.
  10. Parser contains only 42 match expressions in 1,400 lines of code.
  11. Massive amounts of code duplication.
  12. Hand-rolled parser is unusual. Hand-rolled lexer without tables is even more unusual.
  13. Strange choice of syntax, e.g. lots of p |> csymbol instead of just csymbol p.
  14. Allegedly optimized but no structs.
  15. Uses HashSet.Add repeatedly instead of the declarative constructor from a seq.
  16. Doesn't use HashIdentity.Structural.
Furthermore, I don't recall ever seeing that author ask anyone for help with it and it looks like he could do with quite a bit of advice. So I'd take the "rewrite in C#" with a pinch of salt...
Here's an example of some code from the hand-rolled lexer that is using many comparisons of unicode characters when single table lookups could suffice:

let inline isAlpha c = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
let inline isDecimal c = c >= '0' && c <= '9'
let inline isOctal c = c >= '0' && c <= '7'
let inline isNonOctalDigit c = c = '8' && c = '9'
let inline isHex c = isDecimal c || (c >= 'A' && c <= 'F') || (c >= 'a' && c <= 'f')

Here's an example of some odd code that uses OOP methods to give names to the elements of a tuple when it should just use a record type instead of a tuple:

member x.TokenName =
  let s, _, _, _ = x.Token in s |> Lexer.Symbol.getName

member x.TokenValue =
  let _, v, _, _ = x.Token in v

member x.TokenLine =
  let _, _, l, _ = x.Token in l

member x.TokenColumn =
  let _, _, _, c = x.Token in c

I believe this is the offending type definition:

type Token = int * string * int * int

Here's another example of some strange code from the lexer where the programmer has manually assigned consecutive integers to symbols:

let [<Literal>] Break = 0
let [<Literal>] Case = 1
let [<Literal>] Catch = 2
let [<Literal>] Continue = 3
let [<Literal>] Default = 4
...

Note the complete absence of any metaprogramming here, which is surprising because it is written in a MetaLanguage (ML).

Here's another oddity, the use of new when constructing non-IDisposable objects:

new Dictionary<string, int>(

Yet another oddity, pretending that all type definitions are mutually recursive when they are actually completely independent:

and ScopeType
  = GlobalScope
  | FunctionScope
and EvalMode
  = Clean
  | Contains
  | Effected
and LookupMode
  = Static
  | Dynamic

Very odd to say that a Dictionary being faster than a Map is "sad" when it isn't even being mutated:

// Sadly a normal Dictionary is so
// much faster then a F# Map that
// it's worth using it
new Dictionary<string, int>(

In summary, although IronJS is written in F# it is not idiomatic F# code and there is a lot of room for improvement. I see no reason to believe that the problems with the IronJS code base have anything to do with the F# programming language.

3 comments:

Anton Tayanovskyy said...

Mostly agree, just two questions on parsing/lexing.

1. Lexing: I once toyed with lexing JS files with fslex, and lexing a file the size of jQuery was visibly slow. I think I might have used --unicode. This probably can be fixed in fslex, a table driven approach certainly should be fast enough when properly done but.. For the moment, there is at least some motivation hand-written lexers.

Do you default to using FsLex, or are there other better tools?

2. Parsing: JS (EMCA-262) grammar is expressed in EBNF but has the most annoying exceptions in margin notes, notably semicolon insertion and some meta-rules on lookahead. In my attempts I utterly failed to coerce this to a format ocamlyacc or fsyacc (LALR(1)) would accept; it does not mean it is impossible - probably just that I am not very experienced with these tools; they have some "error recovery" feature that might do it, but I was not sure how. So for my code I do this in a hand-rolled parser or parsec-style combinators, where there is a lot of manual control.

Do you routinely use fsyacc for parsing? FParsec? Something else perhaps?

Steve Gilham said...

Even granted that this picks on the more egregious highlights, it sounds like a case of "The determined Real Programmer can write FORTRAN programs in any language." (except that the state of the art seems to have moved on from FORTRAN to Java).

Perhaps someone with the bandwidth to take it on should attempt a rewrite in F# :)

Jon Harrop said...

@Anton: In F# today I use active patterns for both lexing and parsing. I would use fslex and fsyacc if they had tooling (i.e. VS integration and preferably run-time compilation rather than 2-stage) but they don't so I steer clear of them.

@Steve: Exactly.