Thursday, 25 December 2008

Building a better future: the High-Level Virtual Machine

Microsoft's Common Language Run-time (CLR) was a fantastic idea. The ability to interoperate safely and at a high-level between different languages, from managed C++ to F#, has greatly accelerated development on the Microsoft platform. The resulting libraries, like Windows Presentation Foundation, are already a generation ahead of anything available on any other platform.

Linux and Mac OS X do not currently have the luxury of a solid foundation like the CLR. Consequently, they are composed entirely from uninteroperable components written in independent languages, from unmanaged custom C++ dialects to Objective C and Python. Some developers choose to restrict themselves to the lowest common denominator (e.g. writing GTK in C) which aids interoperability but only at a grave cost in productivity. Other developers gravitate to huge libraries written in custom dialects of particularly uninteroperable languages (e.g. Qt). Both approaches have a bleak future.

The situation is compounded by the fact that Linux has a far richer variety of programming languages than Windows, thanks to Linux being the platform of choice for academics such as programming language researchers who develop and maintain a variety of state-of-the-art programming languages, libraries and tools on the Linux platform. However, despite any benefits of languages like OCaml, Erlang, Haskell, Lisp, Scheme, ATS, Pure and others, these languages are almost entirely uninteroperable because they do not have a shared run-time and many do not even have easy foreign function interfaces (FFIs) to access existing unmanaged libraries.

If there were a high-level virtual machine (HLVM) available for Linux that could act as a common language run-time for these kinds of languages then it may be possible to build a better future for software development on these platforms. The impedance mismatch between different languages (including C) would be a lot smaller and the ability to write and consume libraries from other languages would greatly improve productivity.

We believe this approach has a bright future and, consequently, we have begun developing a new HLVM that is designed to act as a common language run-time, initially for the ML family of languages, in the hope that others will build upon it and efforts can be combined between language communities. We are using the excellent LLVM library that provides high-performance native code generation across a variety of architectures and platforms, including x86/x64 and Linux/OSX.

Although the project is still at a very early stage of development, we already have some promising results. We can compile a subset of ML including bools, ints, floats and arrays types, we have full tail calls between internal functions and the C calling convention for external functions which can be invoked directly and our implementation is 2-4× faster than OCaml on x86 at several simple benchmarks including the Sieve of Eratosthenes and a Mandelbrot renderer.

The main features that we have yet to implement are algebraic datatypes, pattern matching and garbage collection. Once those features have been completed we shall release a first version of our HLVM as an open source project and ask for contributors and developers to start improving and building upon this foundation. This will take time but hopefully we can work together to build a better future for high-level programming on the Linux and Mac OS X platforms.


12 comments:

adir1 said...

check openLINA

sampo said...

A high level language (with GC, reflection, interactive top-level, higher level threading constructs), fast implementation, good FFI and works on OS X and linux.. May be a tall order but that'd be a dream come true!

Kurt Schelfthout said...

Your reasoning seems a bit strange: there are too many different, non-interoperable platforms available. So let's make our own?

Wouldn't it make more sense to port your favourite ML to Mono (insofar as F# doesn't already fill that gap), or the JVM?

Flying Frog Consultancy Ltd. said...

@Kurt

Absolutely. Writing a VM is a huge undertaking and should not be taken lightly. Consequently, Mono and the JVM were my first ports of call for this project and I have studied both VMs in quite some detail.

The JVM offers excellent performance for structless imperative monomorphic code (i.e. Java code) but the JVM lacks value types, tail calls and implements parametric polymorphism via type erasure. Lack of values types makes tuples and multiple return values slow. Lack of tail calls makes functional code either very slow or very fragile. Lack of real parametric polymorphism makes polymorphic code slow (and almost all ML code is polymorphic thanks to automatic generalization by the compiler).

Mono has a much more compelling feature list than the JVM: it is supposed to support value types, tail calls and real parametric polymorphism and they are even reimplementing Microsoft's mainstream libraries. However, Mono simply does not deliver. Firstly, Mono does not have an accurate garbage collector. Secondly, Mono's code generator is poor (my F# code runs 3x slower under Mono than .NET). Thirdly, Mono is largely untested because it has a tiny user base (comparable to OCaml's). Fourthly, Mono only aspires to reimplement Microsoft's previous generation of libraries (i.e. WinForms and not WPF). Finally, Mono's VM is badly written and can only be fixed by a complete rewrite because the code generator must work in harmony with a real garbage collector (that does not exist yet).

Consequently, LLVM is the VM of choice for my project because it provides essential functionality that the JVM lacks and it is far more robust and well written than Mono. Moreover, LLVM facilitates a C compatible FFI which makes interoperability far easier from my HLVM than from alternatives like OCaml.

Indeed, if anyone ever wants to do a decent job of Mono then I would urge them to build upon LLVM.

James said...

Interestingly, although it took them four years to build it, the mono guys just announced a new code generation engine for Mono 2.2. They still haven't touched the GC though.

Kurt Schelfthout said...

Jon,

Thank you for your answer. Please don't take anything I say as criticism - I am simply intrigued by your project.

As for your assessments of current VMs - I'll take your word for it.

It seems you want to provide a powerful type system already at the VM level. In that case, I'm interested in your view on approaches like Newspeak and Gilad Bracha's ideas on pluggable type systems. Essentially the idea is this: a language, and certainly a VM, should never offer a typing system to begin with; this constrains the programs one can write in this language too much. Instead, type systems should be added later, and should be pluggable, i.e. viewed as a useful, yet optional, static analysis one can execute on programs. I know that Gilad has strong opinions on this, saying e.g. that basing the DLR on the CLR does not make sense; instead, MS should have started with a DLR, then layered a CLR on top of that.

Finally, regarding Mono: it seems that most of your criticisms are self-fulfilling prophecies. For example, if no one dares to use mono because of its tiny user base, then their user base is not likely to grow.

cheers,

Kurt

Flying Frog Consultancy Ltd. said...

James, the Mono guys seem to be grossly underestimating the performance benefits of their new code gen, saying that it can be up to 50% faster. According to their results using my F# port of SciMark2, Mono 2.2 is running up to 370% faster than Mono 2.0.

Kurt, there have been some (potentially) important developments over the past week that I will blog about next. For my HLVM, I only desire a type system capable of supporting ML efficiently and not anything particularly clever. That could be as simple as a safe derivative of LLVM's own type system + generics. I have not studied Gilad Bracha's ideas (I'll take a look) but, from your interpretation of them, they look like the ideal way to implement interoperable dynamic languages but would undermine the performance and correctness of static languages, most notably predictable performance. Obviously I have no desire to create or use dynamic languages so I would not pursue those ideas.

Marshall said...

I'm interested in following the development of this. How might I best do that? I lurk on the LLVM list; does that help?

Flying Frog Consultancy Ltd. said...

@Marshall

The best way to follow this work is by reading my articles about HLVM in the OCaml Journal. They describe my completed and tested work.

If you would like to chat about ideas then feel free to e-mail me. Alternatively, I shall be publishing this as open source once it is of some practical use, i.e. as soon as algebraic datatypes and GC are complete.

Pietro Braione said...

There was a HLVM project once, also related with LLVM and (if I do not remember bad) with similar aim. Is the choice of the name incidental?

Flying Frog Consultancy Ltd. said...

@Pietro

Yes. We recycled the name HLVM because the old HLVM project has been dead for many years now. However, their HLVM was designed for dynamic languages whereas our HLVM is designed to take full advantage of statically-typed languages.

jhuni said...

Ever heard of the parrot vm?

http://www.parrot.org/

It is a very effective foundation for high-level language interoperability and it already has Perl6, JavaScript, Lua, Basic, and a few others working, so rather then build some new "HLVM" thing, I would recommend checking out parrot which already has lots of support...

Allison Randal is developing on it all the time and Perl6 on parrot is getting a major release in June.