What is LLVM? The power behind Swift, Rust, Clang, and more

Learn how the compiler framework for programmatically generating machine-native code has made it easier than ever to roll new languages and enhance existing ones

What is LLVM? The power behind Swift, Rust, Clang, and more
Thinkstock

New languages, and improvements on existing ones, are mushrooming throughout the develoment landscape. Mozilla’s Rust, Apple’s Swift, Jetbrains’s Kotlin, and many other languages provide developers with a new range of choices for speed, safety, convenience, portability, and power.

Why now? One big reason is new tools for building languages—specifically, compilers. And chief among them is LLVM (Low-Level Virtual Machine), an open source project originally developed by Swift language creator Chris Lattner as a research project at the University of Illinois.

LLVM makes it easier to not only create new languages, but to enhance the development of existing ones. It provides tools for automating many of the most thankless parts of the task of language creation: creating a compiler, porting the outputted code to multiple platforms and architectures, and writing code to handle common language metaphors like exceptions. Its liberal licensing means it can be freely reused as a software component or deployed as a service.

The roster of languages making use of LLVM has many familiar names. Apple’s Swift language uses LLVM as its compiler framework, and Rust uses LLVM as a core component of its tool chain. Also, many compilers have an LLVM edition, such as Clang, the C/C++ compiler (this the name, “C-lang”), itself a project closely allied with LLVM. And Kotlin, nominally a JVM language, is developing a version of the language called Kotlin Native that uses LLVM to compile to machine-native code.

LLVM defined

At its heart, LLVM is a library for programmatically creating machine-native code. A developer uses the API to generate instructions in a format called an intermediate representation, or IR. LLVM can then compile the IR into a standalone binary, or perform a JIT (just-in-time) compilation on the code to run in the context of another program, such as an interpreter for the language.

LLVM’s APIs provide primitives for developing many common structures and patterns found in programming languages. For example, almost every language has the concept of a function and of a global variable. LLVM has functions and global variables as standard elements in its IR, so instead of spending time and energy reinventing those particular wheels, you can just use LLVM’s implementations and focus on the parts of your language that need the attention.

hello world llvm IDG

An example of LLVM’s intermediate representation (IR). On the right is a simple program in C; on the left is the same code translated into LLVM IR by the Clang compiler.

LLVM: Designed for portability

One way to think of LLVM is as something that’s often been said about the C programming language: C is sometimes described as a portable, high-level assembly language, because it has constructions that can map closely to system hardware, and it has been ported to almost every system architecture there is. But C only works as a portable assembly language as a side effect of how it works; that wasn’t really one of its design goals.

By contrast, LLVM’s IR was designed from the beginning to be a portable assembly. One way it accomplishes this portability is by offering primitives independent of any particular machine architecture. For example, integer types aren’t confined to the maximum bit width of the underlying hardware (such as 32 or 64 bits). You can create primitive integer types using as many bits as needed, like a 128-bit integer. You also don’t have to worry about crafting output to match a specific processor’s instruction set; LLVM takes care of that for you too.

If you want to see live examples of what LLVM IR looks like, go to the ELLCC Project website and try out the live demo that converts C code into LLVM IR right in the browser.

How programming languages uses LLVM

The most common use case for LLVM is as an ahead-of-time (AOT) compiler for a language. But LLVM makes other things possible as well.

Just-in-time compiling with LLVM

Some situations require code to be generated on the fly at runtime, rather than compiled ahead of time. The Julia language, for example, JIT-compiles its code, because it needs to run fast and interact with the user via a REPL (read-eval-print loop) or interactive prompt. Mono, the .Net implementation, has an option to compile to native code by way of an LLVM back end.

Numba, a math-acceleration package for Python, JIT-compiles selected Python functions to machine code. It can also compile Numba-decorated code ahead of time, but (like Julia) Python offers rapid development by being an interpreted language. Using JIT compilation to produce such code complements Python’s interactive workflow better than ahead-of-time compilation.

Others are experimenting with unorthodox ways to use LLVM as a JIT, such as compiling PostgreSQL queries, yielding up to a fivefold increase in performance.

numba llvm IDG

Numba uses LLVM to just-in-time-compile numerical code and accelerate its execution. The JIT-accelerated sum2d function finishes its execution about 139 times faster than the regular Python code.

Automatic code optimization with LLVM

LLVM doesn’t just compile the IR to native machine code. You can also programmatically direct it to optimize the code with a high degree of granularity, all the way through the linking process. The optimizations can be quite aggressive, including things like inlining functions, eliminating dead code (including unused type declarations and function arguments), and unrolling loops.

Again, the power is in not having to implement all this yourself. LLVM can handle them for you, or you can direct it to toggle them off as needed. For example, if you want smaller binaries at the cost of some performance, you could have your compiler front end tell LLVM to disable loop unrolling.

Domain-specific languages with LLVM

LLVM has been used to produce compilers for many general-purpose languages, but it’s also useful for producing languages that are highly vertical or exclusive to a problem domain. In some ways, this is where LLVM shines brightest, because it removes a lot of the drudgery in creating such a language and makes it perform well.

The Emscripten project, for example, takes LLVM IR code and converts it into JavaScript, in theory allowing any language with an LLVM back end to export code that can run in-browser. The long-term plan is to have LLVM-based back ends that can produce WebAssembly, but Emscripten is a good example of how flexible LLVM can be.

Another way LLVM can be used is to add domain-specific extensions to an existing language. Nvidia used LLVM to create the Nvidia CUDA Compiler, which lets languages add native support for CUDA that compiles as part of the native code you’re generating, instead of being invoked through a library shipped with it.

Working with LLVM in various languages

The typical way to work with LLVM is via code in a language you’re comfortable with (and that has support for LLVM’s libraries, of course).

Two common language choices are C and C++. Many LLVM developers default to one of those two for several good reasons: 

  • LLVM itself is written in C++.
  • LLVM’s APIs are available in C and C++ incarnations.
  • Much language development tends to happen with C/C++ as a base

Still, those two languages are not the only choices. Many languages can call natively into C libraries, so it’s theoretically possible to perform LLVM development with any such language. But it helps to have an actual library in the language that elegantly wraps LLVM’s APIs. Fortunately, many languages and language runtimes have such libraries, including C#/.Net/Mono, Rust, Haskell, OCAML, Node.js, Go, and Python.

One caveat is that some of the language bindings to LLVM may be less complete than others. With Python, for example, there are many choices, but each varies in its completeness and utility:

  • The LLVM project maintains its own set of bindings to LLVM’s C API, but they are currently not maintained.
  • llvmpy fell out of maintenance in 2015—bad for any software project, but even more so when working with LLVM, given the number of changes that come along in each edition of LLVM.
  • llvmlite, developed by the team that creates Numba, has emerged as the current contender for working with LLVM in Python. It implements only a subset of LLVM’s functionality, as dictated by the needs of the Numba project. But that subset provides the vast majority of what LLVM users need.
  • llvmcpy aims to bring the Python bindings for the C library up to date, keep them updated in an automated way, and make them accessible using Python’s native idioms. llvmcpy is still in the early stages, but already can do some rudimentary work with the LLVM APIs.

If you’re curious about how to use LLVM libraries to build a language, LLVM’s own creators have a tutorial, using either C++ or OCAML, that steps you through creating a simple language called Kaleidoscope. It’s since been ported to other languages:

  • HaskellA direct port of the original tutorial.
  • Python: One such port follows the tutorial closely, while the other is a more ambitious rewrite with an interactive command line. Both of those use llvmlite as the bindings to LLVM.
  • Rust and Swift: It seemed inevitable we’d get ports of the tutorial to two of the languages that LLVM itself helped bring into existence.

Finally, the tutorial’s also available in human languages as well. It’s been translated into Chinese, using the original C++ and Python.

What LLVM doesn’t do

With all that LLVM does provide, it’s useful to also know what it doesn’t do.

For instance, LLVM does not parse a language’s grammar. Many tools already do that job, like lex/yacc, flex/bison, and ANTLR. Parsing is meant to be decoupled from compilation anyway, so it’s not surprising LLVM doesn’t try to address any of this.

LLVM also does not directly address the larger culture of software around a given language. Installing the compiler’s binaries, managing packages in an installation, and upgrading the tool chain—you need to do that on your own.

Finally, and most important, there are still common parts of languages that LLVM doesn’t provide primitives for. Many languages have some manner of garbage-collected memory management, either as the main way to manage memory or as an adjunct to strategies like RAII (which C++ and Rust use). LLVM doesn’t give you a garbage-collector mechanism, but it does provide tools to implement garbage collection by allowing code to be marked with metadata that makes writing garbage collectors easier.

None of this, though, rules out the possibility that LLVM might eventually add native mechanisms for implementing garbage collection. LLVM is developing quickly, with a major release every six months or so. And the pace of development is likely to only pick up thanks to the way many current languages have put LLVM at the heart of their development process.