Some things I know about Clang and LLVM, part 1

2016-06-17 04:10 -0300. Tags: comp, prog, llvm, mestrado, in-english

In this post I'm going to talk about a few things about Clang and LLVM which I learned during my Master's and which might be useful to people new to Clang/LLVM.

The LLVM Project

According to its website, "The LLVM Project is a collection of modular and reusable compiler and toolchain technologies." The LLVM Project encompasses a number of sub-projects, the main ones being LLVM and Clang. Basically, LLVM is an infrastructure for code compilation, analysis and transformation. LLVM originally stood for "Low Level Virtual Machine", but it is not really a virtual machine, so nowadays "LLVM" is not considered an acronym anymore, it's just the name of the project. Clang (which is pronounced "clang", by the way, not "C-lang") is a C/C++/Objective C compiler which uses LLVM for code generation. The great things about Clang and LLVM are:

Unlike traditional compilers, they are designed as reusable components. You can use Clang and LLVM components as libraries to write your own program analysis/transformation tools, for instance, or you might extend Clang and LLVM with new passes/plugins, or use LLVM as a backend to generate machine code for your compiler. People have implemented JIT compilers (such as this) using LLVM as a basis for code generation, for example.
LLVM is designed around a well-defined intermediate language, called LLVM IR (Intermediate Representation), which is sort of an assembly-like language for an abstract machine. All code transformations at the LLVM level (apart from machine code generation) are implemented as transformations taking an LLVM IR program as input and producing a modified LLVM IR program as output. This makes it easier to add new passes, use them individually, combine them, inspect what each pass does, etc. LLVM IR has a textual representation (which you can print out or feed as input to LLVM), a bitcode representation which is more space-efficient, and an in-memory data structure representation (which is what the LLVM tools use internally).

Nowadays LLVM is quite popular as a compiler backend for various languages, such as Rust. The great thing about targeting LLVM for code generation is that it implements a large number of code optimizations. In fact, when Clang compiles a C/C++/ObjC program, it emits very naive, unoptimized code – most optimizations happen at the LLVM level. Because of this, any compiler targeting LLVM is able to use those same optimizations without having to do anything in particular (other than emitting code which LLVM is able to optimize – LLVM can't do magic, after all).

An example of a weirder project using Clang/LLVM is Emscripten, which implements an LLVM backend, the part which translates LLVM IR to machine code – except in this case the machine code is JavaScript.

Documentation, caveat, and scope of this post

The LLVM Project website has plenty of documentation (for suitable values of "plenty"), for LLVM and Clang. You should consult those for reference. The mailing lists (there are separate ones for the various projects) also have plenty of useful information (although I usually end up there by searching stuff on ~~Google~~ StartPage rather than going directly to the mailing list). As far as I can tell, people are quite helpful if you ask questions there (but I have never personally asked anything).

It's probably a good moment to warn that LLVM and Clang development moves quite fast, so it's probable some (or most) things in this post will be out of date sooner or later. So, when in doubt, consult the documentation. I will not attempt to duplicate the information in the documentation here, but rather will try to provide an overview of things I had to learn and some gotchas I found during the process. As of now, the current stable version of LLVM is 3.8, although I used initially 3.6 and later 3.7 for most of my Master's (which were the most current stable versions at the times).

Compiling LLVM and Clang

If you want to compile LLVM and Clang from source, you can find information in Getting Started with the LLVM System, Building LLVM with CMake, and Clang – Getting Started.

The main gotchas I found in the process were:

If you compile the source from the tarball (rather than the version on SVN/Git) using the standard ./configure; make; make install, it will compile a release build, not a debug build, even if you specify you want debug symbols to ./configure (and if you intend to write new passes/plugins or modify Clang/LLVM, you probably want a debug build). The solution is to compile with CMake instead of the standard Makefiles which come with the distribution. According to my notes here, the commands I used were:
```
tar -xvf llvm-3.6.1.src.tar.xz 

tar -xvf cfe-3.6.1.src.tar.xz 
mv cfe-3.6.1.src llvm-3.6.1.src/tools/clang

tar -xvf compiler-rt-3.6.1.src.tar.xz
mv compiler-rt-3.6.1.src llvm-3.6.1.src/projects/compiler-rt

mkdir build
cd build

cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ../llvm-3.6.1.src
ninja     # or ninja -j1
```
ninja (package ninja-build in Debian) is a build program similar to make. You can pass cmake the argument -G"Unix Makefiles" instead of -GNinja, and then it will generate classical Makefiles instead of Ninja files and you can run make instead of ninja, but then you don't get to think of this video every time you type ninja in the terminal.
By default, ninja will run multiple compilation jobs in parallel. This is all fine and dandy, and you probably want do to that if you have a multi-core machine, except linking Clang eats a lot of memory, and with four linkers in parallel it will likely eat up all your RAM (I froze my 8GB RAM machine a couple times due to that). This only happens during linking, at the very end of the compilation process. A couple of things help here:
- I very much recommend disabling swap space (sudo swapoff -a) during compilation. This way the compilation process will die when it eats up all your memory instead of freezing your system. After compilation you can reenable swap with sudo swapon -a.
- After the compilation process dies from lack of memory (or alternatively, if you interrupt it manually after your system starts to freeze), you can restart it with ninja -j1, which tells ninja to run a single compilation job. This way, you can run most of the process in parallel, and then restart it with a single job for the memory intensive last part.
- Another thing that may help is using the gold linker, which consumes somewhat less memory than the standard ld linker. It comes with recent versions of GNU Binutils, so you probably already have it on your system under the name ld.gold, but it is not used by default. You can set up a symlink in /usr/bin from ld to ld.gold to use it (you'll probably want to rename the old ld first).

If you just want to use the LLVM/Clang infrastructure, rather than modifying it, you may not need to compile it from source; you can install your distribution's development packages for LLVM and Clang instead (e.g., llvm-3.7-dev and libclang-3.7-dev on Debian). Then you can compile your pass/plugin/whatever against those.

Interfacing with Clang/LLVM

As far as I can tell, the primary interface with LLVM is the C++ API. There are C bindings to it too, but I don't know how common it is to use them. Besides C, there are bindings for OCaml, Python and Go, as well as third-party ones for Haskell, Rust, and maybe others. I can't attest to their stability or completeness (I remember trying to compile the OCaml bindings and failing miserably, but I didn't really try hard enough).

For Clang, there is a number of interfaces, the most stable of which (as in "the one that changes the least across Clang versions") is the C LibClang. There is also the Plugin interface and LibTooling, both of which are based on C++ and provide finer-grained control over the generated AST.

LLVM IR as an interface

If you want to use LLVM from a language for which there are no bindings (and you don't want to write the bindings yourself), an alternative is to communicate with LLVM by parsing and emitting LLVM IR directly, rather than using LLVM's APIs. This is what I did for my Master's software, which I wrote in Scheme. If you intend to take LLVM IR code as input (e.g., for writing a code analysis/transformation), you will have to write an LLVM IR parser, which is somewhat annoying (LLVM IR syntax could be quite a bit more regular, if you ask me), but is not particularly hard. If you don't need to read LLVM IR code, but only emit it (for example, if you are using LLVM as a backend for a compiler), then you don't need a parser, you just need to be able to print valid LLVM IR code. The drawback of this approach rather than using a binding is that you will have an extra overhead from converting your data structures to textual LLVM IR, and then feeding it to LLVM (typically invoked as a separate program (usually the opt tool)), which will then reconstruct it as the in-memory LLVM IR representation, rather than generating the in-memory representation directly and running the LLVM routines as library calls in the same process. On the other hand, that's exactly what a traditional compiler (such as GCC) does when calling the assembler, which takes textual assembly code as input (usually piped into it), so it's not like you're necessarily going to have an unacceptable overhead from this.

If you are writing an LLVM IR transformation in this way, and you want to run it as if it were a pass during compilation of a C/C++ program, you'll have to do some tricks. If you want to run your transformation after all other LLVM IR passes, then your life is simple: you can run clang -S -emit-llvm -o - (your normal arguments) to tell Clang to generate "assembly" code rather than an executable (-S), to emit LLVM IR rather than assembly, to output to stdout rather than a file (-o -), and use your normal compilation flags and arguments. Then you can pipe the LLVM IR output into your program (or make your program call clang and read its output via a pipe), transform it as you wish, and then pipe the result back into Clang with clang -x ir - (more arguments) to finish compilation, where -x ir - tells Clang to read code in LLVM IR language from stdin, and (more arguments) will typically include -o executable-name.

If you need to take the output from Clang before any optimization passes are run, things are slightly more tricky. Even if you run Clang with -O0 some LLVM passes may still run. Worse, if you do that, Clang will not include within the LLVM IR code information needed by the optimization passes, such as type information used by type-based alias analysis (TBAA), which means that if you try to do something like clang -O0 ... | your-pass | clang -O3 ..., the result won't be as optimized as if you had directly run clang -O3 on the source, because clang -O0 will lose information which is needed by some of the optimizations performed by clang -O3. The solution is:

clang -S -emit-llvm -Xclang -disable-llvm-optzns -o - -O3 (your normal arguments)

This will make sure Clang includes all information required by optimizations, but stops Clang from invoking the optimizations themselves. Then you can feed this into clang -x ir - -O3 later and optimizations will work properly. (-Xname option passes the option to the compilation subprocess name. Note also that -x ir will apply to all inputs specified afterwards in the command line, not just the -; if you need to pass, say, an extra C file to be combined with the result of your transformation, then you have to specify -x c filename.)

As far as I know, there is no way to simply intercalate a new external pass (i.e., one implemented as an external program) into the process, like "I just want to run:

clang -O3 -lsomelibrary -o hello hello.c

but with this new pass intercalated"; if you want your "compiler+pass" to accept the same arguments as the standard compiler, you'll have to write a routine or script to do some juggling of the arguments passed to each call to the compiler, to get something like:

clang -S -emit-llvm -Xclang -disable-llvm-optzns -o - -O3 hello.c |
     your-pass |
     clang -x ir - -O3 -l somelibrary -o hello

This is another drawback of using an external program and communicating purely via the IR, rather than writing a real LLVM IR pass (which I guess you could intercalate with some -Xclang option or something, I don't really know).

If you need to run specific LLVM passes on an LLVM IR program, you can use the opt tool. For example, if you want to run the reg2mem pass, you can add opt -S -reg2mem in the pipeline. You can run opt -help for a list of available passes. (-S tells opt to emit textual LLVM IR, rather than bitcode.)

End of Transmission

That's it for today. In the next post, I intend to talk a bit about the LLVM IR language itself.

Comentários / Comments (4)

Marcus Aurelius, 2016-06-28 10:57:38 -0300 #

Wait, it isn't pronounced c-lang? I'll never get used to that!

Vítor De Araújo ★, 2016-06-28 20:13:30 -0300 #

My advisors call it "C-lang" too, which annoys me no end. :P

Vítor De Araújo ★, 2016-06-28 20:14:04 -0300 #

(But don't tell them. :P)

Garantula, o Gargula de Tarântula, 2016-06-29 13:22:41 -0300 #

Clang, uma linguagem C

Elmord's Magic Valley

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.