In this post I'm going to talk about a few things about Clang and LLVM which I learned during my Master's and which might be useful to people new to Clang/LLVM.
According to its website, "The LLVM Project is a collection of modular and reusable compiler and toolchain technologies." The LLVM Project encompasses a number of sub-projects, the main ones being LLVM and Clang. Basically, LLVM is an infrastructure for code compilation, analysis and transformation. LLVM originally stood for "Low Level Virtual Machine", but it is not really a virtual machine, so nowadays "LLVM" is not considered an acronym anymore, it's just the name of the project. Clang (which is pronounced "clang", by the way, not "C-lang") is a C/C++/Objective C compiler which uses LLVM for code generation. The great things about Clang and LLVM are:
Nowadays LLVM is quite popular as a compiler backend for various languages, such as Rust. The great thing about targeting LLVM for code generation is that it implements a large number of code optimizations. In fact, when Clang compiles a C/C++/ObjC program, it emits very naive, unoptimized code – most optimizations happen at the LLVM level. Because of this, any compiler targeting LLVM is able to use those same optimizations without having to do anything in particular (other than emitting code which LLVM is able to optimize – LLVM can't do magic, after all).
The LLVM Project website has plenty of documentation (for suitable values of "plenty"), for LLVM and Clang. You should consult those for reference. The mailing lists (there are separate ones for the various projects) also have plenty of useful information (although I usually end up there by searching stuff on
It's probably a good moment to warn that LLVM and Clang development moves quite fast, so it's probable some (or most) things in this post will be out of date sooner or later. So, when in doubt, consult the documentation. I will not attempt to duplicate the information in the documentation here, but rather will try to provide an overview of things I had to learn and some gotchas I found during the process. As of now, the current stable version of LLVM is 3.8, although I used initially 3.6 and later 3.7 for most of my Master's (which were the most current stable versions at the times).
The main gotchas I found in the process were:
If you compile the source from the tarball (rather than the version on SVN/Git) using the standard ./configure; make; make install, it will compile a release build, not a debug build, even if you specify you want debug symbols to ./configure (and if you intend to write new passes/plugins or modify Clang/LLVM, you probably want a debug build). The solution is to compile with CMake instead of the standard Makefiles which come with the distribution. According to my notes here, the commands I used were:
tar -xvf llvm-3.6.1.src.tar.xz tar -xvf cfe-3.6.1.src.tar.xz mv cfe-3.6.1.src llvm-3.6.1.src/tools/clang tar -xvf compiler-rt-3.6.1.src.tar.xz mv compiler-rt-3.6.1.src llvm-3.6.1.src/projects/compiler-rt mkdir build cd build cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ../llvm-3.6.1.src ninja # or ninja -j1
ninja (package ninja-build in Debian) is a build program similar to make. You can pass cmake the argument -G"Unix Makefiles" instead of -GNinja, and then it will generate classical Makefiles instead of Ninja files and you can run make instead of ninja, but then you don't get to think of this video every time you type ninja in the terminal.
By default, ninja will run multiple compilation jobs in parallel. This is all fine and dandy, and you probably want do to that if you have a multi-core machine, except linking Clang eats a lot of memory, and with four linkers in parallel it will likely eat up all your RAM (I froze my 8GB RAM machine a couple times due to that). This only happens during linking, at the very end of the compilation process. A couple of things help here:
If you just want to use the LLVM/Clang infrastructure, rather than modifying it, you may not need to compile it from source; you can install your distribution's development packages for LLVM and Clang instead (e.g., llvm-3.7-dev and libclang-3.7-dev on Debian). Then you can compile your pass/plugin/whatever against those.
As far as I can tell, the primary interface with LLVM is the C++ API. There are C bindings to it too, but I don't know how common it is to use them. Besides C, there are bindings for OCaml, Python and Go, as well as third-party ones for Haskell, Rust, and maybe others. I can't attest to their stability or completeness (I remember trying to compile the OCaml bindings and failing miserably, but I didn't really try hard enough).
For Clang, there is a number of interfaces, the most stable of which (as in "the one that changes the least across Clang versions") is the C LibClang. There is also the Plugin interface and LibTooling, both of which are based on C++ and provide finer-grained control over the generated AST.
If you want to use LLVM from a language for which there are no bindings (and you don't want to write the bindings yourself), an alternative is to communicate with LLVM by parsing and emitting LLVM IR directly, rather than using LLVM's APIs. This is what I did for my Master's software, which I wrote in Scheme. If you intend to take LLVM IR code as input (e.g., for writing a code analysis/transformation), you will have to write an LLVM IR parser, which is somewhat annoying (LLVM IR syntax could be quite a bit more regular, if you ask me), but is not particularly hard. If you don't need to read LLVM IR code, but only emit it (for example, if you are using LLVM as a backend for a compiler), then you don't need a parser, you just need to be able to print valid LLVM IR code. The drawback of this approach rather than using a binding is that you will have an extra overhead from converting your data structures to textual LLVM IR, and then feeding it to LLVM (typically invoked as a separate program (usually the opt tool)), which will then reconstruct it as the in-memory LLVM IR representation, rather than generating the in-memory representation directly and running the LLVM routines as library calls in the same process. On the other hand, that's exactly what a traditional compiler (such as GCC) does when calling the assembler, which takes textual assembly code as input (usually piped into it), so it's not like you're necessarily going to have an unacceptable overhead from this.
If you are writing an LLVM IR transformation in this way, and you want to run it as if it were a pass during compilation of a C/C++ program, you'll have to do some tricks. If you want to run your transformation after all other LLVM IR passes, then your life is simple: you can run clang -S -emit-llvm -o - (your normal arguments) to tell Clang to generate "assembly" code rather than an executable (-S), to emit LLVM IR rather than assembly, to output to stdout rather than a file (-o -), and use your normal compilation flags and arguments. Then you can pipe the LLVM IR output into your program (or make your program call clang and read its output via a pipe), transform it as you wish, and then pipe the result back into Clang with clang -x ir - (more arguments) to finish compilation, where -x ir - tells Clang to read code in LLVM IR language from stdin, and (more arguments) will typically include -o executable-name.
If you need to take the output from Clang before any optimization passes are run, things are slightly more tricky. Even if you run Clang with -O0 some LLVM passes may still run. Worse, if you do that, Clang will not include within the LLVM IR code information needed by the optimization passes, such as type information used by type-based alias analysis (TBAA), which means that if you try to do something like clang -O0 ... | your-pass | clang -O3 ..., the result won't be as optimized as if you had directly run clang -O3 on the source, because clang -O0 will lose information which is needed by some of the optimizations performed by clang -O3. The solution is:
clang -S -emit-llvm -Xclang -disable-llvm-optzns -o - -O3 (your normal arguments)
This will make sure Clang includes all information required by optimizations, but stops Clang from invoking the optimizations themselves. Then you can feed this into clang -x ir - -O3 later and optimizations will work properly. (-Xname option passes the option to the compilation subprocess name. Note also that -x ir will apply to all inputs specified afterwards in the command line, not just the -; if you need to pass, say, an extra C file to be combined with the result of your transformation, then you have to specify -x c filename.)
As far as I know, there is no way to simply intercalate a new external pass (i.e., one implemented as an external program) into the process, like "I just want to run:
clang -O3 -lsomelibrary -o hello hello.c
but with this new pass intercalated"; if you want your "compiler+pass" to accept the same arguments as the standard compiler, you'll have to write a routine or script to do some juggling of the arguments passed to each call to the compiler, to get something like:
clang -S -emit-llvm -Xclang -disable-llvm-optzns -o - -O3 hello.c | your-pass | clang -x ir - -O3 -l somelibrary -o hello
This is another drawback of using an external program and communicating purely via the IR, rather than writing a real LLVM IR pass (which I guess you could intercalate with some -Xclang option or something, I don't really know).
If you need to run specific LLVM passes on an LLVM IR program, you can use the opt tool. For example, if you want to run the reg2mem pass, you can add opt -S -reg2mem in the pipeline. You can run opt -help for a list of available passes. (-S tells opt to emit textual LLVM IR, rather than bitcode.)
That's it for today. In the next post, I intend to talk a bit about the LLVM IR language itself.
Copyright © 2010-2020 Vítor De Araújo
O conteúdo deste blog, a menos que de outra forma especificado, pode ser utilizado segundo os termos da licença Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
Powered by Blognir.