12
Sep

Ghost in the Compiler: Backdooring Rust // Manish Goregaokar


>>MANISH GOREGAOKAR: I name is Manish. Today,
I’ll I’m going to talk to you about compilers. So a bit about me. I work at ‑‑ yes, we
have a ‑‑ for a logo. I contribute to the Rust community and the Rust compiler.
Rust, it’s a programming language, that ‑‑ sorry. That tries to empower everyone to build fast
and reliable and secure systems. Rust is a compiled language. And compiled languages,
you have a compiler program that takes the source code that you write, and converts it
into machine code that the compiler can understand. But the computer can understand. The compilers
are a fundamental tool that we use to build programs and work with computers. I think
compilers are really, really cool. So compilers are what compile our program, but where do
they come from? Who compiles the compiler? I mean, we had a compiler here last time,
so it’s got to be some other compiler, right? It’s ‑‑ that means it’s compilers all
the way down, I guess? I don’t know? For many languages, it’s not just that you need a compiler
to build that compiler, you need a compiler for the same language. The compiler for your
language is written in the same language, which somehow makes sense. I promise you,
it does make sense. Eventually. The Rust compiler is written in Rust, many C++ compilers are
written in C++. GCC is written in C++ for example. Such compilers are called boot strapped
compilers. The way their development works is that you can build a version of the compiler
with an older version of the compiler. So for example, if you want to build the latest
Rust compiler, I think, 32, you need the previous version of that, if you want to build this,
then you need that one and so on, all the way back. Woops. Similarly, if you want to
build GCC, most C++ compilers from the last ten‑ish years will work. Eventually, this
chicken and egg problem has to end up somewhere, right? You can’t just go compilers all the
way, there’s something at the bottom. So it turns out that this is where the boot strapping
comes in. Boot strapping comes from the phrase like pulling yourself up by your own boot
straps. And what happens is that initially you start writing a simpler compiler and add
complexity to it as you go forward. When Denis Richey wrote his first C compiler, he wrote
a compiler for a much simpler language, C0, in assembly, assembly is this language, it’s
a human readable version of machine code and what computers can understand, so really ‑‑ it’s
hard to program an assembly, but you can write a simple compiler. And he used this assembler,
which is a simple program which converts assembly to machine code to get the first compiler
for this ‑‑ for this subset of C called C0. Then, he wrote, then he wrote another
compiler for a much ‑‑ a much better compiler for a language that I’m going to call C1.
That was written in the subset C0. And he used the compilers that he had to build a
‑‑ to compile this one. And now you can add features to this. And write a better version
of that. Compile a better version, and then add more feature, compile it with the old
one and keep doing this until you have a compiler that supports all of the features you wanted.
This takes a while, but you get to a C compiler. Newer languages don’t actually do all the
way up to the top, they’ll start with a compiler written in a different language that already
has a compiler, and then at some point, once it’s mature enough, switch over to being boot
strapped and then initially, you can use that other compiler to start. So the Rust compiler
was initially written in a language called OCaml. Have you ever thought about how much
we trust the tools we use? In 1984, Ken Thompson wrote give a talk about this attack. And about
this nature ‑‑ the nature of trust in our systems. And he focused on a very, very
fascinating outcome of this whole boot strapping process, your compiler is written in the language
that it compiles. He first talked about how it’s possible to make a compiler malicious.
Let’s say you have a log in program. It will probably have like some function like check
password that takes your password and I don’t know checks it against some list of what the
passwords are allowed to be. Now, you’re compiling your program, you’re compiling this log in
program using your compiler and using it on your system and that’s fine, except, someone
can ‑‑ what if the compiler was changed? So that when it finds this check password
function, it actually adds a back door to the function? This kind of function ‑‑ this
back door could be something like, oh, if this ‑‑ if the user types in this password,
always let them in. And now, the log in function suddenly lets you log in as anyone as long
as you know the secret password that the compiler just inserted in the program. So this means
that if you want to trust your program, you have to trust the compiler that you’re using
to build the program. Okay. If we have to do that, we have our compiler source code,
we can just build our own compiler, we have the source code we just build it. But wait.
That’s also compiled by a compiler. That compiler could also do things to our code. So you have
the same problem. Since you need a copy of that compiler to build a copy of your compiler,
someone can back door the first compiler ‑‑ that compiler copy, so that it inserts itself into
the copy of the compiler that you’re building. You’re now forced to trust your code. For
example here, you can change the compiler, it adds the same back door, it also finds
the function that compiled stuff and the compiler and adds this whole back door to itself. And
then, you can compile a compiler’s ‑‑ whose source code is nice and clean and clean of
back doors and get a bad compiler out that and that will continue forever, as much as
you use that compiler. Yourself forced to trust your tools. And this attack is called
the “Trusting Trust” attack, because of the name of the original talk. So when I first
heard about this, I was really intrigued. Compilers have this like ‑‑ so it turns
out, compilers have this weird chicken and egg situation, where they’re built with themselves
and somehow that works. And further more, you can abuse this to insert back doors that
don’t even show up in the source code, that they’re just like ghosts in the compiler.

( laughter )

( applause )

 So this really, really got me interested in
compilers. At the time I didn’t really understand how they worked. I just understood that they’re
weird and cool. So my personal goal, I gave for myself, back then, was one day I will
actually do this for myself. And then I kind of forgot about it, but it’s fine. Many years
later, I had become a contributor to the Rust programming language, so now I knew how compilers
worked and I realized, I thought about it, oh, this is actually not that hard to do.
So I tried doing it. It’s worth understanding how compilers kind of work to understand this
attack. This diagram is a very simplified pipeline of how most compilers work, so first
you have your source code, it parses the source code that you give to it and it turns it into
a more structured format, often called the syntax tree. Then, it will look at this syntax
tree, and find what things are referring to what things and like link things up, like
if you have imports, where the imports came from, things like that, that’s called name
resolution. Then it will run a bunch of checks like type checking, for example, checking
if all of your types match, and everything. And once you know that your program is correct,
and you have all of this structure about it, you can now restructure the program to optimize
it and convert it to machine code which again, the computer understands. In the Rust compiler,
these last two steps are handled by a different library called LLVM, as you go further down
this pipeline, things get more and more complicated. You have a structured format of the source
code. Here, you have that structured format plus, like a billion other things that depend
on that. So poking at that structured format might mean invalidating other things.
so
when I wanted to pick where to stick this attack, I decided to put it right after parsing,
because then I have a very nice structured representation that I can work with. But I
don’t have to deal with all of that other stuff that’s floating around that I might
break if I change things. All right. So my first step was to insert just some back door
into the compiler, not the weird loopy “Trusting Trust” back door, just some back door. I didn’t
want to do the entire attack at once because I wasn’t yet sure how to do it. I instead
wrote a very simple back door that replaces things that said hello, with the equivalent
in my native world, apologies to the transcriber, when I just ‑‑ when I wrote this talk,
it didn’t occur to me that I’d have to deal with that. A program that prints out hello
world, when compiled by my compiler will print out ( sp? ). They have a code in this slide
‑‑ pseudocode ‑‑ 
 ( laughter )
 So the code in this slide is pseudocode, the
compiler code is more complicated, but it’s only complicated because like the compiler
has these structures and stuff, and I’ve simplified it for this purpose, but the idea is the same.
So and if you want to look at the actual code, I’ll link to a blog post at the end that contains
the actual code. So I first found the part of the compiler that handles the step right
after parsing, I’m going to call it after parsing here, it takes in this structured
representation, the syntax of the program, calling program here, and it does some other
stuff to it. I added a line to that, that is just called this function add back door.
On this program. And I passed this variable to ‑‑ so I passed this variable to this
add back door function up there. Which I also defined. In this function, I basically it
rate through all of the expressions in the program, and I find expressions which are
string literals that say hello world, and instead, I replaced them with a new string
that says ( sp? ). All right. The next step is to make the back door add itself, this
is where it gets interesting. We can it rate through all of the functions in the program,
we can looking for the function after parsing that we’ve just added a line to. And we can
add this function called of add back door to that, but now, we need to insert this add
back door function itself. How do we do that? We don’t have it like ‑‑ if we wanted
to do that, we’d need that function as a string inside of itself, but then, that string also
needs to be inside of that string, which needs to be inside that string, which needs to be
inside that string and this is terrible. So we needed a different solution. Some of you
may have worked with things called quines, something similar to what happens when you
try to write a quine, it prints its own source code, you should try to do one sometime in
your favorite language. These are tricky to construct, because of the same reason, if
your program is printing its own source code, it needs to contain a string that contains
a source code, that needs to contain that string that needs to contain that string,
it’s terrible. This makes your program infinitely large. With ( sp? ) There’s a trick to get
around it, you use a variable. So this is our quine. We first create a variable that
contained the second line. Don’t look at the contents, just the same as the second line.
As you can see, this is just the second line. In the second line, we do two things. We first
print out the line that sets the variable itself. So now we’re printing out this first
line. And that’s done by ‑‑ we can just reuse the variable, because we’re just printing
its contents after setting it to a variable. And then, we print out the line that’s actually
printing the program, which is the second line, so all we have to do is use the variable
twice, and now this program isn’t really large. If you really look closely at this example,
like this is python code. But it doesn’t quite solve the problem. Because you still need
to do with the escaping. And there’s various ways to fix that problem. But it depends on
the language and it’s not so relevant. The core insight is that if you set most of the
code to a variable, and then use that variable twice, once to set the line that sets the
code and wants to print the rest of the code, you can create a quine and really, we can
use the same trick here, we’re inserting the back door, and it’s kind of like printing,
but we’re doing a bit more work. So let’s try what we’ve learned to our attack. We now
have a variable, which I’m calling program string at the top and it contains all of this
code as a string, it’s not on the slide because it would get too big. And this is all of add
back door function, it doesn’t contain itself, it contains this function. When we find the
after parsing function, just like last time, we add the function call to the add back door
function. And then, we do two things, first, we parse this program string line to insert
the add back door function. This entire function is contained in this string, so the if we
parse it and insert it, we get this function in our code. And then, we create the line
that actually sets this constant. We create this line and use the variable to create it
own contents and insert that somewhere. As you can see here, we’re using this variable
twice, we first use it to insert the function, and then we use it to insert the line that
creates itself. While I did this with the Rust compiler, most compilers work really
similarly. And it should be possible to like replicate this attack with mostly the same
features with a compiler of your choice. This attack is really ‑‑ a really fun way to
learn more about your compiler. I highly recommend trying it out, if you’re interested in playing
with a compiler. Just pick one and like find where it parses stuff, and look at things
and you’ll eventually get something. Anyway. Compilers are really, really cool. I hope
this talk motivates you to start poking at compilers, hack on them, maybe write this
attack or write a feature that’s useful and not an attack. 
( laughter )
Those are
fine. I like the attack one more. I really enjoyed my time hacking on compilers. And
I hope you do, too. Thank you.

Tags: , , , , , , , , , , ,

There are no comments yet

Why not be the first

Leave a Reply

Your email address will not be published. Required fields are marked *