Open-source Tools for Binary Analysis and RewritingTweet
Unfortunately binary-only software is unavoidable; dependencies of active software projects, firmware and applications distributed without source access, or simply old software whose developers are no longer drawing pay checks (or drawing breath). Consequently binary analysis and rewriting are topics of perennial interest to security and software engineering researchers and practitioners.
Binary analysis enables the review of binary software and binary rewriting enables the remediation of problems in binary software. Both depend on a high quality intermediate representation (IR) of the binary, and a high-quality disassembler to lift executables to this IR. GrammaTech is releasing both under open-source licensing; GrammaTech's IR for Binaries, GTIRB (not to be confused with an intermediate language1), and ddisasm, a fast and accurate disassembler capable of lifting binary programs to GTIRB. We are also releasing GTIRB-pprinter, a pretty printer from GTIRB to assembler which, when used with an assembler and linker, completes a robust end-to-end binary rewriting system.
We are in the midst of a revival of interest in binary analysis and reverse engineering. For example, the new Binary Analysis Research workshop collocated with NDSS (we published at BAR 2018) and it's centrality to DARPA's Cyber Grand Challenge (we took silver). Analysis and rewriting tools which were once a niche domain largely for government agencies and CTF competitions are now commonplace with dozens of academics and companies building their own frameworks. IDA Pro remains the most commonly used interactive binary analysis tool, with some newer tools like Binary Ninja rapidly gaining in popularity. NSA's recently open sourced Ghidra has garnered a great deal of attention. Of the academic tools Angr (also from a CGC finalist) is probably the most widely used open source binary analysis platform supporting binary rewriting and currently has the best published binary rewriting results. There are many platforms supporting binary rewriting.2
GrammaTech has been working in this field for roughly 20 years. Our CodeSonar for Binaries is an easy-to-use on-premise automated fault detection tool for native binaries. Our binary analysis and rewriting framework which supports this commercial tool as well as our binary rewriting tooling, is the most mature framework of its kind.
We are releasing GTIRB in the hopes that it provides a common data structure to facilitate communication and collaboration between the many new entrants to this space. We hope that the combined open-source suite of GTIRB, ddisasm, and GTIRB-pprinter will reduce the barrier of entry into this space so that anyone with an interesting new approach to binary analysis, transformation, or rewriting can try out their ideas without first having to put in the huge investment required to get a usable IR. (Ideally we would like to see ddisasm and GTIRB become the Clang and LLVM of binary analysis research.) In the remainder of this tool introduction we will review both GTIRB and ddisasm, and demonstrate their use identifying and then rewriting to neutralize a real-world (if old) command injection exploit in a popular piece of open-source software.
Case Study: UnrealIRCd backdoor Detection and Removal
This Dockerfile may be used to play along.
To introduce these tools, lets use them to find and fix a real flaw in a real program. (We'll use an open-source program for the example so you could just fix this problem in the source code, but lets play along regardless.) In 2009 there was a brazen backdoor in version 22.214.171.124 of UnrealIRCd (see also this writeup). This backdoor could be used by an attacker to run any command on the system running the IRC daemon. Due to some clever hiding in nested macros the backdoor persisted for seven months before it was detected. (This is actually one case where the exploit is easier to see in the binary, thanks to the C pre-processor removing the obfuscation. In fact GrammaTech's CodeSonar automatically flags this as a Command Injection vulnerability.3)
So, if you want to play along you can build it yourself from this tarball of the backdoored source.4 To get this to build locally I had to (1) remove the
inline declaration from
parse_addlag in src/parse.c and then manually re-run the final
gcc invocation adding
-ldl (included in the Dockerfile). After this you should find the vulnerable binary in
src/ircd (remember don't run it on a public machine). With the vulnerable binary built, and with GTIRB, GTIRB-pprinter and ddisasm built locally (please open issues on the GitHub repos if you have any problems building any of our tools) you're ready to lift the binary to IR, repair, and rewrite.
Lifting should be easy. Simple run the
ddisasm executable instructing it to dump a GTIRB representation of UnrealIRCd.
ddisasm /ircd --ir ircd.gtirb
(This may take a couple of minutes. Time we can use to explain how
ddisasm works.) The
ddisasm tool is a disassembler whose analysis is implemented in datalog. Disassembly is impossible in general, so all practical approaches rely on heuristics based on assumptions about what common compilers and common assembler code is known to do. As it turns out Datalog rules are an absolutely marvelous way to express these heuristics. We can write them declaratively and concisely, and the high performance souffle datalog engine combines the rules and compiles them into fast parallel C++ code. With this approach
ddisasm is able to lift faster and more accurately than the best previously published lifters with much less implementation effort (measured in either FTE months or KLOC). We are really hoping to see the binary analysis community contribute additional heuristics to
Now that we have lifted to GTIRB, we can investigate the results using the reference C++ GTIRB library, note that GTIRB is serialized using Google's Protobuf so other languages may also be used and we hope to develop a dedicated Python library as well. See the Examples section of the GTIRB manual for examples demonstrating the use of GTIRB in C++, Python, and Java.
This simple C++ program, blog.cpp, find a block invoking system and then tracks back from this block in GTIRB's inter-procedural control flow graph printing the names of referenced symbols.
We can compile this and run it on the IR resulting in the following:
g++ --std=c++17 -lgtirb blog.cpp ./a.out ircd.gtirb 1 # => Basic Block calling system found at 0x00407f32 # => Predecessor 0x00407f32 references readbuf
So the first (and only) basic block calling
system has a predecessor which references
readbuf. This looks like the backdoor described in the previously mentioned writeup. Finally, we can repair, re-assemble, and re-link a new UnrealIRCd with the vulnerability removed. For simplicity we can perform this repair by modifying the pretty printed assembly code. We first invoke
gtirb-pprinter to convert the GTIRB to assembler.
gtirb-pprinter /ircd.gtirb -o /ircd.s
Now we simply remove the bad call to system. There is only one call to system in this case (otherwise we could leverage the output of our simple GTIRB program output above to remove only the bad system call), so we can easily use
sed as our binary rewriter to repair the program.
Confirm there is only one call to system:
# Only one single call to system: grep -c system@PLT ircd.s # => 1 # We can view its context: grep -C3 system@PLT ircd.s # => .L_407f32: # => # => mov EDI,OFFSET readbuf # => call system@PLT # => .L_407f3c: # => jmp .L_407de2 # => .L_407f41:
sedto replace the system calls with calls to
sed 's/system@PLT/puts@PLT/' -i ircd.s
Finally, we re-assemble and re-link the modified assembler (we could call
ldexplicitly or we can
gccdo this for us). We run ldd on the original
ircdbinary to see what dynamic libraries we need to link against,
ldd ircd # => linux-vdso.so.1 (0x00007fff7278a000) # => libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007fb286413000) # => librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb28620b000) # => libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb286007000) # => libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb285c16000) # => libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb2859f7000) # => /lib64/ld-linux-x86-64.so.2 (0x00007fb28664b000)
and we invoke gcc to link the new repaired
ircd-fixedbinary. (Note that we add the
-no-pieflag to gcc to avoid gcc trying to compile a position independent executable, the default on Ubuntu 18.)
gcc ircd.s -o ircd-fixed -lcrypt -lrt -ldl -lc -lpthread -no-pie
ircd-fixed binary is just like the original, but without this nefarious
system call backdoor. When the backdoor is triggered what was previously a remote shell should now print harmless debug output.
Hopefully this was informative. We are really proud of these tools and we hope others find them useful as well. We will be continuing to improve them as we use them more widely here at GrammaTech.
This material is based upon work supported by the Office of Naval Research under Contract No. N68335-17-C-0700. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Office of Naval Research.
GTIRB is not an IL for representing the semantics of assembler instructions in the same way that BAP's BIL, Angr's Vex, or Ghidra's P-code are ILs. GTIRB represents the higher-level structure of the binary. These structures are often the result of sophisticated analyses (e.g., those performed by our lifter, ddisasm). These structures include; identification of code and data blocks, construction of the control flow graph and information on cross-references (i.e., symbolization). These structures support additional binary analysis. These structures are also sufficient to support modification of the binary and re-assembly of a new binary executable.
For instruction representation, GTIRB uses in the most general and efficient representation we could find, the machine code bytes. The users of GTIRB read/write these bytes using the decoder/encoder of own favorite intermediate language (e.g., BIL, Vex, P-code) or using the high quality open-source Capstone/Keystone. Back
Here is a partial list of binary analysis platforms with some rewriting support (various levels of maturity).Back
CodeSonar automatic warning of command injection in UnrealIRCd.