Software Assurance            Software Hardening            Autonomic Computing

Adding C++, Python, Java, and C# Bindings for the CodeSonar API (Part 3) 

This is the third in a series of posts about adding additional language bindings for the CodeSonar API.

[Read the first part | second part | third part | fourth part | fifth part]


Example #3

Invoke any function that returns a vector

// C

size_t bn;

cs_result r;

cs_ast_field *buf = NULL;

r = cs_ast_children(node, NULL, 0, &bn);

if( r == CS_TRUNCATED )

{

buf = malloc( bn );

if( !buf ) abort();

}

else if( r != CS_SUCCESS )

abort();

r = cs_ast_children(node, buf, bn, &bn);

if( r != CS_SUCCESS ) abort();

free(buf);

// C++

std::vector<ast_field> vec = node.children();

# Python

vec = node.children()

// Java

ast_field_vector vec = node.children();


Enter SWIG

The code "lifting" the C API to Scheme clocks in at about 15KLOC of hand-written C. It is a maintenance headache to keep the two consistent, but on the bright side the code is highly formulaic — so formulaic, one wonders if a tool could generate it.

SWIG is exactly that. It is an open source tool that takes C/C++ function prototypes as input and generates the glue code necessary to "lift" those functions to other languages such as Python, Java, C#, and tens more. If it sounds a little too good to be true, that’s because it is. Handing it a random header file will likely produce a disorganized, memory-unsafe, partially-functioning API. Additionally, support for some target languages (Python) is much more mature than others (lua).

For SWIG to produce a properly organized API – one with objects and methods – it needs C++ class definitions as input, as opposed to C-style prototypes. By default, it will lift all public methods and data members. Our task, therefore, is to produce a C++ API that is usable both directly by humans and also by SWIG. In order for the higher-level languages to have memory safety, the C++ API must be bullet-proof. It should be difficult for a C++ API client to cause a hard crash in the absence of new, delete, casts, unions, pointers, arithmetic exceptions, and other scary things I've overlooked. If this condition is met, then SWIG should be able lift the API to other languages without introducing memory-safety issues.

By using SWIG, we can generate APIs that are isomorphic with the C++ API in many languages. Consistency across multiple languages will lower maintenance, testing, documentation, and training costs. Furthermore, the implementation cost without SWIG would have been prohibitive. The API layering looks like this:

API layering
SWIG and public data members: Just say "no"

If memory safety is desirable, then do not let SWIG see any public data members. In order to implement python code like this:

p.q.f = 42

SWIG can end up running code conceptually like:

T *field_ptr = &p->q;

decref(p);

field_ptr->f = 42;

The problem is, p’s reference count could drop to 0 and get freed, causing a use-after free. Use getters and setters to avoid this behavior.

SWIG and pointers: best avoided

SWIG has all the same issues with pointers that humans have: it doesn't know which ones are…

  • Input parameters
  • Output parameters
  • Arrays
  • Destroyed by the call
  • Created by the call

There are out-of-band ways to specify these things to SWIG, but since humans are also confused by pointers, let’s avoid them in the C++ API. const C++ references are OK since they are clearly inputs. Other pointers and references should not be used as return types or parameter types. This means absolutely no output parameters in the C++ API. Private members can still use pointers. Hopefully the compiler will be smart about doing Return Value Optimization (RVO), since all functions return their outputs by-value.

SWIG and STL: Sufficient

With pointers off the table, we need some way of implementing arrays and strings. Good news: SWIG has some support for lifting STL types. For example, it is smart enough to translate a std::string into python’s str type and back.

All we need are std::string and std::vector. However, by default SWIG will not translate std::vector into the native vector type of the higher level language; it will instead lift a new opaque type wrapping an actual std::vector to the higher level language. There is a good reason for this in some cases: Converting a std::vector to a python list will take O(n) time, whereas wrapping an existing std::vector takes O(1). However, for our purposes, all the functions that have vectors as input or output already have O(n) time, so we would like to use Python’s native list type, for example.

This was easy enough to do with Python: someone else had already done the work and I basically needed to flip a flag and make a few adjustments. However, Java and C# are a different story: It’s still using the opaque types.

In general, the C++ API mimics STL’s naming and style conventions.

Unlike the C API, the C++ API has compile-time dependences on system header files. This could potentially cause problems with some compilers, but I suspect it will be OK with most implementations of STL.

SWIG and templates: Yes

Modern versions of SWIG have fairly good support for templates. However, you must explicitly instruct SWIG for each template instantiation it should lift. SWIG has different rules than C++ about the ordering in which it must read code. I would say this was the most painful part of the whole process: SWIG wants to have seen the definition of a type before seeing a template instantiation using that type.

SWIG and inheritance: Yes

SWIG supports inheritance, but there are a few gotchas. Similarly to the template situation, SWIG must see the definition of a base class before seeing the definition of a derived class. If the base class is a template instantiation, then this can become particularly painful.

SWIG and the Curiously Recurring Template Pattern: Tricky

CodeSurfer has several "set" types with the same interface that store differently-typed values. For instance, there is a set of symbols and there is a set of program points. To preserve asymptotics, these types needed to be lifted as opposed to converting them to native types of the higher level languages.

In order to avoid code duplication in the C++ interface, I used the Curiously Recurring Template Pattern and template specialization to tie into the correct C functions for each template instantiation. Thus, a C++ API client might declare a set of symbols as cs::set<cs::symbol>. Surprisingly, it was eventually possible to make SWIG swallow this after much trial and error with respect to the order in which SWIG saw various declarations.

SWIG and callbacks: Yes

CodeSonar exposes a visitor interface with which API clients can register visitor callbacks to be invoked on various IR types during various analysis passes. SWIG implements a feature termed “directors.” It enables higher-level languages to subclass virtual C++ classes. This is the natural way to do callbacks in a statically typed object oriented language anyway, so the C++ API exposes abstract functors to be subclassed.

For the python API, we add some syntactic sugar so that arbitrary callables can be used as functors, avoiding the need for subclassing. For python visitors, we expose decorators such as symbol_visitor:

@ cs.symbol_visitor

def visit_symbol(sym):

print ‘sym is’, sym

SWIG and exceptions: Yes

Exceptions are absolutely necessary for producing a reasonable API with SWIG. The good news is, SWIG can lift C++ exceptions to the higher level languages. However, it is quite a bit of work to get it just right.

SWIG reads exception specifier lists ("throw (int)") to determine what exceptions a method might throw. However, inline methods can’t have these, and I don't particularly want to write over 1000 of them. I ended up telling SWIG that basically every function in the universe potentially throws our main exception type (essentially a status code), and then overriding that in a few places (mostly destructors).

SWIG needs hand-written language-specific glue code for each C++ exception type that tells it how to catch the exception, construct the exception object in the higher-level language, and then throw it in the higher-level language. This required learning details about cpython/JNI/C# that I would have been happy to skip.

SWIG and callbacks that throw exceptions: Ouch!

Allowing user-defined subclasses to throw arbitrary exceptions (the only reasonable thing to do) is quite painful. It tends to involve writing language-specific code to recognize that an exception is happening in the higher level language, repackaging it in a way that allows the C stack to unwind, and in some cases re-raising it in some transitive caller just before returning control to the higher-level language. This requires special code for every language in every lifted method that transitively invokes callbacks.

Part of the issue is that you cannot catch an arbitrarily-typed C++ exception, allow the C stack to unwind, and then rethrow it as a C++ exception later, since not all C++ exceptions share a virtual base class. This is an issue with C++ independent of SWIG.

In the end, we have something solid, but it was laborious for Java and C# (python was significantly easier).

SWIG and Enums

I decided to implement boxed enum and flag types in C++, instead of using native C++ enum types. Why? I wanted stronger runtime validation and typing, and I also wanted the enum types to have methods. So we have a class for each enum type, and a global of the class type for each enum value. The C++11 feature "constexpr" would be very useful for defining these globals, but it will be a few years before the feature is widely available on most setups.

Easy printf debugging

Every type implements operator<< to facilitate convenient "printf debugging." In each language, operator<< is lifted to the appropriate primitive (repr in Python, toString in Java). It generally results in a short human-readable string. For example, printing a procedure typed variable will print the procedure’s name:

cout << some_procedure << endl;

Side effects

Very few functions have side effects and the method names generally make it obvious when there are side effects. For example, the “add” method of the “set” class adds an element to the set.

Java and CamelCase

Java usually uses CamelCase, and I wish the API could be compliant. SWIG has limited support for magically renaming things, but I do not believe it works for the %template construct, and so I eventually threw in the towel.

C++ and Exceptions

Some C++ developers object to exceptions. Luckily, it is possible to use the C++ API without exceptions. If an API user can avoid causing any API function to throw an exception, and is happy enough to have the program terminate in the case where an exception would be thrown, then it is not necessary to compile with support for exceptions. A special preprocessor flag must be specified to activate this mode.

Resource Acquisition is Initialization and null values

Many of the C++ types simply do not offer default constructors, and API functions will never return null-valued IR elements. This should reduce errors from null or uninitialized IR elements.

That concludes our discussion of design considerations and implementation of the new APIs. The next post discusses fuzzing the python API to find bugs. »»