Making Python Faster with Rust and CFFI, or Not

Recently, I had the (genuine) joy of helping port a Python library with a C extension to work on Python 2 and Python 3. C was my first language that I really understood pretty well and I have some (possibly misplaced) nostalgia for the time when I only ever wrote C and aspired to hack some parts of Linux.

After the fact, however, I started thinking about how one might implement the same functionality with Rust. I ran into some problems along the way, and further, I found something rather interesting: Extending Python with Rust via CFFI in this very specific situation was not much better than a pure CPython implementation.

First, some background information and requirements:

The existing C extension only works with CPython.
There are already benchmarks in the project that show that the C extension can be anywhere from 17 to 30 times faster than the pure-Python version
I wanted to use Rust and CFFI to potentially replace the C extension so it could be used on PyPy and CPython but needed it to be as fast (and hopefully faster)
The user-experience should be seemless. If the user gets a dictionary back from the pure-Python implementation, it should behave exactly the same if it's using the C or Rust extensions.

The project, pghstore was already fairly mature and had existed for quite a while. As part of adding Python 3 support, I re-organized it a little to ensure we were testing the installed version of pghstore by placing its source inside a src directory and added tests for missing corners. As such, I reused the src directory for my Rust project so that the directory tree looks like

src/
 + pghstore/
    + __init__.py
    + .. other C and Python files
 + pghstorers/
    + lib.rs
    + .. other Rust files
Cargo.toml

I built a Rust crate that could act as a stand-alone library to generate and parse HStore strings. I used a rather naive algorithm that benchmarked at less than than 0.0005s (or 500,000 nanoseconds) for even very large examples. If there were any doubts before, it was clear that Rust was fast enough for my needs. I then also wrote a naive bridge between C(FFI) and Rust. This included having to create structs that were representations that CFFI would use to pass the data back and forth. Finally, I wrote the code in Python to use CFFI.

Our Results

Unfortunately, (as you may have guessed) when I ran our Python benchmarks it turns out that the Rust extension on smaller examples was rougly as fast as the pure Python implementation (no more than 0.5s). It was only on sufficiently large examples that the Rust extension out-performed CPython and then it was only 2x as fast (19 seconds versus 38). With the exact same examples, the C extension was taking a fraction of the time. On the largest example the C extension took 1.4s while the Rust extension took 19s.

This was a little surprising at first because the largest examples in Rust were much larger and ran much faster. So I took a closer look at the Python end of things. Let's look at this together:

def _decode_cdata(cdata, encoding):
    if cdata == ffi.NULL:
        return None
    return ffi.string(cdata).decode('utf-8')


def loads(hstore_string, encoding='utf-8', return_type=dict):
    """Load a string into a dictionary."""
    error = error_ptr()
    hstore_char_ptr = char_from(hstore_string, encoding)
    returned_length = ffi.new("int *", 0)
    cdata = lib.hstore_loads(hstore_char_ptr, returned_length, error)

    length = returned_length[0]
    if cdata == ffi.NULL:
        if error.description != ffi.NULL:
            description = ffi.gc(error.description, lib.hstore_dumps_free)
            message = ffi.string(description)
            raise exceptions.ParseError(message, hstore_string)
        elif length <= 0:
            return return_type()

    cdata = ffi.gc(cdata, lib.hstore_loads_free)
    items = ffi.unpack(cdata, length)
    return return_type(
        (_decode_cdata(i.key, encoding), _decode_cdata(i.value, encoding))
        for i in items
    )

This is exactly what we're benchmarking. Let's review the steps in the code:

We're converting the hstore_string to something we can pass to a C function. This is very fast in CFFI.
```
hstore_char_ptr = char_from(hstore_string, encoding)
```
We do some error checking, which is also pretty darn fast.
```
if cdata == ffi.NULL:
```
Finally, we have to unpack the returned key-value pairs and coerce them into the return type that the user wants (to match the pure Python and C extension APIs). This is the slow part.
```
items = ffi.unpack(cdata, length)
return return_type(
    (_decode_cdata(i.key, encoding), _decode_cdata(i.value, encoding))
    for i in items
)
```

It turns out that no matter how you shave that yak, creating a large dictionary in pure-Python is really darn slow. I'd be willing to believe that I'm doing something silly here, but I can't find a faster way to build a dictionary.

Why doesn't the C version suffer this same problem?

That's a great question, I'm glad I asked it for you. The short answre is that the C implementation uses CPython's C API to build the extension. This gives it direct access to all of CPython's types and language features but they operate incredibly fast at a really low-level because the code is compiled instead of interpreted. Let's take a look at the C implementation together:

static PyObject *
_speedups_loads(PyObject *self, PyObject *args, PyObject *keywds)
{
  static char *keyword_argument_names[] = {"string", "encoding", "return_type", NULL};
  const char *errors = "strict";
  char *encoding = "utf-8";
  char *s;
  Py_ssize_t s_length = 0;
  int i, null_value = 0;
  int is_dictionary = 0;
  char *key_start, *key_end, *value_start, *value_end;
  PyTypeObject *return_type = &PyDict_Type;
  PyObject *return_value, *key, *value;

  if (!PyArg_ParseTupleAndKeywords(args, keywds, "s#|sO", keyword_argument_names, &s, &s_length, &encoding, &return_type)) {
    return NULL;
  }
  /* NOTE(sigmavirus24): Because we use `s#` to parse the input string, we
   * also need to pass s_length to get the length of the string because `s`
   * may have embedded null characters.
   * All of our tests presently pass but it's plausible that we could find
   * data with null characters inside and have to update this to match.
   * We need `s#` as a format argument here so we can receive both unicode and
   * bytes objects in char *s.
   */

  return_value = PyObject_CallObject((PyObject *) return_type, NULL);
  is_dictionary = PyDict_Check(return_value);

  // Each iteration will find one key/value pair
  while ((key_start = strchr(s, '"')) != NULL) {
    // move to the next character after the found "
    key_start++;

    // Find end of key
    key_end = strchr_unescaped(key_start, '"');

    // Find begining of value or NULL
    for (i=1; 1; i++) {
      switch (*(key_end+i)) {
      case 'N':
      case 'n':
        // found NULL
        null_value = 1;
        break;
      case '"':
        // found begining of value
        value_start = key_end+i+1;
        break;
      case '\0':
        // found end of string
        return return_value;
      default:
        // neither NULL nor begining of value, keep looking
        continue;
      }
      break;
    }

    key = unescape(key_start, key_end, encoding, errors);
    if (key == NULL) {
        goto _speedup_loads_cleanup_and_exit;
    }
    if (null_value == 0) {
      // find and null terminate end of value
      value_end = strchr_unescaped(value_start, '"');
      value = unescape(value_start, value_end, encoding, errors);
    } else {
      Py_INCREF(Py_None);
      value = Py_None;
    }

    if (key == NULL || value == NULL) {
_speedup_loads_cleanup_and_exit:
        Py_XDECREF(key);
        Py_XDECREF(value);
        Py_DECREF(return_value);
        return NULL;
    }

    if (is_dictionary) {
      PyDict_SetItem(return_value, key, value);
    } else {
      PyList_Append(return_value, PyTuple_Pack(2, key, value));
    }

    Py_DECREF(key);
    Py_DECREF(value);
    key = NULL;
    value = NULL;

    // set new search position
    if (null_value == 0) {
      s = value_end + 1;
    } else {
      s = key_end + i;
    }

    // reset null value flag
    null_value = 0;
  }
  return return_value;
}

Most of the work that involves parsing and building the return type is done in this loop:

  // Each iteration will find one key/value pair
  while ((key_start = strchr(s, '"')) != NULL) {
    // move to the next character after the found "
    key_start++;

    // Find end of key
    key_end = strchr_unescaped(key_start, '"');

    // Find begining of value or NULL
    for (i=1; 1; i++) {
      switch (*(key_end+i)) {
      case 'N':
      case 'n':
        // found NULL
        null_value = 1;
        break;
      case '"':
        // found begining of value
        value_start = key_end+i+1;
        break;
      case '\0':
        // found end of string
        return return_value;
      default:
        // neither NULL nor begining of value, keep looking
        continue;
      }
      break;
    }

    key = unescape(key_start, key_end, encoding, errors);
    if (key == NULL) {
        goto _speedup_loads_cleanup_and_exit;
    }
    if (null_value == 0) {
      // find and null terminate end of value
      value_end = strchr_unescaped(value_start, '"');
      value = unescape(value_start, value_end, encoding, errors);
    } else {
      Py_INCREF(Py_None);
      value = Py_None;
    }

    if (key == NULL || value == NULL) {
_speedup_loads_cleanup_and_exit:
        Py_XDECREF(key);
        Py_XDECREF(value);
        Py_DECREF(return_value);
        return NULL;
    }

    if (is_dictionary) {
      PyDict_SetItem(return_value, key, value);
    } else {
      PyList_Append(return_value, PyTuple_Pack(2, key, value));
    }

    Py_DECREF(key);
    Py_DECREF(value);
    key = NULL;
    value = NULL;

    // set new search position
    if (null_value == 0) {
      s = value_end + 1;
    } else {
      s = key_end + i;
    }

    // reset null value flag
    null_value = 0;
  }

We scan the string to find the start of a key (key_start = strchr(s, '"')) and then we scan until we find the end (key_end = strcr_unescaped(key_start, ''"')). We then find the enclosing bounds of the value in a for-loop that also checks for NULL and then we convert the key and value to unicode using our unescape function. Finally, we have checked if the user wanted the return type to be a dictionary or a list and we add the key-value pair to the return value as appropriate.

The creation of that return value is done in C before we start our loop:

return_value = PyObject_CallObject((PyObject *) return_type, NULL);
is_dictionary = PyDict_Check(return_value);

And we have direct access to the Py_INCREF and Py_DECREF macros meaning that we're manging our reference counters ourselves. Unfortunately, it seems like there's just no beating this implementation.

Personal Take-aways

I love writing Rust and I had a bizarre amount of fun making the C extension compatible on both Python 2 and Python 3 via CPython's C API. This whole experiment was surprising but also deeply satisfying.

I'm also pretty confident this is a unique result. Perhaps if I had different requirements for the API in Python, we could have made this implementation faster. As a result, I'm still going to keep looking for places where Rust can possibly improve the performance of the work I'm doing and experiment to see if it actually does. If and when I get those opportunities, I will happily report back.

But you didn't ...

I'm sure there are a bunch of things I missed in this post or things I could have tried that I didn't. I'd love to hear about them and more so, I'd love it if you blogged about them for others so they can learn from you. If you write a post, please do send it my way. I'd love to read them and learn from you.

cat /dev/brain

Our Results

Why doesn't the C version suffer this same problem?

Personal Take-aways

But you didn't ...