lloyd.io - What's New In YAJL Two?

What's New In YAJL Two?

2011-04-26 00:00:00 -0700

YAJL is a little sax style JSON parser written in C (conforming to C99). The first iteration was put together in a couple evening/weekend hacking sessions, and YAJL sat in version zero for about two years (2007-2009), quietly delighting a small number of folks with extreme JSON parsing needs. On April 1st 2009 YAJL was tagged 1.0.0 (apparently that was a joke, because the same day it hit version 1.0.2).

Given 2 years seems to YAJL’s natural period for major version bumps, I’m happy to announce YAJL 2, which is available now. This post will cover the changes and features in the new version.

First, Thanks!

Over the course of the last two years YAJL has had contributions from at least 24 different individuals and provided the heavy lifting for at 8 different high level language bindings. I’ve also received questions and comments from all over the planet from folks working in companies with market caps in the hundreds of billions, to university students in places like Cairo, Bangkok, and Bangladesh. YAJL has been particularly useful to the rails and iPhone communities, given its performance and low level streaming api. And anyone remember the twitpocalypse and YAJL Error 3? Dude, that’s not my bag!

At the risk of abusing my time on stage, I just wanted to say that there’s something very nice about world where one can go and build something as modest as YAJL and have it touch billions of people. So again, I’m pleased YAJL pleases you.

What’s new?

This post is intended to both tour new features and to serve as a quick and dirty porters guide for users of YAJL 1.x. If you want to skip all the language and examples, you’re more than welcome to head over to the ChangeLog and go from there.

License Changes

YAJL was three clause BSD, now it’s ISC. The functional difference between these two is that YAJL no longer includes a ‘non-endorsement’ clause, and really, I don’t care so much if you use YAJL in your product and decide to apply a “Lloyd Hilaiel inside” logo. But, YMMV with that.

So the new license is a bit more permissive but preserves everything I care about (you can’t say you wrote it), and is quite a bit smaller. That’s all.

Faster

YAJL 2 is somewhere between 20%-35% faster in its raw parsing performance. This is specifically due to lexer optimizations in string scanning. My above-average colleague Michael Hanson suggested these changes, and you’re encouraged to review the commit where the change was landed.

New “Tree” API

Given how prevalent YAJL is, it’s quite natural that lots of people get pointed to it when they ask about a good JSON parser for the C language. Unfortunately, many folks with simple parsing needs (they don’t need streaming, they have no data representation of their own) who just want to parse a little JSON file and extract some stuff have been repeatedly disappointed with YAJL. These folks want to pass in a buffer, get back a memory representation, plunk out a value or two, and go on their way.

Florian Forster decided to fill this need by writing a tiny little high level tree parser implementation on top of YAJL. The implementation is less than 10k of compiled object code, and if you statically link it’ll be stripped right out when not used. At that small cost, why not include an alternate high level API if so many people seem to want it?

To get an idea of how it works see the configuration parser example, and you can give the header file a read for more details.

Final note, this code is young, so expect several fixes and improvements in the coming months as people start sending in patches and bugs.

Changed YAJL Configuration

The 1.x YAJL implementation used C structures for configuration. So client code would allocate, and populate a structure, then pass it into the yajl_alloc() routine to change the behavior of the parser.

This design sucked for several reasons:

Addition of new options couldn’t be done in a binary compatible way
Client code always had to think about options, even if it just wanted default behaviors
it wasn’t clear from reading client code what was going on

Brian Lopez suggested an alternative API that’s implemented in YAJL. Now parser setup is, in my opinion, a lot clearer:

hand = yajl_alloc(&callbacks, NULL, (void *) g);
yajl_config(hand, yajl_allow_comments, 1);
yajl_config(hand, yajl_dont_validate_strings 1);

The generator works similarly and now has a yajl_gen_config() function.

Little API Changes

There were also several smaller API breaking changes, which will be especially interesting as you port:

YAJL no longer will build with a compiler that doesn’t support C99.
size_t is now used instead of unsigned int wherever buffer lengths or offsets are represented.
integers are now always represented with long long, which are at least 64bits on all modern compilers.
yajl_parse_complete() is now yajl_complete_parse(), see the commit to understand why.

Big API Changes

One of the longest standing issues in YAJL was it’s default tendency to consume as little of input buffers as required to complete a parse. That is if you did this:

const char * buf = "2009-10-20@20:38:21.539575"
yajl_status s = yajl_parse(h, buf, strlen(buf));

YAJL would parse out that 2009 as an integer, and consider the parse complete! He got his value, the rest of your buffer is your business. Given that YAJL is a stream parser, the design goal was specifically that yajl would process as few bytes as possible. If you wanted to ensure that an entire buffer consisted of a single valid JSON value, you’d need to use yajl_get_bytes_consumed() after the completion of the parse to ensure that your entire input buffer was consumed.

This turned out to be a bad decision. Greg Olszewski fixed this with a patch that provided additional configuration flags to help people change the behavior of yajl into what they expected.

Beyond from this patch, I’ve folded the configuration into the new API mentioned above, and changed the default configuration to be what I expect you expect, specifically an entire buffer will be consumed (so trailing junk will be considered a parse error), and if you call parse_complete() and a complete top level value hasn’t been parsed, it’s a hard error.

For more information on the changed semantics, refer to the yajl_option documentation and take a look at the updated JSON reformatter example which demonstrates how error handling should fit into your parsing flow.

A Built-in Benchmark

Lots of people pick YAJL because it’s fast. But one thing that we’ve lacked in the past is a stable way to assess just how fast, and more specifically to gauge the performance impact of code changes.

To solve these problems I wrote a small in-tree performance test that spends some number of seconds parsing through three sample JSON documents (from popular APIs around the net), and represents how fast it can parse as a throughput.

If you want to give the tool a whirl, simple:

$ make
...
$ build/perf/perftest
-- speed tests determine parsing throughput given 3 different sample documents --
With UTF8 validation:
Parsing speed: 267.096 MB/s
Without UTF8 validation:
Parsing speed: 267.443 MB/s

Now it’s not a particularly sophisticated benchmark, but is a good starting point for giving a quick high level view of the performance implications of changes.

More Details

A couple less interesting changes are in there too, so feel free to peruse the ChangeLog and commit history, if you need more!

Happy YAJLing, lloyd