Wrapper’s Delight

Patrick Palka, University of Illinois at Chicago

August 26, 2019

engineering-practice, internship-projects

During my summer at Trail of Bits, I took full advantage of the latest C++ language features to build a new SQLite wrapper from scratch that is easy to use, lightweight, high performance, and concurrency friendly—all in under 750 lines of code. The wrapper is available at https://github.com/trailofbits/sqlite_wrapper under the Apache 2.0 license. Comments and pull requests are welcome.

The motivation for this new SQLite wrapper came from working on an internal code-auditing tool built on top of Clang that uses a database to store and perform queries on semantic information about source code. Originally, RocksDB was chosen as the backing database, but I quickly found myself wrestling with the rigidity of a key-value database as queries became more complex. Wishing that we had a more expressive relational database, I began to explore switching to SQLite.

Initial experiments suggested that switching would not degrade performance if the database was properly tuned (see below), so I began looking at existing C++ SQLite wrappers to see which, if any, could suit our needs. We wanted something that would let us perform raw SQL queries that fit all our criteria for being both lightweight and able to handle concurrency. Unfortunately, none of the existing wrappers satisfied all of these, so I set out to write one from scratch.

After migrating over the backend to SQLite, we were impressed by the scalability and feature-richness of SQLite. It has a command line interface that makes debugging and prototyping easy, handles databases on the order of 100GB without a sweat and even has a built-in full-text search (FTS) extension. The wrapper makes interfacing with SQLite in C++ about as easy as possible, too.

Details

An example of its usage can be found at https://gist.github.com/patrick-palka/ffd836d0294f71d183f4199d0e842186. For simplicity, we chose to model parameter and column bindings as variadic function calls, so that all binds are specified at once.
Some of the modern C++ language features you’ll notice are inline and template variables, constexpr if and auto template parameters, generic lambdas, fold expressions, and thread_local variables.
This wrapper supports user-defined serialization and deserialization hooks. (https://gist.github.com/patrick-palka/d22df0eceb9b73ed405e6dfec10068c7)
This wrapper also supports automatic marshaling of C++ functions into SQL user functions. (https://gist.github.com/patrick-palka/00f002c76ad35ec55957716879c87ebe)
There is no database or connection object to explicitly initialize. Because the wrapper utilizes thread_local objects to manage connections to the database, a connection is made implicitly before the first use of the connection and is disconnected once the thread exits.
The database name and query strings are passed as template arguments instead of function arguments. This creates compile-time separation of the thread_local connection objects, which are per-database, and the thread_local prepared-statement caches, which are per-database, per-query-string. This design decision discourages the use of dynamically-generated query strings, since non-type template arguments must be compile-time constant expressions. In cases where a database name or a query string must be dynamically generated, the wrapper does support passing a lambda, which builds and returns the query string at runtime.
Every single prepared statement made by this wrapper is cached and reused, so the bare minimum number of calls to sqlite3_prepare is made throughout the lifetime of the program.
Downside: This wrapper cannot be used to manually manage connections to the database. It currently handles connections using a thread_local object, so a connection is created before the first query is performed on a given thread and is destroyed during thread exit. If you find yourself needing fine-grained control of when to connect or disconnect from your SQLite database, this wrapper may not work well for you. But this is limitation may be amended in the future.

Fine-tuning SQLite

Here are some SQLite tuning tips to maximize performance. Our wrapper does the first three for you automatically.

Prefer “external” FTS tables when using SQLite’s FTS extension. Build the table after the data is inserted using the ‘rebuild’ command. (https://www.sqlite.org/fts5.html#the_rebuild_command)
Reuse prepared statements. Create them using the sqlite_prepare_v3() routine and pass in the SQLITE_PREPARE_PERSISTENT option. (https://www.sqlite.org/c3ref/prepare.html)
Use the SQLITE_STATIC option when binding text and blob values via the sqlite3_bind_*() routines. Ensure that the underlying memory is valid until the first call to sqlite3_step(). This avoids a redundant copy of the text or blob data. (https://www.sqlite.org/c3ref/bind_blob.html)
Perform your insertions and updates in bulk transactions when possible. The speedup relative to using unit-size transactions grows nearly linearly with the size of the transaction, so inserting 10,000 rows per transaction is thousands of times faster than inserting 1 row per transaction.
Create your indexes after all most or all of your data has been inserted, and choose your indices wisely. Creating indices once is going to be faster overall than continuously building and rebuilding them as more data gets inserted. Use the SQLite command-line interface to double-check each query’s plan, and install a log callback to have SQLite inform you whenever it decides to create a temporary index.
Don’t use the same database connection or prepared statement object concurrently. SQLite serializes access to these objects. (https://sqlite.org/threadsafe.html) Also, the isolation component of ACID is guaranteed only between separate connections to the same database. (https://www.sqlite.org/isolation.html)
Consider pragma temp_store = memory when storing temporary tables and data structures in memory. (https://www.sqlite.org/pragma.html#pragma_temp_store)

SQLite C API Tips

Finally, here are some miscellaneous tips to simplify working with the SQLite C API, where the first two are done for you by our wrapper.

Install a concurrency-friendly sqlite_busy_handler to avoid having to check for `SQLITE_BUSY` after every API call. (https://www.sqlite.org/c3ref/busy_handler.html)
Set up a log callback to have errors and other notices, like hints on where to add indices, printed automatically. (https://www.sqlite.org/errlog.html)
Bundle a copy of SQLite into your project. This is the recommended way to use SQLite in your application. (https://www.sqlite.org/custombuild.html) Doing so also lets you enable SQLite’s full-text-search extension and its other useful disabled-by-default extensions.
Use C++11 raw string literals to format query strings.

Final Thoughts

This summer, some of my takeaways were that when locally storing a moderate amount of structured data without large concurrency demands, sooner or later you will want to perform complex queries on this data. Unless you have a clear vision for your data-access patterns from the outset, using a key-value database will quickly back you into a corner whenever your data-access pattern changes. On the other hand, relational databases make it easy to adapt your database to continuously changing access patterns. And finally, modern C++ can help make interfacing with SQLite and other C APIs concise and easy, and when configured properly, SQLite is quite scalable.