Optimizing Lifted Bitcode with Dead Store Elimination

Tim Alberdingk Thijm

As part of my Springternship at Trail of Bits, I created a series of data-flow-based optimizations that eliminate most “dead” stores that emulate writes to machine code registers in McSema-lifted programs. For example, applying my dead-store-elimination (DSE) passes to Apache httpd eliminated 117,059 stores, or 50% of the store operations to Remill’s register State structure. If you’re a regular McSema user, then pull the latest code to reap the benefits. DSE is now enabled by default.

Now, you might be thinking, “Back it up, Tim, isn’t DSE a fundamental optimization that’s already part of LLVM?” You would be right to ask this (and the answer is yes), because if you’ve used LLVM then you know that it has an excellent optimizer. However, despite LLVM’s excellence, the truth is that, like any optimizer, LLVM can only cut instructions it knows to be unnecessary. The Remill dead code eliminator has the advantage of possessing more higher-level information about the nature of lifted bitcode, which lets it be more aggressive than LLVM in performing its optimizations.

But every question answered just raises more questions! You might now be thinking, “LLVM only does safe optimizations. This DSE is more aggressive… How do we know it didn’t break the lifted httpd program?” Fear not! The dead store elimination tool is specifically designed to perform a whole-program analysis on lifted bitcode that has already been optimized. This ensures that it can find dead instructions with the maximum possible context, avoiding mistakes where the program assumes some code won’t be used. The output is a fully-functioning httpd executable, minus a mountain of useless computation.

What Happens When We Lift

The backbone of Remill/McSema’s lifted bitcode is the State structure, which models the machine’s register state. Remill emulates reads and writes to registers by using LLVM load and store instructions that operate on pointers into the State structure. Here’s what Remill’s State structure might look like for a toy x86-like architecture with two registers: eax and ebx.

struct State {
  uint32_t eax;
  uint32_t ebx;
};

This would be represented in LLVM as follows:

%struct.State = type { i32, i32 }

Let’s say we’re looking at a few lines of machine code in this architecture:

mov eax, ebx
add eax, 10

A heavily-simplified version of the LLVM IR for this code might look like this:

The first two lines derive pointers to the memory backing the emulated eax and ebx registers (%eax_addr and %ebx_addr, respectively) from a pointer to the state (%state). This derivation is performed using the getelementptr instruction, and is equivalent to the C code &(state->eax) and &(state->ebx). The next two lines represent the mov instruction, where the emulated ebx register is read (load), and the value read is then written to (store) the emulated eax register. Finally, the last three lines represent the add instruction.

We can see that %ebx_0 is stored to %eax_ptr and then %eax_0 is loaded from the %eax_ptr without any intervening stores to the %eax_ptr pointer. This means that the load into %eax_0 is redundant. We can simply use %ebx_0 anywhere that %eax_0 is used, i.e. forward the store to the load.

Next, we might also notice that the store %ebx_0, %eax_ptr instruction isn’t particularly useful either, since store %eax_1, %eax_ptr happens before %eax_ptr is read from again. In fact, this is a dead store. Eliminating these kinds of dead stores is what my optimization focuses on!

This process will go on in real bitcode until nothing more can be forwarded or killed.

So now that you have an understanding of how dead store elimination works, let’s explore how we could teach this technique to a computer.

As it turns out, each of the above steps are related to data-flow analyses. To build our eliminator, we’re going to want to figure out how to represent these decisions using data-flow techniques.

Building the Eliminator

With introductions out of the way, let’s get into how this dead code elimination is supposed to work.

Playing the Slots

The DSE pass needs to recognize loads/stores through %eax_ptr and %ebx_ptr as being different. The DSE pass does this by chopping up the State structure into “slots”, which roughly represent registers, with some small distinctions for cases where we bundle sequence types like arrays and vectors as one logical object. The slots for our simplified State structure are:

After chopping up the State structure, the DSE pass tries to label instructions with the slot to which that instruction might refer. But how do we even do this labelling? I mentioned earlier that we have deeper knowledge about the nature of lifted bitcode, and here’s where we get to use it. In lifted bitcode, the State structure is passed into every lifted function as an argument. Every load or store to an emulated register is therefore derived from this State pointer (e.g. via getelementptr, bitcast, etc.). Each such derivation results in a new pointer that is possibly offsetted from its base. Therefore, to determine the slot referenced by any given pointer, we need to calculate that pointer’s offset, and map the offset back to the slot. If it’s a derived pointer, then we need to calculate the base pointer’s offset. And if the base pointer is derived then… really, it’s just offsets all the way down.

And They Were Slot-mates!

The case that interests us most is when two instructions get friendly and alias to the same slot. That’s all it takes for one instruction to kill another: in Remill, it’s the law of the jungle.

To identify instructions which alias, we use a ForwardAliasVisitor (FAV). The FAV keeps track of all the pointers to offsets to the state structure and all the instructions involving accesses to the state structure in two respective maps. As the name implies, it iterates forward through the instructions it’s given, keeping a tally if it notices that one of the addresses it’s tracking has been modified or used.

Here’s how this information is built up from our instructions:

Each time the FAV visits an instruction, it checks if updates need to be made to its maps.

The accesses map stores the instructions which access state offsets. We’ll use this map later to determine which load and store instructions could potentially alias. You can already see here that the offsets of three instructions are all the same: a clear sign that we can eliminate instructions later!

The offsets map ensures the accesses map can get the right information. Starting with the base %state pointer, the offsets map accumulates any pointers that may be referenced as the program runs. You can think of it as the address book which the loads and stores use to make calls to different parts of the state structure.

The third data structure shown here is the exclude set. This keeps track of all the other values instructions might refer to that we know shouldn’t contact the state structure. These would be the values read by load instructions, or pointers to alloca’d memory. In this example, you can also see that if a value is already in the offsets map or exclude set, any value produced from one such value will remain in the same set (e.g. %eax_1 is excluded since %eax_0 already was). You can think of the exclude set as the Do-Not-Call list to the offset map’s address book.

The FAV picks through the code and ensures that it’s able to visit every instruction of every function. Once it’s done, we can associate the relevant state slot to each load and store as LLVM metadata, and move on to the violent crescendo of the dead code eliminator: eliminating the dead instructions!

You’ll Be Stone Dead In a Moment

Now it’s time for us to pick through the aliasing instructions and see if any of them can be eliminated. We have a few techniques available to us, following a similar pattern as before. We’ll look through the instructions and determine their viability for elimination as a data-flow.

Sequentially, we run the ForwardingBlockVisitor to forward unnecessary loads and stores and then use the LiveSetBlockVisitor to choose which ones to eliminate. For the purpose of this post, however, we’ll cover these steps in reverse order to get a better sense of why they’re useful.

Live and Set Live

The LiveSetBlockVisitor (LSBV) has the illustrious job of inspecting each basic block of a module’s functions to determine the overall liveness of slots in the State. Briefly, live variable analysis allows the DSE to check if a store will be overwritten (“killed”) before a load accesses (“revives”) the slot. The LiveSet of LSBV is a bitset representing the liveness of each slot in the State structure: if a slot is live, the bit in the LiveSet corresponding to the slot’s index is set to 1.

The LSBV proceeds from the terminating blocks (blocks ending with ret instructions) of the function back to the entry block, keeping track of a live set for each block. This allows it to determine the live set of preceding blocks based on the liveness of their successors.

Here’s an example of how an LSBV pass proceeds. Starting from the terminating blocks, we iterate through the block’s instructions backwards and update its live set as we do. Once we’re finished, we add the block’s predecessors to our worklist and continue with them. After analyzing the entry block, we finish the pass. Any stores visited while a slot was already dead can be declared dead stores, which we can then remove.

In order to avoid any undefined behaviour, the LSBV had a few generalizations in place. Some instructions, like resume or indirectbr, that could cause uncertain changes to the block’s live set conservatively mark all slots as live. This provides a simple way of avoiding dangerous eliminations and an opportunity for future improvements.

Not To Be Forward, But…

Our work could end here with the LSBV, but there are still potential improvements we can make to the DSE. As mentioned earlier, we can “forward” some instructions by replacing unnecessary sequences of storing a value, loading that value and using that value with direct use of the value prior to the store. This is handled by the ForwardingBlockVisitor, another backward block visitor. Using the aliases gathered by the FAV, it can iterate through the instructions of the block from back to front, keeping track of the upcoming loads to each slot of the State. If we find an operation occurs earlier that accesses the same slot, we can forward it to cut down on the number of operations, as shown in the earlier elimination example.

Doing this step before the LSBV pass allows the LSBV to identify more dead instructions than before. Looking again at our example, we’ve now set up another store to be killed by the LSBV pass. This type of procedure allows us to remove more instructions than before by better exploiting our knowledge of when slots will be used next. Cascading eliminations this way is part of what allows DSE to remove so many instructions: if a store is removed, there may be more instructions rendered useless that can also be eliminated.

A DSE Diet Testimonial

Thanks to the slimming power of dead store elimination, we can make some impressive cuts to the number of instructions in our lifted code.

For an amd64 Apache httpd, we were able to generate the following report:

Candidate stores: 210,855
Dead stores: 117,059
Instructions removed from DSE: 273,322
Forwarded loads: 840
Forwarded stores: 2,222
Perfectly forwarded: 2,836
Forwarded by truncation: 215
Forwarded by casting: 11
Forwarded by reordering: 61
Could not forward: 1,558
Unanalyzed functions: 0

An additional feature of the DSE is the ability to generate DOT diagrams of the instructions removed. Currently, the DSE will produce three diagrams for each function visited, showing the offsets identified, the stores marked for removal, and the post-removal instructions.

DOT diagrams are produced that show eliminated instructions

Still Hungry for Optimizations?

While this may be the end of Tim’s work on the DSE for the time being, future improvements are already in the pipeline to make Remill/McSema’s lifted bitcode even leaner. Work will continue to handle cases that the DSE is currently not brave enough to take on, like sinking store instructions when a slot is only live down one branch, handling calls to other functions more precisely, and lifting live regions to allocas to benefit from LLVM’s mem2reg pass.

Think what Tim did was cool? Check out the “intern project” GitHub issue tags on McSema and Remill to get involved, talk to us on #binary-lifting channel of the Empire Hacking Slack, or reach out to us via our careers page.

Tim is starting a PhD in programming language theory this September at Princeton University, where he will try his hand at following instructions, instead of eliminating them.

Trail of Bits donates $100,000 to support young researchers through SummerCon

We have a soft spot in our hearts for SummerCon. This event, the longest-running hacker conference in the US, is a great chance to host hacker friends from around the world in NYC, catch up in person, and learn about delightfully weird security topics. It draws a great crowd, ranging from “hackers to feds to convicted felons to concerned parents.”

The folks running SummerCon have pulled together an excellent line-up of high-quality talks time and again. However, this year there’s a bigtime issue: all the speakers are men.

We recognize the thanklessness of the job of hosting SummerCon and assume the best of intentions. Nonetheless, we were disappointed. This lineup isn’t an exception to security conferences – it’s close to the norm. Exclusion of women and minorities in the security industry is a pandemic that we need to address. The hacker conference that started them all should be at the forefront of the solution.

This year we’ll be working together to change that.

A grant for inclusion in security research

We are partnering with the SummerCon Foundation to create the Trail of Bits SummerCon Fellowship. This grant will provide $100,000 in funding for budding security researchers. At least 50% of the program spots will be reserved for minority and female-identifying candidates. The organization will reach out directly to women- and minority-serving groups at universities to encourage them to apply (shout out to @MaddieStone for that awesome idea!). Participants will receive grant funding, mentorship from Trail of Bits and the SummerCon Foundation, and an invitation to present their findings at SummerCon after their fellowship.

In addition to this program, SummerCon has committed to a greater level of transparency and representation in its future selection of speakers. They’ll publish well-defined criteria for their CFP. They will identify the SummerCon alumni who comprise their speaker-selection committee. Finally, they will expand the selection team to include 50% minorities and women.

Next, SummerCon has committed to making the conference a safe space of inclusion. They’ve announced and will enforce a clear anti-harassment policy with multiple points of contact for reporting disrespectful behavior. Violators will be kicked out.

Finally, in a small effort to bring more awareness to the change, we have a sweet bonus in store: Keep your eyes peeled for the Trail-of-Bits-sponsored ice cream flavor in a Van Leeuwen ice cream truck outside LittleField. For every scoop sold, we’ll be matching the sales with a donation to Girls Who Code.

Truck-1

Serving up tasty treats for a cause!

Does it fix the problem?

No. This is a small step. The issue of inclusion within security is much bigger than one small annual hacker meetup. Fortunately, everyone in the industry can help, including us. Even today, our growing team of 37 people has only four women, only two of whom are engineers. We must do better.

We’ve already taken some steps to improve:

Here’s what we’ll do this year:

  • Actively work with diversity- and inclusion-recruiting groups to get out of the cycle of predisposing our recruiting toward homogeneity
  • Continue to search for opportunities to volunteer and mentor with groups that support inclusion in tech and infosec
  • Reimburse employees for any tax expenses incurred for insurance of domestic partners

Get involved!

Want to participate as a SummerCon research fellow? Keep an eye on @trailofbits. We’ll be making a joint announcement with SummerCon soon.

Have other ideas about how to foster a more inclusive security environment? Contact us!

Announcing the Trail of Bits osquery support group

As great as it is, osquery could be a whole lot better. (Think write access for extensions, triggered responses upon detection, and even better performance, reliability and ease of use.)

Facebook’s small osquery team can’t respond to every request for enhancement. That’s understandable. They have their hands full with managing the osquery community, reviewing PRs, and ensuring the security of the world’s largest social network. It’s up to the community to move the osquery platform forward.

Good news: none of these feature requests are infeasible. The custom engineering is just uneconomical for individual organizations to bankroll.

We propose a strategy for osquery users to share the cost of development. Participating companies could pool resources and collectively target specific features. This would accelerate the depreciation of other full-suite tools that are more expensive, less flexible and less transparent.

It’s the only way to make real progress quickly. Otherwise, projects rely solely on the charity and coordination of their contributors.

Can an open-source tool replace commercial solutions?

We think that open-source security solutions are inherently better. They’re transparent. They’re more flexible. Their costs are tied closely to the value you get; not just access. Finally, each time there’s an investment in the tool, it increases the advantages for current users, and increases the number of users who can access these advantages.

However, in order to compete with their commercial counterparts, open source projects need implementation support and development support. The former is basically the ability to “set it and forget it.” The latter ensures the absence of show-stopping bugs and the regular addition of new required features.

Companies like Kolide and Uptycs provide user-friendly support for deployment.

For development support, you can now hire us.

Announcing the Trail of Bits osquery support group

We’re offering two ‘flavors’ of support plans; one for year-round assurance, the other for custom development.

12-month assurance plan

Think of this like an all-you-can-eat buffet for critical features and fixes. Any time you need a bug fixed or a feature added, just file a ticket with us. This option’s great for root-cause and fix issues, the development of new tables and extensions, or the redesign of parts of osquery’s core. Basically, the stuff that is holding you back from cancelling those expensive monthly contracts with the proprietary vendors.

Bespoke development

This plan’s for you if you need one-off help with a big-time osquery change. Perhaps: ports to new platforms, non-core features, or forks.

Regardless of the plan you choose, you’ll get:

  • Access to a private Trail of Bits Slack channel for direct access to our engineers
  • The opportunity to participate in a bi-weekly iteration planning meeting for collaborative feature ideation, problem-solving, and feature prioritization
  • A private GitHub repository with issue tracker for visibility and influence over what features are worked on
  • Special access and support to our osquery extensions
  • Early access to all software increments

Whether you’re a long-time osquery user with a list of feature requests, or part of a team that has been holding out for osquery’s feature-parity with commercial tools, this may be the opportunity you’ve been waiting for. As a member, you’ll gain multiple benefits: confidence that there aren’t any show-stopping bugs; direct access to our team of world-class engineers, many of whom have been doing this exact work since we ported osquery to Windows; peace of mind that your internal engineers won’t spend any more time on issues with osquery; and the chance to drive osquery’s product direction while leaving the heavy lifting to us.

Want in? Let us know.

QueryCon 2018: our talks and takeaways

Sometimes a conference just gets it right. Good talks, single track, select engaged attendees, and no sales talks. It’s a recipe for success that Kolide got right on its very first try with QueryCon, the first-ever osquery conference.

It’s no secret that we are huge fans of osquery, Facebook’s award-winning open source endpoint detection tool. From when we ported osquery to Windows in 2016 to our launch of our osquery extension repo this year, we’ve been one of the leading contributors to the tool’s development. This is why we were delighted Kolide invited us to participate in QueryCon!

The two-day conference, hosted at the beautiful Palace of the Fine Arts in San Francisco, drew over 120 attendees and 16 speakers. The attendance list was a Who’s Who in Big Tech security; teams from Facebook, Airbnb, Yelp, Atlassian, Adobe, Netflix, Salesforce, and more. It was great to meet face-to-face. We’ve been collaborating with some of these teams on osquery for years. It was also exciting to see the widespread adoption of the technology manifested in person. Though some of the teams attending were there to learn about the tech before deploying, the majority seemed to be committed adopters.

The talks ranged from the big-picture (operational security preparedness by Rob Fry of JASK) to the highly technical (breakdowns of macOS internals by Michael Lynn of Facebook), with consistent levity, epitomized by the brilliantly sulky Ben Hughes of Stripe. Scott Lundgren of Carbon Black gave a report-card-style review of the community from an outsider’s perspective. Longtime osquery evangelist Chris Long of Palantir provided a candid user experience of working with osquery’s audit framework in his organization. It was a well-curated mix of subjects, speakers, and perspectives. They all taught us something new.

What we learned at QueryCon

1. The community is bigger and stronger than we thought

As of this week, osquery’s Slack has 1,703 users. Until the sold-out showing at QueryCon, I never thought to check how many of those users were active; 431 in the last 30 days. 120 of those people made it to QueryCon. Dozens more joined the waitlist.

2. Some users are innovating in very cool ways

We came to QueryCon intent on pushing the community to use osquery in new, innovative ways. Turns out, it didn’t need much pushing. Take the security team at Netflix. They’re using osquery in multiple internal open source projects: Diffy, a digital forensics and incident response (DFIR) tool, and Stethoscope, their security detection and recommendation application. We heard many more examples from many more teams.

3. The community really likes our contributions

Many of the talks mentioned our team and our work. We knew we were contributing significant engineering effort, but we hadn’t truly realized how much others had been benefiting. It felt great to hear that work done for our clients truly advances the whole community.

4. The goals are clear, but the way there is not

We gleaned some clear takeaways that are likely common for a first meetup of a new open source project:

  • We need to define and broadcast osquery’s guiding principles;
  • We need to solidify some best practices for effective collaboration;
  • We need to tackle technical debt.

However, we didn’t determine how these will get done. Facebook was clear in defining its role in this process. Their small dedicated osquery team will continue to put in the hard work of testing, managing versions, and holding the community to high standards for both written code and community inclusion. However, it’s up to the community to take care of the rest.

What we shared at QueryCon

Osquery Super Features

Speaker: Lauren Pearl

Abstract: In this talk, we reviewed a user feature wishlist gathered from interviews with five Silicon Valley tech teams who use osquery. From these, we identified Super Features – features that would fundamentally improve the value proposition of osquery. We explained how these developments could transform osquery’s power in technical organizations. Finally, we walked through the high-level development plans for making these Super Features a reality.

Link to Video: QueryCon 2018 | Lauren Pearl (Trail of Bits) – Three Super Features That Could Transform Osquery

Slides: Super Features PDF

The Osquery Extensions Skunkworks Project: Unconventional Uses for Osquery

Speaker: Mike Myers

Abstract: Facebook created osquery with certain guiding principles: don’t pry into users’ data, don’t change the state of the system, don’t create network traffic to third parties. It was originally intended as a read-only information gatherer. For those that didn’t want to play by these rules, there’s the extension interface. We’ve begun experimenting with extensions that don’t align with mainline osquery: integrating with third-party services, writable tables, host-based firewall administration, malware vaccination, and more. We shared some of our lessons-learned on the challenges of using osquery as a control interface.

Link to Video: QueryCon 2018 | Mike Myers (Trail of Bits) – Extensions Skunkworks: Unconventional Uses for Osquery

Slides: Skunkworks Extensions PDF

Thank you so much!

This was a great first conference for an emerging technology. It awakened community leaders to issues and opportunities and started the conversation of how to push forward. Attendees renewed enthusiasm and commitment to advance and maintain the project.

It’s hard to believe that this was Kolide’s first time hosting such an event. Director Of Operations, Antigoni Sinanis, the lady in charge of the event’s success, has set a high bar for her company to clear next year. We at are already looking forward to round two!

Manage your fleet’s firewalls with osquery

We’re releasing an extension for osquery that lets you manage the local firewalls of your fleet.

Each of the three major operating systems provides a native firewall, capable of blocking incoming and outgoing access when configured. However, the interface for each of these three firewall systems are dissimilar and each requires different methods of configuration. Furthermore, there are few options for cross-platform fleet configuration, and nearly all are commercial and proprietary.

In partnership with Airbnb, we have created a cross-platform firewall management extension for osquery. The extension enables programmatic control over the native firewalls and provides a common interface for each host operating system, permitting more advanced control over an enterprise fleet’s endpoint protections as well as closing the loop between endpoint monitoring and endpoint management.

Along with our Santa management extension, this extension shows the utility of writable tables in osquery extensions. Programmatic control over endpoint firewalls means that an administrator can react more quickly to prevent the spread of malware on their fleet, prevent unexpected data egress from particularly vital systems, or block incoming connections from known malicious addresses. This is a huge advance in osquery’s capabilities, shifting it from merely a monitoring tool into both prevention and recovery domains.

What it can do now

The extension creates two new tables: HostBlacklist and PortBlacklist. These virtual tables generate their entries via the underlying operating systems’ native firewall interfaces: iptables on Linux, netsh on Windows, and pfctl on MacOS. This keeps them compatible with the widest possible range of deployments and avoids further dependence on external libraries or applications. It will work with your existing configuration, and, regardless of underlying platform, provide the same interface and capabilities.

Use osquery to access the local firewall configuration on Mac, Windows, and Linux

What’s on the horizon

While the ability to read the state of the firewall is useful, it’s the possibility of controlling them that we’re most excited about. With writable tables available in osquery, blacklisting a port or a host on a managed system will become as simple as an INSERT statement. No need to deploy an additional firewall management service. No more reviewing how you configure the firewall on macOS. Just write an INSERT statement and push it out the fleet.

Instantly block hostnames and ports across your entire fleet with osquery

Give it a try

With this extension you can query the state of blacklisted ports and hosts across a managed fleet and ensure that they’re all configured to your specifications. With the advent of the writable tables feature osquery can shift from a monitoring role to a management and preventative tool. This extension takes the first step in that direction.

We’re adding this extension to our managed repository. We’re committed to maintaining and extending our collection of extensions. You should check in and see what else we’ve released.

Do you have an idea for an osquery extension? File an issue on our GitHub repo for it. Contact us for osquery development.

Manage Santa within osquery

We’re releasing an extension for osquery that lets you manage Google Santa without the need for a separate sync server.

Google Santa is an application whitelist and blacklist system for macOS ideal for deployment across managed fleets. It uses a sync server from which daemons pull rules onto managed computers. However, the sync server provides no functionality for the bulk collection of logs or configuration states. It does not indicate whether all the agents have pulled the latest rules or how often those agents block execution of blacklisted binaries.

In partnership with Palantir, we have integrated Santa into the osquery interface as an extension. Santa can now be managed directly through osquery and no longer requires a separate sync server. Enterprises can use a single interface, osquery, to centrally manage logs and update or review agent configuration.

We’ve described writable access to endpoints as a superfeature of osquery. This extension shows why. Now, it’s possible to add remote management features to the osquery agent, which is normally limited to read-only access. This represents a huge advance in osquery’s capabilities, moving it from the role of strictly monitoring into an active and preventative role. Trail of Bits is pleased to announce the release of the Santa extension into our open-source repository of osquery extensions.

What it can do

Santa gives you fine-grained control over which applications may run on your computer. Add osquery and this extension into the mix, and now you’ve got fine-grained control over which applications may run on your fleet. Lock down endpoints to only run applications signed by a handful of approved certificates, or blacklist known malicious applications before they get a chance to run.

The extension can be loaded at the startup of osquery with the extension command line argument, e.g., osqueryi --extension path/to/santa.ext. On loading, it adds two new tables to the database: santa_rules and santa_events.The tables themselves are straightforward.

santa_rules consists of the three text columns: shasum, state, and type. The type column contains the rule type and may be either certificate or binary. state is either whitelist or blacklist. shasum contains either the hash of the binary or the signing certificate’s hash, depending on rule type.

The santa_events table has four text columns: timestamp, path, shasum, and reason. timestamp marks the time the deny event was logged. path lists the path to the denied application. shasum displays the hash of the file. reason shows the type of rule that caused the deny (either binary or certificate).

Time to use it

This extension provides a simplified interface to oversee and control your Santa deployment across your fleet, granting easy access to both rules and events. You can find it and other osquery extensions in our repository of maintained osquery extensions. We’ll continue to add new extensions. Take a look and see what we have available.

Hire us to tailor osquery to your needs

Do you have an idea for an osquery extension? File an issue on our GitHub repo for it. Contact us for osquery development.

Note: This feature depends on writable tables support for extensions which has not yet been merged. Contact us if you’d like to try this feature now — we create custom binary builds to test upcoming features of osquery for our clients.

Collect NTFS forensic information with osquery

We’re releasing an extension for osquery that will let you dig deeper into the NTFS filesystem. It’s one more tool for incident response and data collection. But it’s also an opportunity to dispense with forensics toolkits and commercial services that offer similar capabilities.

Until now, osquery has been inadequate for performing the kind of filesystem forensics that is often part of an incident response effort. It collects some information about files on its host platforms – timestamps, permissions, owner and more – but anyone with experience in forensics will tell you that there’s a lot more data available on a file system if you’re willing to dig. Think additional timestamps, unallocated metadata, or stale directory entries.

The alternatives are often closed source and expensive. They become one more item in your budget, deployment roadmap, and maintenance schedule. And none of them integrate with osquery. You have to go to the extra effort of mapping the forensic report back to your fleet.

That changes today. In partnership with Crypsis, we have integrated NTFS forensic information into the osquery interface as an extension. Consider this the first step toward a better, cost-effective, more efficient alternative that’s easier to deploy.

What it can do

The NTFS forensics extension provides specific additional file metadata from NTFS images, including filename timestamp entries, the security descriptor for files, whether a file has Alternate Data Streams (ADS), as well as other information. It also provides index entries for directory indices, including entries that are deallocated. You can find the malware that just cleaned up after itself, or altered its file timestamps but forgot about the filename timestamps, or installed a rootkit in the ADS of calc.exe, all without ever leaving osquery.

How to use it

Load the extension at the startup of osquery with the command line argument, e.g., <code>osqueryi.exe --extension path\to\ntfs_forensics.ext.exe</code>. On loading, three new tables will be added to the database: ntfs_part_data, ntfs_file_data, and ntfs_indx_data.

ntfs_part_data

This table provides information about partitions on a disk image. If queried without a specified disk image, it will attempt to interrogate the physical drives of the host system by walking up from \\.\PhysicalDrive0 until it finds a drive number it fails to open.

Enumerating partition entries in an NTFS image

ntfs_file_data

This table provides information about file entries in an NTFS file system. The device and partition columns must be specified explicitly in the WHERE clause to query the table. If the path or inode column is specified, then a single row about the specified file is returned. If the directory column is specified, then a row is returned for every file in that directory. If nothing is specified, a walk of the entire partition is performed. Because the walk of the entire partition is costly, results are cached to be reused without reperforming the entire walk. If you need fresh results of a partition walk, use the hidden column from_cache in the WHERE clause to force the collection of live data (e.g., select * from ntfs_file_data where device=”\\.\PhysicalDrive0” and partition=2 and from_cache=0;).

Displaying collected data on a single entry in an NTFS file system

ntfs_indx_data

This table provides the content of index entries for a specified directory, including index entries discovered in slack space. Like ntfs_file_data, the device and partition columns must be specified in the WHERE clause of a query, as well as either parent_path or parent_inode. Entries discovered in slack space will have a non-zero value in the slack column.

Displaying inode entries recovered from a directory index’s slack space

Getting Started

This extension offers a fast and convenient way to perform filesystem forensics on Windows endpoints as a part of an incident response. You can find it and our other osquery extensions in our repository. We’re committed to maintaining and extending our collection of extensions. Take a look, and see what else we have available.

Hire us to tailor osquery to your needs

Do you have an idea for an osquery extension? File an issue on our GitHub repo for it. Contact us for osquery development.

State Machine Testing with Echidna

Property-based testing is a powerful technique for verifying arbitrary properties of a program via execution on a large set of inputs, typically generated stochastically. Echidna is a library and executable I’ve been working on for applying property-based testing to EVM code (particularly code written in Solidity).

Echidna is a library for generating random sequences of calls against a given smart contract’s ABI and making sure that their evaluation preserves some user-defined invariants (e.g.: the balance in this wallet must never go down). If you’re from a more conventional security background, you can think of it as a fuzzer, with the caveat that it looks for user-specified logic bugs rather than crashes (as programs written for the EVM don’t “crash” in any conventional way).

The property-based testing functionality in Echidna is implemented with Hedgehog, a property-based testing library by Jacob Stanley. Think of Hedgehog as a nicer version of QuickCheck. It’s an extremely powerful library, providing automatic minimal testcase generation (“shrinking”), well-designed abstractions for things like ranges, and most importantly for this blog post, abstract state machine testing tools.

After reading a particularly excellent blog post by Tim Humphries (“State machine testing with Hedgehog,” which I’ll refer to as the “Hedgehog post” from now on) about testing a simple state machine with this functionality, I was curious if the same techniques could be extended to the EVM. Many contracts I see in the wild are just implementations of some textbook state machine, and the ability to write tests against that invariant-rich representation would be invaluable.

The rest of this blog post assumes at least a degree of familiarity with Hedgehog’s state machine testing functionality. If you’re unfamiliar with the software, I’d recommend reading Humphries’s blog post first. It’s also worth noting that the below code demonstrates advanced usage of Echidna’s API, and you can also use it to test code without writing a line of Haskell.

First, we’ll describe our state machine’s states, then its transitions, and once we’ve done that we’ll use it to actually find some bugs in contracts implementing it. If you’d like to follow along on your own, all the Haskell code is in examples/state-machine and all the Solidity code is in solidity/turnstile.

Step 0: Build the model

Fig. 1: A turnstile state machine

The state machine in the Hedgehog post is a turnstile with two states (locked and unlocked) and two actions (inserting a coin and pushing the turnstile), with “locked” as its initial state. We can copy this code verbatim.

data ModelState (v :: * -> *) = TLocked
                              | TUnlocked
                              deriving (Eq, Ord, Show)

initialState :: ModelState v
initialState = TLocked

However, in the Hedgehog post the effectful implementation of this abstract model was a mutable variable that required I/O to access. We can instead use a simple Solidity program.

contract Turnstile {
  bool private locked = true; // initial state is locked

  function coin() {
    locked = false;
  }

  function push() returns (bool) {
    if (locked) {
      return(false);
    } else {
      locked = true;
      return(true);
    }
  }
}

At this point, we have an abstract model that just describes the states, not the transitions, and some Solidity code we claim implements a state machine. In order to test it, we still have to describe this machine’s transitions and invariants.

Step 1: Write some commands

To write these tests, we need to make explicit how we can execute the implementation of our model. The examples given in the Hedgehog post work in any MonadIO, as they deal with IORefs. However, since EVM execution is deterministic, we can work instead in any MonadState VM.

The simplest command is inserting a coin. This should always result in the turnstile being unlocked.

s_coin :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState
s_coin = Command (\_ -> Just $ pure Coin)
                 -- Regardless of initial state, we can always insert a coin
  (\Coin -> cleanUp >> execCall ("coin", []))
  -- Inserting a coin is just calling coin() in the contract
  -- We need cleanUp to chain multiple calls together
  [ Update $ \_ Coin _ -> TUnlocked
    -- Inserting a coin sets the state to unlocked
  , Ensure $ \_ s Coin _ -> s === TUnlocked
    -- After inserting a coin, the state should be unlocked
  ]

Since the push function in our implementation returns a boolean value we care about (whether or not pushing “worked”), we need a way to parse EVM output. execCall has type MonadState VM => SolCall -> m VMResult, so we need a way to check whether a given VMResult is true, false, or something else entirely. This turns out to be pretty trivial.

match :: VMResult -> Bool -> Bool
match (VMSuccess (B s)) b = s == encodeAbiValue (AbiBool b)
match _ _ = False

Now that we can check the results of pushing, we have everything we need to write the rest of the model. As before, we’ll write two Commands; modeling pushing while the turnstile is locked and unlocked, respectively. Pushing while locked should succeed, and result in the turnstile becoming locked. Pushing while unlocked should fail, and leave the turnstile locked.

s_push_locked :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState
s_push_locked = Command (\s -> if s == TLocked then Just $ pure Push else Nothing)
                        -- We can only run this command when the turnstile is locked
  (\Push -> cleanUp >> execCall ("push", []))
  -- Pushing is just calling push()
  [ Require $ \s Push -> s == TLocked
    -- Before we push, the turnstile should be locked
  , Update $ \_ Push _ -> TLocked
    -- After we push, the turnstile should be locked
  , Ensure $ \before after Push b -> do before === TLocked
                                        -- As before
                                        assert (match b False)
                                        -- Pushing should fail
                                        after === TLocked
                                        -- As before
  ]
s_push_unlocked :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState
s_push_unlocked = Command (\s -> if s == TUnlocked then Just $ pure Push else Nothing)
                          -- We can only run this command when the turnstile is unlocked
  (\Push -> cleanUp >> execCall ("push", []))
  -- Pushing is just calling push()
  [ Require $ \s Push -> s == TUnlocked
    -- Before we push, the turnstile should be unlocked
  , Update $ \_ Push _ -> TLocked
    -- After we push, the turnstile should be locked
  , Ensure $ \before after Push b -> do before === TUnlocked
                                        -- As before
                                        assert (match b True)
                                        -- Pushing should succeed
                                        after === TLocked
                                        -- As before
  ]

If you can recall the image from Step 0, you can think of the states we enumerated there as the shapes and the transitions we wrote here as the arrows. Our arrows are also equipped with some rigid invariants about the conditions that must be satisfied to make each state transition (that’s our Ensure above). We now have a language that totally describes our state machine, and we can simply describe how its statements compose to get a Property!

Step 2: Write a property

This composition is actually fairly simple, we just tell Echidna to execute our actions sequentially, and since the invariants are captured in the actions themselves, that’s all that’s required to test! The only thing we need now is the actual subject of our testing, which, since we work in any MonadState VM, is just a VM, which we can parametrize the property on.

prop_turnstile :: VM -> property
prop_turnstile v = property $ do
  actions <- forAll $ Gen.sequential (Range.linear 1 100) initialState
    [s_coin, s_push_locked, s_push_unlocked
  -- Generate between 1 and 100 actions, starting with a locked (model) turnstile
  evalStateT (executeSequential initialState actions) v
  -- Execute them sequentially on the given VM.

You can think of the above code as a function that takes an EVM state and returns a hedgehog-checkable assertion that it implements our (haskell) state machine definition.

Step 3: Test

With this property written, we’re ready to test some Solidity! Let’s spin up ghci to check this property with Echidna.

λ> (v,_,_) <- loadSolidity "solidity/turnstile/turnstile.sol" -- set up a VM with our contract loaded
λ> check $ prop_turnstile v -- check that the property we just defined holds
  ✓ passed 10000 tests.
True
λ>

It works! The Solidity we wrote implements our model of the turnstile state machine. Echidna evaluated 10,000 random call sequences without finding anything wrong.

Now, let’s find some failures. Suppose we initialize the contract with the turnstile unlocked, as below. This should be a pretty easy failure to detect, since it’s now possible to push successfully without putting a coin in first.

We can just slightly modify our initial contract as below:

contract Turnstile {
  bool private locked = false; // initial state is unlocked

  function coin() {
    locked = false;
  }

  function push() returns (bool) {
    if (locked) {
      return(false);
    } else {
      locked = true;
      return(true);
    }
  }
}

And now we can use the exact same ghci commands as before:

λ> (v,_,_) <- loadSolidity "solidity/turnstile/turnstile_badinit.sol"
λ> check $ prop_turnstile v
  ✗ failed after 1 test.

       ┏━━ examples/state-machine/StateMachine.hs ━━━
    49 ┃ s_push_locked :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState
    50 ┃ s_push_locked = Command (\s -> if s == TLocked then Just $ pure Push else Nothing)
    51 ┃   (\Push -> cleanUp >> execCall ("push", []))
    52 ┃   [ Require $ \s Push -> s == TLocked
    53 ┃   , Update $ \_ Push _ -> TLocked
    54 ┃   , Ensure $ \before after Push b -> do before === TLocked
    55 ┃                                         assert (match b False)
       ┃                                         ^^^^^^^^^^^^^^^^^^^^^^
    56 ┃                                         after === TLocked
    57 ┃ ]

       ┏━━ examples/state-machine/StateMachine.hs ━━━
    69 ┃ prop_turnstile :: VM -> property
    70 ┃ prop_turnstile v = property $ do
    71 ┃   actions <- forAll $ Gen.sequential (Range.linear 1 100) initialState 72 ┃ [s_coin, s_push_locked, s_push_unlocked] ┃ │ Var 0 = Push 73 ┃ evalStateT (executeSequential initialState actions) v This failure can be reproduced by running: > recheck (Size 0) (Seed 3606927596287211471 (-1511786221238791673))

False
λ>

As we’d expect, our property isn’t satisfied. The first time we push it should fail, as the model thinks the turnstile is locked, but it actually succeeds. This is exactly the result we expected above!

We can try the same thing with some other buggy contracts as well. Consider the below Turnstile, which doesn’t lock after a successful push.

contract Turnstile {
  bool private locked = true; // initial state is locked

  function coin() {
    locked = false;
  }

  function push() returns (bool) {
    if (locked) {
      return(false);
    } else {
      return(true);
    }
  }
}

Let’s use those same ghci commands one more time

λ> (v,_,_) <- loadSolidity "solidity/turnstile/turnstile_nolock.sol"
λ> check $ prop_turnstile v
  ✗ failed after 4 tests and 1 shrink.

       ┏━━ examples/state-machine/StateMachine.hs ━━━
    49 ┃ s_push_locked :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState
    50 ┃ s_push_locked = Command (\s -> if s == TLocked then Just $ pure Push else Nothing)
    51 ┃   (\Push -> cleanUp >> execCall ("push", []))
    52 ┃   [ Require $ \s Push -> s == TLocked
    53 ┃   , Update $ \_ Push _ -> TLocked
    54 ┃   , Ensure $ \before after Push b -> do before === TLocked
    55 ┃                                         assert (match b False)
       ┃                                         ^^^^^^^^^^^^^^^^^^^^^^
    56 ┃                                         after === TLocked
    57 ┃  ]

       ┏━━ examples/state-machine/StateMachine.hs ━━━
    69 ┃ prop_turnstile :: VM -> property
    70 ┃ prop_turnstile v = property $ do
    72 ┃   [s_coin, s_push_locked, s_push_unlocked]
       ┃   │ Var 0 = Coin
       ┃   │ Var 1 = Push
       ┃   │ Var 3 = Push
    73 ┃   evalStateT (executeSequential initialState actions) v

    This failure can be reproduced by running:
    > recheck (Size 3) (Seed 133816964769084861 (-8105329698605641335))

False
λ>

When we insert a coin then push twice, the second should fail. Instead, it succeeds. Note that in all these failures, Echidna finds the minimal sequence of actions that demonstrates the failing behavior. This is because of Hedgehog’s shrinking features, which provide this behavior by default.

More broadly, we now have a tool that will accept arbitrary contracts (that implement the push/coin ABI), check whether they implement our specified state machine correctly, and return either a minimal falsifying counterexample if they do not. As a Solidity developer working on a turnstile contract, I can run this on every commit and get a simple explanation of any regression that occurs.

Concluding Notes

Hopefully the above presents a motivating example for testing with Echidna. We wrote a simple description of a state machine, then tested four different contracts against it; each case yielded either a minimal proof the contract did not implement the machine or a statement of assurance that it did.

If you’d like to try implementing this kind of testing yourself on a canal lock, use this exercise we wrote for a workshop.

What do you wish osquery could do?

Welcome to the third post in our series about osquery. So far, we’ve described how five enterprise security teams use osquery and reviewed the issues they’ve encountered. For our third post, we focus on the future of osquery. We asked users, “What do you wish osquery could do?” The answers we received ranged from small requests to huge advancements that could disrupt the incident-response tool market. Let’s dive into those ‘super features’ first.

osquery super features

Some users’ suggestions could fundamentally expand osquery’s role from an incident detection tool, potentially allowing it to steal significant market share from commercial tools in doing prevention and response (we listed a few of these in our first blog post). This would be a big deal. A free and open source tool that gives security teams access to incident response abilities normally reserved for customers of expensive paid services would be a windfall for the community. It could democratize fleet security and enhance the entire community’s defence against attackers. Here are the features that could take osquery to the next level:

Writable access to endpoints

What it is: Currently, osquery is limited to read-only access on endpoints. Such access allows the program to detect and report changes in the operating systems it monitors. Write-access via an osquery extension would allow it to edit registries in the operating system and change the way endpoints perform. It could use this access to enforce security policies throughout the fleet.

Why it would be amazing: Write-access would elevate osquery from a detection tool to the domain of prevention. Rather than simply observing system issues with osquery, write-access would afford you the ability to harden the system right from the SQL interface. Application whitelisting and enforcement, managing licenses, partitioning firewall settings, and more could all be available.

How we could build it: If not built correctly, write-access in osquery could cause more harm than good. Write-access goes beyond the scope of osquery core. Some current users are only permitted to deploy osquery throughout their fleet because of its limited read-only permissions. Granting write-access through osquery core would bring heightened security risks as well as potential for system disruption. The right way to implement this would be to make it available to extensions that request the functionality during initialization and minimize the impact this feature has on the core.

IRL Proof: In fact, we have a pull request waiting on approval that would support write-access through extensions! The code enables write-permissions for extensions but also blocks write-permissions for tables built into core.

We built this feature in support of a client who wanted to block malicious IP addresses, domains and ports for both preventative and reactive use-cases. Once this code is committed, our clients will be able to download our osquery firewall extension to use osquery to partition firewall settings throughout their fleets.

Event-triggered responses

What it is: If osquery reads a log entry that indicates an attack, it could automatically respond with an action such as quarantining the affected endpoint(s). This super feature would add automated prevention and incident response to osquery’s capabilities.

Why it would be amazing: This would elevate osquery’s capabilities to those of commercial vulnerability detection/response tools, but it would be transparent and customizable. Defense teams could evaluate, customize, and match osquery’s incident-response capabilities to their companies’ needs, as a stand-alone solution or as a complement to another more generic response suite.

How we could build it: Automated event response for osquery could be built flexibly to allow security teams to define their own indicators of incidents and their preferred reactions. Users could select from known updated databases: URL reputation via VirusTotal, file reputation via ReversingLabs, IP reputation of the remote addresses of active connections via OpenDNS, etc. The user could pick the type of matching criteria (e.g., exact, partial, particular patterns, etc.), and prescribe a response such as ramping up logging frequency, adding an associated malicious ID to a firewall block list, or calling an external program to take an action. As an additional option, event triggering that sends logs to an external analysis tool could provide more sophisticated response without damaging endpoint performance.

IRL Proof: Not only did multiple interviewees long for this feature; some teams have started to build rudimentary versions of it. As discussed in “How are teams currently using osquery?”, we spoke with one team who built incident alerting with osquery by piping log data into ElasticSearch and auto-generated Jira tickets through ElastAlert upon anomaly detection. This example doesn’t demonstrate full response capability, but it illustrates how useful just-in-time business process reaction to incidents is possible with osquery. If osquery can monitor event-driven logs (FIM, process auditing, etc), trigger an action based on detection of a certain pattern, and administer a protective response, it can provide an effective endpoint protection platform.

Technical debt overhaul

What it is: Many open source projects carry ‘technical debt.’ That is, some of the code engineering is built to be effective for short-term goals but isn’t suitable for long-term program architecture. A distributed developer community each enhancing the technology for slightly different requirement exacerbates this problem. Solving this problem requires costly coordination and effort from multiple community members to rebuild and standardize the system.

Why it would be amazing: Decreasing osquery’s technical debt would upgrade the program to a standard that’s adoptable to a significantly wider range of security teams. Users in our osquery pain points research cited performance effects and reliability among organizational leadership’s top concerns for adopting osquery. Ultimately, the teams we interviewed won the argument, but there are likely many teams who didn’t get the green light on using osquery.

How we could build it: Tackling technical debt is hard enough within an organization. It’s liable to be even harder in a distributed community. Unless developers have a specific motivation for tackling very difficult high-value inefficiencies, the natural reward for closing an issue biases developers toward smaller efforts. To combat this, leaders in the community could dump and sort all technical debt issues along a matrix of value and time, leave all high-value/low-time issues for individual open source developers, and pool community resources to resolve harder problems as full-fledged development projects.

IRL Proof: We know that pooling community resources to tackle technical debt works. We’ve been doing it for over a year. Trail of Bits has been commissioned by multiple companies to build features and fixes too big for the open source community. We’ve leveraged this model to port osquery to Windows, enhance FIM and process auditing, and much more that we’re excited to share with the public over the coming months. Often, multiple clients are interested in building the same things. We’re able to pool resources to make the project less expensive for everyone involved while the entire community benefits.

Other features users want

osquery shows considerable potential to grow beyond endpoint monitoring. However, the enterprise security teams and developers whom we interviewed say that the open source tool has room for improvement. Here are some of the other requests we heard from users:

  • Guardrails & rules for queries: Right now, a malformed query or practice can hamper the user’s workflow. Interviewees wanted guidance on targeting the correct data, querying at correct intervals, gathering from recommended tables, and customized recommendations for different environments.
  • Enhance Deployment Options: Users sought better tools for deploying throughout fleets and keeping these implementations updated. Beyond recommended QueryPacks, administrators wanted to be able to define and select platform-specific configurations of osquery across multi-platform endpoints. Automatically detecting and deploying configurations for unique systems and software was another desired feature.
  • Integrated Testing, Debugging, and Diagnostics: In addition to the current debugging tools, users wanted more resources for testing and diagnosing issues. New tools should help improve reliability and predictability, avoid performance issues, and make osquery easier to use.
  • Enhanced Event-Driven Data Collection: osquery has support for event-based data collection through FIM, Process Auditing, and other tables. However, these data sources suffer from logging implementation issues and are not supported on all platforms. Better event-handling configurations, published best practices, and guardrails for gathering data would be a great help.
  • Enhanced Performance Features: Users want osquery to do more with fewer resources. This would either lead to overall performance enhancements, or allow osquery to operate on endpoints with low resource profiles or mission-critical performance requirements.
  • Better Configuration Management: Enhancements such as custom tables and osqueryd scheduled queries for differing endpoint environments would make osquery easier to deploy and maintain on a growing fleet.
  • Support for Offline Endpoint Logging: Users reported a desire for forensic data availability to support remote endpoints. This would require offline endpoints to store data locally –- including storage of failed queries –- and push to the server upon reconnection
  • Support for Common Platforms: Facebook built osquery for its fleet of macOS- and Linux-based endpoints. PC sysadmins were out of luck until our Windows port last year. Support for other operating systems has been growing steadily thanks to the development community’s efforts. Nevertheless, there are still limitations. Think of this as one umbrella feature request: support for all features on all operating systems.

The list keeps growing

Unfortunately for current and prospective osquery users, Facebook can’t satisfy all of these requests. They’ve shared a tremendous gift by open sourcing osquery. Now it’s up to the community to move the platform forward.

Good news: none of these feature requests are unfeasible. The custom engineering is just uneconomical for individual organizations to invest in.

In the final post in this series, we’ll propose a strategy for osquery users to share the cost of development. Companies that would benefit could pool resources and collectively target specific features.

This would accelerate the rate at which companies could deprecate other full-suite tools that are more expensive, less flexible and less transparent.

If any of these items resonate with your team’s needs, or if you use osquery currently and have another request to add to the list, please let us know.

How to prepare for a security audit

You’ve just approved an audit of your codebase. Do you:

  • Send a copy of the repository and wait for the auditors’ reports, or
  • Take the extra effort to set the auditors up for success?

By the end of the audit, the difference between these answers will lead to profoundly disparate results. In the former case, you’ll waste money, lose time, and miss security issues. In the latter case, you’ll reduce your risk, protect your time, and get more valuable security guidance.

It’s an easy choice, right?

Glad you agree.

Now, here’s how to make that audit more effective, valuable, and satisfying for everybody involved.

Set a goal for the audit

This is the most important step of an audit, and paradoxically the one most often overlooked. You should have an idea of what kind of question you want answered, such as:

  • What’s the overall level of security for this product?
  • Are all client data transactions handled securely?
  • Can a user leak information about another user?

Knowing your biggest area of concern will help the auditing team tailor their approach to meet your needs.

Resolve the easy issues

Handing the code off to the auditing team is a lot like releasing the product: the cleaner the code, the better everything will go. To that end:

  • Enable and address compiler warnings. Go after the easy stuff first: turn on every single compiler warning you can find, understand each warning, then fix your code until they’re all gone. Upgrade your compiler to the latest version, then fix all the new warnings and errors. Even innocuous seeming warnings can indicate problems lying in wait.
  • Increase unit and feature test coverage. Ideally this has been part of the development process, but everyone slips up, tests don’t get updated, or new features don’t quite match the old integrations tests. Now is the time to update the tests and run them all.
  • Remove dead code, stale branches, unused libraries, and other extraneous weight. You may know which branch is active and which is dead but the auditors won’t and will waste time investigating it for potential issues. The same goes for that new feature that hasn’t seen progress in months, or that third-party library that doesn’t get used anymore.

Some issues will persist — a patch that isn’t quite ready, or a refactor that’s not integrated yet. Document any incomplete changes as thoroughly as possible, so that your auditors don’t waste a week digging into code that will be gone in two months’ time.

Document, Document, Document

Think of an audit team as a newly hired, fully remote developer; skilled at what they do, but unfamiliar with your product and code base. The more documentation, the faster they’ll get up to speed and the sooner they’ll be able to start their analysis.

  • Describe what your product does, who uses it, and how. The most important documentation is high level: what does your product do? What do users want from it? How does it achieve that goal? Use clear language to describe how systems interact and the rationale for design decisions made during development.
  • Add comments in-line with the code. Functions should have comments containing high-level descriptions of their intended behavior. Complicated sections of code should have comments describing what is happening and why this particular approach was chosen.
  • Label and describe your tests. More complicated tests should describe the exact behavior they’re testing. The expected results of tests, both positive and negative, should be documented.

Include past reviews and bugs. Previous audit reports can provide guidance to the new audit team. Similarly, documentation regarding past security-relevant bugs can give an audit team clues about where to look most carefully.

Deliver the code batteries included

Just like a new fully remote developer, the audit team will need a copy of the code and clear guidance on how to build and deploy your application.

  • Prepare the build environment. Document the steps to create a build environment from scratch on a computer that is fully disconnected from your internal network. Where relevant, be specific about software versions. Walk through this process on your own to ensure that it is complete. If you have external dependencies that are not publicly available, include them with your code. Fully provisioned virtual machine images are a great way to deliver a working build environment.
  • Document the build process. Include both the debug and release build processes, and also include steps on how to build and run the tests. If the test environment is distinct from the build environment, include steps on how to create the test environment. A well-documented build process enables an auditor to run static analysis tools far more efficiently and effectively.
  • Document the deploy process. This includes how to build the deployment environment. It is very important to list all the specific versions of external tools and libraries for this process, as the deployment environment is a considerable factor in evaluating the security of your product. A well-documented deployment process enables an auditor to run dynamic analysis tools in a real world environment.

The payoff

At this point you’ve handed off your code, documentation, and build environment to the auditors. All that prep work will pay off. Rather than puzzling over how to build your code or what it does, the audit team can immediately start work integrating advanced analysis tools, writing custom fuzzers, or bringing custom internal tools to bear. Knowing your specific goals will help them focus where you want them to.

An audit can produce a lot of insight into the security of your product. Having a clear goal for the audit, a clean codebase, and complete documentation will not only help the audit, it’ll make you more confident about the quality of the results.

Interested in getting an audit? Contact us to find out what we can do for you.

Checklist

Resolve the easy issues

  • Enable and address every last compiler warning.
  • Increase unit and feature test coverage.
  • Remove dead code, stale branches, unused libraries, and other extraneous weight.

Document

  • Describe what your product does, who uses it, why, and how it delivers.
  • Add comments about intended behavior in-line with the code.
  • Label and describe your tests and results, both positive and negative.
  • Include past reviews and bugs.

Deliver the code batteries included

  • Document the steps to create a build environment from scratch on a computer that is fully disconnected from your internal network. Include external dependencies.
  • Document the build process, including debugging and the test environment.
  • Document the deploy process and environment, including all the specific versions of external tools and libraries for this process.