# QueryCon 2018: our talks and takeaways

Sometimes a conference just gets it right. Good talks, single track, select engaged attendees, and no sales talks. It’s a recipe for success that Kolide got right on its very first try with QueryCon, the first-ever osquery conference.

It’s no secret that we are huge fans of osquery, Facebook’s award-winning open source endpoint detection tool. From when we ported osquery to Windows in 2016 to our launch of our osquery extension repo this year, we’ve been one of the leading contributors to the tool’s development. This is why we were delighted Kolide invited us to participate in QueryCon!

The two-day conference, hosted at the beautiful Palace of the Fine Arts in San Francisco, drew over 120 attendees and 16 speakers. The attendance list was a Who’s Who in Big Tech security; teams from Facebook, Airbnb, Yelp, Atlassian, Adobe, Netflix, Salesforce, and more. It was great to meet face-to-face. We’ve been collaborating with some of these teams on osquery for years. It was also exciting to see the widespread adoption of the technology manifested in person. Though some of the teams attending were there to learn about the tech before deploying, the majority seemed to be committed adopters.

The talks ranged from the big-picture (operational security preparedness by Rob Fry of JASK) to the highly technical (breakdowns of macOS internals by Michael Lynn of Facebook), with consistent levity, epitomized by the brilliantly sulky Ben Hughes of Stripe. Scott Lundgren of Carbon Black gave a report-card-style review of the community from an outsider’s perspective. Longtime osquery evangelist Chris Long of Palantir provided a candid user experience of working with osquery’s audit framework in his organization. It was a well-curated mix of subjects, speakers, and perspectives. They all taught us something new.

## What we learned at QueryCon

### 1. The community is bigger and stronger than we thought

As of this week, osquery’s Slack has 1,703 users. Until the sold-out showing at QueryCon, I never thought to check how many of those users were active; 431 in the last 30 days. 120 of those people made it to QueryCon. Dozens more joined the waitlist.

### 2. Some users are innovating in very cool ways

We came to QueryCon intent on pushing the community to use osquery in new, innovative ways. Turns out, it didn’t need much pushing. Take the security team at Netflix. They’re using osquery in multiple internal open source projects: Diffy, a digital forensics and incident response (DFIR) tool, and Stethoscope, their security detection and recommendation application. We heard many more examples from many more teams.

### 3. The community really likes our contributions

Many of the talks mentioned our team and our work. We knew we were contributing significant engineering effort, but we hadn’t truly realized how much others had been benefiting. It felt great to hear that work done for our clients truly advances the whole community.

### 4. The goals are clear, but the way there is not

We gleaned some clear takeaways that are likely common for a first meetup of a new open source project:

• We need to define and broadcast osquery’s guiding principles;
• We need to solidify some best practices for effective collaboration;
• We need to tackle technical debt.

However, we didn’t determine how these will get done. Facebook was clear in defining its role in this process. Their small dedicated osquery team will continue to put in the hard work of testing, managing versions, and holding the community to high standards for both written code and community inclusion. However, it’s up to the community to take care of the rest.

## What we shared at QueryCon

### Osquery Super Features

Speaker: Lauren Pearl

Abstract: In this talk, we reviewed a user feature wishlist gathered from interviews with five Silicon Valley tech teams who use osquery. From these, we identified Super Features – features that would fundamentally improve the value proposition of osquery. We explained how these developments could transform osquery’s power in technical organizations. Finally, we walked through the high-level development plans for making these Super Features a reality.

Link to Video: To be released soon!

Slides: Super Features PDF

### The Osquery Extensions Skunkworks Project: Unconventional Uses for Osquery

Speaker: Mike Myers

Abstract: Facebook created osquery with certain guiding principles: don’t pry into users’ data, don’t change the state of the system, don’t create network traffic to third parties. It was originally intended as a read-only information gatherer. For those that didn’t want to play by these rules, there’s the extension interface. We’ve begun experimenting with extensions that don’t align with mainline osquery: integrating with third-party services, writable tables, host-based firewall administration, malware vaccination, and more. We shared some of our lessons-learned on the challenges of using osquery as a control interface.

Link to Video: To be released soon!

Slides: Skunkworks Extensions PDF

## Thank you so much!

This was a great first conference for an emerging technology. It awakened community leaders to issues and opportunities and started the conversation of how to push forward. Attendees renewed enthusiasm and commitment to advance and maintain the project.

It’s hard to believe that this was Kolide’s first time hosting such an event. Director Of Operations, Antigoni Sinanis, the lady in charge of the event’s success, has set a high bar for her company to clear next year. We at are already looking forward to round two!

# Manage your fleet’s firewalls with osquery

Each of the three major operating systems provides a native firewall, capable of blocking incoming and outgoing access when configured. However, the interface for each of these three firewall systems are dissimilar and each requires different methods of configuration. Furthermore, there are few options for cross-platform fleet configuration, and nearly all are commercial and proprietary.

In partnership with Airbnb, we have created a cross-platform firewall management extension for osquery. The extension enables programmatic control over the native firewalls and provides a common interface for each host operating system, permitting more advanced control over an enterprise fleet’s endpoint protections as well as closing the loop between endpoint monitoring and endpoint management.

Along with our Santa management extension, this extension shows the utility of writable tables in osquery extensions. Programmatic control over endpoint firewalls means that an administrator can react more quickly to prevent the spread of malware on their fleet, prevent unexpected data egress from particularly vital systems, or block incoming connections from known malicious addresses. This is a huge advance in osquery’s capabilities, shifting it from merely a monitoring tool into both prevention and recovery domains.

## What it can do now

The extension creates two new tables: HostBlacklist and PortBlacklist. These virtual tables generate their entries via the underlying operating systems’ native firewall interfaces: iptables on Linux, netsh on Windows, and pfctl on MacOS. This keeps them compatible with the widest possible range of deployments and avoids further dependence on external libraries or applications. It will work with your existing configuration, and, regardless of underlying platform, provide the same interface and capabilities.

Use osquery to access the local firewall configuration on Mac, Windows, and Linux

## What’s on the horizon

While the ability to read the state of the firewall is useful, it’s the possibility of controlling them that we’re most excited about. With writable tables available in osquery, blacklisting a port or a host on a managed system will become as simple as an INSERT statement. No need to deploy an additional firewall management service. No more reviewing how you configure the firewall on macOS. Just write an INSERT statement and push it out the fleet.

Instantly block hostnames and ports across your entire fleet with osquery

## Give it a try

With this extension you can query the state of blacklisted ports and hosts across a managed fleet and ensure that they’re all configured to your specifications. With the advent of the writable tables feature osquery can shift from a monitoring role to a management and preventative tool. This extension takes the first step in that direction.

We’re adding this extension to our managed repository. We’re committed to maintaining and extending our collection of extensions. You should check in and see what else we’ve released.

Do you have an idea for an osquery extension? File an issue on our GitHub repo for it. Contact us for osquery development.

# Manage Santa within osquery

We’re releasing an extension for osquery that lets you manage Google Santa without the need for a separate sync server.

Google Santa is an application whitelist and blacklist system for macOS ideal for deployment across managed fleets. It uses a sync server from which daemons pull rules onto managed computers. However, the sync server provides no functionality for the bulk collection of logs or configuration states. It does not indicate whether all the agents have pulled the latest rules or how often those agents block execution of blacklisted binaries.

In partnership with Palantir, we have integrated Santa into the osquery interface as an extension. Santa can now be managed directly through osquery and no longer requires a separate sync server. Enterprises can use a single interface, osquery, to centrally manage logs and update or review agent configuration.

We’ve described writable access to endpoints as a superfeature of osquery. This extension shows why. Now, it’s possible to add remote management features to the osquery agent, which is normally limited to read-only access. This represents a huge advance in osquery’s capabilities, moving it from the role of strictly monitoring into an active and preventative role. Trail of Bits is pleased to announce the release of the Santa extension into our open-source repository of osquery extensions.

## What it can do

Santa gives you fine-grained control over which applications may run on your computer. Add osquery and this extension into the mix, and now you’ve got fine-grained control over which applications may run on your fleet. Lock down endpoints to only run applications signed by a handful of approved certificates, or blacklist known malicious applications before they get a chance to run.

The extension can be loaded at the startup of osquery with the extension command line argument, e.g., osqueryi --extension path/to/santa.ext. On loading, it adds two new tables to the database: santa_rules and santa_events.The tables themselves are straightforward.

santa_rules consists of the three text columns: shasum, state, and type. The type column contains the rule type and may be either certificate or binary. state is either whitelist or blacklist. shasum contains either the hash of the binary or the signing certificate’s hash, depending on rule type.

The santa_events table has four text columns: timestamp, path, shasum, and reason. timestamp marks the time the deny event was logged. path lists the path to the denied application. shasum displays the hash of the file. reason shows the type of rule that caused the deny (either binary or certificate).

## Time to use it

This extension provides a simplified interface to oversee and control your Santa deployment across your fleet, granting easy access to both rules and events. You can find it and other osquery extensions in our repository of maintained osquery extensions. We’ll continue to add new extensions. Take a look and see what we have available.

## Hire us to tailor osquery to your needs

Do you have an idea for an osquery extension? File an issue on our GitHub repo for it. Contact us for osquery development.

Note: This feature depends on writable tables support for extensions which has not yet been merged. Contact us if you’d like to try this feature now — we create custom binary builds to test upcoming features of osquery for our clients.

# Collect NTFS forensic information with osquery

We’re releasing an extension for osquery that will let you dig deeper into the NTFS filesystem. It’s one more tool for incident response and data collection. But it’s also an opportunity to dispense with forensics toolkits and commercial services that offer similar capabilities.

Until now, osquery has been inadequate for performing the kind of filesystem forensics that is often part of an incident response effort. It collects some information about files on its host platforms – timestamps, permissions, owner and more – but anyone with experience in forensics will tell you that there’s a lot more data available on a file system if you’re willing to dig. Think additional timestamps, unallocated metadata, or stale directory entries.

The alternatives are often closed source and expensive. They become one more item in your budget, deployment roadmap, and maintenance schedule. And none of them integrate with osquery. You have to go to the extra effort of mapping the forensic report back to your fleet.

That changes today. In partnership with Crypsis, we have integrated NTFS forensic information into the osquery interface as an extension. Consider this the first step toward a better, cost-effective, more efficient alternative that’s easier to deploy.

## What it can do

The NTFS forensics extension provides specific additional file metadata from NTFS images, including filename timestamp entries, the security descriptor for files, whether a file has Alternate Data Streams (ADS), as well as other information. It also provides index entries for directory indices, including entries that are deallocated. You can find the malware that just cleaned up after itself, or altered its file timestamps but forgot about the filename timestamps, or installed a rootkit in the ADS of calc.exe, all without ever leaving osquery.

## How to use it

Load the extension at the startup of osquery with the command line argument, e.g., <code>osqueryi.exe --extension path\to\ntfs_forensics.ext.exe</code>. On loading, three new tables will be added to the database: ntfs_part_data, ntfs_file_data, and ntfs_indx_data.

### ntfs_part_data

This table provides information about partitions on a disk image. If queried without a specified disk image, it will attempt to interrogate the physical drives of the host system by walking up from \\.\PhysicalDrive0 until it finds a drive number it fails to open.

Enumerating partition entries in an NTFS image

### ntfs_file_data

This table provides information about file entries in an NTFS file system. The device and partition columns must be specified explicitly in the WHERE clause to query the table. If the path or inode column is specified, then a single row about the specified file is returned. If the directory column is specified, then a row is returned for every file in that directory. If nothing is specified, a walk of the entire partition is performed. Because the walk of the entire partition is costly, results are cached to be reused without reperforming the entire walk. If you need fresh results of a partition walk, use the hidden column from_cache in the WHERE clause to force the collection of live data (e.g., select * from ntfs_file_data where device=”\\.\PhysicalDrive0” and partition=2 and from_cache=0;).

Displaying collected data on a single entry in an NTFS file system

### ntfs_indx_data

This table provides the content of index entries for a specified directory, including index entries discovered in slack space. Like ntfs_file_data, the device and partition columns must be specified in the WHERE clause of a query, as well as either parent_path or parent_inode. Entries discovered in slack space will have a non-zero value in the slack column.

Displaying inode entries recovered from a directory index’s slack space

### Getting Started

This extension offers a fast and convenient way to perform filesystem forensics on Windows endpoints as a part of an incident response. You can find it and our other osquery extensions in our repository. We’re committed to maintaining and extending our collection of extensions. Take a look, and see what else we have available.

## Hire us to tailor osquery to your needs

Do you have an idea for an osquery extension? File an issue on our GitHub repo for it. Contact us for osquery development.

# State Machine Testing with Echidna

Property-based testing is a powerful technique for verifying arbitrary properties of a program via execution on a large set of inputs, typically generated stochastically. Echidna is a library and executable I’ve been working on for applying property-based testing to EVM code (particularly code written in Solidity).

Echidna is a library for generating random sequences of calls against a given smart contract’s ABI and making sure that their evaluation preserves some user-defined invariants (e.g.: the balance in this wallet must never go down). If you’re from a more conventional security background, you can think of it as a fuzzer, with the caveat that it looks for user-specified logic bugs rather than crashes (as programs written for the EVM don’t “crash” in any conventional way).

The property-based testing functionality in Echidna is implemented with Hedgehog, a property-based testing library by Jacob Stanley. Think of Hedgehog as a nicer version of QuickCheck. It’s an extremely powerful library, providing automatic minimal testcase generation (“shrinking”), well-designed abstractions for things like ranges, and most importantly for this blog post, abstract state machine testing tools.

After reading a particularly excellent blog post by Tim Humphries (“State machine testing with Hedgehog,” which I’ll refer to as the “Hedgehog post” from now on) about testing a simple state machine with this functionality, I was curious if the same techniques could be extended to the EVM. Many contracts I see in the wild are just implementations of some textbook state machine, and the ability to write tests against that invariant-rich representation would be invaluable.

The rest of this blog post assumes at least a degree of familiarity with Hedgehog’s state machine testing functionality. If you’re unfamiliar with the software, I’d recommend reading Humphries’s blog post first. It’s also worth noting that the below code demonstrates advanced usage of Echidna’s API, and you can also use it to test code without writing a line of Haskell.

First, we’ll describe our state machine’s states, then its transitions, and once we’ve done that we’ll use it to actually find some bugs in contracts implementing it. If you’d like to follow along on your own, all the Haskell code is in examples/state-machine and all the Solidity code is in solidity/turnstile.

## Step 0: Build the model

Fig. 1: A turnstile state machine

The state machine in the Hedgehog post is a turnstile with two states (locked and unlocked) and two actions (inserting a coin and pushing the turnstile), with “locked” as its initial state. We can copy this code verbatim.

data ModelState (v :: * -> *) = TLocked
| TUnlocked
deriving (Eq, Ord, Show)

initialState :: ModelState v
initialState = TLocked


However, in the Hedgehog post the effectful implementation of this abstract model was a mutable variable that required I/O to access. We can instead use a simple Solidity program.

contract Turnstile {
bool private locked = true; // initial state is locked

function coin() {
locked = false;
}

function push() returns (bool) {
if (locked) {
return(false);
} else {
locked = true;
return(true);
}
}
}


At this point, we have an abstract model that just describes the states, not the transitions, and some Solidity code we claim implements a state machine. In order to test it, we still have to describe this machine’s transitions and invariants.

## Step 1: Write some commands

To write these tests, we need to make explicit how we can execute the implementation of our model. The examples given in the Hedgehog post work in any MonadIO, as they deal with IORefs. However, since EVM execution is deterministic, we can work instead in any MonadState VM.

The simplest command is inserting a coin. This should always result in the turnstile being unlocked.

s_coin :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState
s_coin = Command (\_ -> Just $pure Coin) -- Regardless of initial state, we can always insert a coin (\Coin -> cleanUp >> execCall ("coin", [])) -- Inserting a coin is just calling coin() in the contract -- We need cleanUp to chain multiple calls together [ Update$ \_ Coin _ -> TUnlocked
-- Inserting a coin sets the state to unlocked
, Ensure $\_ s Coin _ -> s === TUnlocked -- After inserting a coin, the state should be unlocked ]  Since the push function in our implementation returns a boolean value we care about (whether or not pushing “worked”), we need a way to parse EVM output. execCall has type MonadState VM => SolCall -> m VMResult, so we need a way to check whether a given VMResult is true, false, or something else entirely. This turns out to be pretty trivial. match :: VMResult -> Bool -> Bool match (VMSuccess (B s)) b = s == encodeAbiValue (AbiBool b) match _ _ = False  Now that we can check the results of pushing, we have everything we need to write the rest of the model. As before, we’ll write two Commands; modeling pushing while the turnstile is locked and unlocked, respectively. Pushing while locked should succeed, and result in the turnstile becoming locked. Pushing while unlocked should fail, and leave the turnstile locked. s_push_locked :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState s_push_locked = Command (\s -> if s == TLocked then Just$ pure Push else Nothing)
-- We can only run this command when the turnstile is locked
(\Push -> cleanUp >> execCall ("push", []))
-- Pushing is just calling push()
[ Require $\s Push -> s == TLocked -- Before we push, the turnstile should be locked , Update$ \_ Push _ -> TLocked
-- After we push, the turnstile should be locked
, Ensure $\before after Push b -> do before === TLocked -- As before assert (match b False) -- Pushing should fail after === TLocked -- As before ]  s_push_unlocked :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState s_push_unlocked = Command (\s -> if s == TUnlocked then Just$ pure Push else Nothing)
-- We can only run this command when the turnstile is unlocked
(\Push -> cleanUp >> execCall ("push", []))
-- Pushing is just calling push()
[ Require $\s Push -> s == TUnlocked -- Before we push, the turnstile should be unlocked , Update$ \_ Push _ -> TLocked
-- After we push, the turnstile should be locked
actions <- forAll $Gen.sequential (Range.linear 1 100) initialState [s_coin, s_push_locked, s_push_unlocked -- Generate between 1 and 100 actions, starting with a locked (model) turnstile evalStateT (executeSequential initialState actions) v -- Execute them sequentially on the given VM.  You can think of the above code as a function that takes an EVM state and returns a hedgehog-checkable assertion that it implements our (haskell) state machine definition. ## Step 3: Test With this property written, we’re ready to test some Solidity! Let’s spin up ghci to check this property with Echidna. λ> (v,_,_) <- loadSolidity "solidity/turnstile/turnstile.sol" -- set up a VM with our contract loaded λ> check$ prop_turnstile v -- check that the property we just defined holds
✓ passed 10000 tests.
True
λ>


It works! The Solidity we wrote implements our model of the turnstile state machine. Echidna evaluated 10,000 random call sequences without finding anything wrong.

Now, let’s find some failures. Suppose we initialize the contract with the turnstile unlocked, as below. This should be a pretty easy failure to detect, since it’s now possible to push successfully without putting a coin in first.

We can just slightly modify our initial contract as below:

contract Turnstile {
bool private locked = false; // initial state is unlocked

function coin() {
locked = false;
}

function push() returns (bool) {
if (locked) {
return(false);
} else {
locked = true;
return(true);
}
}
}


And now we can use the exact same ghci commands as before:

λ> (v,_,_) <- loadSolidity "solidity/turnstile/turnstile_badinit.sol"
λ> check $prop_turnstile v ✗ failed after 1 test. ┏━━ examples/state-machine/StateMachine.hs ━━━ 49 ┃ s_push_locked :: (Monad n, MonadTest m, MonadState VM m) => Command n m ModelState 50 ┃ s_push_locked = Command (\s -> if s == TLocked then Just$ pure Push else Nothing)
51 ┃   (\Push -> cleanUp >> execCall ("push", []))
52 ┃   [ Require $\s Push -> s == TLocked 53 ┃ , Update$ \_ Push _ -> TLocked
54 ┃   , Ensure $\before after Push b -> do before === TLocked 55 ┃ assert (match b False) ┃ ^^^^^^^^^^^^^^^^^^^^^^ 56 ┃ after === TLocked 57 ┃ ] ┏━━ examples/state-machine/StateMachine.hs ━━━ 69 ┃ prop_turnstile :: VM -> property 70 ┃ prop_turnstile v = property$ do
71 ┃   actions <- forAll $Gen.sequential (Range.linear 1 100) initialState 72 ┃ [s_coin, s_push_locked, s_push_unlocked] ┃ │ Var 0 = Push 73 ┃ evalStateT (executeSequential initialState actions) v This failure can be reproduced by running: > recheck (Size 0) (Seed 3606927596287211471 (-1511786221238791673)) False λ>  As we’d expect, our property isn’t satisfied. The first time we push it should fail, as the model thinks the turnstile is locked, but it actually succeeds. This is exactly the result we expected above! We can try the same thing with some other buggy contracts as well. Consider the below Turnstile, which doesn’t lock after a successful push. contract Turnstile { bool private locked = true; // initial state is locked function coin() { locked = false; } function push() returns (bool) { if (locked) { return(false); } else { return(true); } } }  Let’s use those same ghci commands one more time λ> (v,_,_) <- loadSolidity "solidity/turnstile/turnstile_nolock.sol" λ> check$ prop_turnstile v
✗ failed after 4 tests and 1 shrink.

┏━━ examples/state-machine/StateMachine.hs ━━━
50 ┃ s_push_locked = Command (\s -> if s == TLocked then Just $pure Push else Nothing) 51 ┃ (\Push -> cleanUp >> execCall ("push", [])) 52 ┃ [ Require$ \s Push -> s == TLocked
53 ┃   , Update $\_ Push _ -> TLocked 54 ┃ , Ensure$ \before after Push b -> do before === TLocked
55 ┃                                         assert (match b False)
┃                                         ^^^^^^^^^^^^^^^^^^^^^^
56 ┃                                         after === TLocked
57 ┃  ]

┏━━ examples/state-machine/StateMachine.hs ━━━
69 ┃ prop_turnstile :: VM -> property
70 ┃ prop_turnstile v = property $do 72 ┃ [s_coin, s_push_locked, s_push_unlocked] ┃ │ Var 0 = Coin ┃ │ Var 1 = Push ┃ │ Var 3 = Push 73 ┃ evalStateT (executeSequential initialState actions) v This failure can be reproduced by running: > recheck (Size 3) (Seed 133816964769084861 (-8105329698605641335)) False λ>  When we insert a coin then push twice, the second should fail. Instead, it succeeds. Note that in all these failures, Echidna finds the minimal sequence of actions that demonstrates the failing behavior. This is because of Hedgehog’s shrinking features, which provide this behavior by default. More broadly, we now have a tool that will accept arbitrary contracts (that implement the push/coin ABI), check whether they implement our specified state machine correctly, and return either a minimal falsifying counterexample if they do not. As a Solidity developer working on a turnstile contract, I can run this on every commit and get a simple explanation of any regression that occurs. ## Concluding Notes Hopefully the above presents a motivating example for testing with Echidna. We wrote a simple description of a state machine, then tested four different contracts against it; each case yielded either a minimal proof the contract did not implement the machine or a statement of assurance that it did. If you’d like to try implementing this kind of testing yourself on a canal lock, use this exercise we wrote for a workshop. # What do you wish osquery could do? Welcome to the third post in our series about osquery. So far, we’ve described how five enterprise security teams use osquery and reviewed the issues they’ve encountered. For our third post, we focus on the future of osquery. We asked users, “What do you wish osquery could do?” The answers we received ranged from small requests to huge advancements that could disrupt the incident-response tool market. Let’s dive into those ‘super features’ first. ## osquery super features Some users’ suggestions could fundamentally expand osquery’s role from an incident detection tool, potentially allowing it to steal significant market share from commercial tools in doing prevention and response (we listed a few of these in our first blog post). This would be a big deal. A free and open source tool that gives security teams access to incident response abilities normally reserved for customers of expensive paid services would be a windfall for the community. It could democratize fleet security and enhance the entire community’s defence against attackers. Here are the features that could take osquery to the next level: ### Writable access to endpoints What it is: Currently, osquery is limited to read-only access on endpoints. Such access allows the program to detect and report changes in the operating systems it monitors. Write-access via an osquery extension would allow it to edit registries in the operating system and change the way endpoints perform. It could use this access to enforce security policies throughout the fleet. Why it would be amazing: Write-access would elevate osquery from a detection tool to the domain of prevention. Rather than simply observing system issues with osquery, write-access would afford you the ability to harden the system right from the SQL interface. Application whitelisting and enforcement, managing licenses, partitioning firewall settings, and more could all be available. How we could build it: If not built correctly, write-access in osquery could cause more harm than good. Write-access goes beyond the scope of osquery core. Some current users are only permitted to deploy osquery throughout their fleet because of its limited read-only permissions. Granting write-access through osquery core would bring heightened security risks as well as potential for system disruption. The right way to implement this would be to make it available to extensions that request the functionality during initialization and minimize the impact this feature has on the core. IRL Proof: In fact, we have a pull request waiting on approval that would support write-access through extensions! The code enables write-permissions for extensions but also blocks write-permissions for tables built into core. We built this feature in support of a client who wanted to block malicious IP addresses, domains and ports for both preventative and reactive use-cases. Once this code is committed, our clients will be able to download our osquery firewall extension to use osquery to partition firewall settings throughout their fleets. ### Event-triggered responses What it is: If osquery reads a log entry that indicates an attack, it could automatically respond with an action such as quarantining the affected endpoint(s). This super feature would add automated prevention and incident response to osquery’s capabilities. Why it would be amazing: This would elevate osquery’s capabilities to those of commercial vulnerability detection/response tools, but it would be transparent and customizable. Defense teams could evaluate, customize, and match osquery’s incident-response capabilities to their companies’ needs, as a stand-alone solution or as a complement to another more generic response suite. How we could build it: Automated event response for osquery could be built flexibly to allow security teams to define their own indicators of incidents and their preferred reactions. Users could select from known updated databases: URL reputation via VirusTotal, file reputation via ReversingLabs, IP reputation of the remote addresses of active connections via OpenDNS, etc. The user could pick the type of matching criteria (e.g., exact, partial, particular patterns, etc.), and prescribe a response such as ramping up logging frequency, adding an associated malicious ID to a firewall block list, or calling an external program to take an action. As an additional option, event triggering that sends logs to an external analysis tool could provide more sophisticated response without damaging endpoint performance. IRL Proof: Not only did multiple interviewees long for this feature; some teams have started to build rudimentary versions of it. As discussed in “How are teams currently using osquery?”, we spoke with one team who built incident alerting with osquery by piping log data into ElasticSearch and auto-generated Jira tickets through ElastAlert upon anomaly detection. This example doesn’t demonstrate full response capability, but it illustrates how useful just-in-time business process reaction to incidents is possible with osquery. If osquery can monitor event-driven logs (FIM, process auditing, etc), trigger an action based on detection of a certain pattern, and administer a protective response, it can provide an effective endpoint protection platform. ### Technical debt overhaul What it is: Many open source projects carry ‘technical debt.’ That is, some of the code engineering is built to be effective for short-term goals but isn’t suitable for long-term program architecture. A distributed developer community each enhancing the technology for slightly different requirement exacerbates this problem. Solving this problem requires costly coordination and effort from multiple community members to rebuild and standardize the system. Why it would be amazing: Decreasing osquery’s technical debt would upgrade the program to a standard that’s adoptable to a significantly wider range of security teams. Users in our osquery pain points research cited performance effects and reliability among organizational leadership’s top concerns for adopting osquery. Ultimately, the teams we interviewed won the argument, but there are likely many teams who didn’t get the green light on using osquery. How we could build it: Tackling technical debt is hard enough within an organization. It’s liable to be even harder in a distributed community. Unless developers have a specific motivation for tackling very difficult high-value inefficiencies, the natural reward for closing an issue biases developers toward smaller efforts. To combat this, leaders in the community could dump and sort all technical debt issues along a matrix of value and time, leave all high-value/low-time issues for individual open source developers, and pool community resources to resolve harder problems as full-fledged development projects. IRL Proof: We know that pooling community resources to tackle technical debt works. We’ve been doing it for over a year. Trail of Bits has been commissioned by multiple companies to build features and fixes too big for the open source community. We’ve leveraged this model to port osquery to Windows, enhance FIM and process auditing, and much more that we’re excited to share with the public over the coming months. Often, multiple clients are interested in building the same things. We’re able to pool resources to make the project less expensive for everyone involved while the entire community benefits. ## Other features users want osquery shows considerable potential to grow beyond endpoint monitoring. However, the enterprise security teams and developers whom we interviewed say that the open source tool has room for improvement. Here are some of the other requests we heard from users: • Guardrails & rules for queries: Right now, a malformed query or practice can hamper the user’s workflow. Interviewees wanted guidance on targeting the correct data, querying at correct intervals, gathering from recommended tables, and customized recommendations for different environments. • Enhance Deployment Options: Users sought better tools for deploying throughout fleets and keeping these implementations updated. Beyond recommended QueryPacks, administrators wanted to be able to define and select platform-specific configurations of osquery across multi-platform endpoints. Automatically detecting and deploying configurations for unique systems and software was another desired feature. • Integrated Testing, Debugging, and Diagnostics: In addition to the current debugging tools, users wanted more resources for testing and diagnosing issues. New tools should help improve reliability and predictability, avoid performance issues, and make osquery easier to use. • Enhanced Event-Driven Data Collection: osquery has support for event-based data collection through FIM, Process Auditing, and other tables. However, these data sources suffer from logging implementation issues and are not supported on all platforms. Better event-handling configurations, published best practices, and guardrails for gathering data would be a great help. • Enhanced Performance Features: Users want osquery to do more with fewer resources. This would either lead to overall performance enhancements, or allow osquery to operate on endpoints with low resource profiles or mission-critical performance requirements. • Better Configuration Management: Enhancements such as custom tables and osqueryd scheduled queries for differing endpoint environments would make osquery easier to deploy and maintain on a growing fleet. • Support for Offline Endpoint Logging: Users reported a desire for forensic data availability to support remote endpoints. This would require offline endpoints to store data locally –- including storage of failed queries –- and push to the server upon reconnection • Support for Common Platforms: Facebook built osquery for its fleet of macOS- and Linux-based endpoints. PC sysadmins were out of luck until our Windows port last year. Support for other operating systems has been growing steadily thanks to the development community’s efforts. Nevertheless, there are still limitations. Think of this as one umbrella feature request: support for all features on all operating systems. ## The list keeps growing Unfortunately for current and prospective osquery users, Facebook can’t satisfy all of these requests. They’ve shared a tremendous gift by open sourcing osquery. Now it’s up to the community to move the platform forward. Good news: none of these feature requests are unfeasible. The custom engineering is just uneconomical for individual organizations to invest in. In the final post in this series, we’ll propose a strategy for osquery users to share the cost of development. Companies that would benefit could pool resources and collectively target specific features. This would accelerate the rate at which companies could deprecate other full-suite tools that are more expensive, less flexible and less transparent. If any of these items resonate with your team’s needs, or if you use osquery currently and have another request to add to the list, please let us know. # How to prepare for a security audit You’ve just approved an audit of your codebase. Do you: • Send a copy of the repository and wait for the auditors’ reports, or • Take the extra effort to set the auditors up for success? By the end of the audit, the difference between these answers will lead to profoundly disparate results. In the former case, you’ll waste money, lose time, and miss security issues. In the latter case, you’ll reduce your risk, protect your time, and get more valuable security guidance. It’s an easy choice, right? Glad you agree. Now, here’s how to make that audit more effective, valuable, and satisfying for everybody involved. ## Set a goal for the audit This is the most important step of an audit, and paradoxically the one most often overlooked. You should have an idea of what kind of question you want answered, such as: • What’s the overall level of security for this product? • Are all client data transactions handled securely? • Can a user leak information about another user? Knowing your biggest area of concern will help the auditing team tailor their approach to meet your needs. ## Resolve the easy issues Handing the code off to the auditing team is a lot like releasing the product: the cleaner the code, the better everything will go. To that end: • Enable and address compiler warnings. Go after the easy stuff first: turn on every single compiler warning you can find, understand each warning, then fix your code until they’re all gone. Upgrade your compiler to the latest version, then fix all the new warnings and errors. Even innocuous seeming warnings can indicate problems lying in wait. • Increase unit and feature test coverage. Ideally this has been part of the development process, but everyone slips up, tests don’t get updated, or new features don’t quite match the old integrations tests. Now is the time to update the tests and run them all. • Remove dead code, stale branches, unused libraries, and other extraneous weight. You may know which branch is active and which is dead but the auditors won’t and will waste time investigating it for potential issues. The same goes for that new feature that hasn’t seen progress in months, or that third-party library that doesn’t get used anymore. Some issues will persist — a patch that isn’t quite ready, or a refactor that’s not integrated yet. Document any incomplete changes as thoroughly as possible, so that your auditors don’t waste a week digging into code that will be gone in two months’ time. ## Document, Document, Document Think of an audit team as a newly hired, fully remote developer; skilled at what they do, but unfamiliar with your product and code base. The more documentation, the faster they’ll get up to speed and the sooner they’ll be able to start their analysis. • Describe what your product does, who uses it, and how. The most important documentation is high level: what does your product do? What do users want from it? How does it achieve that goal? Use clear language to describe how systems interact and the rationale for design decisions made during development. • Add comments in-line with the code. Functions should have comments containing high-level descriptions of their intended behavior. Complicated sections of code should have comments describing what is happening and why this particular approach was chosen. • Label and describe your tests. More complicated tests should describe the exact behavior they’re testing. The expected results of tests, both positive and negative, should be documented. Include past reviews and bugs. Previous audit reports can provide guidance to the new audit team. Similarly, documentation regarding past security-relevant bugs can give an audit team clues about where to look most carefully. ## Deliver the code batteries included Just like a new fully remote developer, the audit team will need a copy of the code and clear guidance on how to build and deploy your application. • Prepare the build environment. Document the steps to create a build environment from scratch on a computer that is fully disconnected from your internal network. Where relevant, be specific about software versions. Walk through this process on your own to ensure that it is complete. If you have external dependencies that are not publicly available, include them with your code. Fully provisioned virtual machine images are a great way to deliver a working build environment. • Document the build process. Include both the debug and release build processes, and also include steps on how to build and run the tests. If the test environment is distinct from the build environment, include steps on how to create the test environment. A well-documented build process enables an auditor to run static analysis tools far more efficiently and effectively. • Document the deploy process. This includes how to build the deployment environment. It is very important to list all the specific versions of external tools and libraries for this process, as the deployment environment is a considerable factor in evaluating the security of your product. A well-documented deployment process enables an auditor to run dynamic analysis tools in a real world environment. ## The payoff At this point you’ve handed off your code, documentation, and build environment to the auditors. All that prep work will pay off. Rather than puzzling over how to build your code or what it does, the audit team can immediately start work integrating advanced analysis tools, writing custom fuzzers, or bringing custom internal tools to bear. Knowing your specific goals will help them focus where you want them to. An audit can produce a lot of insight into the security of your product. Having a clear goal for the audit, a clean codebase, and complete documentation will not only help the audit, it’ll make you more confident about the quality of the results. Interested in getting an audit? Contact us to find out what we can do for you. ## Checklist Resolve the easy issues • Enable and address every last compiler warning. • Increase unit and feature test coverage. • Remove dead code, stale branches, unused libraries, and other extraneous weight. Document • Describe what your product does, who uses it, why, and how it delivers. • Add comments about intended behavior in-line with the code. • Label and describe your tests and results, both positive and negative. • Include past reviews and bugs. Deliver the code batteries included • Document the steps to create a build environment from scratch on a computer that is fully disconnected from your internal network. Include external dependencies. • Document the build process, including debugging and the test environment. • Document the deploy process and environment, including all the specific versions of external tools and libraries for this process. # Vulnerability Modeling with Binary Ninja This is Part 3 in a series of posts about the Binary Ninja Intermediate Language (BNIL) family. You can read Part 1 here and Part 2 here. In my previous post, I demonstrated how to leverage the Low Level IL (LLIL) to write an architecture-agnostic plugin that could devirtualize C++ virtual functions. A lot of new and exciting features have been added to Binary Ninja since then; in particular, Medium Level IL (MLIL) and Single Static Assignment (SSA) form[1]. In this post, I’m going to discuss both of these and demonstrate one fun use of them: automated vulnerability discovery. Plenty of static analyzers can perform vulnerability discovery on source code, but what if you only have the binary? How can we model a vulnerability and then check a binary to see if it is vulnerable? The short answer: use Binary Ninja’s MLIL and SSA form. Together, they make it easy to build and solve a system of equations with a theorem prover that takes binaries and turns them, alchemy-like, into vulnerabilities! Let’s walk through the process with everyone’s favorite hyped vulnerability of yesteryear, Heartbleed. ## Hacking like it’s 2014: Let’s find Heartbleed! For those who might not remember or be familiar with the Heartbleed vulnerability, let’s run through a quick refresher. Heartbleed was a remote information-disclosure vulnerability in OpenSSL 1.0.1 – 1.0.1f that allowed an attacker to send a crafted TLS heartbeat message to any service using TLS. The message would trick the service into responding with up to 64KB of uninitialized data, which could contain sensitive information such as private cryptographic keys or personal data. This was possible because OpenSSL used a field in the attacker’s message as a size parameter for malloc and memcpy calls without first validating that the given size was less than or equal to the size of the data to read. Here’s a snippet of the vulnerable code in OpenSSL 1.0.1f, from tls1_process_heartbeat:  /* Read type and payload length first */ hbtype = *p++; n2s(p, payload); pl = p; /* Skip some stuff... */ if (hbtype == TLS1_HB_REQUEST) { unsigned char *buffer, *bp; int r; /* Allocate memory for the response, size is 1 bytes * message type, plus 2 bytes payload length, plus * payload, plus padding */ buffer = OPENSSL_malloc(1 + 2 + payload + padding); bp = buffer; /* Enter response type, length and copy payload */ *bp++ = TLS1_HB_RESPONSE; s2n(payload, bp); memcpy(bp, pl, payload); bp += payload; /* Random padding */ RAND_pseudo_bytes(bp, padding); r = ssl3_write_bytes(s, TLS1_RT_HEARTBEAT, buffer, 3 + payload + padding);  Looking at the code, we can see that the size parameter (payload) comes directly from the user-controlled TLS heartbeat message, is converted from network-byte order to host-byte order (n2s), and then passed to OPENSSL_malloc and memcpy with no validation. In this scenario, when a value for payload is greater than the data at pl, memcpy will overflow from the buffer starting at pl and begin reading the data that follows immediately after it, revealing data that it shouldn’t. The fix in 1.0.1g was pretty simple:  hbtype = *p++; n2s(p, payload); if (1 + 2 + payload + 16 > s->s3->rrec.length) return 0; /* silently discard per RFC 6520 sec. 4 */ pl = p;  This new check ensures that the memcpy won’t overflow into different data. Back in 2014, Andrew blogged about writing a clang analyzer plugin that could find vulnerabilities like Heartbleed. A clang analyzer plugin runs on source code, though; how could we find the same vulnerability in a binary if we didn’t have the source for it? One way: build a model of a vulnerability by representing MLIL variables as a set of constraints and solving them with a theorem prover! ## Model binary code as equations with z3 A theorem prover lets us construct a system of equations and: 1. Verify whether those equations contradict each other. 2. Find values that make the equations work. For example, if we have the following equations: x + y = 8 2x + 3 = 7  A theorem prover could tell us that a) a solution does exist for these equations, meaning that they don’t contradict each other, and b) a solution to these equations is x = 2 and y = 6. For the purposes of this exercise, I’ll be using the Z3 Theorem Prover from Microsoft Research. Using the z3 Python library, the above example would look like the following: >>> from z3 import * >>> x = Int('x') >>> y = Int('y') >>> s = Solver() >>> s.add(x + y == 8) >>> s.add(2*x + 3 == 7) >>> s.check() sat >>> s.model() [x = 2, y = 6]  Z3 tells us that the equations can be satisfied and provides values to solve them. We can apply this technique to modeling a vulnerability. It turns out that assembly instructions can be modeled as algebraic statements. Take the following snippet of assembly:  lea eax, [ebx+8] cmp eax, 0x20 jle allocate int3 allocate: push eax call malloc ret When we lift this assembly to Binary Ninja’s LLIL, we get the following graph: Figure 1. LLIL makes it easy to identify the signed comparison conditional. In this code, eax takes the value of ebx and then adds 8 to it. If this value is above 0x20, an interrupt is raised. However, if the value is less than or equal to 0x20, the value is passed to malloc. We can use LLIL’s output to model this as a set of equations that should be unsatisfiable if an integer overflow is not possible (e.g. there should never be a value of ebx such that ebx is larger than 0x20 but eax is less than or equal to 0x20), which would look something like this: eax = ebx + 8 ebx > 0x20 eax <= 0x20  What happens if we plug these equations into Z3? Not exactly what we’d hope for. >>> eax = Int('eax') >>> ebx = Int('ebx') >>> s = Solver() >>> s.add(eax == ebx + 8) >>> s.add(ebx > 0x20) >>> s.add(eax <= 0x20) >>> s.check() unsat  There should be an integer overflow, but our equations were unsat. This is because the Int type (or “sort” in z3 parlance) represents a number in the set of all integers, which has a range of -∞ to +∞, and thus an overflow is not possible. Instead, we must use the BitVec sort, to represent each variable as a vector of 32 bits: >>> eax = BitVec('eax', 32) >>> ebx = BitVec('ebx', 32) >>> s = Solver() >>> s.add(eax == ebx + 8) >>> s.add(ebx > 0x20) >>> s.add(eax <= 0x20) >>> s.check() sat  There’s the result we expected! With this result, Z3 tells us that it is possible for eax to overflow and call malloc with a value that is unexpected. With a few more lines, we can even see a possible value to satisfy these equations: >>> s.model() [ebx = 2147483640, eax = 2147483648] >>> hex(2147483640) '0x7ffffff8'  This works really well for registers, which can be trivially represented as discrete 32-bit variables. To represent memory accesses, we also need Z3’s Array sort, which can model regions of memory. However, stack variables reside in memory and are more difficult to model with a constraint solver. Instead, what if we could treat stack variables the same as registers in our model? We can easily do that with Binary Ninja’s Medium Level IL. ## Medium Level IL Just as LLIL abstracts native disassembly, Medium Level IL (MLIL) adds another layer of abstraction on top of LLIL. Whereas LLIL abstracted away flags and NOP instructions, MLIL abstracts away the stack, presenting both stack accesses and register accesses as variables. Additionally, during the process of mapping LLIL to MLIL, memory stores that aren’t referenced later are identified and eliminated. These processes can be observed in the example below. Notice how there are no stack accesses (i.e. push or pop instructions) and var_8 does not appear in the MLIL at all. Figure 2a. An example function in x86. Figure 2b. LLIL of the example function. Figure 2c. MLIL of the example function. Another feature you might notice in the MLIL is that variables are typed. Binary Ninja initially infers these types heuristically, but the user can override these with manual assignment later. Types are propagated through the function and also help inform the analysis when determining function signatures. ### MLIL structure Structurally, MLIL and LLIL are very similar; both are expression trees and share many of the same expression types for operations (for more on the IL’s tree structure, see my first blog post). However, there are several stark differences. Obviously, MLIL does not have analogous operations for LLIL_PUSH and LLIL_POP, since the stack has been abstracted. The LLIL_REG, LLIL_SET_REG, and other register-based operations are instead MLIL_VAR, MLIL_SET_VAR, and similar. On top of this, thanks to typing, MLIL also has a notion of structures; MLIL_VAR_FIELD and MLIL_STORE_STRUCT expressions describe these operations. Figure 3. Types in MLIL can generate some very clean code. Some operations are common to both LLIL and MLIL, though their operands differ. The LLIL_CALL operation has a single operand: dest, the target of the call. In contrast, the MLIL_CALL operation also specifies the output operand that identifies what variables receive a return value and the params operand, which holds a list of MLIL expressions that describe the function call’s parameters. A user-specific calling convention, or one determined by automated analysis based on usage of variables interprocedurally, determines these parameters and return values. This allows Binary Ninja to identify things like when the ebx register is used as a global data pointer in an x86 PIC binary, or when a custom calling convention is used. Putting all of this together, MLIL comes pretty close to decompiled code. This also makes MLIL ideal for translating to Z3, due to its abstraction of both registers and stack variables, using Binary Ninja’s API. ### MLIL and the API Working with MLIL in the Binary Ninja API is similar to working with LLIL, though there are some notable differences. Like LLIL, a function’s MLIL can be accessed directly via the medium_level_il property of the Function class, but there is no corresponding MLIL method to get_low_level_il_at. In order to directly access a specific instruction’s MLIL, a user must first query for the LLIL. The LowLevelILInstruction class now has a medium_level_il property that retrieves its MediumLevelILInstruction form. As a single line of Python, this would look like current_function.get_low_level_il_at(address).medium_level_il. It is important to remember that this can sometimes be None, as an LLIL instruction can be optimized away completely in MLIL. The MediumLevelILInstruction class introduces new convenience properties that aren’t available in the LowLevelILInstruction class. The vars_read and vars_written properties make it simple to query an instruction for a list of variables the instruction uses without parsing the operands. If we revisit an old instruction from my first blog post, lea eax, [edx+ecx*4], the equivalent MLIL instruction would look similar to the LLIL. In fact, it appears to be identical at first glance. >>> current_function.medium_level_il[0] <il: eax = ecx + (edx << 2)> But, if we look closer, we can see the difference: >>> current_function.medium_level_il[0].dest <var int32_t* eax> Unlike LLIL, where dest would have been an ILRegister object representing the register eax, the dest operand here is a typed Variable object, representing a variable named eax as an int32_t pointer. There are several other new properties and methods introduced for MLIL as well. If we wanted to extract the variables read by this expression, this would be as simple as: >>> current_function.medium_level_il[0].vars_read [<var int32_t* ecx>, <var int32_t edx>]  The branch_dependence property returns the conditional branches of basic blocks that dominate the instruction’s basic block when only the true or false branch dominates, but not both. This is useful for determining which decisions an instruction explicitly depends on. Two properties use the dataflow analysis to calculate the value of an MLIL expression: value can efficiently calculate constant values, and possible_values uses the more computationally-expensive path-sensitive dataflow analysis to calculate ranges and disjoint sets of values that an instruction can result in. Figure 5. Path-sensitive dataflow analysis identifies all concrete data values that can reach a certain instruction. With these features at our disposal, we can model registers, stack variables, and memory, but there is one more hangup that we need to solve: variables are often re-assigned values that are dependent on the previous value of the assignment. For example, if we are iterating over instructions and come across something like the following: mov eax, ebx lea eax, [ecx+eax*4]  When creating our equations, how do we model this kind of reassignment? We can’t just model it as: eax = ebx eax = ecx + (eax * 4)  This can cause all sorts of unsatisfiability, because constraints are purely expressing mathematical truths about variables in a system of equations and have no temporal element at all. Since constraint solving has no concept of time, we need to find some way to bridge this gap, transforming the program to effectively remove the idea of time. Moreover, we need to be able to efficiently determine from where the previous value eax originates. The final piece of the puzzle is another feature available via the Binary Ninja API: SSA Form. ## Single Static Assignment (SSA) Form In concert with Medium Level IL’s release, Binary Ninja also introduced Static Single Assignment (SSA) form for all representations in the BNIL family. SSA form is a representation of a program in which every variable is defined once and only once. If the variable is assigned a new value, a new “version” of that variable is defined instead. A simple example of this would be the following:  a = 1 b = 2 a = a + b  a1 = 1 b1 = 2 a2 = a1 + b1  The other concept introduced with SSA form is the phi-function (or Φ). When a variable has a value that is dependent on the path the control flow took through the program, such as an if-statement or loop, a Φ-function represents all of the possible values that that variable could take. A new version of that variable is defined as the result of this function. Below is a more complicated (and specific) example, using a Φ-function:  def f(a): if a > 20: a = a * 2 else: a = a + 5 return a  def f(a0): if a0 > 20: a1 = a0 * 2 else: a2 = a0 + 5 a3 = Φ(a1, a2) return a3  SSA makes it easy to explicitly track all definitions and uses of a variable throughout the lifetime of the program, which is exactly what we need to model variable assignments in Z3. ### SSA form in Binary Ninja The SSA form of the IL can be viewed within Binary Ninja, but it’s not available by default. In order to view it, you must first check the “Enable plugin development debugging mode” box in the preferences. The SSA form, seen below, isn’t really meant to be consumed visually, as it’s more difficult to read than a normal IL graph. Instead, it is primarily intended to be used with the API. Figure 6. An MLIL function (left) and its corresponding SSA form (right). The SSA form of any of the intermediate languages is accessible in the API through the ssa_form property. This property is present in both function (e.g. LowLevelILFunction and MediumLevelILFunction) and instruction (e.g. LowLevelILInstruction and MediumLevelILInstruction) objects. In this form, operations such as MLIL_SET_VAR and MLIL_VAR are replaced with new operations, MLIL_SET_VAR_SSA and MLIL_VAR_SSA. These operations use SSAVariable operands instead of Variable operands. An SSAVariable object is a wrapper of its corresponding Variable, but with the added information of which version of the Variable it represents in SSA form. Going back to our previous re-assignment example, the MLIL SSA form would output the following: eax#1 = ebx#0 eax#2 = ecx#0 + (eax#1 << 2)  This solves the problem of reusing variable identifiers, but there is still the issue of locating usage and definitions of variables. For this, we can use MediumLevelILFunction.get_ssa_var_uses and MediumLevelILFunction.get_ssa_var_definition, respectively (these methods are also members of the LowLevelILFunction class). Now our bag of tools is complete, let’s dive into actually modeling a real world vulnerability in Binary Ninja! ## Example script: Finding Heartbleed Our approach will be very similar to Andrew’s, as well as the one Coverity used in their article on the subject. Byte-swapping operations are a pretty good indicator that the data is coming from the network and is user-controlled, so we will use Z3 to model memcpy operations and determine if the size parameter is a byte-swapped value. Figure 7. A backward static slice of the size parameter of the vulnerable memcpy in tls1_process_heartbeat of OpenSSL 1.0.1f, in MLIL ### Step 1: Finding our “sinks” It would be time-consuming and expensive to perform typical source-to-sink taint tracking, as demonstrated in the aforementioned articles. Let’s do the reverse; identify all code references to the memcpy function and examine them.  memcpy_refs = [ (ref.function, ref.address) for ref in bv.get_code_refs(bv.symbols['_memcpy'].address) ] dangerous_calls = [] for function, addr in memcpy_refs: call_instr = function.get_low_level_il_at(addr).medium_level_il if check_memcpy(call_instr.ssa_form): dangerous_calls.append((addr, call_instr.address))  ### Step 2: Eliminate sinks that we know aren’t vulnerable In check_memcpy, we can quickly eliminate any size parameters that Binary Ninja’s dataflow can calculate on its own[2], using the MediumLevelILInstruction.possible_values property. We’ll model whatever is left. def check_memcpy(memcpy_call): size_param = memcpy_call.params[2] if size_param.operation != MediumLevelILOperation.MLIL_VAR_SSA: return False possible_sizes = size_param.possible_values # Dataflow won't combine multiple possible values from # shifted bytes, so any value we care about will be # undetermined at this point. This might change in the future? if possible_sizes.type != RegisterValueType.UndeterminedValue: return False model = ByteSwapModeler(size_param, bv.address_size) return model.is_byte_swap()  ### Step 3: Track the variables the size depends on Using the size parameter as a starting point, we use what is called a static backwards slice to trace backwards through the code and track all of the variables that the size parameter is dependent on.  var_def = self.function.get_ssa_var_definition(self.var.src) # Visit statements that our variable directly depends on self.to_visit.append(var_def) while self.to_visit: idx = self.to_visit.pop() if idx is not None: self.visit(self.function[idx])  The visit method takes a MediumLevelILInstruction object and dispatches a different method depending on the value of the instruction’s operation field. Recalling that BNIL is a tree-based language, visitor methods will recursively call visit on an instruction’s operands until it reaches the terminating nodes of the tree. At that point it will generate a variable or constant for the Z3 model that will propagate back through the recursive callers, very similar to the vtable-navigator plugin of Part 2. The visitor for MLIL_ADD is fairly simple, recursively generating its operands before returning the sum of the two:  def visit_MLIL_ADD(self, expr): left = self.visit(expr.left) right = self.visit(expr.right) if None not in (left, right): return left + right  ### Step 4: Identify variables that might be part of a byte swap MLIL_VAR_SSA, the operation that describes an SSAVariable, is a terminating node of an MLIL instruction tree. When we encounter a new SSA variable, we identify the instruction responsible for the definition of this variable, and add it to the set of instructions to visit as we slice backwards. Then, we generate a Z3 variable to represent this SSAVariable in our model. Finally, we query Binary Ninja’s range value analysis to see if this variable is constrained to being a single byte (i.e. within the range 00xff, starting at an offset that is a multiple of 8). If it is, we go ahead and constrain this variable to that value range in our model.  def visit_MLIL_VAR_SSA(self, expr): if expr.src not in self.visited: var_def = expr.function.get_ssa_var_definition(expr.src) if var_def is not None: self.to_visit.append(var_def) src = create_BitVec(expr.src, expr.size) value_range = identify_byte(expr, self.function) if value_range is not None: self.solver.add( Or( src == 0, And(src = value_range.step) ) ) self.byte_vars.add(expr.src) return src  The parent operation of an MLIL instruction that we visit will generally be MLIL_SET_VAR_SSA or MLIL_SET_VAR_PHI. In visit_MLIL_SET_VAR_SSA, we can recursively generate a model for the src operand as usual, but the src operand of an MLIL_SET_VAR_PHI operation is a list of SSAVariable objects, representing each of the parameters of the Φ-function. We add each of these variables’ definition sites to our set of instructions to visit, then write an expression for our model that states dest == src0 || dest == src1 || … || dest == srcn:  phi_values = [] for var in expr.src: if var not in self.visited: var_def = self.function.get_ssa_var_definition(var) self.to_visit.append(var_def) src = create_BitVec(var, var.var.type.width) # ... phi_values.append(src) if phi_values: phi_expr = reduce( lambda i, j: Or(i, j), [dest == s for s in phi_values] ) self.solver.add(phi_expr)  In both visit_MLIL_SET_VAR_SSA and visit_MLIL_SET_VAR_PHI, we keep track of variables that are constrained to a single byte, and which byte they are constrained to:  # If this value can never be larger than a byte, # then it must be one of the bytes in our swap. # Add it to a list to check later. if src is not None and not isinstance(src, (int, long)): value_range = identify_byte(expr.src, self.function) if value_range is not None: self.solver.add( Or( src == 0, And(src = value_range.step) ) ) self.byte_vars.add(*expr.src.vars_read) if self.byte_values.get( (value_range.step, value_range.end) ) is None: self.byte_values[ (value_range.step, value_range.end) ] = simplify(Extract( int(math.floor(math.log(value_range.end, 2))), int(math.floor(math.log(value_range.step, 2))), src ) )  And finally, once we’ve visited a variable’s definition instruction, we mark it as visited so that it won’t be added to to_visit again. ### Step 5: Identify constraints on the size parameter Once we’ve sliced the size parameter and located potential bytes used in our byte swap, we need to make sure that there aren’t any constraints that would restrict the value of the size on the path of execution leading to the memcpy. The branch_dependence property of the memcpy’s MediumLevelILInstruction object identifies mandatory control flow decisions required to arrive at the instruction, as well as which branch (true/false) must be taken. We examine the variables checked by each branch decision, as well as the dependencies of those variables. If there is a decision made based on any of the bytes we determined to be in our swap, we’ll assume this size parameter is constrained and bail on its analysis.  for i, branch in self.var.branch_dependence.iteritems(): for vr in self.function[i].vars_read: if vr in self.byte_vars: raise ModelIsConstrained() vr_def = self.function.get_ssa_var_definition(vr) if vr_def is None: continue for vr_vr in self.function[vr_def].vars_read: if vr_vr in self.byte_vars: raise ModelIsConstrained  ### Step 6: Solve the model If the size isn’t constrained and we’ve found that the size parameter relies on variables that are just bytes, we need to add one final equation to our Z3 Solver. To identify a byte swap, we need to make sure that even though our size parameter is unconstrained, the size is still explicitly constructed only from the bytes that we previously identified. Additionally, we also want to make sure that the reverse of the size parameter is equal to the identified bytes reversed. If we just added an equation to the model for those properties, it wouldn’t work, though. Theorem checkers only care if any value satisfies the equations, not all values, so this presents a problem. We can overcome this problem by negating the final equation. By telling the theorem solver that we want to ensure that no value satisfies the negation and looking for an unsat result, we can find the size parameters that satisfy the original (not negated) equation for all values. So if our model is unsatisfiable after we add this equation, then we have found a size parameter that is a byte swap. This might be a bug!  self.solver.add( Not( And( var == ZeroExt( var.size() - len(ordering)*8, Concat(*ordering) ), reverse_var == ZeroExt( reverse_var.size() - reversed_ordering.size(), reversed_ordering ) ) ) ) if self.solver.check() == unsat: return True  ### Step 7: Find bugs I tested my script on two versions of OpenSSL: first the vulnerable 1.0.1f, and then 1.0.1g, which fixed the vulnerability. I compiled both versions on macOS with the command ./Configure darwin-i386-cc to get a 32-bit x86 version. When the script is run on 1.0.1f, we get the following: Figure 8. find_heartbleed.py successfully identifies both vulnerable Heartbleed functions in 1.0.1f. If we then run the script on the patched version, 1.0.1g: Figure 9. The vulnerable functions are no longer identified in the patched 1.0.1g! As we can see, the patched version that removes the Heartbleed vulnerability is no longer flagged by our model! ## Conclusion I’ve now covered how the Heartbleed flaw led to a major information leak bug in OpenSSL, and how Binary Ninja’s Medium Level IL and SSA form translates seamlessly to a constraint solver like Z3. Putting it all together, I demonstrated how a vulnerability such as Heartbleed can be accurately modeled in a binary. You can find the script in its entirety here. Of course, a static model such as this can only go so far. For more complicated program modeling, such as interprocedural analysis and loops, explore constraint solving with a symbolic execution engine such as our open source tool Manticore. Now that you know how to leverage the IL for vulnerability analysis, hop into the Binary Ninja Slack and share with the community your own tools, and if you’re interested in learning even more about the BNIL, SSA form, and other good stuff. Finally, don’t miss the Binary Ninja workshop at Infiltrate 2018. I’ll be hanging around with the Vector 35 team and helping answer questions! [1] After Part 2, Jordan told me that Rusty remarked, “Josh could really use SSA form.” Since SSA form is now available, I’ve added a refactored and more concise version of the article’s script here! [2] This is currently only true because Binary Ninja’s dataflow does not calculate the union of disparate value ranges, such as using bitwise-or to concatenate two bytes as happens in a byte swap. I believe this is a design tradeoff for speed. If Vector 35 ever implements a full algebraic solver, this could change, and a new heuristic would be necessary. # Use our suite of Ethereum security tools Two years ago, when we began taking on blockchain security engagements, there were no tools engineered for the work. No static analyzers, fuzzers, or reverse engineering tools for Ethereum. So, we invested significant time and expertise to create what we needed, adapt what we already had, and refine the work continuously over dozens of audits. We’ve filled every gap in the process of creating secure blockchain software. Today, we’re happy to share most of these tools in the spirit of helping to secure Ethereum’s foundation. Think of what follows as the roadmap. If you are new to blockchain security, just start at the top. You have all you need. And, if you’re diligent, less reason to worry about succumbing to an attack. ## Development Tools To build a secure Ethereum codebase: get familiar with known mistakes to avoid, run a static analysis on every new checkin of code, fuzz new features, and verify your final product with symbolic execution. ### 1. Not So Smart Contracts This repository contains examples of common Ethereum smart contract vulnerabilities, including real code. Review this list to ensure you’re well acquainted with possible issues. The repository contains a subdirectory for each class of vulnerability, such as integer overflow, reentrancy, and unprotected functions. Each subdirectory contains its own readme and real-world examples of vulnerable contracts. Where appropriate, contracts that exploit the vulnerabilities are also provided. We use these examples as test cases for our Ethereum bug-finding tools, listed below. The issues in this repository can be used to measure the effectiveness of other tools you develop or use. If you are a smart contract developer, carefully examine the vulnerable code in this repository to fully understand each issue before writing your own contracts. ### 2. Slither Slither combines a set of proprietary static analyses on Solidity that detect common mistakes such as bugs in reentrancy, constructors, method access, and more. Run Slither as you develop, on every new checkin of code. We continuously incorporate new, unique bugs types that we discover in our audits. Slither is privately available to all firms that work with us, and may become available for licensing or accessible via an API if there’s enough interest. Sign up to get notified if Slither becomes available. Running Slither is simple: $ slither.py contract.sol

Slither will then output the vulnerabilities it finds in the contract.

### 3. Echidna

Echidna applies next-generation smart fuzzing to EVM bytecode. Write Echidna tests for your code after you complete new features. It provides simple, high coverage unit tests that discover security bugs. Until your app has 80+% coverage with Echidna, don’t consider it complete.

Using Echidna is simple:

1. Add some Echidna tests to your existing code (like in this example),
2. Run ./echidna-test contract.sol, and
3. See if your invariants hold.

If you want to write a fancier analysis (say, abstract state machine testing), we have support for that too.

### 4. Manticore

Manticore uses symbolic execution to simulate complex multi-contract and multi-transaction attacks against EVM bytecode. Once your app is functional, write Manticore tests to discover hidden, unexpected, or dangerous states that it can enter. Manticore enumerates the execution states of your contract and verifies critical functionality.

If your contract doesn’t require initialization parameters, then you can use the command line to easily explore all the possible executions of your smart contract as an attacker or the contract owner:

manticore contract.sol --contract ContractName --txaccount [attacker|owner]

Manticore will generate a list of all the reachable states (including assertion failures and reverts) and the inputs that cause them. It will also automatically flag certain types of issues, like integer overflows and use of uninitialized memory.

Using the Manticore API to review more advanced contracts is simple:

1. Initialize your contract with the proper values
2. Define symbolic transactions to explore potential states
3. Review the list of resulting transactions for undesirable states

## Reversing Tools

Once you’ve developed your smart contract, or you want to look at someone else’s code, you’ll want to use our reversing tools. Load the binary contract into Ethersplay or IDA-EVM. For an instruction set reference, use our EVM Opcodes Database. If you’d like to do more complex analysis, use Rattle.

### 1. EVM Opcode Database

Whether you’re stepping through code in the Remix debugger or reverse engineering a binary contract, you may want to look up details of EVM instructions. This reference contains a complete and concise list of EVM opcodes and their implementation details. We think this is a big time saver when compared to scrolling through the Yellow Paper, reading Go/Rust source, or checking comments in StackOverflow articles.

### 2. Ethersplay

Ethersplay is a graphical EVM disassembler capable of method recovery, dynamic jump computation, source code matching, and binary diffing. Use Ethersplay to investigate and debug compiled contracts or contracts already deployed to the blockchain.

Ethersplay takes EVM bytecode as input in either ascii hex encoded or raw binary format. Examples of each are test.evm and test.bytecode, respectively. Open the test.evm file in Binary Ninja, and it will automatically analyze it, identify functions, and generate a control flow graph.

Ethersplay includes two Binary Ninja plugins to help. “EVM Source Code” will correlate contract source to the EVM bytecode. “EVM Manticore Highlight” integrates Manticore with Ethersplay, graphically highlighting code coverage information from Manticore output.

### 3. IDA-EVM

IDA-EVM is a graphical EVM disassembler for IDA Pro capable of function recovery, dynamic jump computation, applying library signatures, and binary diffing using BinDiff.

IDA-EVM allows you to analyze and reverse engineer smart contracts without source. To use it, follow the installation instructions in the readme, then open a .evm or .bytecode file in IDA.

### 4. Rattle

Rattle is an EVM static analyzer that analyzes the EVM bytecode directly for vulnerabilities. It does this by disassembling and recovering the EVM control flow graph and lifting the operations to a Single Static Assignment (SSA) form called EVM::SSA. EVM::SSA optimizes out all pushes, pops, dups, and swaps, often reducing the instruction count by 75%. Rattle will eventually support storage, memory, and argument recovery as well as static security checks similar to those implemented in Slither.

Rattle is privately available to all firms that work with us, and may become available for licensing or accessible via an API if there’s enough interest. Sign up to be notified if Rattle becomes available.

To use Rattle, supply it runtime bytecode from solc or extracted directly from the blockchain:

\$ ./rattle -i path/to/input.evm

## Work with us!

Please, use the tools, file issues in their respective repos, and participate in their feature and bug bounties. Let us know how they could be better on the Empire Hacking Slack in #ethereum.

Now that we’ve introduced each tool, we plan to write follow-up posts that dig into their details.

# An accessible overview of Meltdown and Spectre, Part 2

This is the second half of our blog post on the Meltdown an Spectre vulnerabilities, describing Spectre Variant 1 (V1) and Spectre Variant 2 (V2). If you have not done so already, please review the first blog post for an accessible review of computer architecture fundamentals. This blog post will start by covering the technical details of Spectre V1 and Spectre V2, and conclude with a discussion of how these bugs lurked undetected for so long, and what we think the future holds.

Like Meltdown, the Spectre vulnerabilities rely on speculative execution and timing side channels to read memory without proper permission. The difference between Meltdown and Spectre is the method of operation and the potential impact — more computers are vulnerable to Spectre. Meltdown works by taking advantage of an asynchronous permissions check, and affects only Intel processors. Spectre works by tricking the branch predictor (Figure 1), and affects almost every processor released in the last 25 years.

Figure 1: How branch misprediction leads to speculative execution. When the branch predictor makes an incorrect guess about the destination of a conditional branch, some instructions are speculatively executed. The execution of these instructions is undone, but their effects on the cache remain. Spectre causes the branch predictor to guess wrong and speculatively execute a carefully chosen set of instructions.

## Spectre (Variant 1)

The first variant of Spectre allows a program to read memory that it should not have the ability to read. Spectre V1 attacks are possible because of the confluence of two optimizations: branch prediction and speculative execution. A Spectre V1 attack tricks the branch predictor (for conditional branches) into skipping security checks and speculatively executing instructions in the wrong context. The effects of these speculatively executed instructions are visible via a cache timing side-channel.

### Technical Details

Let’s walk through a hypothetical example. Even though the steps seem complex, they are quite possible to execute in a web browser via JavaScript — see the Spectre paper’s working proof of concept. The following example assumes a Spectre attack delivery via JavaScript in a web browser. That isn’t a requirement. Spectre can work on all kinds of programs, not just web browsers. Also, some specifics are omitted for brevity and clarity.

1. First we’ll create two large memory allocations. We’ll call one leaker and the other reader. In JavaScript, these can be created with ArrayBuffers, a structure used for storing binary data (like pictures). The leaker allocation will be used for the cache timing side-channel. The reader allocation only exists to train the branch predictor. The arrays we allocated have a fixed length. It is only legal to read items from the array starting at zero (the first element) and ending before the array length (the last element). So for a 256 byte array, the valid items are numbered 0 to 255. This range of the first to last element is called the array bounds.
2. Next we start training the branch predictor. A Spectre attack relies on grooming the branch predictor to guess that certain security checks always pass. In this specific case, we are going to make the branch predictor guess that the values read out of reader are always in bounds. JavaScript checks bounds on array reads; an out of bounds read will fail and cause the program to stop working. Array bounds are (currently!) checked via branch instructions. For this part of the Spectre attack, we will repeatedly read in bounds values from reader. Recall that the processor assumes current behavior is predictive of future behavior. If branches are always taken (or not taken), then the branch predictor will be trained to expect that same behavior.
3. Next we ensure that no part of leaker is cached. This is also possible to do in JavaScript. We avoid the details for brevity.
4. Now that the preparation is out of the way, let’s get to the core Spectre V1 flaw. We will read an out of bounds element from reader. Let’s call this value secret. Because we read secret from outside the array bounds, it can, by definition, be any chosen memory location. The read of secret will temporarily succeed even though there is a bounds check that logically prevents it from ever happening. It succeeds only speculatively, because the branch predictor has been primed to assume that the bounds check will succeed. At this point the countdown also starts for the processor to discover the mispredicted branch and un-execute the mispredicted instructions.
5. Next, we use the value of secret as an index to read an element from reader (i.e., reader[secret]). This action will be speculatively executed and cause that element of reader to be cached. At this point, the branch predictor can correct itself and un-execute all speculatively executed instructions.
6. By measuring the time to read every element of secret, it is possible to determine which element was cached. The index of the cached element will be the value of secret, a value the program was not permitted to read. For example, if reader[42] was cached, the value of secret would be 42.
7. The attack can now be repeated to read more bytes. The bandwidth of this channel is estimated at 10 KB/s.

See Figure 2 for a graphical representation of the reader and secret data memory locations at Step 4.

Figure 2: A hypothetical layout of reader and some secret data in computer memory. Because computer memory is laid out linearly, it is possible to access any part of it in terms of a positive or negative index into reader. Access to reader should be limited by its bounds, 0 to 255. If bounds checks are bypassed, even by a few speculatively executed instructions, it is possible to access memory without proper permission.

### What is the impact?

Spectre (Variant 1) is a really big deal to desktop, laptop, and mobile users. It lets websites break security and isolation guarantees built into web browsers like Chrome, IE, and Firefox. As an example, Spectre could allow a website open in one tab to read passwords you are typing into a different tab, or allow an ad on a checkout page to read the credit card number you are typing.

Spectre is equally devastating for cloud providers and internet companies. A lot of the code powering your favorite websites relies on isolation guarantees provided by programming languages. Spectre renders those guarantees into good intentions. Application hosting providers have to re-evaluate their security architecture, rebuild a lot of core code (with performance loss), or both.

There is no generic fix for Variant 1. Affected software has to be re-compiled to avoid using vulnerable code patterns. However, exploitation of the vulnerability can be made more difficult. This is the path taken by web-browser vendors so far. They have removed high-resolution timers (necessary to determine if something was cached) and are actively working to avoid using vulnerable code patterns.

### What should I do?

Update your browser, operating system, and BIOS to the latest version. All browser vendors have released mitigations for Spectre vulnerabilities.

All major cloud providers have deployed mitigations. You as a customer should have nothing to worry about moving forward.

## Spectre (Variant 2)

The second variant of Spectre allows a program to read memory that it should not have access to, regardless of whether the memory is part of the same program, another program, or the core operating system.

Like Spectre V1, Spectre V2 also relies on abusing branch prediction, speculative execution, and cache timing side-channels. Whereas V1 tricks the conditional branch predictor, V2 tricks the indirect branch predictor. The indirect branch predictor is indifferent to privilege changes, including from user to kernel, from program to program, and even from a virtual machine (think cloud computing instance) to hypervisor (the cloud control plane). For those reasons, Spectre V2 attacks can happen across most if not all privilege levels provided by modern processors.

### Technical Details

Out of the three published attacks, Spectre V2 attacks are the most complex. To explain a Spectre V2 attack, we are going to describe a simplified version outlined in the Google Project Zero blog. Many technical details will be omitted with the hope of providing a more accessible understanding of the attack.

The Spectre V2 attack described here reads privileged hypervisor memory from a guest virtual machine. Google Project Zero demonstrated that this attack works with real software running on real processors.

First, a quick refresher about indirect branches and the indirect branch predictor. Indirect branches tell a processor to start executing instructions at some new location. This location can be stored in memory. If that location is not cached, the processor would have to pause a very long time (relative to instruction execution speed) while waiting to find out where to get instructions. The indirect branch predictor guesses where the processor would go next, so the processor doesn’t have to wait. If the guess turns out to be wrong, the processor un-executes some instructions and goes back to the correct place. If the guess turns out to be correct, the program runs much faster. Processors are very good at guessing correctly, so this usually results in a big speedup.

Time for the first Spectre V2 concept: the indirect branch predictor works by keeping a history of recently executed branches in the Branch History Buffer (BHB), sometimes called the Branch History Table, because previous branch decisions are typically indicative of future branch decisions (Figure 3). As with Meltdown and Spectre V1, by carefully measuring how long some branch operations take, it is possible to “read” the BHB and determine the location of previously executed instructions — even if those instructions were executed at a different privilege level or in a different program.

Figure 3: The branch history buffer is used to predict the target of indirect branches, so that the processor can execute them faster. When an indirect branch executes, an entry is written into the table. The table has limited space, and is indexed based on branch address. Because of this limitation, it is possible to “poison” the branch predictor and make it guess an attacker-chosen address for any indirect branch. In this hypothetical prediction example, any branch at an address ending in 045C would have the same predicted destination.

Now for the second Spectre V2 concept: The branch predictor is shared among all programs running on a processor. Branches executed in one program change how the indirect branch predictor guesses in another program. By executing a carefully chosen series of branches, you can “write” the BHB, and force the branch predictor to guess a location you chose for a future indirect branch — even if that indirect branch will be executed at a different privilege level. This is analogous to Spectre V1, but using indirect branches instead of conditional branches.

The steps described for Google Project Zero’s Spectre V2 attack are less detailed because there isn’t a good way to condense the required information. Each step described below is extremely involved and difficult to accomplish properly. The described scenario is one where a cloud computing instance reads memory contents of the cloud control plane (Linux-KVM, specifically). That memory contains private information about what every other cloud computing instance on that physical machine is doing, including private user data, software source code, encryption keys, etc.

#### Attack Preparation

1. First, allocate a large block of memory that will be used for the cache-timing side channel. Per our previous convention, we’ll call this block reader.
2. To leak out data via speculative execution, we will need some way to force the hypervisor to read a byte of privileged memory first, and to use that byte to access reader. This code pattern may be hard to find, but luckily we can cheat. The KVM hypervisor is a part of the Linux kernel, and the Linux kernel includes a feature that runs custom programs to quickly deal with network traffic. This feature is called the eBPF interpreter. We can use the eBPF interpreter to create the code pattern we need in the hypervisor. To do that, we’ll need to:
1. Find the eBPF interpreter in the hypervisor, and
2. Supply some code to the eBPF interpreter, which we’ll call ebpf_code.
3. Because Spectre V2 relies on tricking the indirect branch predictor, we need to know the location of an indirect branch in the hypervisor. We’ll call this indirect_branch.
4. The hypervisor has a different view of computer memory than our program in the cloud instance. Because we’ll be leaking memory from the hypervisor, we’ll need to find out where in hypervisor-land reader, ebpf_code, indirect_branch and the eBPF interpreter are located.
5. To find these items, we will need to know the location of some code in the hypervisor. We can’t simply read hypervisor code. We lack the proper privileges. Instead, we leak out code locations via the BHB. How? We’ll ask the hypervisor to do some normal operations on behalf of the cloud instance. In the course of these operations, the hypervisor executes indirect branches, the locations of which will be stored in the BHB. We then “read” the BHB to identify these code locations.
6. Once we know some code locations in the hypervisor, with some math and multiple repeated attempts we can find where in hypervisor-land reader, ebpf_code, indirect_branch and the eBPF interpreter reside (Figure 4).

Figure 4: A simplified diagram of how a cloud computing system would look after completing Step 6 of this attack description. The attacker has gathered all information they need to conduct the attack, but the key part is still to come.

#### Attack Execution

1. We ensure that no part of reader is cached.
2. Time for the magic! We execute a series of indirect branches that write new entries into the BHB. These branches and their targets are set up to trick the indirect branch predictor into guessing that indirect_branch points to the eBPF interpreter (Figure 5).

Figure 5: This is the core of the Spectre V2 attack: specially crafted indirect branches can trick the Branch History Buffer into predicting an attacker-chosen speculative destination for an indirect branch. In this case, an indirect branch in the hypervisor is set to speculatively point to the eBPF interpreter code instead of its original location.

#### Leaking Secrets

1. We set up processor state so that if it were to start executing the eBPF interpreter, then the interpreter would run ebpf_code, which would in turn read a byte of hypervisor memory and use that byte to access a part of reader.
2. We ask the hypervisor to perform an innocuous action that is guaranteed to trigger indirect_branch. A complex chain of events now happens:
1. The processor will guess indirect_branch points to the eBPF interpreter.
2. The processor will start speculatively executing the eBPF interpreter. Also, the countdown starts until the processor knows it guessed the wrong branch target.
3. The eBPF interpreter will execute instructions provided by ebpf_code, which will:
1. Read a byte of hypervisor memory.
2. Use that byte to access a piece of reader.
3. Cause that piece of reader to be cached.
4. The processor figures out it executed the wrong branch, and un-executes every instruction speculatively executed thus far. However, a piece of reader is now in the cache (Figure 6).
3. As with Meltdown and Spectre V1, we time access to every piece of reader and identify which piece is read much faster than the others. The index of that piece is the value of the byte we read from hypervisor memory.
4. We can repeat this process to read more bytes, although with less setup since we can skip the Attack Preparation steps.

Figure 6: A visualization of how secrets can be leaked in a Spectre V2 attack. The eBPF interpreter speculatively executes attacker-specified eBPF code which will read a secret value and use it to access reader. The effects of speculative execution can be observed via a cache-timing side channel.

Again, the steps described above are greatly simplified to show the general concepts and some of the exciting trickery involved in a working Spectre V2 attack. Each step is very complex on its own (i.e., one does not simply “read” the BHB).

### What is the impact?

This is very bad news for everyone — no matter if you are a cloud provider, an internet company, or just a citizen of the web. The impact combines the worst parts of Spectre (Variant 1) with the worst parts of Meltdown. Unlike Variant 1, there is no proof of concept that targets web browsers, but the possibility can’t be ruled out.

The news for cloud providers is worse. There is a proof-of-concept exploit that breaks the strongest isolation mechanisms used to separate tenants on cloud computing systems. Also, not only Intel, but processors from various vendors across multiple CPU architectures are vulnerable to Variant 2.

There are multiple mitigations available, each of which has some performance cost. Processor vendors have updates available that tune how their processors work to lessen the impact. When that is not possible, software can be re-compiled to avoid using indirect branch instructions. We have not been able to find reliable numbers for the performance penalty of these fixes, but they are certainly not zero and must be paid in addition to the penalties for fixing Meltdown.

### What should I do?

Update your operating system and firmware (i.e., BIOS or UEFI) to the latest version. The latest operating systems and BIOS updates will mitigate the most serious instances of Variant 2 either via processor microcode updates, workarounds to control indirect branch predictor behavior, or recompilation to avoid indirect branches.

All major cloud providers have deployed mitigations. You as a customer should have nothing to worry about.

## How Did This Happen?

So how did a fundamental computer design flaw go unnoticed for the past 25 years? The answer has two parts: our fundamental assumptions about computing have changed, and the many hints about these security implications weren’t put together in a working proof-of-concept until now.

When speculative execution made its consumer processor debut in the 90s, the way we used computers was different. Most machines were single user, and the most popular operating systems of the day — Windows 9x and Mac OS Classic — lacked memory protection. You didn’t need Spectre or Meltdown to read (or even write!) another application’s memory — you could just do it. In that environment, the performance gains were real, and the security implications weren’t.

Nowadays, multi-tenant cloud computing, whether using virtual machines or containers, powers a huge part of the web. It is a regular occurrence for web browsers to download and run untrusted code (i.e. JavaScript) that is meant to be sandboxed from other untrusted code (i.e. other tabs you have open). In this environment, leaking information from one isolated memory compartment to another (e.g. between browser tabs) is a huge problem.

There have been numerous hints that speculative execution could lead to leaks of privileged information. The Google Project Zero blog post cites multiple sources (including one by our very own Sophia D’Antoine) that insinuate the problem. Multiple, independent researchers identified and reported the Spectre and Meltdown vulnerabilities, and others were very close. There is, however, a big difference between thinking there may be a problem and writing a proof-of-concept showing the problem is real. The work done by the researchers reporting this issue was fantastic. I hope this blog post shows just how difficult it was.

## Conclusion

We hope you have a better understanding of how Meltdown and Spectre work at a technical level, their impact, and the available mitigations. This blog post was written to be accessible to someone without a computer architecture background, and we sincerely hope we succeeded. To our more technical readers, the Meltdown and Spectre papers, and the Project Zero blog post are better sources for the gory details.

Looking forward, micro-architectural attacks on computing platforms are going to be an exciting area of computer security. Because so many deployed platforms are vulnerable to Meltdown and Spectre, micro-architectural attacks will continue to be relevant and dangerous for many years to come.