Real-time file monitoring on Windows with osquery

March 16, 2020

Page content

TL;DR: Trail of Bits has developed ntfs_journal_events, a new event-based osquery table for Windows that enables real-time file change monitoring. You can use this table today to performantly monitor changes to specific files, directories, and entire patterns on your Windows endpoints. Read the schema documentation!

$Terminal output of an osquery query selecting FileWrite events from the ntfs_journal_events table, showing two records for C:\Users\user\Downloads\foobaz.txt$

File monitoring for fleet security and management purposes

File event monitoring and auditing are vital primitives for endpoint security and management:

Many malicious activities are reliably sentineled or forecast by well-known and easy to identify patterns of filesystem activity: rewriting of system libraries, dropping of payloads into fixed locations, and (attempted) removal of defensive programs all indicate potential compromise
Non-malicious integrity violations can also be detected through file monitoring: employees jailbreaking their company devices or otherwise circumventing security policies
Software deployment, updating, and automated configuration across large fleets: “Does every host have Software X installed and updated to version Y?”
Automated troubleshooting and remediation of non-security problems: incorrect permissions on shared files, bad network configurations, disk (over)utilization

A brief survey of file monitoring on Windows

Methods for file monitoring on Windows typically fall into one of three approaches:

Win32/WinAPI interfaces: FindFirstChangeNotification, ReadDirectoryChangesW
Filesystem filter drivers and minifilters
Journal monitoring

We’ll cover the technical details of each of these approaches, as well as their advantages and disadvantages (both general and pertaining to osquery) below.

Win32 APIs

The Windows API provides a collection of (mostly) filesystem-agnostic functions for polling for events on a registered directory:

FindFirstChangeNotification can be used to place a set of notification filters on a particular directory’s entries (and those of all subdirectories, if requested).
The handle returned by FindFirstChangeNotification can be used with the standard Windows object waiting routines, like WaitForSingleObject and WaitForMultipleObjects.
Once waited for and processed, subsequent events can be queued with FindNextChangeNotification.

These routines come with several gotchas:

FindFirstChangeNotification does not monitor the specified directory itself, only its entries. Consequently, the “correct” way to monitor both a directory and its entries is to invoke the function twice: once for the directory itself, and again for its parent (or drive root). This, in turn, requires additional filtering if the only entry of interest in the parent is the directory itself.

These routines provide the filtering and synchronization for retrieving filesystem events, but do not expose the events themselves or their associated metadata. The actual events must be retrieved through ReadDirectoryChangesW, which takes an open handle to the watched directory and many of the same parameters as the polling functions (since it can be used entirely independently of them). Users must also finagle with the bizarre world of OVERLAPPED in order to use ReadDirectoryChangesW safely in an asynchronous context.

ReadDirectoryChangesW can be difficult to use with the Recycling Bin and other pseudo-directory concepts on Windows. This SO post suggests that the final moved name can be resolved with GetFinalPathNameByHandle. This GitHub issue suggests that the function’s behavior is also inconsistent between Windows versions.

Last but not least, ReadDirectoryChangesW uses a fixed-size buffer for each directory handle internally and will flush all change records before they get handled if it cannot keep up with the number of events. In other words, its internal buffer does not function as a ring, and cannot be trusted to degrade gradually or gracefully in the presence of lots of high I/O loads.

An older solution also exists: SHChangeNotifyRegister can be used to register a window as the recipient of file notifications from the shell (i.e., Explorer) via Windows messages. This approach also comes with numerous downsides: it requires the receiving application to maintain a window (even if it’s just a message-only window), uses some weird “item-list” view of filesystem paths, and is capped by the (limited) throughput of Windows message delivery.

All told, the performance and accuracy issues of these APIs make them poor candidates for osquery.

Filter drivers and minifilters

Like so many other engineering challenges in Windows environments, file monitoring has a nuclear option in the form of a kernel-mode APIs. Windows is kind enough to provide two general categories for this purpose: the legacy file system filter API and the more recent minifilter framework. We’ll cover the latter in this post, since it’s what Microsoft recommends.

Minifilters are kernel-mode drivers that directly interpose the I/O operations performed by Windows filesystems. Because they operate at the common filesystem interface layer, minifilters are (mostly) agnostic towards their underlying storage — they can (in theory) interpose any of the filesystem operations known by the NT kernel regardless of filesystem kind or underlying implementation. Minifilters are also composable, meaning that multiple filters can be registered against and interact with a filesystem without conflict.

Minifilters are implemented via the Filter Manager, which establishes a filter loading order based on a configured unique “altitude” (lower altitudes corresponding to earlier loads, and thus earlier access) and presence in a “load order group”, which corresponds to a unique range of altitudes. Load order groups are themselves loaded in ascending order with their members loaded in random order, meaning that having a lower altitude than another minifilter in the same group as you does not guarantee higher precedence. Microsoft provides some documentation for (public) load order groups and altitude ranges; a list of publicly known altitudes is available. You can even request one yourself!

While powerful and flexible and generally the right choice for introspecting the filesystem on Windows, minifilters are unsuitable for osquery’s file monitoring purposes:

For in-tree (i.e., non-extension) tables, osquery has a policy against system modifications. Installing a minifilter requires us to modify the system by loading a driver, and would require osquery to either ship with a driver or fetch one at install-time.
Because minifilters are full kernel-mode drivers, they come with undesirable security and stability risks.
The design of osquery makes certain assurances to its users: that it is a single-executable, user-mode agent, self-monitoring its performance overhead at runtime — a kernel-mode driver would violate that design.

Journal monitoring

A third option is available to us: the NTFS journal.

Like most (relatively) modern filesystems, NTFS is journaled: changes to the underlying storage are preceded by updates to a (usually circular) region that records metadata associated with the changes. Dan Luu’s “Files are fraught with peril” contains some good motivating examples of journaling in the form of an “undo log”.

Journaling provides a number of benefits:

Increased resilience against corruption: the full chain of userspace-to-kernel-to-hardware operations for an single I/O operation (e.g., unlinking a file) isn’t internally atomic, meaning that a crash can leave the filesystem in an indeterminate or corrupted state. Having journal records for the last pre-committed operations makes it more likely that the filesystem can be rolled back to a known good state.
Because the journal provides a reversible record of filesystem actions, interactions with the underlying storage hardware can be made more aggressive: the batch size for triggering a commit can be increased, increasing performance.
Since the journal is timely and small (relative to the filesystem), it can be used to avoid costly filesystem queries (e.g., stat) for metadata. This is especially pertinent on Windows, where metadata requests generally involve acquiring a full HANDLE.

NTFS’s journaling mechanism is actually split into two separate components: $LogFile is a write-ahead log that handles journaling for rollback purposes, while the change journal ($Extend\$UsnJrnl) records recent changes on the volume by kind (i.e., without the offset and size information needed for rollback).

Windows uses the latter for its File History feature, and it’s what we’ll use too.

Accessing the change journal

⚠ The samples below have been simplified for brevity’s sake. They don’t contain error handling and bounds checking, both of which are essential for safe and correct use. Read MSDN and/or the full source code in osquery before copying! ⚠

Fortunately for us, opening a handle to and reading from the NTFS change journal for a volume is a relatively painless affair with just a few steps.

We obtain the handle for the volume that we want to monitor via a plain old CreateFile call:
We issue a DeviceIoControl[FSCTL_QUERY_USN_JOURNAL] on the handle to get the most recent Update Sequence Number (USN). USNs uniquely identify a batch records committed together; we’ll use our first to “anchor” our queries chronologically:
We issue another DeviceIoControl, this time with FSCTL_READ_USN_JOURNAL, to pull a raw buffer of records from the journal. We use a READ_USN_JOURNAL_DATA_V1 to tell the journal to only give us records starting at the USN we got in the last step:

Mind the last two fields (2U and 3U) — they’ll be relevant later.

Interpreting the change record buffer

DeviceIoControl[FSCTL_READ_USN_JOURNAL] gives us a raw buffer of variable-length USN_RECORDs, prefixed by a single USN that we can use to issue a future request:

Then, in our process_usn_record:

Recall those last two fields from READ_USN_JOURNAL_DATA_V1 — they correspond to the range of USN_RECORD versions returned to us, inclusive. We explicitly exclude v4 records, since they’re only emitted as part of range tracking and don’t include any additional information we need. You can read more about them on their MSDN page.

MSDN is explicit about these casts being necessary: USN_RECORD is an alias for USN_RECORD_V2, and USN_RECORD_V3 is not guaranteed to have any common layout other than that defined in USN_RECORD_COMMON_HEADER.

Once that’s out of the way, however, the following fields are available in both:

Reason: A bitmask of flags indicating changes that have accumulated in the current record. See MSDN’s USN_RECORD_V2 or USN_RECORD_V3 for a list of reason constants.
FileReferenceNumber: A unique (usually 128-bit) ordinal referencing the underlying filesystem object. This is the same as the FRN that can be obtained by calling GetFileInformationByHandleEx with FileIdInfo as the information class. FRNs correspond roughly to the “inode” concept in UNIX-land, and have similar semantics (unique per filesystem, not system-wide).
ParentFileReferenceNumber: Another FRN, this one for the parent directory (or volume) of the file or directory that this record is for.
FileNameLength, FileNameOffset, FileName: The byte-length, offset, and pointer to the filename of the file or directory that this is for. Note that FileName is the base (i.e., unqualified) name — retrieving the fully qualified name requires us to resolve the name of the parent FRN by opening a handle to it (OpenFileById), calling GetFinalPathNameByHandle, and joining the two.

Boom: file events via the change journal. Observe that our approach has sidestepped many of the common performance and overhead problems in file monitoring: we operate completely asynchronously and without blocking the filesystem whatsoever. This alone is a substantial improvement over the minifilter model, which imposes overhead on every I/O operation.

Caveats

Like the other techniques mentioned, the change journal approach to file monitoring is not without its disadvantages.

As the name of the table suggests, change journal monitoring only works on NTFS (and ReFS, which appears to be partially abandoned). It can’t be used to monitor changes on FAT or exFAT volumes, as these lack journaling entirely. It also won’t work on SMB shares, although it will work on cluster-shared volumes of the appropriate underlying format.

Handling of rename operations is also slightly annoying: the change journal records one event for the “old” file being renamed and another for the “new” file being created, meaning that we have to pair the two into a single event for coherent presentation. This isn’t hard (the events reference each other and have distinct masks), but it’s an extra step.

The change journal documentation is also conspicuously absent of information about the possibility of dropped records: the need for a start USN and the returning of a follow-up USN in the raw buffer imply that subsequent queries are expected to succeed, but no official details about the size of wraparound behavior of the change journal are provided. This blog post indicates that the default size is 1MB, which is probably sufficient for most workloads. It’s also changeable via fsutil.

Potentially more important is this line in the MSDN documentation for the Reason bitmask:

The flags that identify reasons for changes that have accumulated in this file or directory journal record since the file or directory opened.
When a file or directory closes, then a final USN record is generated with the USN_REASON_CLOSE flag set. The next change (for example, after the next open operation or deletion) starts a new record with a new set of reason flags.

This implies that duplicate events in the context of an open file’s lifetime can be combined into a single bit in the Reason mask: USN_REASON_DATA_EXTEND can only be set once per record, so an I/O pattern that consists of an open, two writes, and a close will only indicate that some write happened, not which or how many. Consequently, the change journal can’t answer detailed questions about the magnitude of I/O on an open resource; only whether or not some events did occur on it. This is not a major deficiency for the purposes of integrity monitoring, however, as we’re primarily interested in knowing when files change and what their end state is when they do.

Bringing the change journal into osquery

The snippets above give us the foundation for retrieving and interpreting change journal records from a single volume. osquery’s use case is more involved: we’d like to monitor every volume that the user registers interest in, and perform filtering on the retrieved records to limit output to a set of configured patterns.

Every NTFS volume has its own change journal, so each one needs to be opened and monitored independently. osquery’s pub-sub framework is well suited to this task:

We define an event publisher (NTFSEventPublisher)
In our configuration phase (NTFSEventPublisher::configure()), we read user configuration a la the Linux file_events table:
The configuration gives us the base list of volumes to monitor change journals on; we create a USNJournalReader for each and add them as services via Dispatcher::addService()
Each reader does its own change journal monitoring and event collection, reporting a list of events back to the publisher
We perform some normalization, including reduction of “old” and “new” rename events into singular NTFSEventRecords. We also maintain a cache of parent FRNs to directory names to avoid missing changes caused by directory renames and to minimize the number of open-handle requests we issue
The publisher fire()s those normalized events off for consumption by our subscribing table: ntfs_journal_events

Put together, this gives us the event-based table seen in the screenshot above. It’s query time!

Wrapping things up

The ntfs_journal_events table makes osquery a first-class option for file monitoring on Windows, and further decreases the osquery feature gap between Windows and Linux/macOS (which have had the file_events table for a long time).

Do you have osquery development or deployment needs? Drop us a line! Trail of Bits has been at the center of osquery’s development for years, and has worked on everything from the core to table development to new platform support.