ETW internals for security research and forensics

By Yarden Shafir

Why has Event Tracing for Windows (ETW) become so pivotal for endpoint detection and response (EDR) solutions in Windows 10 and 11? The answer lies in the value of the intelligence it provides to security tools through secure ETW channels, which are now also a target for offensive researchers looking to bypass detections.

In this deep dive, we’re not just discussing ETW’s functionalities; we’re exploring how ETW works internally so you can conduct novel research or forensic analysis on a system. Security researchers and malware authors already target ETW. They have developed several techniques to tamper with or bypass ETW-based EDRs, hook system calls, or gain access to ETW providers normally reserved for anti-malware solutions. Most recently, the Lazarus Group bypassed EDR detection by disabling ETW providers. Here, we’ll explain how ETW works and what makes it such a tempting target, and we’ll embark on an exciting journey deep into Windows.

Overview of ETW internals

Two main components of ETW are providers and consumers. Providers send events to an ETW globally unique identifier (GUID), and the events are written to a file, a buffer in memory, or both. Every Windows system has hundreds or thousands of providers registered. We can view available providers by running the command logman query providers:

By checking my system, we can see there are nearly 1,200 registered providers:

Each of these ETW providers defines its own events in a manifest file, which is used by consumers to parse provider-generated data. ETW providers may define hundreds of different event types, so the amount of information we can get from ETW is enormous. Most of these events can be seen in Event Viewer, a built-in Windows tool that consumes ETW events. But you’ll only see some of the data. Not all logs are enabled by default in Event Viewer, and not all event IDs are shown for each log.

On the other side we have consumers: trace logging sessions that receive events from one or several providers. For example, EDRs that rely on ETW data for their detection will consume events from security-related ETW channels such as the Threat Intelligence channel.

We can look at all running ETW consumers via Performance Monitor; clicking one of the sessions will show the providers it subscribes to. (You may need to run as SYSTEM to see all ETW logging sessions.)

The list of processes that receive events from this log session is useful information but not easy to obtain. As far as I could see there is no way to get that information from user mode at all, and even from kernel mode it’s not an easy task unless you are very familiar with ETW internals. So we will see what we can learn from a kernel debugging session using WinDbg.

Finding ETW consumer processes

There are ways to find consumers of ETW log sessions from user mode. However, they only supply very partial information that isn’t enough in all cases. So instead, we’ll head to our kernel debugger session. One way to get information about ETW sessions from the debugger is using the built-in extension !wmitrace. This extremely useful extension allows users to investigate all of the running loggers and their attributes, consumers, and buffers. It even allows users to start and stop log sessions (on a live debugger connection). Still, like all legacy extensions, it has its limitations: it can’t be easily automated, and since it’s a precompiled binary it can’t be extended with new functionality.

So instead we’ll write a JavaScript script—scripts are easier to extend and modify, and we can use them to get as much data as we need without being limited to the preexisting functionality of a legacy extension.

Every handle contains a pointer to an object. For example, a file handle will point to a kernel structure of type FILE_OBJECT. A handle to an object of type EtwConsumer will point to an undocumented data structure called ETW_REALTIME_CONSUMER. This structure contains a pointer to the process that opened it, events that get notified for different actions, flags, and also one piece of information that will (eventually) lead us back to the log session—LoggerId. Using a custom script, we can scan the handle tables of all processes for handles to EtwConsumer objects. For each one, we can get the linked ETW_REALTIME_CONSUMER structure and print the LoggerId:

"use strict";

function initializeScript()
{
    return [new host.apiVersionSupport(1, 7)];
}


function EtwConsumersForProcess(process)
{
    let dbgOutput = host.diagnostics.debugLog;
    let handles = process.Io.Handles;

    try 
    {
        for (let handle of handles)
        {
            try
            {
                let objType = handle.Object.ObjectType;
                if (objType === "EtwConsumer")
                {
                    let consumer = host.createTypedObject(handle.Object.Body.address, "nt", "_ETW_REALTIME_CONSUMER");
                    let loggerId = consumer.LoggerId;

                    dbgOutput("Process ", process.Name, " with ID ", process.Id, " has handle ", handle.Handle, " to Logger ID ", loggerId, "\n");
                }
            } catch (e) {
                dbgOutput("\tException parsing handle ", handle.Handle, "in process ", process.Name, "!\n");
            }
        }
    } catch (e) {

    }
}

Next, we load the script into the debugger with .scriptload and call our function to identify which process consumes ETW events:

dx @$cursession.Processes.Select(p => @$scriptContents.EtwConsumersForProcess(p))
@$cursession.Processes.Select(p => @$scriptContents.EtwConsumersForProcess(p))                
Process svchost.exe with ID 0x558 has handle 0x7cc to Logger ID 31
Process svchost.exe with ID 0x114c has handle 0x40c to Logger ID 36
Process svchost.exe with ID 0x11f8 has handle 0x2d8 to Logger ID 17
Process svchost.exe with ID 0x11f8 has handle 0x2e8 to Logger ID 3
Process svchost.exe with ID 0x11f8 has handle 0x2f4 to Logger ID 9
Process NVDisplay.Container.exe with ID 0x1478 has handle 0x890 to Logger ID 38
Process svchost.exe with ID 0x1cec has handle 0x1dc to Logger ID 7
Process svchost.exe with ID 0x1d2c has handle 0x780 to Logger ID 8
Process CSFalconService.exe with ID 0x1e54 has handle 0x760 to Logger ID 3
Process CSFalconService.exe with ID 0x1e54 has handle 0x79c to Logger ID 45
Process CSFalconService.exe with ID 0x1e54 has handle 0xbb0 to Logger ID 10
Process Dell.TechHub.Instrumentation.SubAgent.exe with ID 0x25c4 has handle 0xcd8 to Logger ID 41
Process Dell.TechHub.Instrumentation.SubAgent.exe with ID 0x25c4 has handle 0xdb8 to Logger ID 35
Process Dell.TechHub.Instrumentation.SubAgent.exe with ID 0x25c4 has handle 0xf54 to Logger ID 44
Process SgrmBroker.exe with ID 0x17b8 has handle 0x178 to Logger ID 15
Process SystemInformer.exe with ID 0x4304 has handle 0x30c to Logger ID 16
Process PerfWatson2.exe with ID 0xa60 has handle 0xa3c to Logger ID 46
Process PerfWatson2.exe with ID 0x81a4 has handle 0x9c4 to Logger ID 40
Process PerfWatson2.exe with ID 0x76f0 has handle 0x9a8 to Logger ID 47
Process operfmon.exe with ID 0x3388 has handle 0x88c to Logger ID 48
Process operfmon.exe with ID 0x3388 has handle 0x8f4 to Logger ID 49

While we still don’t get the name of the log sessions, we already have more data than we did in user mode. We can see, for example, that some processes have multiple consumer handles since they are subscribed to multiple log sessions. Unfortunately, the ETW_REALTIME_CONSUMER structure doesn’t have any information about the log session besides its identifier, so we must find a way to match identifiers to human-readable names.

The registered loggers and their IDs are stored in a global list of loggers (or at least they were until the introduction of server silos; now, every isolated process will have its own separate ETW loggers while non-isolated processes will use the global list, which I will also use in this post). The global list is stored inside an ETW_SILODRIVERSTATE structure within the host silo globals, nt!PspHostSiloGlobals:

dx ((nt!_ESERVERSILO_GLOBALS*)&nt!PspHostSiloGlobals)->EtwSiloState
((nt!_ESERVERSILO_GLOBALS*)&nt!PspHostSiloGlobals)->EtwSiloState                 : 0xffffe38f3deeb000 [Type: _ETW_SILODRIVERSTATE *]
    [+0x000] Silo             : 0x0 [Type: _EJOB *]
    [+0x008] SiloGlobals      : 0xfffff8052bd489c0 [Type: _ESERVERSILO_GLOBALS *]
    [+0x010] MaxLoggers       : 0x50 [Type: unsigned long]
    [+0x018] EtwpSecurityProviderGuidEntry [Type: _ETW_GUID_ENTRY]
    [+0x1c0] EtwpLoggerRundown : 0xffffe38f3deca040 [Type: _EX_RUNDOWN_REF_CACHE_AWARE * *]
    [+0x1c8] EtwpLoggerContext : 0xffffe38f3deca2c0 [Type: _WMI_LOGGER_CONTEXT * *]
    [+0x1d0] EtwpGuidHashTable [Type: _ETW_HASH_BUCKET [64]]
    [+0xfd0] EtwpSecurityLoggers [Type: unsigned short [8]]
    [+0xfe0] EtwpSecurityProviderEnableMask : 0x3 [Type: unsigned char]
    [+0xfe4] EtwpShutdownInProgress : 0 [Type: long]
    [+0xfe8] EtwpSecurityProviderPID : 0x798 [Type: unsigned long]
    [+0xff0] PrivHandleDemuxTable [Type: _ETW_PRIV_HANDLE_DEMUX_TABLE]
    [+0x1010] RTBacklogFileRoot : 0x0 [Type: wchar_t *]
    [+0x1018] EtwpCounters     [Type: _ETW_COUNTERS]
    [+0x1028] LogfileBytesWritten : {4391651513} [Type: _LARGE_INTEGER]
    [+0x1030] ProcessorBlocks  : 0x0 [Type: _ETW_SILO_TRACING_BLOCK *]
    [+0x1038] ContainerStateWnfSubscription : 0xffffaf8de0386130 [Type: _EX_WNF_SUBSCRIPTION *]
    [+0x1040] ContainerStateWnfCallbackCalled : 0x0 [Type: unsigned long]
    [+0x1048] UnsubscribeWorkItem : 0xffffaf8de0202170 [Type: _WORK_QUEUE_ITEM *]
    [+0x1050] PartitionId      : {00000000-0000-0000-0000-000000000000} [Type: _GUID]
    [+0x1060] ParentId         : {00000000-0000-0000-0000-000000000000} [Type: _GUID]
    [+0x1070] QpcOffsetFromRoot : {0} [Type: _LARGE_INTEGER]
    [+0x1078] PartitionName    : 0x0 [Type: char *]
    [+0x1080] PartitionNameSize : 0x0 [Type: unsigned short]
    [+0x1082] UnusedPadding    : 0x0 [Type: unsigned short]
    [+0x1084] PartitionType    : 0x0 [Type: unsigned long]
    [+0x1088] SystemLoggerSettings [Type: _ETW_SYSTEM_LOGGER_SETTINGS]
    [+0x1200] EtwpStartTraceMutex [Type: _KMUTANT]

The EtwpLoggerContext field points to an array of pointers to WMI_LOGGER_CONTEXT structures, each describing one logger session. The size of the array is saved in the MaxLoggers field of the ETW_SILODRIVERSTATE. Not all entries of the array are necessarily used; unused entries will be set to 1. Knowing this, we can dump all of the initialized entries of the array. (I’ve hard coded the array size for convenience):

dx ((nt!_WMI_LOGGER_CONTEXT*(*)[0x50])(((nt!_ESERVERSILO_GLOBALS*)&nt!PspHostSiloGlobals)->EtwSiloState->EtwpLoggerContext))->Where(l => l != 1)
((nt!_WMI_LOGGER_CONTEXT*(*)[0x50])(((nt!_ESERVERSILO_GLOBALS*)&nt!PspHostSiloGlobals)->EtwSiloState->EtwpLoggerContext))->Where(l => l != 1)                
    [2]              : 0xffffe38f3f0c9040 [Type: _WMI_LOGGER_CONTEXT *]
    [3]              : 0xffffe38f3fe07640 [Type: _WMI_LOGGER_CONTEXT *]
    [4]              : 0xffffe38f3f0c75c0 [Type: _WMI_LOGGER_CONTEXT *]
    [5]              : 0xffffe38f3f0c9780 [Type: _WMI_LOGGER_CONTEXT *]
    [6]              : 0xffffe38f3f0cb040 [Type: _WMI_LOGGER_CONTEXT *]
    [7]              : 0xffffe38f3f0cb600 [Type: _WMI_LOGGER_CONTEXT *]
    [8]              : 0xffffe38f3f0ce040 [Type: _WMI_LOGGER_CONTEXT *]
    [9]              : 0xffffe38f3f0ce600 [Type: _WMI_LOGGER_CONTEXT *]
    [10]             : 0xffffe38f79832a40 [Type: _WMI_LOGGER_CONTEXT *]
    [11]             : 0xffffe38f3f0d1640 [Type: _WMI_LOGGER_CONTEXT *]
    [12]             : 0xffffe38f89535a00 [Type: _WMI_LOGGER_CONTEXT *]
    [13]             : 0xffffe38f3dacc940 [Type: _WMI_LOGGER_CONTEXT *]
    [14]             : 0xffffe38f3fe04040 [Type: _WMI_LOGGER_CONTEXT *]
       …

Each logger context contains information about the logger session such as its name, the file that stores the events, the security descriptor, and more. Each structure also contains a logger ID, which matches the index of the logger in the array we just dumped. So given a logger ID, we can find its details like this:

dx (((nt!_ESERVERSILO_GLOBALS*)&nt!PspHostSiloGlobals)->EtwSiloState->EtwpLoggerContext)[@$loggerId]
 (((nt!_ESERVERSILO_GLOBALS*)&nt!PspHostSiloGlobals)->EtwSiloState->EtwpLoggerContext)[@$loggerId]                 : 0xffffe38f3f0ce600 [Type: _WMI_LOGGER_CONTEXT *]
    [+0x000] LoggerId         : 0x9 [Type: unsigned long]
    [+0x004] BufferSize       : 0x10000 [Type: unsigned long]
    [+0x008] MaximumEventSize : 0xffb8 [Type: unsigned long]
    [+0x00c] LoggerMode       : 0x19800180 [Type: unsigned long]
    [+0x010] AcceptNewEvents  : 0 [Type: long]
    [+0x018] GetCpuClock      : 0x0 [Type: unsigned __int64]
    [+0x020] LoggerThread     : 0xffffe38f3f0d0040 [Type: _ETHREAD *]
    [+0x028] LoggerStatus     : 0 [Type: long]
       …

Now we can implement this as a function (in DX or JavaScript) and print the logger name for each open consumer handle we find:

dx @$cursession.Processes.Select(p => @$scriptContents.EtwConsumersForProcess(p))
@$cursession.Processes.Select(p => @$scriptContents.EtwConsumersForProcess(p))                
Process svchost.exe with ID 0x558 has handle 0x7cc to Logger ID 31
    Logger Name: "UBPM"

Process svchost.exe with ID 0x114c has handle 0x40c to Logger ID 36
    Logger Name: "WFP-IPsec Diagnostics"

Process svchost.exe with ID 0x11f8 has handle 0x2d8 to Logger ID 17
    Logger Name: "EventLog-System"

Process svchost.exe with ID 0x11f8 has handle 0x2e8 to Logger ID 3
    Logger Name: "Eventlog-Security"

Process svchost.exe with ID 0x11f8 has handle 0x2f4 to Logger ID 9
    Logger Name: "EventLog-Application"

Process NVDisplay.Container.exe with ID 0x1478 has handle 0x890 to Logger ID 38
    Logger Name: "NOCAT"

Process svchost.exe with ID 0x1cec has handle 0x1dc to Logger ID 7
    Logger Name: "DiagLog"

Process svchost.exe with ID 0x1d2c has handle 0x780 to Logger ID 8
    Logger Name: "Diagtrack-Listener"

Process CSFalconService.exe with ID 0x1e54 has handle 0x760 to Logger ID 3
    Logger Name: "Eventlog-Security"
...

In fact, by using the logger array, we can build a better way to enumerate ETW log session consumers. Each logger context has a Consumers field, which is a linked list connecting all of the ETW_REALTIME_CONSUMER structures that are subscribed to this log session:

So instead of scanning the handle table of each and every process in the system, we can go directly to the loggers array and find the registered processes for each one:

function EtwLoggersWithConsumerProcesses()
{
    let dbgOutput = host.diagnostics.debugLog;
    let hostSiloGlobals = host.getModuleSymbolAddress("nt", "PspHostSiloGlobals");
    let typedhostSiloGlobals = host.createTypedObject(hostSiloGlobals, "nt", "_ESERVERSILO_GLOBALS");

    let maxLoggers = typedhostSiloGlobals.EtwSiloState.MaxLoggers;
    for (let i = 0; i < maxLoggers; i++)
    {
        let logger = typedhostSiloGlobals.EtwSiloState.EtwpLoggerContext[i];
        if (host.parseInt64(logger.address, 16).compareTo(host.parseInt64("0x1")) != 0)
        {
            dbgOutput("Logger Name: ", logger.LoggerName, "\n");

            let consumers = host.namespace.Debugger.Utility.Collections.FromListEntry(logger.Consumers, "nt!_ETW_REALTIME_CONSUMER", "Links");
            if (consumers.Count() != 0)
            {
                for (let consumer of consumers)
                {
                    dbgOutput("\tProcess Name: ", consumer.ProcessObject.SeAuditProcessCreationInfo.ImageFileName.Name, "\n");
                    dbgOutput("\tProcess Id: ", host.parseInt64(consumer.ProcessObject.UniqueProcessId.address, 16).toString(10), "\n");
                    dbgOutput("\n");
                }
            }
            else
            {
                dbgOutput("\tThis logger has no consumers\n\n");
            }
        }
    }
}

Calling this function should get us the exact same results as earlier, only much faster!

After getting this part, we can continue to search for another piece of information that could be useful—the list of GUIDs that provide events to a log session.

Finding provider GUIDs

Finding the consumers of an ETW log session is only half the battle—we also want to know which providers notify each log session. We saw earlier that we can get that information from Performance Monitor, but let’s see how we can also get it from a debugger session, as it might be useful when the live machine isn’t available or when looking for details that aren’t supplied by user-mode tools like Performance Monitor.

If we look at the WMI_LOGGER_CONTEXT structure, we won’t see any details about the providers that notify the log session. To find this information, we need to go back to the ETW_SILODRIVERSTATE structure from earlier and look at the EtwpGuidHashTable field. This is an array of buckets storing all of the registered provider GUIDs. For performance reasons, the GUIDs are hashed and stored in 64 buckets. Each bucket contains three lists linking ETW_GUID_ENTRY structures. There is one list for each ETW_GUID_TYPE:

  • EtwpTraceGuidType
  • EtwpNotificationGuidType
  • EtwpGroupGuidType

Each ETW_GUID_ENTRY structure contains an EnableInfo array with eight entries, and each contains information about one log session that the GUID is providing events for (which means that an event GUID entry can supply events for up to eight different log sessions):

dt nt!_ETW_GUID_ENTRY EnableInfo.
   +0x080 EnableInfo  : [8] 
      +0x000 IsEnabled   : Uint4B
      +0x004 Level       : UChar
      +0x005 Reserved1   : UChar
      +0x006 LoggerId    : Uint2B
      +0x008 EnableProperty : Uint4B
      +0x00c Reserved2   : Uint4B
      +0x010 MatchAnyKeyword : Uint8B
      +0x018 MatchAllKeyword : Uint8B

Visually, this is what this whole thing looks like:

As we can see, the ETW_GUID_ENTRY structure contains a LoggerId field, which we can use as the index into the EtwpLoggerContext array to find the log session.

With this new information in mind, we can write a simple JavaScript function to print the GUIDs that match a logger ID. (In this case, I chose to go over only one ETW_GUID_TYPE at a time to make this code a bit cleaner.) Then we can go one step further and parse the ETW_REG_ENTRY list in each GUID entry to find out which processes notify it, or if it’s a kernel-mode provider:

function GetGuidsForLoggerId(loggerId, guidType)
{
    let dbgOutput = host.diagnostics.debugLog;

    let hostSiloGlobals = host.getModuleSymbolAddress("nt", "PspHostSiloGlobals");
    let typedhostSiloGlobals = host.createTypedObject(hostSiloGlobals, "nt", "_ESERVERSILO_GLOBALS");
    let guidHashTable = typedhostSiloGlobals.EtwSiloState.EtwpGuidHashTable;
    for (let bucket of guidHashTable)
    {
        let guidEntries = host.namespace.Debugger.Utility.Collections.FromListEntry(bucket.ListHead[guidType], "nt!_ETW_GUID_ENTRY", "GuidList");
        if (guidEntries.Count() != 0)
        {
            for (let guid of guidEntries)
            {
                for (let enableInfo of guid.EnableInfo)
                {
                    if (enableInfo.LoggerId === loggerId)
                    {
                        dbgOutput("\tGuid: ", guid.Guid, "\n");
                        let regEntryLinkField = "RegList";
                        if (guidType == 2)
                        {
                            // group GUIDs registration entries are linked through the GroupRegList field
                            regEntryLinkField = "GroupRegList";
                        }
                        let regEntries = host.namespace.Debugger.Utility.Collections.FromListEntry(guid.RegListHead, "nt!_ETW_REG_ENTRY", regEntryLinkField);
                        if (regEntries.Count() != 0)
                        {
                            dbgOutput("\tProvider Processes:\n");
                            for (let regEntry of regEntries)
                            {
                                if (regEntry.DbgUserRegistration != 0)
                                {
                                    dbgOutput("\t\tProcess: ", regEntry.Process.SeAuditProcessCreationInfo.ImageFileName.Name, " ID: ", host.parseInt64(regEntry.Process.UniqueProcessId.address, 16).toString(10), "\n");
                                }
                                else
                                {
                                    dbgOutput("\t\tKernel Provider\n");
                                }
                            }
                        }
                        break;
                    }
                }
            }
        }
    }
}

As an example, here are all of the trace provider GUIDs and the processes that notify them for ETW session UBPM (LoggerId 31 in my case):

dx @$scriptContents.GetGuidsForLoggerId(31, 0)
    Guid: {9E03F75A-BCBE-428A-8F3C-D46F2A444935}
    Provider Processes:
        Process: "\Device\HarddiskVolume3\Windows\System32\svchost.exe" ID: 2816
    Guid: {2D7904D8-5C90-4209-BA6A-4C08F409934C}
    Guid: {E46EEAD8-0C54-4489-9898-8FA79D059E0E}
    Provider Processes:
        Process: "\Device\HarddiskVolume3\Windows\System32\dwm.exe" ID: 2268
    Guid: {D02A9C27-79B8-40D6-9B97-CF3F8B7B5D60}
    Guid: {92AAB24D-D9A9-4A60-9F94-201FED3E3E88}
    Provider Processes:
        Process: "\Device\HarddiskVolume3\Windows\System32\svchost.exe" ID: 2100
        Kernel Provider
    Guid: {FBCFAC3F-8460-419F-8E48-1F0B49CDB85E}
    Guid: {199FE037-2B82-40A9-82AC-E1D46C792B99}
    Provider Processes:
        Process: "\Device\HarddiskVolume3\Windows\System32\lsass.exe" ID: 1944
    Guid: {BD2F4252-5E1E-49FC-9A30-F3978AD89EE2}
    Provider Processes:
        Process: "\Device\HarddiskVolume3\Windows\System32\svchost.exe" ID: 16292
    Guid: {22B6D684-FA63-4578-87C9-EFFCBE6643C7}
    Guid: {3635D4B6-77E3-4375-8124-D545B7149337}
    Guid: {0621B9DF-3249-4559-9889-21F76B5C80F3}
    Guid: {BD8FEA17-5549-4B49-AA03-1981D16396A9}
    Guid: {F5528ADA-BE5F-4F14-8AEF-A95DE7281161}
    Guid: {54732EE5-61CA-4727-9DA1-10BE5A4F773D}
    Provider Processes:
        Process: "\Device\HarddiskVolume3\Windows\System32\svchost.exe" ID: 4428
    Guid: {18F4A5FD-FD3B-40A5-8FC2-E5D261C5D02E}
    Guid: {8E6A5303-A4CE-498F-AFDB-E03A8A82B077}
    Provider Processes:
        Kernel Provider
    Guid: {CE20D1C3-A247-4C41-BCB8-3C7F52C8B805}
    Provider Processes:
        Kernel Provider
    Guid: {5EF81E80-CA64-475B-B469-485DBC993FE2}
    Guid: {9B307223-4E4D-4BF5-9BE8-995CD8E7420B}
    Provider Processes:
        Kernel Provider
    Guid: {AA1F73E8-15FD-45D2-ABFD-E7F64F78EB11}
    Provider Processes:
        Kernel Provider
    Guid: {E1BDC95E-0F07-5469-8E64-061EA5BE6A0D}
    Guid: {5B004607-1087-4F16-B10E-979685A8D131}
    Guid: {AEDD909F-41C6-401A-9E41-DFC33006AF5D}
    Guid: {277C9237-51D8-5C1C-B089-F02C683E5BA7}
    Provider Processes:
        Kernel Provider
    Guid: {F230D19A-5D93-47D9-A83F-53829EDFB8DF}
    Provider Processes:
        Process: "\Device\HarddiskVolume3\Windows\System32\svchost.exe" ID: 2816

Putting all of those steps together, we finally have a way to know which log sessions are running on the machine, which processes notify each of the GUIDs in the session, and which processes are subscribed to them. This can help us understand the purpose of different ETW log sessions running on the machine, such as identifying the log sessions used by EDR software or interesting hardware components. These scripts can also be modified as needed to identify ETW irregularities, such as a log session that has been disabled in order to blind security products. From an attacker perspective, gathering this information can tell us which ETW providers are used on a machine and which ones are ignored and, therefore, don’t present us with any risk of detection.

Overall, ETW is a very powerful mechanism, so getting more visibility into its internal workings is useful for attackers and defenders alike. This post only scratches the surface, and there’s so much more work that can be done in this area.

All of the JavaScript functions shown in this post can be found in this GitHub repo.

How CISA can improve OSS security

By Jim Miller

The US government recently issued a request for information (RFI) about open-source software (OSS) security. In this blog post, we will present a summary of our response and proposed solutions. Some of our solutions include rewriting widely used legacy code in memory safe languages such as Rust, funding OSS solutions to improve compliance, sponsoring research and development of vulnerability tracking and analysis tools, and educating developers on how to reduce attack surfaces and manage complex features.

Background details

The government entities responsible for the RFI were the Office of the National Cyber Director (ONCD), Cybersecurity Infrastructure Security Agency (CISA), National Science Foundation (NSF), Defense Advanced Research Projects Agency (DARPA), and Office of Management and Budget (OMB). The specific objective of this RFI was to gather public comments on future priorities and long-term areas of focus for OSS security. This RFI is a key part of the ongoing efforts by these organizations to identify systemic risks in OSS and foster the long-term sustainability of OSS communities.

The RFI includes five potential areas for long-term focus and prioritization. In response to this request, we are prioritizing the “Securing OSS Foundations” area and each of its four sub-areas: fostering the adoption of memory-safe programming languages, strengthening software supply chains, reducing entire classes of vulnerabilities at scale, and advancing developer education. We will provide suggested solutions for each of these four sub-areas below.

Fostering the adoption of memory-safe programming languages

Memory corruption vulnerabilities remain a grave threat to OSS security. This is demonstrated by the number and impact of several vulnerabilities such as the recent heap buffer overflow in libwebp, which was actively being exploited while we drafted our RFI response. Exploits such as these illustrate the need for solutions beyond runtime mitigations, and languages like Rust, which provide both memory and type safety, are the most promising.

In addition to dramatically reducing vulnerabilities, Rust also blends well with legacy codebases, offers high performance, and is relatively easy to use. Thus, our proposed solution centers on sponsoring strategic rewrites of important legacy codebases. Since rewrites are very costly, we specifically recommend undertaking a comprehensive and systematic analysis to identify the most suitable OSS candidates for transitioning to memory-safe languages like Rust. We propose a strong focus on software components that are widely used, poorly covered by tests, and prone to such memory safety vulnerabilities.

Strengthening software supply chains

Supply chain attacks, as demonstrated by the 2020 SolarWinds hack, represent another significant risk to OSS security. Supply chain security is a complex and multifaceted problem. Therefore, we propose improving protections across the entire software supply chain—from individual developers, to package indices, to downstream users.

Our suggested strategy includes establishing “strong link” guidelines that CISA could release. These would provide guidance for each of the critical OSS components: OSS developers, repository hosts, package indices, and consumers. In addition to this guidance, we also propose funding OSS solutions that better enable compliance, such as improving software bill of materials (SBOM) fidelity by integrating with build systems.

Reducing entire classes of vulnerabilities at scale

Another area of focus should be on large-scale reduction of vulnerabilities in the OSS ecosystem. Efforts such as OSS-Fuzz have successfully mitigated thousands of potential security issues, and we propose funding similar projects using this as a model. In addition, vulnerability tracking tools (like cargo-audit and pip-audit) have been successful at quickly remediating vulnerabilities that affect a wide number of users. A critical part of effectively maintaining these tools is properly maintaining the vulnerability database and not allowing over-reporting of insignificant security issues that could result in security fatigue, where developers ignore alerts because there are too many to process.

Therefore, our proposed solution is sponsoring the development and maintenance of tools for vulnerability tracking, analysis tools like Semgrep and CodeQL, and other novel techniques that could work at scale. We also recommend sponsoring research pertaining to novel tools and techniques to help solve specific high-value problems, such as secure HTTP parsing.

Advancing developer education

Lastly, we believe that improving developer education is an important long-term focus area for OSS security. In contrast to current educational efforts, which focus primarily on common vulnerabilities, we propose fostering an extension of developer education that covers areas like reducing attack surfaces, managing complex features, and “shifting left.” If done effectively, creating documentation and training materials specifically for these areas could have a substantially positive, long-term impact on OSS security.

Looking ahead

Addressing OSS security can be a complex challenge, but by making targeted interventions in these four areas, we can make significant improvements. We believe the US government can maximize impact through a combination of three strategies: provisioning comprehensive guidance, allocating funding through agencies like DARPA and ONR, and fostering collaboration with OSS foundations like OSTIF, OTF, and OpenSSF. This combined approach will enable the sponsorship and monetary support necessary to drive the research and engineering tasks outlined in our proposed solutions.

Together, these actions can build a safer future for open-source software. We welcome the initiative by ONCD, CISA, NSF, DARPA, and OMB for fostering such an open discussion and giving us the chance to contribute.

We welcome you to read our full response.

Assessing the security posture of a widely used vision model: YOLOv7

By Alvin Crighton, Anusha Ghosh, Suha Hussain, Heidy Khlaaf, and Jim Miller

TL;DR: We identified 11 security vulnerabilities in YOLOv7, a popular computer vision framework, that could enable attacks including remote code execution (RCE), denial of service, and model differentials (where an attacker can trigger a model to perform differently in different contexts).

Open-source software provides the foundation of many widely used ML systems. However, these frameworks have been developed rapidly, often at the cost of secure and robust practices. Furthermore, these open-source models and frameworks are not specifically intended for critical applications, yet they are being adopted for such applications at scale, through momentum or popularity. Few of these software projects have been rigorously reviewed, leading to latent risks and a rise of unidentified supply chain attack surfaces that impact the confidentiality, integrity, and availability of the model and its associated assets. For example, pickle files, used widely in the ML ecosystem, can be exploited to achieve arbitrary code execution

Given these risks, we decided to assess the security of a popular and well-established vision model: YOLOv7. This blog post shares and discusses the results of our review, which comprised a lightweight threat model and secure code review, including our conclusion that the YOLOv7 codebase is not suitable for mission-critical applications or applications that require high availability. A link to the full public report is available here.

Disclaimer: YOLOv7 is a product of academic work. Academic prototypes are not intended to be production-ready nor have appropriate security hygiene, and our review is not intended as a criticism of the authors or their development choices. However, as with many ML prototypes, they have been adopted within production systems (e.g., as YOLOv7 is promoted by Roboflow, with 3.5k forks). Our review is only intended to bring to light the risks of using such prototypes without further security scrutiny.

As part of our responsible disclosure policy, we contacted the authors of the YOLOv7 repository to make them aware of the issues we identified. We did not receive a response, but we propose concrete solutions and changes that would mitigate the identified security gaps.

What is YOLOv7?

You Only Look Once (YOLO) is a state-of-the-art, real-time object detection system whose combination of high accuracy and good performance has made it a popular choice for vision systems embedded in mission-critical applications such as robotics, autonomous vehicles, and manufacturing. YOLOv1 was initially developed in 2015; its latest version, YOLOv7, is the open-source codebase revision of YOLO developed by Academia Sinica that implements their corresponding academic paper, which outlines how YOLOv7 outperforms both transformer-based object detectors and convolutional-based object detectors (including YOLOv5).

The codebase has over 3k forks and allows users to provide their own pre-trained files, model architecture, and dataset to train custom models. Even though YOLOv7 is an academic project, YOLO is the de facto algorithm in object detection, and is often used commercially and in mission-critical applications (e.g., by Roboflow).

What we found

Our review identified five high-severity and three medium-severity findings, which we attribute to the following insecure practices: 

  • The codebase is not written defensively; it has no unit tests or testing framework, and inputs are poorly validated and sanitized.
  • Complete trust is placed in model and configuration files that can be pulled from external sources.
  • The codebase dangerously and unnecessarily relies on permissive functions in ways that introduce vectors for RCE. 

The table below summarizes our high-severity findings:

**It is common practice to download datasets, model pickle files, and YAML configuration files from external sources, such as PyTorch Hub. To compromise these files on a target machine, an attacker could upload a malicious file to one of these public sources.

Building the threat model

For our review, we first carried out a lightweight threat model to identify threat scenarios and the most critical components that would in turn inform our code review. Our approach draws from Mozilla’s “Rapid Risk Assessment” methodology and NIST’s guidance on data-centric threat modeling (NIST 800-154). We reviewed YOLO academic papers, the YOLOv7 codebase, and user documentation to identify all data types, data flow, trust zones (and their connections), and threat actors. These artifacts were then used to develop a comprehensive list of threat scenarios that document each of the possible threats and risks present in the system.

The threat model accounts for the ML pipeline’s unique architecture (relative to traditional software systems), which introduces novel threats and risks due to new attack surfaces within the ML lifecycle and pipeline such as data collection, model training, and model inference and deployment. Corresponding threats and failures can lead to the degradation of model performance, exploitation of the collection and processing of data, and manipulation of the resulting outputs. For example, downloading a dataset from an untrusted or insecure source can lead to dataset poisoning and model degradation.

Our threat model thus aims to examine ML-specific areas of entry as well as outline significant sub-components of the YOLOv7 codebase. Based on our assessment of YOLOv7 artifacts, we constructed the following data flow diagram.

Figure 1: Data flow diagram produced during the lightweight threat model

Note that this diagram and our threat model do not target a specific application or deployment environment. Our identified scenarios were tailored to bring focus to general ML threats that developers should consider before deploying YOLOv7 within their ecosystem. We identified a total of twelve threat scenarios pertaining to three primary threats: dataset compromise, host compromise, and YOLO process compromise (such as injecting malicious code into the YOLO system or one of its dependencies).

Code review results

Next, we performed a secure code review of the YOLOv7 codebase, focusing on the most critical components identified in the threat model’s threat scenarios. We used both manual and automated testing methods; our automated testing tools included Trail of Bits’ repository of custom Semgrep rules, which target the misuse of ML frameworks such as PyTorch and which identified one security issue and several code quality issues in the YoloV7 codebase. We also used the TorchScript automatic trace checking tool to automatically detect potential errors in traced models. Finally, we used the public Python CodeQL queries across the codebase and identified multiple code quality issues.

In total, our code review resulted in the discovery of twelve security issues, five of which are high severity. The review also uncovered twelve code quality findings that serve as recommendations for enhancing the quality and readability of the codebase and preventing the introduction of future vulnerabilities.

All of these findings are indicative of a system that was not written or designed with a defensive lens:

  • Five security issues could individually lead to RCE, most of which are caused by the unnecessary and dangerous use of permissive functions such as subprocess.check_output, eval, and os.system. See the highlight below for an example. 
  • User and external data inputs are poorly validated and sanitized. Multiple issues enable a denial-of-service attack if an end user can control certain inputs, such as model files, dataset files, or configuration files (TOB-YOLO-9, TOB-YOLO-8, TOB-YOLO-12). For example, the codebase allows engineers to provide their own configuration files, whether they represent a different model architecture or are pre-trained files (given the different applications of the YOLO model architecture). These files and datasets are loaded into the training network where PyTorch is used to train the model. For a more secure design, the amount of trust placed into external inputs needs to be drastically reduced, and these values need to be carefully sanitized and validated.
  • There are currently no unit tests or any testing framework in the codebase (TOB-YOLO-11). A proper testing framework would have prevented some of the issues we uncovered, and without this framework it is likely that other implementation flaws and bugs exist in the codebase. Moreover, as the system continues to evolve, without any testing, code regressions are likely to occur.

Below, we highlight some of the details of our high severity findings and discuss their repercussions on ML-based systems.

Secure code review highlight #1: How YAML parsing leads to RCE

Our most notable finding regards the insecure parsing of YAML files that could result in RCE. Like many ML systems, YOLO uses YAML files to specify the architecture of models. Unfortunately, the YAML parsing function, parse_model, parses the contents of the file by calling eval on unvalidated contents of the file, as shown in this code snippet:

Figure 2: Snippet of parse_model in models/yolo.py

If an attacker is able to manipulate one of these YAML files used by a target user, they could inject malicious code that would be executed during this parsing. This is particularly concerning since these YAML files are often obtained from third-party websites that host these files along with other model files and datasets. A sophisticated attacker could compromise one of these third-party services or hosted assets. However, this issue can be detected through proper inspection of these YAML files, if done closely and often.

Given the potential severity of this finding, we proposed an alternative implementation as a mitigation: remove the need for the parse_model function altogether by rewriting the given architectures defined in config files as different block classes that call standard PyTorch modules. This rewrite serves a few different purposes:

  • It removes the inherent vulnerability present in calling eval on unsanitized input.
  • The block class structure more effectively replicates the architecture proposed in the implemented paper, allowing for easier replication of the given architecture and definition of subsequent iterations of similar structures.
  • It presents a more extensible base to continue defining configurations, as the classes are easily modifiable based on different parameters set by the user.

Our proposed fix can be tracked here.

Secure code review highlight #2: ML-specific vulnerabilities and improvements

As previously noted, ML frameworks are leading to a rise of novel attack avenues targeting confidentiality, integrity, and availability of the model and its associated assets. Highlights from the ML-specific issues that we uncovered during our security assessment include the following:

  • The YOLOv7 codebase uses pickle files to store models and datasets; these files have not been verified and may have been obtained from third-party sources. We previously found that the widespread use of pickle files in the ML ecosystem is a security risk, as pickle files enable arbitrary code execution. To deserialize a pickle file, a virtual machine known as the Pickle Machine (PM) interprets the file as a sequence of opcodes. Two opcodes contained in the PM, GLOBAL and REDUCE, can execute arbitrary Python code outside of the PM, thereby enabling arbitrary code execution. We built and released fickling, a tool to reverse engineer and analyze pickle files; however, we further recommend that ML implementations use safer file formats instead such as safetensors.
  • The way YOLOv7 traces its models could lead to model differentials—that is, the traced model that is being deployed behaves differently from the original, untraced model. In particular, YOLO uses PyTorch’s torch.jit.trace to convert its models into the TorchScript format for deployment. However, the YOLOv7 models contain many tracer edge cases: elements of the model that are not accurately captured by tracing. The most notable occurrence was the inclusion of input-dependent control flow. We used TorchScript’s automatic trace checker to confirm this divergence by generating an input that had different outputs depending on whether or not the model was traced, which could lead to backdoors. An attacker could release a model that exhibits a specific malicious behavior only when it is traced, making it harder to catch.

Specific recommendations and mitigations are outlined in our report.

Enhancing YOLOv7’s security

Beyond the identified code review issues, a series of design and operational changes are needed to ensure sufficient security posture. Highlights from the list of strategic recommendations provided in our report include:

  • Implementing an adequate testing framework with comprehensive unit tests and integration tests
  • Removing the use of highly permissive functions, such as subprocess.check_output, eval, and os.system
  • Improving development process of the codebase 
  • Enforcing the usage of secure protocols, such as HTTPS and RMTPS, when available
  • Continuously updating dependencies to ensure upstream security fixes are applied
  • Providing documentation to users about the potential threats when using data from untrusted training data or webcam streams

Although the identified security gaps may be acceptable for academic prototypes, we do not recommend using YOLOv7 within mission-critical applications or domains, despite existing use cases. For affected end users who are already using and deploying YOLOv7, we strongly recommend disallowing end users from providing datasets, model files, configuration files, and any other type of external inputs until the recommended changes are made to the design and maintenance of YOLOv7.

Coordinated disclosure timeline

As part of the disclosure process, we reported the vulnerabilities to the YOLOv7 maintainers first. Despite multiple attempts, we were not able to establish contact with the maintainers in order to coordinate fixes for these vulnerabilities. As a result, at the time of this blog post being released, the identified issues remain unfixed. As mentioned, we have proposed a fix to one of the issues that is being tracked here. The timeline of disclosure is provided below:

  • May 24, 2023: We notified the YOLOv7 maintainers that we intend to review the YOLOv7 codebase for internal purposes and invited them to participate and engage in our audit.
  • June 9, 2023: We notified the maintainers that we have begun the audit, and again invited them to participate and engage with our efforts.
  • July 10, 2023: We notified the maintainers that we had several security findings and requested engagement to discuss them.
  • July 26, 2023: We informed the maintainers of our official security disclosure notice with a release date of August 21, 2023.
  • November 15, 2023: The disclosure blog post was released and issues were filed with the original project repository.

Our audit of PyPI

By William Woodruff

This is a joint post with the PyPI maintainers; read their announcement here!

This audit was sponsored by the Open Tech Fund as part of their larger mission to secure critical pieces of internet infrastructure. You can read the full report in our Publications repository.

Late this summer, we performed an audit of Warehouse and cabotage, the codebases that power and deploy PyPI, respectively. Our review uncovered a number of findings that, while not critical, could compromise the integrity and availability of both. These findings reflect a broader trend in large systems: security issues largely correspond to places where services interact, particularly where those services have insufficiently specified or weak contracts.

PyPI

PyPI is the Python Package Index: the official and primary packaging index and repository for the Python ecosystem. It hosts half a million unique Python packages uploaded by 750,000 unique users and serves over 26 billion downloads every single month. (That’s over three downloads for every human on Earth, each month, every month!)

Consequently, PyPI’s hosted distributions are essentially the ground truth for just about every program written in Python. Moreover, PyPI is extensively mirrored across the globe, including in countries with limited or surveilled internet access.

Before 2018, PyPI was a large and freestanding legacy application with significant technical debt that accumulated over nearly two decades of feature growth. An extensive rewrite was conducted from 2016 to 2018, culminating in the general availability of Warehouse, the current codebase powering PyPI.

Various significant feature enhancements have been performed since then, including the addition of scoped API tokens, TOTP- and WebAuthn-based MFA, organization accounts, secret scanning, and Trusted Publishing.

Our audit and findings

Under the hood, PyPI is built out of multiple components, including third-party dependencies that are themselves hosted on PyPI. Our audit focused on two of its most central components:

  • Warehouse: PyPI’s “core” back end and front end, including the majority of publicly reachable views on pypi.org, as well as the PEP 503 index, public REST and XML-RPC APIs, and administrator interface
  • cabotage: PyPI’s continuous deployment infrastructure, enabling GitOps-style deployment by the PyPI administrators

Warehouse

We performed a holistic audit of Warehouse’s codebase, including the relatively small amount of JavaScript served to browser clients. Some particular areas of focus included:

  • The “legacy” upload endpoint, which is currently the primary upload mechanism for package submission to PyPI;
  • The administrator interface, which allows admin-privileged users to perform destructive and sensitive operations on the production PyPI instance;
  • All user and project management views, which allow their respectively privileged users to perform destructive and sensitive operations on PyPI user accounts and project state;
  • Warehouse’s AuthN, AuthZ, permissions, and ACL schemes, including the handling and adequate permissioning of different credentials (e.g., passwords, API tokens, OIDC credentials);
  • Third-party service integrations, including integrations with GitHub secret scanning, the PyPA Advisory Database, email delivery and state management through AWS SNS, and external object storages (Backblaze B2, AWS S3);
  • All login and authentication flows, including TOTP and WebAuthn-based MFA flows as well as account recovery and password reset flows.

During our review, we uncovered a number of findings that, while not critical, could potentially compromise Warehouse’s availability, integrity, or the integrity of its hosted distributions. We also uncovered a finding that would allow an attacker to disclose ordinarily private account information. Following a post-audit fix review, we believe that each of these findings has been mitigated sufficiently or does not pose an immediate risk to PyPI’s operations.

Findings of interest include:

  • TOB-PYPI-2: weak signature verification could allow an attacker to manipulate PyPI’s AWS SNS integration, including topic subscriptions and bounce/complaint notices against individual user emails.
  • TOB-PYPI-5: an attacker could use an unintentional information leak on the upload endpoint as a reconnaissance oracle, determining account validity without triggering ordinary login attempt events.
  • TOB-PYPI-14: an attacker with access to one or more of PyPI’s object storage services could cause cache poisoning or confusion due to weak cryptographic hashes.

Our overall evaluation of Warehouse is reflected in our report: Warehouse’s design and development practices are consistent with industry-standard best practices, including the enforcement of ordinarily aspirational practices such as 100% branch coverage, automated quality and security linting, and dependency updates.

cabotage

Like with Warehouse, our audit of cabotage was holistic. Some particular areas of focus included:

  • The handling of GitHub webhooks and event payloads, including container and build dispatching logic based on GitHub events;
  • Container and image build and orchestration;
  • Secrets handling and log filtering;
  • The user-facing cabotage web application, including all form and route logic.

During our review, we uncovered a number of findings that, while not critical, could potentially compromise cabotage’s availability and integrity, as well as the availability and integrity of the containers that it builds and deploys. We also uncovered two findings that could allow an attacker to circumvent ordinary access controls or log filtering mechanisms. Following a post-audit fix review, we believe that these findings have been mitigated sufficiently or do not pose an immediate risk to PyPI’s operations (or other applications deployed through cabotage).

Findings of interest include:

  • TOB-PYPI-17: an attacker with build privileges on cabotage could potentially pivot into backplane control of Caborage itself through command injection.
  • TOB-PYPI-19: an attacker with build privileges on cabotage could potentially pivot into backplane control of cabotage itself through a crafted hosted application Procfile.
  • TOB-PYPI-20: an attacker with deployment privileges on cabotage could potentially deploy a legitimate-looking-but-inauthentic image due to GitHub commit impersonation.

From the report, our overall evaluation is that cabotage’s codebase is not as mature as Warehouse’s. In particular, our evaluation reflects operational deficiencies that are not shared with Warehouse: cabotage has a single active maintainer, has limited available public documentation, does not have a complete unit test suite, and does not use CI/CD system to automatically run tests or evaluate code quality metrics.

Takeaways

Unit testing, automated linting, and code scanning are all necessary components in a secure software development lifecycle. At the same time, as our full report demonstrates, they cannot guarantee the security of a system or design: manual code review remains invaluable for catching interprocedural and systems-level flaws.

We worked closely with the PyPI maintainers and administrators throughout the audit and would like to thank them for sharing their extensive knowledge and expertise, as well as for actively triaging reports submitted to them. In particular, we would like to thank Mike Fiedler, the current PyPI Safety & Security Engineer, for his documentation and triage efforts before, during, and after the engagement period.

Adding build provenance to Homebrew

By William Woodruff

This is a joint post with Alpha-Omega—read their announcement post as well!

We’re starting a new project in collaboration with Alpha-Omega and OpenSSF to improve the transparency and security of Homebrew. This six-month project will bring cryptographically verifiable build provenance to homebrew-core, allowing end users and companies to prove that Homebrew’s packages come from the official Homebrew CI/CD. In a nutshell, Homebrew’s packages will become compliant with SLSA Build L2 (formerly known as Level 2).

As the dominant macOS package manager and popular userspace alternative on Linux, Homebrew facilitates hundreds of millions of package installs per year, including development tools and toolchains that millions of programmers rely on for trustworthy builds of their software. This critical status makes Homebrew a high-profile target for supply chain attacks, which this project will help stymie.

Vulnerable links in the supply chain

The software supply chain is built from individual links, and the attacker’s goal is to break the entire chain by finding and subverting the weakest link. Conversely, the defender aims to strengthen every link because the attacker needs to break only one to win.

Previous efforts to strengthen the entire chain have focused on various links:

  • The security of the software itself: static and dynamic analyses, as well as the rise of programming languages intended to eliminate entire vulnerability classes
  • Transport security: the use of HTTPS and other authenticated, integrity-preserving channels for retrieving and publishing software artifacts
  • Packaging index and manager security: the adoption of 2FA by package indices, as well as technologies like PyPI’s Trusted Publishing for reducing the “blast radius” of package publishing workflows

With this post, we’d like to spotlight another link that urgently needs strengthening: opaque and complex build processes.

Taming beastly builds with verifiable provenance

Software grows in complexity over time, and builds are no exception; modern build processes contain all the indications of a weak link in the software supply chain:

  • Opaque, unauditable build hosts: Much of today’s software is built on hosted CI/CD services, forming an implicit trust relationship. These services inject their dependencies into the build environment and change constantly—often for important reasons, such as patching vulnerable software.
  • Large, dense dependency graphs: We rely more than ever on small third-party dependencies, often maintained (or not) by hobbyists with limited interest or experience in secure development. The pace of development we’ve come to expect necessitates this dense web of small dependencies. Still, their rise (along with the rise of automatic dependency updating) means that all our projects contain dozens of left-pad incidents waiting to happen.
  • Complex, unreproducible build systems and processes: Undeclared and implicit dependencies, environments that cannot be reproduced locally, incorrect assumptions, and race conditions are just a few of the ways in which builds can misbehave or fail to reproduce, leaving engineers in the lurch. These reliability and usability problems are also security problems in our world of CI/CD and real-time security releases.

Taming these complexities requires visibility into them. We must be able to enumerate and formally describe the components of our build systems to analyze them automatically. This goes by many names and covers many techniques (SBOMs, build transparency, reproducibility, etc.), but the basic idea is one of provenance.

At the same time, collecting provenance adds a new link to our chain. Without integrity and authenticity protections, provenance is just another piece of information that an attacker could potentially manipulate.

This brings us to our ultimate goal: provenance that we can cryptographically verify, giving us confidence in our claims about a build’s origin and integrity.

Fortunately, all the building blocks for verifiable provenance already exist: Sigstore gives us strong digital signatures bound to machine (or human) identities, DSSE and in-toto offer standard formats and signing procedures for crafting signed attestations, and SLSA provides a formal taxonomy for evaluating the strength and trustworthiness of our statements.

Verifiable provenance for Homebrew

What does this mean for Homebrew? Once complete, every single bottle provided by homebrew-core will be digitally signed in a way that attests it was built on Homebrew’s trusted CI/CD. Those digital signatures will be provided through Sigstore; the attestations behind them will be performed with the in-toto attestation framework.

Even if an attacker manages to compromise Homebrew’s bottle hosting or otherwise tamper with the contents of the bottles referenced in the homebrew-core formulas, they cannot contrive an authentic digital signature for their changes.

This protection complements Homebrew’s existing integrity and source-side authenticity guarantees. Once provenance on homebrew-core is fully deployed, a user who runs brew install python will be able to prove each of the following:

  1. The formula metadata used to install Python is authenticated, thanks to Homebrew’s signed JSON API.
  2. The bottle has not been tampered with in transit, thanks to digests in the formula metadata.
  3. The bottle was built in a public, auditable, controlled CI/CD environment against a specific source revision.

That last property is brand new and is equivalent to Build L2 in the SLSA taxonomy of security levels.

Follow along!

This work is open source and will be conducted openly, so you can follow our activity. We are actively involved in the Sigstore and OpenSSF Slacks, so please drop in and say hi!

Alpha-Omega, an associated project of OpenSSF, is funding this work. The Alpha-Omega mission is to protect society by catalyzing sustainable security improvements to the most critical open-source software projects and ecosystems. OpenSSF holds regularly scheduled meetings for its working groups and projects, and we’ll be in attendance.

The issue with ATS in Apple’s macOS and iOS

By Will Brattain

Trail of Bits is publicly disclosing a vulnerability (CVE-2023-38596) that affects iOS, iPadOS, and tvOS before version 17, macOS before version 14, and watchOS before version 10. The flaw resides in Apple’s App Transport Security (ATS) protocol handling. We discovered that Apple’s ATS fails to require the encryption of connections to IP addresses and *.local hostnames, which can leave applications vulnerable to information disclosure vulnerabilities and machine-in-the-middle (MitM) attacks.

Note: Apple published an advisory on September 18, 2023 confirming that CVE-2023-38596 had been fixed.

Background

ATS is a network security feature enabled by default in applications linked against iOS 9+ and macOS 10.11+ software development kits (SDK). ATS requires the use of the Transport Layer Security (TLS) protocol for network connections made by an application. Before iOS version 10 and macOS version 10.12, ATS disallowed connections to .local domains and IP addresses by default but allowed for the configuration of exceptions. As of iOS version 10 and macOS version 10.12, connections to .local domains and IP addresses are allowed by default.

Proof of concept

We created a simple app protected by ATS that POSTs to a user-specified URL. The following table summarizes the tests we performed. Notably, to demonstrate the flaw in ATS’s protocol handing, we submitted POST requests to an unencrypted IP address and *.local domain and observed that the requests succeeded when they should not have, as shown in figure 1.

Note: The URL with the IP address (http://174.138.48.47/) and local domain (http://ats-poc.local) both map to http://ie.gy/.

Figure 1: We submitted POST requests to an unencrypted IP address (left) and *.local domain (right). Both requests succeeded.

This behavior demonstrates that ATS requirements are not enforced on requests to .local domains and IP addresses. Thus, network connections established by iOS and macOS apps through the URL Loading System may be susceptible to information disclosure vulnerabilities and MitM attacks—both of which pose a risk to the confidentiality and integrity of transmitted data.

An exploit scenario

An app is designed to securely transfer data to WebDAV servers. The app relies on ATS to ensure that traffic to user-provided URLs (WebDAV servers) is protected using encryption. When a URL is added, the app makes a request to it, and if ATS does not block the connection, it is assumed to be safe.

A user unwittingly adds a URL and specifies the IP address (e.g., http://174.138.48.47/) on the form, which is allowed by ATS even if it’s not encrypted. The user then accesses the URL from an insecure network, such as a mall WiFi. Because the traffic is not encrypted, a malicious user who is able to capture the network traffic is able to access all data sent to the server, including basic auth credentials, which in turn enable the attacker to recover all sensitive data stored on the WebDAV server that is accessible to the compromised user.

Check your apps!

Now that Apple has forced encryption to .local and IP addresses, developers should check that their app continues to work if they rely on those addresses.

Coordinated disclosure

As part of the disclosure process, we reported the vulnerabilities to Apple first. The timeline of disclosure is provided below:

  • October 21, 2022: Discovered ATS vulnerability.
  • November 3, 2022: Disclosed the vulnerability to Apple and communicated that we planned to publicly disclose on December 5.
  • November 14, 2022: Apple requested delay to February 2023; we requested details about why the delay was necessary.
  • November 16, 2022: Agreed to delay after Apple explained their internal testing and validation processes.
  • November 28, 2022: Requested a status update from Apple.
  • November 29, 2022: Apple confirmed that they were still investigating the vulnerability.
  • December 9, 2022: Apple confirmed and continued investigation.
  • January 31, 2023: Delayed release due to potential impact to apps/developers.
  • March 31, 2023: Requested a status update from Apple.
  • April 10, 2023: Apple indicated they were preparing an update regarding the remediation timeline.
  • April 18, 2023: Fix indicated that a fix would be ready for post-WWDC beta release.
  • September 18, 2023: Apple published an advisory confirming that CVE-2023-38596 had been fixed.

Numbers turned weapons: DoS in Osmosis’ math library

By Sam Alws

Trail of Bits is publicly disclosing a vulnerability in the Osmosis chain that allows an attacker to craft a transaction that takes up a disproportionate amount of compute time on Osmosis nodes compared to the amount of gas it consumes. Using the vulnerability, an attacker can halt the Osmosis chain by spamming validators with these transactions. After we informed the Osmosis developers about this bug, they performed a hard fork that fixed the vulnerability, avoiding the attack.

Osmosis is a Cosmos chain with native functionality for swap pools. Users exchange hundreds of thousands of dollars of value daily on Osmosis’ pools. Naturally, these pools need to perform a significant amount of fairly precise calculations, and that’s where our bug comes in.

The vulnerability

We found the vulnerability in Osmosis’ math library, which is used to give approximate answers to mathematical functions. In particular, the bug affected their exponentiation function. A Taylor series approximation was used to calculate ab:

Note the “…” at the end: since we’re working with computers and have only a finite amount of time to do this calculation, we need to choose when to stop. An intuitive choice here would be to stop when the terms we’re adding onto the end are sufficiently small; once that happens, we know we’re “close enough” to the real answer. This is exactly what the Osmosis developers did. Here’s a pseudocode version of their implementation:

// calculate a^b
// assumption: a is between 0 and 2, b is between 0 and 1
fn PowApprox(a,b) {
  total <- 1
  i <- 0
  term <- 1
  const precision = 0.00000001
  // (the real implementation took precision as a function parameter rather than a constant)
  while abs(term) >= precision {
    i <- i + 1
    term <- term * ((b-(i-1)) / i) * (a-1)
    total <- total + term
  }
  return total
}

However, there’s a problem with this implementation. The while loop runs until term is sufficiently small, but it does not have a bound on the maximum number of iterations. If we hand-pick values of a and b, we can make this loop take a very large number of iterations to terminate. In particular, calculating 1.999999999999990.1 using PowApprox takes over two million iterations, running for over 800 milliseconds on an M1 processor.

This very long runtime is not accounted for in the gas costs of transactions that use the PowApprox function. This means that if an attacker can craft a transaction that calls PowApprox(1.99999999999999, 0.1), they can take up just under a second of runtime on an Osmosis node without having to pay very much gas in exchange. By doing this repeatedly, they can bring the whole chain to a halt.

Luckily for the attacker, such a transaction does exist. There is a call to PowApprox in the following piece of code, used in Osmosis to calculate the amount of shares to give when someone deposits tokens into a swap pool:

shares_to_give = current_total_shares * (1 -
PowApprox(((current_total_tokens + tokens_added) /
current_total_tokens), token_weight))

(Note: the real implementation uses a different function called Pow, which is essentially just a wrapper around PowApprox that makes sure that all the inputs are in the correct range)

So if an attacker makes a pool where tokenA has a weight of 0.1, initializes it with 1.0 tokenA, and then deposits 0.99999999999999 more of tokenA, they can trigger the long calculation in PowApprox. By repeatedly depositing and withdrawing this 0.99999999999999 tokenA, they can get the Osmosis nodes stuck calculating PowApprox over and over, and halt the chain!

A simple solution

Luckily, the fix for this problem was very simple: limit the number of loop iterations, and revert the transaction if the limit is reached. Osmosis’ recent hard fork pushed this fix, preventing the attack. As for how to prevent similar bugs from popping up elsewhere, our recommendation is simple: fuzzing. Testing the PowApprox function with a 100ms timeout using gofuzz would’ve quickly detected the bug. Go’s native fuzzer also detects the bug when a 10ms timeout is used instead.

We reported the vulnerability to the Osmosis team on September 6, 2023. A PR containing the fix was merged on October 6, 2023, and a hard fork applying this fix was performed on October 23, 2023.

We would like to thank the Osmosis team for working swiftly with us to address these issues.

Introducing invariant development as a service

Understanding and rigorously testing system invariants are essential aspects of developing robust smart contracts. Invariants are facts about the protocol that should remain true no matter what happens. Defining and testing these invariants allows developers to prevent the introduction of bugs and make their code more robust in the long term. However, it is difficult to build up internal knowledge and processes to create and maintain such invariants. As a result, only the most matured teams have already integrated invariants into their development life cycle.

Recognizing this need, we are thrilled to announce our new service: Invariant Development. Clients of this service will receive:

  • Invariants, as code and specification
  • Guidance on how to integrate the invariants in their development lifecycle
  • Training on how to write invariants
  • Preferential treatment for additional Trail of Bits services

Comprehensive invariant development

Our invariant development service identifies, develops, and tests invariants for your codebase. While our security reviews typically encompass some development of invariants in areas believed to contain bugs, this service aims to cover invariants more broadly across your codebase, helping you achieve a more holistic approach to long-term security throughout your development lifecycle—not just at the end.

This service is particularly well suited for codebases that are still in development, as they will equip your engineers to write more secure contracts in the long term.

Trail of Bits engineers will lead discussions with your team to identify and understand the different invariants of the system. Our service includes the following activities:

  • Invariant identification. Based on our experience and discussion with the developers, we will identify potential invariants. This can include function-level invariants that must hold with respect to the execution of the function (e.g., addition is commutative) or system-level invariants (e.g., the balance of a user cannot be greater than the total supply). We will specify the invariants in English and identify their pre-conditions (e.g., a parameter is within a given bound).
  • Invariant implementation. We will implement part of the identified invariants in Solidity. We will identify the best testing approach (internal, external, or partial testing), create the relevant wrappers, and set the fuzzing initialization (contract deployments and pre-conditions). We will aim to minimize disruption to the codebase, and will select the most appropriate approach to ensure that the invariants can be used in the long term.
  • Invariant testing and integration. We will run the invariants locally and on dedicated cloud infrastructure. We will refine the specification based on the fuzz run results, identify arithmetic bounds, and narrow the precondition to reflect realistic scenarios. We will work with the development team to integrate the fuzzing in the CI (e.g., through GitHub actions) for short-term fuzzing campaigns, and we will provide recommendations to run long-term fuzzing campaigns locally or in the cloud.
  • Training and guidance. Through this service, our engineers will aim to upskill your team, whom we will empower to write their own invariants and to make fuzzing an integral part of your development process. We will provide guidance and advice on how to maintain the provided invariants, and write new ones. Additionally, our experts will provide design recommendations tailored to optimize the codebase for fuzzing. Finally, we will invite the developers to co-write invariants with our engineers for immediate feedback.

In addition, customers that go through our invariant offering will receive preferential treatment for additional Trail of Bits services. For example, our engineers will leverage the knowledge gained during the invariant development to reduce the effort and cost needed for a security review.

Trail of Bits is uniquely positioned to offer this service. Our engineers have been writing invariants for more than half of a decade (for examples, see the Balancer, Primitive, and Liquity reports). We are the authors of multiple fuzzers (Echidna, Medusa, test-fuzz), and we are the authors of numerous educational materials on fuzzing (+150 pre-defined invariants, How to fuzz like a pro (conference workshop), 10-hour fuzzing workshop, fuzzing tutorials).

Enhance your security

Invariant-based development is set to become a standard for smart contract developers. Our new offering will allow you to do the following:

  • Become proactive instead of reactive in securing your codebase. Invariants prevent the introduction of bugs and address their root causes.
  • Identify and develop the most impactful invariants. Understanding which invariants will have an impact on security requires dedicated expertise, which our team will provide.
  • Educate the team on invariant-driven development. This reorients the development lifecycle toward bug prevention, and enables developers to integrate invariant reasoning into their development process.

Contact us to take advantage of our experience to secure your codebase.

Pitfalls of relying on eBPF for security monitoring (and some solutions)

By Artem Dinaburg

eBPF (extended Berkeley Packet Filter) has emerged as the de facto Linux standard for security monitoring and endpoint observability. It is used by technologies such as BPFTrace, Cilium, Pixie, Sysdig, and Falco due to its low overhead and its versatility.

There is, however, a dark (but open) secret: eBPF was never intended for security monitoring. It is first and foremost a networking and debugging tool. As Brendan Gregg observed:

eBPF has many uses in improving computer security, but just taking eBPF observability tools as-is and using them for security monitoring would be like driving your car into the ocean and expecting it to float.

But eBPF is being used for security monitoring anyway, and developers may not be aware of the common pitfalls and under-reported problems that come with this use case. In this post, we cover some of these problems and provide workarounds. However, some challenges with using eBPF for security monitoring are inherent to the platform and cannot be easily addressed.

Pitfall #1: eBPF probes are not invoked

In theory, the kernel is never supposed to fail to fire eBPF probes. In practice, it does. Sometimes, although very rarely, the kernel will not fire eBPF probes when user code expects to see them. This behavior is not explicitly documented or acknowledged, but you can find hints of it in bug reports for eBPF tooling.

This bug report provides valuable insight. First, the issues involved are rare and difficult to debug. Second, the kernel may be technically correct, but the observed behavior on the user side is missing events, even if the proximate behavior was different (e.g., too many probes). Comments on the bug report present two theories for why events are missing:

More of these issues are likely lurking in the kernel, either as documented edge cases or surprise emergent effects of unrelated design decisions. eBPF is not a security monitoring mechanism, so there is not a guarantee that probes will fire as expected.

Workarounds

None. The callback logic and value for the maximum number of kRetProbes are hard-coded into the kernel. While one can manually edit and rebuild the kernel source, doing so is not advisable or feasible for most scenarios. Any tools relying on eBPF must be prepared for an occasional missing callback.

Pitfall #2: Data is truncated due to space constraints

An eBPF program’s stack space is limited to 512 bytes. When writing eBPF code, developers need to be particularly cautious about how much scratch data they use and the depth of their call stacks. This limit affects both the amount and kind of data that can be processed using eBPF code. For instance, 512 bytes is less than the longest permitted file path length, which is 4,096 bytes.

Workarounds

There are multiple options to get more scratch space, but they all involve cheating. Thanks to the bpf_map_lookup_elem helper, it’s possible to use a map’s memory directly. Directly using maps as storage effectively functions as malloc, but for eBPF code. A plausible implementation is a per-CPU array with a single key, whose size corresponds to our allocation needs:

u64 first_key = 0;
u8 *scratch_buffer = per_cpu_map.lookup(&first_key); // implemented with 
bpf_map_lookup_elem

However, how do we send this data back to our user mode code? A naive approach is to use even more maps, but this approach fails with variable-sized objects like paths and it also wastes memory. Maps can be very expensive in terms of memory use because data must be replicated per CPU to ensure integrity. Unfortunately, per-CPU maps allocate memory based on the number of possible hot-swappable CPUs. This number can easily be huge—on VMWare Fusion, it defaults to 128, so a single map entry wastes 127 times as much space as it uses.

Another approach is to stream data through the perf ring buffer. The linuxevents library uses this method to handle variable paths. The following is an example pseudocode implementation of this approach:

u64 first_key = 0;
u8 *scratch_space = per_cpu_array.lookup(&first_key);
for (const auto &component_ptr : path.components()) {
  bpf_probe_read_str(scratch_space, component_ptr, scratch_space_size);
  perf_submit(scratch_space);
}

Streaming data through the perf ring buffer significantly increases the effective size of each component and also enhances space efficiency, albeit at the expense of additional data reconstruction work. To handle edge cases like untriggered probes or lost/overwritten data, a recovery method must be implemented after data transmission. Unfortunately, perf buffers are allocated in a similar way to per-CPU maps. On newer systems, the BPF ring buffer can be used instead to avoid that issue (the same ring buffer is shared across CPUs)

Pitfall #3: Limited instruction count

An eBPF program can have only 4,096 instructions, and reusing code (e.g., by defining a function) is not possible. Until recently, loops were not supported (or they had to be manually unrolled). While eBPF allows a maximum of 1 million instructions to be executed at runtime, the program can still be only 4,096 instructions long.

Workarounds

Rebuild your programs to take advantage of bounded loops (i.e., loops where the iteration count can be statically determined). These loops are now supported and they save precious program space compared to unrolling loops. Another workaround to increase the program size is multiple programs that tail call each other, which they can do up to 32 times until execution is interrupted. A drawback of this approach is that program state is lost between each transition. To keep state across tail calls, consider storing data in an eBPF map accessible by all 32 programs.

Pitfall #4: Time-of-check to time-of-use issues

An eBPF program can and will run concurrently on different CPU cores. This is true even for kernel code. Since there is no way to call kernel synchronization functions or to reliably acquire locks from eBPF, data races and time-of-check to time-of-use issues are a serious concern.

Workarounds

The only workaround is to carefully choose the event attach point, depending on the program. For example, eBPF commonly needs to work with functions that accept user data. In this situation, a good attach point is right after user data has been read into kernel mode.

When dealing with kernel code and synchronization is involved, you may not be able to mitigate time-of-check to time-of-use issues. As an example, the dentry structure that backs files is often modified under lock by the kernel, and it is impossible to acquire these locks from an eBPF probe. Often the only indication that something is wrong is a bad return code from an API like bpf_probe_read_user. Make sure to handle such errors in a way that does not completely make the event data unusable. For example, if you are streaming data through perf in different packets, insert an error packet that notifies clients of missing data so that they can realign themselves to the event stream without causing corruption.

Pitfall #5: Event overload

Because eBPF lacks concurrency primitives and an eBPF probe cannot block the event producer, an attach point can be easily overwhelmed with events. This can lead to the following issues:

  1. Missed events, as the kernel stops calling the probe
  2. Data loss due to the lack of storage space for new data
  3. Data loss due to the complete overwriting of older but not yet consumed data by newer information
  4. Data corruption from partial overwrites or complex data formats, disrupting normal program operation

These data loss and corruption scenarios depend on the number of probes and events that are adding items into the event stream and on the extent of system activity. For instance, a docker container startup sequence or a deployment script can trigger a surprisingly large number of events. Developers should choose events to be monitored carefully and should avoid repetition and constructs that can make it harder to recover from data loss.

Workarounds

The user-mode helper should treat all data coming from eBPF probes as untrusted. This includes data from your own eBPF probes, which is also susceptible to accidental corruption. There should also be some application-level mechanism to detect missing or corrupted data.

Pitfall #6: Page faults

Memory that has not been accessed recently may be paged out to disk—be it a swap file, a backing file, or a more esoteric location. Normally, when this memory is needed, the kernel will issue a page fault, load the relevant content, and continue execution. For various reasons, eBPF runs with page faults disabled—if memory is paged out, it cannot be accessed. This is bad news for a security monitoring tool.

Workarounds

The only workaround is to hook right after a buffer is used and hope it does not get paged out before the probe reads it. This cannot be strictly guaranteed since there are no concurrency primitives, but the way the hook is implemented can increase the likelihood of success.

Consider the following example:

int syscall_name(const char *user_mode_ptr) {
  function1();
  function2(user_mode_ptr);
  function3()
  return 0;
}

To make sure that user_mode_ptr can be accessed, this code first hooks into the entry of syscall_name and saves all of the pointer parameters in a map. It then searches for a place where user_mode_ptr is almost certainly accessible (i.e., anything past the call to function2) and sets an attach point there to read the data. The following are some options for the attach point:

  1. On function2 exit
  2. On function3 entry
  3. On function3 exit
  4. On syscall_name exit

You may be wondering why we don’t just hook function2 directly. While this can work occasionally, it is normally a bad idea:

  1. function2 is often called outside of the context you are interested in (i.e., outside of syscall_name).
  2. function2 may not have the same signature across kernel revisions. If we just use the function as an opaque breakpoint, signature changes do not affect our probe.

Also note that, at times, the parameter changes during a system call, and we need to read it before the data is gone. For example, the execve system call replaces the entire process memory, erasing all initial data before the call completes.

Again, developers should assume that some memory may be unreadable by the eBPF probe and develop accordingly.

Embracing benefits, addressing limitations

eBPF is a powerful tool for Linux observability and monitoring, but it was not designed for security and comes with inherent limitations. Developers need to be aware of pitfalls like probe unreliability, data truncation, instruction limits, concurrency issues, event overload, and page faults. Workarounds exist, but they are imperfect and often add complexity.

The bottom line is that while eBPF enables exciting new capabilities, it is not a silver bullet. Software using eBPF for security monitoring must be built to gracefully handle missing data and error conditions. Robustness needs to be a top priority.

With care and creativity, eBPF can still be used to build next-generation security tools. But it requires acknowledging and working around eBPF’s constraints, not ignoring them. As with any technology, the most effective security monitoring solutions will embrace eBPF while being aware of how it can fail.

Don’t overextend your Oblivious Transfer

By Joop van de Pol

We found a vulnerability in a threshold signature scheme that allows an attacker to recover the signing key of threshold ECDSA implementations that are based on Oblivious Transfer (OT). A malicious participant of the threshold signing protocols could perform selective abort attacks during the OT extension subprotocol, recover the secret values of other parties, and eventually recover the signing key. Using this key, the attacker could assume the identities of users, gain control over critical systems, or pilfer financial assets. While we cannot yet disclose the client software affected by this vulnerability, we believe it is instructive for other developers implementing MPC protocols.

Protecting against this vulnerability is straightforward: since the attack relies on causing selective aborts during several protocol rounds, punishing or excluding participants that cause selective aborts is sufficient. Still, it’s a good example of a common problem that often leads to severe or even critical issues: a disconnect between assumptions made by academics and the implementers trying to build these protocols efficiently in real systems.

Threshold signature schemes

Threshold signature schemes (TSS) are powerful cryptographic objects that allow decentralized control of the signing key for a digital signature scheme. They are a specific application of the more general multi-party computation (MPC), which aims to decentralize arbitrary computation. Each TSS protocol is typically defined for a specific digital signature scheme because different signature schemes require different computations to create signatures.

Current research aims to define efficient TSS protocols for various digital signature schemes. The target for efficiency includes both computation and communication between the different participants. Typically, TSS protocols rely on standard techniques used in MPC, such as secret sharing, zero-knowledge proofs, and multiplicative-to-additive conversion

Threshold ECDSA

The ECDSA signature scheme is widely used. However, threshold schemes for ECDSA are generally more complicated than those for other signature schemes. This is because an ECDSA signature requires the computation of the modular inverse of a secret value.

Various MPC techniques can be used to distribute the computation of this modular inverse. Currently, one line of work in threshold signature schemes for ECDSA uses the homomorphic Paillier encryption scheme for this purpose, as shown in work by Lindell et al., Gennaro et al., and following works. This blog post will focus on schemes that rely on oblivious transfer (OT), such as the work by Doerner et al. and following works, or Cait-Sith.

Before explaining what OT is, it should be noted that the basic variant is relatively inefficient. To mitigate this issue, researchers proposed something called OT extension, where a small number of OTs can efficiently be turned into a larger number of OTs. This feature is eagerly used by the creators of threshold signature schemes, as you can run the setup of the small number of OTs once, and then extend arbitrarily many times.

Oblivious transfer

Oblivious transfer is like the setup of a magician’s card trick. A magician has a bunch of cards and wants you to choose one of them, so they can show off their magic prowess by divining which card you chose. To this end, it is important that the magician does not know which card you chose, but also that you choose exactly one card and cannot claim later that you actually chose another card.

In real life, the magician could let you write something on the card that you chose, forcing you to choose exactly one card. However, this would not be good enough for the cryptographic setting, because the magician could afterwards just look at all the cards (using their impressive sleight of hand to hide this fact) and pick out the card that has text on it. A better solution would be to have the magician write a random word on each card, such that you can choose a card by memorizing this word. Now, in real life the magician might allow you to look at multiple cards before choosing one, whereas in the cryptographic case you have to choose a card blindly such that you only learn the random word written on the card that you chose.

After choosing the card and giving it back to the magician (randomly shuffling the cards to ensure they cannot directly pick out the card that you returned), they can now try to figure out which card you chose. In real life, the magician will use all kinds of tricks to try and pick out your card, whereas in the cryptographic setting, they actually should not be able to.

So, in a nutshell, OT is about a sender (the magician) who wants to give a receiver (you, the mark) a choice between some values (cards). The sender should not learn the receiver’s choice, and the receiver should not learn any other values than the chosen one.

It turns out that OT is a very powerful MPC primitive, as it can be used as a building block to construct protocols for any arbitrary multi-party computation. However, implementing OT without any special assumptions requires asymmetric cryptography, which is relatively expensive. Using expensive building blocks will lead to an inefficient protocol, so something more is needed to be able to use OT in practice.

OT extension

OT requires either asymmetric cryptography or “special assumptions.” This means that OT is possible when the two parties already have access to something called correlated randomness. Even better, this correlated randomness can be created from the output of an OT protocol.

As a result, it is possible to run the expensive OT protocol a number of times, and then to extend these “base” OTs into many more OTs. This extension is possible using only symmetric cryptography (such as hash functions and pseudo-random generators), which makes it more efficient than the expensive asymmetric variants.

For this blog post, we will focus on a particular line of work in OT extension, starting with this paper by Ishai et al. It is a bit too complicated to explain in detail how this scheme works, but the following points are important:

  • It uses only symmetric primitives (pseudo-random generator and hash function).
  • The role of sender and receiver is swapped (sender in base OT becomes receiver in extended OT and vice versa).
  • The protocol includes constructing randomness that is correlated with both the old choices (of the base OTs) and the new choices (of the OT extension).
  • The extended OT sender cannot cheat, but the protocol is not secure against a cheating extended OT receiver.
  • What does this last point mean? The extended OT receiver can cheat and learn the original choice bits belonging to the extended OT sender. Ishai et al. proposed a solution, but it is not very efficient. Therefore, followup works such as by Asharov et al. and by Keller et al. add a kind of consistency check, where the extended OT receiver has to provide some additional information. The extended OT sender can then use this information to verify that the receiver did not cheat.

    These consistency checks restrict how much the receiver can learn about the sender’s secret choices, but they are not perfect. The check that the sender performs to verify the receiver’s information depends on their own secret choices. Therefore, the receiver can still cheat in specific places such that they learn some bits of the sender’s secret choices based on whether or not the sender aborts. This is known as a selective abort attack, as the receiver can selectively try to cause an abort and learns some information from the sender as a result.

    The aforementioned papers acknowledge that this kind of leakage can happen for a cheating receiver. However, the authors choose the parameters of the scheme such that the receiver can never learn enough about the sender’s original choice bits when running the protocol once. Problem solved, right?

    How the vulnerability works

    Recall that in the context of threshold signature schemes based on OT, you want to perform the base OTs once during a set-up phase and reuse this set-up arbitrarily many times to perform OT extension. Since this improves efficiency, implementers will jump on it. What is not mentioned very explicitly, and what caused the vulnerability that we found, is that you can reuse the set-up arbitrarily many times only if the OT extension receiver does not cheat.

    If the receiver cheats, then they can learn a few bits of the secret set-up value of the OT extension sender. This does become a problem if you allow the receiver to do this multiple times over different executions of the protocol. Eventually, the receiver learns all secret sender bits, and the security is completely compromised. Typically, depending on the specific TSS, the receiver can now use the secret sender bits to recover sender shares corresponding to the ECDSA nonce. In a scheme with a threshold of two, this means that the receiver recovers the nonce, and they can recover the ECDSA signing key given a valid signature with this nonce. (In schemes with more parties, the attacker may have to repeat this attack for multiple parties.)

    So what’s the issue here exactly? Selective abort attacks are known and explicitly discussed in OT extension papers, but those papers are not very clear on whether you can reuse the base OTs. Implementers and TSS protocol designers want to reuse the base OTs arbitrarily many times, because that’s efficient. TSS protocol designers know that selective abort attacks are an issue, so they even specify checks and consider the case closed, but they are not very clear on what implementers are supposed to do when checks fail. This kind of vagueness in academic papers invariably leads to attacks on real-world systems.

    In this case, a clear solution would be to throw away the setup for a participant that has attempted to cheat during the OT extension protocol. Looking at some public repositories out there, most OT extension libraries will report something along the lines of “correlation check failed,” which does not tell a user what to do next. In fact, only one library added a warning that a particular check’s failure may represent an attack and that you should not re-run the protocol.

    Bridging the gap between academia and implementations

    Most academic MPC papers provide a general overview of the scheme and corresponding proof of security; however, they don’t have the detail required to constitute a clear specification and aren’t intended to be blueprints for implementations. Making wrong assumptions when interpreting academic papers to create real-world applications can lead to severe issues. We hope that the recent call from NIST for Multi-Party Threshold Cryptography will set a standard for specifications of MPC and TSS and prevent such issues in the future.

    In the meantime, if you’re planning to implement threshold ECDSA, another TSS, or MPC in general, you can contact us to specify, implement, or review those implementations.