Semantic Analysis of Native Programs with CodeReason

February 23, 2014

darpa, program-analysis, static-analysis

Page content

Have you ever wanted to make a query into a native mode program asking about program locations that write a specific value to a register? Have you ever wanted to automatically deobfuscate obfuscated strings?

Reverse engineering a native program involves understanding its semantics at a low level until a high level picture of functionality emerges. One challenge facing a principled understanding of a native mode program is that this understanding must extend to every instruction used by the program. Your analysis must know which instructions have what effects on memory calls and registers.

We’d like to introduce CodeReason, a machine code analysis framework we produced for DARPA Cyber Fast Track. CodeReason provides a framework for analyzing the semantics of native x86 and ARM code. We like CodeReason because it provides us a platform to make queries about the effects that native code has on overall program state. CodeReason does this by having a deep semantic understanding of native instructions.

Building this semantic understanding is time-consuming and expensive. There are existing systems, but they have high barriers to entry or don’t do precisely what we want, or they don’t apply simplifications and optimizations to their semantics. We want to do that because these simplifications can reduce otherwise hairy optimizations to simple expressions that are easy to understand. To motivate this, we’ll give an example of a time we used CodeReason.

Simplifying Flame

Around when the Flame malware was revealed, some of its binaries were posted onto malware.lu. Their overall scheme is to store the obfuscated string in a structure in global data. The structure looks something like this:

struct ObfuscatedString {
  char padding[7];
  char hasDeobfuscated;
  short stringLen;
  char string[];
};

Each structure has variable-length data at the end, with 7 bytes of data that were apparently unused.

There are two fun things here. First I used Code Reason to write a string deobfuscator in C. The original program logic performs string deobfuscation in three steps.

The first function checks the hasDeobfuscated field and if it is zero, will return a pointer to the first element of the string. If the field is not zero, it will call the second function, and then set hasDeobfuscated to zero.

The second function will iterate over every character in the ‘string’ array. At each character, it will call a third function and then subtract the value returned by the third function from the character in the string array, writing the result back into the array. So it looks something like:

void inplace_buffer_decrypt(unsigned char *buf, int len) {
  int counted = 0;
  while( counted < len ) {
    unsigned char *cur = buf + counted;
    unsigned char newChar = get_decrypt_modifier_f(counted);
    *cur -= newChar;
    ++counted;
  }
  return;
}

What about the third function, ‘get_decrypt_modifier’? This function is one basic block long and looks like this:

lea ecx, [eax+11h]
add eax, 0Bh
imul ecx, eax
mov edx, ecx
shr edx, 8
mov eax, edx
xor eax, ecx
shr eax, 10h
xor eax, edx
xor eax, ecx
retn

An advantage of having a native code semantics understanding system is that I could capture this block and feed it to CodeReason and have it tell me what the equation of ‘eax’ looks like. This would tell me what this block ‘returns’ to its caller, and would let me capture the semantics of what get_decrypt_modifier does in my deobfuscator.

It would also be possible to decompile this snippet to C, however what I’m really concerned with is the effect of the code on ‘eax’ and not something as high-level as what the code “looks like” in a C decompilers view of the world. C decompilers also use a semantics translator, but then proxy the results of that translation through an attempt at translating to C. CodeReason lets us skip the last step and consider just the semantics, which sometimes can be more powerful.

Using CodeReason

Getting this from CodeReason looks like this:

$ ./bin/VEEShell -a X86 -f ../tests/testSkyWipe.bin
blockLen: 28
r
...
EAX = Xor32[ Xor32[ Shr32[ Xor32[ Shr32[ Mul32[ Add32[ REGREAD(EAX), I:U32(0xb) ], Add32[ REGREAD(EAX), I:U32(0x11) ] ], I:U8(0x8) ], Mul32[ Add32[ REGREAD(EAX), I:U32(0xb) ], Add32[ REGREAD(EAX), I:U32(0x11) ] ] ], I:U8(0x10) ], Shr32[ Mul32[ Add32[ REGREAD(EAX), I:U32(0xb) ], Add32[ REGREAD(EAX), I:U32(0x11) ] ], I:U8(0x8) ] ], Mul32[ Add32[ REGREAD(EAX), I:U32(0xb) ], Add32[ REGREAD(EAX), I:U32(0x11) ] ] ]
...
EIP = REGREAD(ESP)

This is cool, because if I implement functions for Xor32, Mul32, Add32, and Shr32, I have this function in C, like so:

unsigned char get_decrypt_modifier_f(unsigned int a) {
return Xor32(
Xor32(
Shr32(
Xor32(
Shr32(
Mul32(
Add32( a, 0xb),
Add32( a, 0x11) ),
0x8 ),
Mul32(
Add32( a, 0xb ),
Add32( a, 0x11 ) ) ),
0x10 ),
Shr32(
Mul32(
Add32( a, 0xb ),
Add32( a, 0x11 ) ),
0x8 ) ),
Mul32(
Add32( a, 0xb ),
Add32( a, 0x11 ) ) );
}

And this also is cool because it works.

C:\code\tmp>skywiper_string_decrypt.exe
CreateToolhelp32Snapshot

We’re extending CodeReason into an IDA plugin that allows us to make these queries directly from IDA, which should be really cool!

The second fun thing here is that this string deobfuscator has a race condition. If two threads try and deobfuscate the same thread at the same time, they will corrupt the string forever. This could be bad if you were trying to do something important with an obfuscated string, as it would result in passing bad data to a system service or something, which could have very bad effects.

I’ve used CodeReason to attack string obfuscations that were implemented like this:

xor eax eax
push eax
sub eax, 0x21ece84
push eax

Where the sequence of native instructions would turn non-string immediate values into string values (through a clever use of the semantics of twos compliment arithmetic) and then push them in the correct order onto the stack, thereby building a string dynamically each time the deobfuscation code ran. CodeReason was able to look at this and, using a very simple pinhole optimizer, convert the code into a sequence of memory writes of string immediate values, like:

MEMWRITE[esp] = '.dll'
MEMWRITE[esp-4] = 'nlan'

Conclusions

Having machine code in a form where it can be optimized and understood can be kind of powerful! Especially when that is available from a programmatic library. Using CodeReason, we were able to extract the semantics of string obfuscation functions and automatically implement a string de-obfuscator. Further, we were able to simplify obfuscating code into a form that expressed the de-obfuscated string values on their own. We plan to cover additional uses and capabilities of CodeReason in future blog posts.