This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MAIL: Malware Analysis Intermediate Language

Shahid Alam
Department of Computer Science
University of Victoria, BC, V8P5C2
E-mail: salam@cs.uvic.ca

1 Introduction

Intermediate languages are used in compilers [1] to translate the source code into a form that is easy to optimize and increase portability. The term intermediate language also refers to the intermediate language used by the compilers of high level languages that does not produce any machine code, such as Java and C#. An example of adding two numbers in the intermediate language CIL (Common Intermediate Language) used in implementing C# is as follows:

   a = a + b;
   is translated to the following CIL code:
   ldloc.0        ; Push the first local on the stack
   ldloc.1        ; Push the second local on the stack
   add            ; Pop the two locals, add them and push the result on the stack
   stloc.0        ; Pop the result and store it in the first local

CIL is a stack based language, i.e: the data is pushed on the stack instead of pulling from the registers. That is one of the reasons why, in the example above, one simple add statement is translated into four stack-based statements. The same add statement can be translated into the three address code [1] as:

   a := a + b

The three address code is an intermediate language used by most of the compilers. The two popular open source compilers GCC [20] and LLVM [12] use three address code as part of there intermediate languages.

1.1 Hidden Malwares (Obfuscation)

Detecting whether a given program is a malware (virus) is an undecidable problem [5, 15]. Antimalware software detection techniques are limited by this theoretical result. Malware writers exploit this limitation to avoid detection.

In the early days the malware writers were hobbyists but now the professionals have become part of this group because of the financial gains [3] attached to it. One of the basic techniques used by a malware writer is obfuscation [14]. Such a technique obscure a code to make it difficult to understand, analyze and detect malwares embedded in the code.

Initial obfuscators were simple and were detected by signature-based detectors. These signature-based detectors work on simple signatures such as byte sequences, instruction sequences and string signatures (pattern of a malware that uniquely identifies it). They lack information about the semantics or behavior of the malicious program. To counter these detectors the obfuscation techniques evolved. Some of the mutations used in polymorphic and metamorphic [16] malwares are:

  • Instruction reordering: By changing the ordering of instructions with commutative or associative operators, the structure of the instructions can be changed. This reordering does not change the behavior of the program. As a simple example:

    a = 10; b = 20;                   a = 10; b = 20;
    x = a * b;   can be changed to:   x = b * a
    
    original machine code    and       assembly:
    
    c7 45 f4 0a 00 00 00 Ψ   movl   [rbp-0xc], 0xa  ; a = 10
    c7 45 f8 14 00 00 00 Ψ   movl   [rbp-0x8], 0x14 ; b = 20
    8b 45 f4             Ψ   mov    eax, [rbp-0xc]  ;
    0f af 45 f8          Ψ   imul   eax, [rbp-0x8]  ; a * b
    89 45 fc             Ψ   mov    [rbp-0x4], eax  ; x = a * b
    
    changed machine code     and       assembly:
    
    c7 45 f4 0a 00 00 00 Ψ   movl   [rbp-0xc], 0xa  ; a = 10
    c7 45 f8 14 00 00 00 Ψ   movl   [rbp-0x8], 0x14 ; b = 20
    8b 45 f8             Ψ   mov    eax, [rbp-0x8]  ; (reordered)
    0f af 45 f4              imul   eax, [rbp-0xc]  ; b * a (reordered)
    89 45 fc                 mov    [rbp-0x4], eax  ; x = b * a
    

    Because of the two reordered instructions the original and the changed machine codes have different signatures. Other instructions can also be reordered if no dependency exists between the instructions.

  • Dead code insertion: Dead code is a code that either does not execute or has no effect on the results of a program. Following is an example of dead code insertion:

    mov   ebx, [ebp+4]
    add   ebx, 0x0           ; dead code
    nop                      ; dead code
    jmp   ebx
    
  • Register renaming: To avoid detection registers are reassigned in a fragment of a binary code. This changes the byte sequence (signature) of the machine code. A signature-based detector will not be able to match the signature if it is searching for a specific register. An example of register renaming is given below (register eax is renamed to edx):

    lea   eax, [RIP+0x203768]               lea   edx, [RIP+0x203768]
    add   eax, 0x10                         add   edx, 0x10
    jmp   eax                               jmp   edx
    
  • Order of instructions: To change the control flow of a program the order of instructions is changed in the binary image of the program, keeping the order of execution the same by using jump instructions.

  • Branch functions: A branch function is used [14] to obscure the flow of control in a program. The target of all or some of the unconditional branches in a program is replaced by the address of a branch function. The branch function makes sure the branch is correctly transferred to the right target for each branch.

Because of the financial gains attached with the malware industry, malware writers are always targeting new technologies. To improve the detection of malwares, especially metamorphic malwares, we need to develop new methods and techniques to analyze behavior of a program, to make a better detection decision with few false positives.

1.2 Why an Intermediate Language for Malware Analysis

Here we are going to list some of the reasons why we need to transform a program in an assembly language to an intermediate language for malware analysis:

  • There are hundreds of different instructions in any assembly language. For example the number of instructions in the two most popular ISAs (Instruction Set Architectures) are: Intel x86-64 = 800+ [7] and ARM = 400+ [17]. We need to reduce the number of these instructions considerably to optimize the static analysis of any such assembly program.

  • Not only the different instructions are big in numbers but they are also big in complexity, such as Intel x86-64 instruction’s PREFETCHh, MOVD and MOVQ. The instruction PREFETCHh moves data from the memory to the cache. Is this action important, if we are performing a static analysis for detecting malwares? Our answer is ’NO’. There are other such instructions that are not required for malware analysis. So our intermediate language hide/ignore these instructions and make the language transparent to the static analysis. The instructions MOVD and MOVQ copy a double word or a quad word respectively, from the source operand to the destination operand. Here we have to ask a question do we need to take into account the size of the word being copied in our static analysis? If the answer is ’NO’, then in our intermediate language we can replace these kind of instructions with a much simpler Assignment instruction. Using such techniques an intermediate language allows us to use simple instructions to make our static analysis much simpler.

  • We want a common intermediate language that can be used with different platforms, such as Intel x86-64 and ARM. So we do not have to perform separate static analysis for each platform. The intermediate language could be used for any of the above mentioned or other such platforms.

  • Assembly instructions can have multiple hidden side effects, such as effecting the flags etc, that can substantially increase the efforts required for the static analysis. In this case there are three options that an intermediate language can use to make the static analysis easier: Either remove all the side effects, or have only one side effect, or explicitly define side effect(s) in the instruction. Because our focus is mainly on malware analysis, out of these three, in our opinion the first option is the best option. We will use this option in our intermediate language, and the instructions used in our language that we call MAIL (Malware Analysis Intermediate Language) will not have any side effects.

  • Unknown branch addresses in an assembly makes it difficult to build a correct CFG. This problem will be taken care of by the MAIL. For example, for indirect jumps and calls (branches whose target is unknown or cannot be determined by static analysis) only a change in the source code can change them, so it is safe to ignore these branches for malware analysis where the change is only carried out in the machine code. We explain this using an example from one of the PARSEC [4] benchmarks.

    The following example shows the function Condition() from one of the benchmarks in the PARSEC benchmark suite [4]. This function initializes a static condition variable of a thread. A local variable rv is used in a switch statement to jump to an appropriate exception generated by a pthread_cond_init() function. This function initializes the condition variable of a thread and returns zero if successful otherwise returns an error number. The value returned by the pthread_cond_init() function can only be determined at runtime and so the value of rv.

    The C++ source code with the translated (disassembled) assembly code:
    
    Condition::Condition(Mutex &_M)
               throw(CondException)
    {                                       471b50: push %rbp
       int rv;                              471b51: push %rbx
       M = $_M;                             471b52: sub $0x38,%rsp
       nWaiting = 0;                        471b52: sub $0x38,%rsp
       nWakeupTickets = 0;                  471b56: mov %rsi,(%rdi)
       rv = pthread_cond_init(&c, NULL);    471b59: movl $0x0,0x8(%rdi)
                                            471b60: movl $0x0,0xc(%rdi)
                                            471b67: xor %esi,%esi
                                            471b69: add $0x10,%rdi
                                            471b6d: callq 404b60 <pthread_cond_init@plt>
    
       switch(rv) {  [  rv UNKNOWN  ]       471b72: cmp $0x16,%eax
          case 0:                           471b75: jbe 471bb0 <Condition:Mutex>
             break;                         471b77: mov 0x21934a(%rip),%r8
          case EAGAIN:                      471b7e: mov $0x8,%edi
          case ENOMEM: {                    471b83: lea 0x10(%r8),%rbp
             CondResourceException e;       471b87: mov %rbp,(%rsp)
             throw e;                       471b8b: callq 404d00 <allocate_exception@plt>
             break;                         471b90: mov 0x219359(%rip),%rdx
          }                                 471b97: mov 0x219342(%rip),%rsi
          case EBUSY:                       471b9e: mov %rax,%rdi
          case EINVAL: {                    471ba1: mov %rbp,(%rax)
             CondInitException e;           471ba4: callq 404da0 <cxa_throw@plt>
             throw e;                       471ba9: nopl 0x0(%rax)
             break;                         471bb0: lea 0x6995(%rip),%rcx <MutexInitException>
          }                                 471bb7: mov %eax,%ebx
          default: {                        471bb9: movslq (%rcx,%rbx,4),%rax
             CondUnknownException e;        471bbd: lea (%rax,%rcx,1),%rdx
             throw e;                       471bc1: jmpq *%rdx     [  UNKNOWN BRANCH TARGET  ]
             break;                         471bc3: nopl 0x0(%rax,%rax,1)
          }                                 471bc8: mov 0x219231(%rip),%rdi
       }                                    471bcf: lea 0x10(%rdi),%rbx
    }                                       471bd3: mov $0x8,%edi
                                            471bd8: mov %rbx,0x10(%rsp)
    

    Dynamic analysis can be used to determine the value of rv, but it is possible that such an analysis may not be able to reach (in one of the runs) one of the executable paths (the switch statement) in the case of the rv being always zero and changes only in rare cases. These rare cases may not get executed or execute only after running the program for a very long time, that may render the analysis impractical. A malware writer can exploit this weakness and inject the malware code by changing the target address of any of the branches inside the switch statement to his/her own malicious code. In such a case the dynamic analysis will not be able to detect this malicious behavior.

    That is where static (binary) analysis can help by building a CFG that covers all the available execution paths, in this case the switch statement. This CFG may not be correct, because by looking at the disassembled (the assembly) code above we can see it generates an unknown branch target address. This address cannot be computed using static analysis. Is it safe to ignore this branch target address while building the CFG for malware analysis? It is not possible for a malware writer to use this particular instruction as it is for malicious code. He/she will have to change this instruction to make it easy to use, such as the register rdx can be loaded with an address of a malicious code before the jmpq *%rdx instruction, which is trivial to detect because in this case the branch target address will become known.

    The language MAIL is specifically designed for malware analysis, so we create a new construct/keyword UNKNOWN that takes care of these branches. This construct will be helpful not only in static but also in dynamic analysis of the malwares.

  • A language such as MAIL can be easily translated into a string, a tree or a graph and hence can be optimized for various analysis that are required for malware analysis and detection, such as pattern matching and data mining. Special patterns are introduced (Sections 2.5, 2.6 and 2.8) in the MAIL language for annotating MAIL statements that can be used for pattern matching.

  • To reduce the number of different instructions for static analysis, functionally equivalent assembly instructions can be grouped together in one intermediate language instruction, such as:

    (xor eax, eax) | (add eax, 0) | (sub eax, eax)  =>  mov eax, 0
    (add ebx, 0x2000) & (add eax, ebx) | (lea eax, [ebx + 0x2000])  =>  load eax, expr
    
    where expr = (ebx + 0x2000) and its value can be known or unknown depending on the
    value of ebx. This information should be explicitly defined in the language.
    

In the following Sections we introduce the new language called MAIL (Malware Analysis Intermediate Language) for malware analysis and detection. In Section 2 we describe its detail design and how a binary program is translated to MAIL. We also cover the CFG (Control Flow Graph) construction and annotation and how graph and pattern matching techniques are used to detect metamorphic malwares [16]. We carried out an empirical study in Section 3 to test the use of MAIL in a tool. Using the MAIL language the tool was able to fully automate the process of malware analysis and detection and achieved 100% results. We finaly conclude in Section 4.

2 Design of MAIL

In the previous Section we provided motivations for a new language for malware analysis and detection. This Section, introduces and provides the design of, this new language called MAIL. The language MAIL is based on binary analysis to optimize malware detection, so before explaining it’s design we first give some background on binary analysis for malware detection.

2.1 Binary Analysis for Malware Detection

Almost all the malwares use binaries (instructions that a computer can interpret and execute) to infiltrate a computer system, which can be a desktop, a server, a laptop, a kiosk or a mobile device. Binary analysis is the process of automatically analysing the structure and behavior of a binary program. There are various purposes of this analysis and some of them are: optimization, verification, profiling, performance tuning and computer security. We further explain how binary analysis can help us understand a program and detect malwares in the program, by using a simple binary program (a function called sort) that is part of the class Merge in a sorting program. This function performs a merge sort on an array of integers. It’s binary analysis (performed using an in-house developed tool) information is listed below and explained in the following paragraphs:

              Listing 1.1 Binary Analysis of The Disassembled Function Merge::sort(int key[], int size)

                        Column I                                                    Column II

0  40108e              55  PUSH                    RBP     :     5  40113b        488b45c8   MOV        RAX, [RBP-0x38]
0  40108f          4889e5   MOV               RBP, RSP     :     5  40113f          4189f9   MOV               R9D, EDI
0  401092              53  PUSH                    RBX     :     5  401142          4189f0   MOV               R8D, ESI
0  401093        4883ec48   SUB              RSP, 0x48     :     5  401145          4889de   MOV               RSI, RBX
0  401097        48897dc8   MOV        [RBP-0x38], RDI     :     5  401148          4889c7   MOV               RDI, RAX
0  40109b        488975c0   MOV        [RBP-0x40], RSI     :     5  40114b      e8e2fdffff  CALL               0x400f32
0  40109f          8955bc   MOV        [RBP-0x44], EDX     :     5  401150          8b45e8   MOV        EAX, [RBP-0x18]
0  4010a2          8b45bc   MOV        EAX, [RBP-0x44]     :     5  401153            01c0   ADD               EAX, EAX
0  4010a5            4898  CDQE                            :     5  401155          0145ec   ADD        [RBP-0x14], EAX
0  4010a7        48c1e002   SHL               RAX, 0x2     :     6  401158          8b45e8   MOV        EAX, [RBP-0x18]
0  4010ab          4889c7   MOV               RDI, RAX     :     6  40115b          8b55bc   MOV        EDX, [RBP-0x44]
0  4010ae      e8e9f9ffff  CALL               0x400a9c     :     6  40115e            89d1   MOV               ECX, EDX
0  4010b3        488945d8   MOV        [RBP-0x28], RAX     :     6  401160            29c1   SUB               ECX, EAX
0  4010b7  c745e801000000   MOV  DWORD [RBP-0x18], 0x1     :     6  401162            89c8   MOV               EAX, ECX
0  4010be      e9f2000000   JMP               0x4011b5 (11):     6  401164          3b45ec   CMP        EAX, [RBP-0x14]
1  4010c3  c745ec00000000   MOV  DWORD [RBP-0x14], 0x0     :     6  401167          0f9fc0  SETG                     AL
1  4010ca      e989000000   JMP               0x401158 (6) :     6  40116a            84c0  TEST                 AL, AL
2  4010cf          8b45e8   MOV        EAX, [RBP-0x18]     :     6  40116c    0f855dffffff   JNZ               0x4010cf (2)
2  4010d2          8b55ec   MOV        EDX, [RBP-0x14]     :     7  401172  c745ec00000000   MOV  DWORD [RBP-0x14], 0x0
2  4010d5          8d0402   LEA         EAX, [RDX+RAX]     :     7  401179      e923000000   JMP               0x4011a1 (9)
2  4010d8          0345e8   ADD        EAX, [RBP-0x18]     :     8  40117e          8b45ec   MOV        EAX, [RBP-0x14]
2  4010db          3b45bc   CMP        EAX, [RBP-0x44]     :     8  401181            4898  CDQE
2  4010de    0f8e11000000   JLE               0x4010f5 (4) :     8  401183        48c1e002   SHL               RAX, 0x2
3  4010e4          8b45ec   MOV        EAX, [RBP-0x14]     :     8  401187        480345c0   ADD        RAX, [RBP-0x40]
3  4010e7          8b55bc   MOV        EDX, [RBP-0x44]     :     8  40118b          8b55ec   MOV        EDX, [RBP-0x14]
3  4010ea            89d1   MOV               ECX, EDX     :     8  40118e          4863d2 MOVSXD              RDX, EDX
3  4010ec            29c1   SUB               ECX, EAX     :     8  401191        48c1e202   SHL               RDX, 0x2
3  4010ee            89c8   MOV               EAX, ECX     :     8  401195        480355d8   ADD        RDX, [RBP-0x28]
3  4010f0          2b45e8   SUB        EAX, [RBP-0x18]     :     8  401199            8b12   MOV             EDX, [RDX]
3  4010f3            eb03   JMP               0x4010f8 (5) :     8  40119b            8910   MOV             [RAX], EDX
4  4010f5          8b45e8   MOV        EAX, [RBP-0x18]     :     8  40119d        8345ec01   ADD  DWORD [RBP-0x14], 0x1
5  4010f8          8945e4   MOV        [RBP-0x1c], EAX     :     9  4011a1          8b45ec   MOV        EAX, [RBP-0x14]
5  4010fb          8b45ec   MOV        EAX, [RBP-0x14]     :     9  4011a4          3b45bc   CMP        EAX, [RBP-0x44]
5  4010fe            4898  CDQE                            :     9  4011a7          0f9cc0  SETL                     AL
5  401100        48c1e002   SHL               RAX, 0x2     :     9  4011aa            84c0  TEST                 AL, AL
5  401104          4889c1   MOV               RCX, RAX     :     9  4011ac    0f85ccffffff   JNZ               0x40117e (8)
5  401107        48034dd8   ADD        RCX, [RBP-0x28]     :     10 4011b2          d165e8   SHL  DWORD [RBP-0x18], 0x1
5  40110b          8b45ec   MOV        EAX, [RBP-0x14]     :     11 4011b5          8b45e8   MOV        EAX, [RBP-0x18]
5  40110e          4863d0 MOVSXD              RDX, EAX     :     11 4011b8          3b45bc   CMP        EAX, [RBP-0x44]
5  401111          8b45e8   MOV        EAX, [RBP-0x18]     :     11 4011bb          0f9cc0  SETL                     AL
5  401114            4898  CDQE                            :     11 4011be            84c0  TEST                 AL, AL
5  401116        488d0402   LEA         RAX, [RDX+RAX]     :     11 4011c0    0f85fdfeffff   JNZ               0x4010c3 (1)
5  40111a        48c1e002   SHL               RAX, 0x2     :     12 4011c6        488b45d8   MOV        RAX, [RBP-0x28]
5  40111e          4889c2   MOV               RDX, RAX     :     12 4011ca          4889c7   MOV               RDI, RAX
5  401121        480355c0   ADD        RDX, [RBP-0x40]     :     12 4011cd      e82af9ffff  CALL               0x400afc
5  401125          8b45ec   MOV        EAX, [RBP-0x14]     :     12 4011d2        4883c448   ADD              RSP, 0x48
5  401128            4898  CDQE                            :     12 4011d6              5b   POP                    RBX
5  40112a        48c1e002   SHL               RAX, 0x2     :     12 4011d7              c9 LEAVE
5  40112e          4889c3   MOV               RBX, RAX     :     12 4011d8    ff0502000000 INC_A             [RIP+0x02]
5  401131        48035dc0   ADD        RBX, [RBP-0x40]     :     12 4011de            eb04 JMP_A               0x4011e4
5  401135          8b7de4   MOV        EDI, [RBP-0x1c]     :     12 4011e0        00000000 CTR_A
5  401138          8b75e8   MOV        ESI, [RBP-0x18]     :     12 4011e4              c3   RET

The listing shown above is divided into two columns numbered I and II separated by a colon (:). There are total 104 assembly instructions in this function. The first column lists the first 52 instructions and the second column lists the rest of the 52 instructions. This function is part of a binary program (in ELF x86-64 file) that is first disassembled and then binary analysis is performed on the disassembled program for building CFGs of each function in this program. In the listing above each instruction is assigned a block number and an address. Columns I and II are further divided into five columns: Column 1 is the block number, column 2 is the address, column 3 is the machine code, column 4 and 5 are the assembly instructions in Intel syntax.

The total number of blocks in this function are 12. Each block contains different number of instructions. For example block number 4 has only 1 instruction whereas block number 5 has 30 instructions. A block is a basic block [1] that has the following properties: (1) It has only one entry but can have more than one exit points. (2) An instruction with a branch to another block in the same function ends the block. (3) If an instruction is a target of another branch within the same function then that instruction starts a new block.

If an instruction branches to another block in the function listed above, the target instruction’s block number is listed at the end enclosed in brackets. For example the last instruction in block 1 ends with (6), because this instruction is branching to the address 401158 and the instruction at this address belongs to (is the first instruction of) block 6. Based on the analysis information listed above we build a CFG of this function that is shown in Figure 1 (a). We are going to compare this CFG with the source code of this function in C++ which is shown in Figure 1 (b).

Refer to caption
(a) The CFG
Refer to caption
(b) The Source Code
Figure 1: The CFG and the Source Code in C++ of the Function in Listing 1.1

The source code was not made available to our binary analysis tool, and the CFG that is build by this tool is only based on the information available in the binary program. Now we want to see how much this analysis helps us to get information about the function for malware detection.

This function has been instrumented, i.e: additional code has been added to this function. There are three instructions at the end (number 101, 102 and 103) in Listing 1.1 that has been added to this function. The first of these instructions INC_A increment a 32 bit counter CTR_A at address 4011e0. The second instruction JMP_A jumps over the counter storage space to address 4011e4 which contains the RET instruction. The counter counts the number of times this function is called. This kind of instrumentation is done for profiling a program for further optimizations.

The CFG starts at block 0 and ends at block 12. Block 11 jumps back to block 1. This indicates a possibility of a loop. If we look at the CFG there is a loop that starts at block 1 and ends at block 11. The blocks that belong to this loop are in red, green and blue colors. This outer loop has two inner loops that are colored green and blue. The source code of the function Merge::sort() has exactly one outer loop and the outer loop has exactly two inner loops.

A malware writer can change (just the machine code) the last instruction of block 11 in Listing 1.1:

from: 11  4011c0          0f85fdfeffff          JNZ 0x4010c3 (1)
to:   11  4011c0          ebfdfeffffff          JMP 0x4010c3 (1)

This will make the outer loop an infinite loop and when the function Merge::sort() is called the program will never return. The malware writer in this case has added an unconditional back jump, which in general is a legal jump. Similarly other back jumps (last instructions of blocks 6 and 9) can also be changed by a malware writer to make more infinite loops. So a signature based malware detection tool will not be able to detect such kind of malwares. Without the behavioral information, obtained either statically or dynamically, about a binary program a manual detection with a debugger is required to detect such malwares. This manual labor is very time consuming and financially can become very expensive.

To automate this process, a further binary analysis on the CFG will be able to detect this infinite loop as follows: We have already identified all the identifiable (more on this latter) loops in the function. We will further analyse all the blocks that contain a back jump. In this case analysing block 11, we see that the TEST instruction is followed by an unconditional jump instruction (added by the malware writer), which indicates an illegal infinite loop and hence a malware. In the case where a malware writer also replaces all the other four instructions with some other instructions, we will have to find the unconditional jump instruction to block 1 in block 11 (starting and ending blocks of the loop) to detect the malware, which in the presence of the CFG is trivial.

What if this unconditional jump instruction is a legal instruction, i.e: it has not been added by a malware writer and is part of the program? For example event-based programs contain one or more infinite loops. In this case we may need to build specific control flow patterns and compare them with the previous control flow patterns of malwares of such kind.

Other malicious changes, such as the following register renaming and control flow change, in block 2 in Listing 1.1, cannot be detected by a signature based malware detector:

from:
2  4010cf          8b45e8   MOV        EAX, [RBP-0x18]
2  4010d2          8b55ec   MOV        EDX, [RBP-0x14]
2  4010d5          8d0402   LEA         EAX, [RDX+RAX]
2  4010d8          0345e8   ADD        EAX, [RBP-0x18]
2  4010db          3b45bc   CMP        EAX, [RBP-0x44]
2  4010de    0f8e11000000   JLE               0x4010f5
to:
2  4010cf          8b45e8   MOV        EBX, [RBP-0x18]
2  4010d2          8b55ec   MOV        EDX, [RBP-0x14]
2  4010d5          8d0402   LEA         EBX, [RDX+RBX]
2  4010d8          0345e8   ADD        EBX, [RBP-0x18]
2  4010db          3b45bc   CMP        EBX, [RBP-0x44]
2  4010de    0f8e10100000   JLE               0x4011e5     ; Jump to some malicious code

In order to detect such anomaly based malwares automatically, we need control flow information as provided by the binary analysis presented in this Section.

Another technique used by malware writers to deceive signature based detectors is to use instructions other than JMP and CALL to change the control flow of a program. We show this by replacing the last instruction with two instructions in block 7 in Listing 1.1 as follows:

from: 7  401179      e923000000   JMP               0x4011a1
to:   7  401179      68e5114000  PUSH         QWORD 0x4011e5
      7  40117e              c3   RET

This change in Listing 1.1 is not finished here. For the code to work correctly the addresses following these instructions and all the effected jump target addresses needs to be updated. A malware writer may or may not update them depending on the complexity of the malware. A tool could be used by the malware writer that can automate updating these addresses.

A further binary analysis on the above instructions reveals that the last value pushed on the stack before the RET statement is 4011e5, so the RET instruction will move the value 4011e5 to the RIP register, the instruction pointer. Next time the instruction at address 4011e5 (malicious code) will be executed.

The added instruction at address 40117e indicates the end of a function. Sometimes the binary provides information about the start and end of all the functions in a program. But if this information is not available it is difficult to find the exact start and end of some of the functions, e.g: the addition of the two instructions shown above in the function in Listing 1.1 divides the function into two functions and makes it difficult to find the original function. For malware detection, we may only need to find where the control is flowing (i.e: just the behavior and not the function boundaries) and then compare this behavior with the previous samples of malwares available to detect such malwares.

In the above paragraphs we have shown using an elaborate example, how trivial changes in the binary can make the malware analysis and detection intricate, difficult and expensive. But with suitable tools and appropriate binary analysis it is possible to analyse and detect such malwares automatically. We have build a CFG (Figure 1 (a)) from the disassembled instructions of the function in Listing 1.1 for malware analysis and detection. In the next Section we describe the design of the intermediate language MAIL that automates and optimizes this step.

2.2 Design

We believe a good language must start small and simple, and must give opportunities to the language developers to grow (extend) the language with the users. Therefore MAIL is designed as a small and simple, and an extensible language. In this and next Sections we describe how MAIL is designed in detail.

The basic purpose of the language MAIL is to represent structural and behavioral information of an assembly program for malware analysis and detection. MAIL will also make the program more readable and understandable by a human malware analyst. An assembly program can comprise of the following type of instructions. We use Intel x86-64 assembly instructions [7] as sample instructions:

  1. 1.

    Control instructions: These instructions include instructions that can change the control flow of the program, such as JMP, CALL, RET, CMP, CMPS, CMPPS, PCMPEQW, REP and LOOP instructions.

  2. 2.

    Arithmetic instructions: These instructions perform arithmetic operations, such as ADD, SUB, MUL, DIV, FSIN, FCOS, PADDW, PSUBW, ADDPS, ADDPD, PMULLD, PAVGW, DPPD, SHR and SHL.

  3. 3.

    Logical instructions: These include instructions that perform logical operations, such as AND, OR and NOT.

  4. 4.

    Data transfer instructions: These instructions involve data moving instructions, such as MOV, CMOV, XCHG, PUSH, POP, LODS, STOS, MOVS, MOVAPS, MOVAPD, IN, OUT, INS, OUTS, LAHF, SAHF, PREFETCH, FLDPI, FLDCW, FXSAVE, LEA and LDS.

  5. 5.

    System instructions: These instructions provide support for operating systems and include instructions LOCK, LGDT, SGDT, LTR, STR and XSAVE etc.

  6. 6.

    Miscellaneous instructions: All other instructions that do not fit into any other group are included in this group of instructions, such as NOP, CPUID, SCAS, CLC, STC, CLI, HLT, WAIT, MFENCE, PACKSSWB, MAXPS, and UD (undefined instruction).

Designing a language that is small and simple, and accurately represent all these instructions for structural and behavioral information is non-trivial. Our goal is to create as few statements as possible in the intermediate language and map as many instructions as possible to these statements. For example we do not translate (i.e: ignore) the following x86 instructions:

   CLFLUSH:   Flush caches
   CLTS:      Clear TLB)
   SMSW:      Restore machine status word
   VERR:      Verify if a segment can be read
   WBINVD:    Writing back and flushing of external caches
   XRSTOR:    Restore processor extended states from memory
   XSAVE:     Save processor extended states from memory

The complete list of x86 and ARM instructions that are not translated into the MAIL statements is given in Appendix B

2.3 MAIL Statements

Majority of the assembly instructions are data moving instructions, as shown above. In the following two MAIL assignment statements we cover the data transfer, arithmetic, logical and some of the system instructions. We use EBNF [8] notation to define these statements:

assignment_s        ::= register_s
                        | address_s ;

register_s          ::= register ’=’ (math_operator)? expr
                        | register ’=’ (expr)? math_operator expr
                        | register ’=’ lib_call_s ;

address_s           ::= address ’=’ (math_operator)? expr
                        | address ’=’ (expr)? math_operator expr
                        | address ’=’ lib_call_s ;

expr                ::= register
                        | address
                        | digit+ ;

register            ::= ’eflags’
                        | ’gr_’ digit+
                        | ’fr_’ digit+
                        | ’sp’
                        | register_name (’:’ register_name)? ;

register_name       ::= letter+ [’0’ - ’9’]?
                        | ’ZF’
                        | ’CF’
                        | ’PF’
                        | ’SF’
                        | ’OF’ ;

address             ::= ’[’ digit+ ’]’
                        | reg_address
                        | ’UNKNOWN’ ;

Control instructions are very important because they can change the behavior of a program, and they can be changed or added by polymorphic and metamorphic malwares to avoid detection. The following MAIL control statement represent the control instructions:

control_s           ::= ( ’if’ condition_s (jump_s | assignment_s) )
                        ( ’else’ (jump_s | assignment_s) )? ;

jump_s              ::= ’jmp’ address ;

lib_call_s          ::= letter+ ’(’ address (, args)* ’)’ ;

function_s          ::= ’start_function_’ digit+ statement ’end_function_’ digit+ ;

condition_s         ::= (expr rel_operator expr)+ ;

All the MAIL language statements can be divided into the following 8 basic statements. The complete MAIL grammar is given in Appendix A:

statements          ::= ( statement* ) ;
statement           ::= assignment_s+
                        | control_s+
                        | condition_s+
                        | function_s+
                        | jump_s+
                        | lib_call_s+
                        | ’halt’
                        | ’lock’ ;

Every statement in the MAIL language has a type also called a pattern that can be used for pattern matching during malware analysis and detection. These patterns are introduced and explained in Section 2.5. MAIL has its own registers but also reuses the registers present in the architecture that is being translated to the MAIL language. There are other special registers such as:

Flag registers: ZF (zero flag), CF (crry flag), PF (parity flag), SF (sign flag) and OF (overflow falg). These flag registers are of size one byte and are used in conditional statements. e.g: if (ZF == 1) jmp 405632;. eflags: stores the flag registers. sp: To keep track of the stack pointer. gr and fr: These are infinite number of general purpose registers for use in integer and floating point instructions respectively.

2.4 MAIL Library

We have added 22 library functions to the MAIL language. Table 1 gives details about all these library functions. These library functions can help in translating most of the complex assembly instructions present in current processors architecture. The purpose of these functions is not to capture the exact functionality of the assembly instruction(s) but to help in analysing the structure and the behavior of the assembly program, and capturing some of the patterns in the program that can help detect malwares.

Function Semantics
abs(op) Returns the absolute value of the parameter op
aes(op, mode) Performs AES encryption/decryption on op; mode=0 for encrypt and vice versa
allocate(n) Allocate memory from the heap of size n bytes
atan(op) Returns the arc tangent of the parameter op
avg(op1, op2) Computes the average of the parameters op1 and op2
bit(op, index, len) Selects len number of bits in op starting at index
clear(op, index, len) Clears the bits in op at index upto len
compare(op1, op2) Compares two values op1 and op2 and then set the flag register
complement(op, index) Complements the bit in op at index
convert(value) Convert the value to either int or float
cos(op) Returns the cosine of the parameter op
count(op) Counts the number of ones in the op
len(obj) Computes the length of the parameter obj
log(op) Computes the log of the parameter op
max(op1, op2) Returns the maximum of the parameters op1 and op2
min(op1, op2) Returns the minimum of the parameters op1 and op2
rev(op) Reverses the bit order in op
round(op) Rounds the parameter op
scanf(op1, op2) Stores the index of the first bit one, found in op1, in op2 (forward scan)
scanr(op1, op2) Stores the index of the last bit one, found in op1, in op2 (reverse scan)
set(op, index, len) Sets the bit in op at index upto len
sin(op) Returns the sine of the parameter op
sqrt(op) Computes the square root of the parameter op
substr(value, offset, len) Returns the sub string from the string value starting at offset upto len
swap(op1, op2) Swaps the bits in op2 and write back in op1
swap(op) Swaps the bits in op
tan(op) Returns the tangent of the parameter op

Table 1: MAIL Library Functions

2.5 MAIL Patterns for Annotation

MAIL language can also be used to annotate a CFG of a program using different patterns available in the MAIL language. The purpose of these annotations is to assign patterns to MAIL statements that can be used latter for pattern matching during malware detection. Section 2.6 gives an example of a CFG with pattern annotation and Section 2.8 explains how they are used in malware detection. More than one statements in the MAIL langauge can have one pattern. There are total 21 patterns in the MAIL language and are listed and explained as follows:

  • ASSIGN: An assignment statement. e.g: EAX = EAX + ECX;

  • ASSIGN_CONSTANT: An assignment statement including a constant. e.g: EAX = EAX + 0x01;

  • CONTROL: A control statement where the target of the jump is unknown. e.g: if (ZF == 1) JMP [EAX+ECX+0x10];

  • CONTROL_CONSTANT: A control statement where the target of the jump is known. e.g: if (ZF == 1) JMP 0x400567;

  • CALL: A call statement where the target of the call is unknown. e.g: CALL EBX;

  • CALL_CONSTANT: A call statement where the target of the call is known. e.g: CALL 0x603248;

  • FLAG: A statement including a flag. e.g: CF = 1;

  • FLAG_STACK: A statement including flag register with stack. e.g: EFLAGS = [SP=SP-0x1];

  • HALT: A halt statement. e.g: HALT;

  • JUMP: A jump statement where the target of the jump is unknown. e.g: JMP [EAX+ECX+0x10];

  • JUMP_CONSTANT: A jump statement where the target of the jump is known. e.g: JMP 0x680376

  • JUMP_STACK: A return jump. e.g: JMP [SP=SP-0x8]

  • LIBCALL: A library call. e.g: compare(EAX, ECX);

  • LIBCALL_CONSTANT: A library call including a constant. e.g: compare(EAX, 0x10);

  • LOCK: A lock statement. e.g: lock;

  • STACK: A stack statement. e.g: EAX = [SP=SP-0x1];

  • STACK_CONSTANT: A stack statement including a constant. e.g: [SP=SP+0x1] = 0x432516;

  • TEST: A test statement. e.g: EAX and ECX;

  • TEST_CONSTANT: A test statement including a constant. e.g: EAX and 0x10;

  • UNKNOWN: Any unknown assembly instruction that cannot be translated.

  • NOTDEFINED: The default pattern. e.g: All the new statements when created are assigned this default value.

2.6 Binary to MAIL Translation

For translating a binary program, we first disassemble the binary program into an assembly program. Then we translate this assembly into MAIL program. We use a sample program, one of the malware samples, to give an example of the steps invlove in the translation as shown in Figure 2. This example shows how x86 assembly program is translated to MAIL program. In Section 2.6.1 we give an example of translating an ARM assembly program to MAIL program.

Refer to caption
Figure 2: Disassembly and Translation to MAIL of one of the Functions of one of the Malware Samples Listed in Table 4

The binary analysis of the function shown in Figure 2 have identified 5 blocks in this function labelled 18 - 25. There are two columns separated by \rightarrow. The first column lists the x86 assembly instructions and the second column lists the corresponding translated MAIL statements. The mathematical instructions are translated to an assignment statement with the appropiate operator added. Most of the data instructions are translated to simple assignement statements. Conditional jump instructions, such as JZ and JNZ, are translated to an if statement. Some of the instructions are translated to more than one MAIL statements. For example the instruction XCHG in block 25 is translated to three MAIL statements. The MAIL library functions are used to translate some of the instructions, such as the instrucion CMP in blocks 20 and 21 is translated using the library function compare. All the MAIL library functions are explained in Section 2.4

In addition to its own registers the MAIL language reuses all the x86 registers. There is a special register sp used in the MAIL language to keep track of the stack pointer in the program. The example shows data embedded inside the code section in block 27. This block is used to store, load and process data by the function as pointed out by the underlined addresses in the blocks. There are five instructions that change the control flow of the program and are indicted by the arrows in the Figure. There are two back edges 202220\leftarrow 22 and 192319\leftarrow 23. These edges indicate the presence of loops in the program. The jump in block 24 jumps out of the function. MAIL also keeps track of the flags using boolean values. For example the instruction CLD sets the direction flag in block 19.

Each MAIL statement is associated with a type also called a pattern. There are total 21 patterns in the MAIL language as explained in Section 2.5. For example an assignment statement with a constant value and an assignment statement without a constant value are two different patterns. Jump statements can have upto three patterns. Following are the patterns that are assigned during translation to the statements in block 21 shown in Figure 2:

 21               EAX = EAX + -0x1;     -->     ASSIGN_CONSTANT
 21              compare(EAX, 0x0);     -->     CALL_CONSTANT
 21      if (ZF == 1) jmp 0x401267;     -->     CONTROL_CONSTANT

These patterns are used to annotate a CFG for pattern matching. Section 2.8 explains how they help and improve malware detection.

2.6.1 Translation of an ARM Program to MAIL Program

In the previous Section we gave an example of translating x86 assembly program to MAIL program. As already mentioned that MAIL as a common language for different platforms, such as x86 and ARM binaries, helps malware analysis and detection tools to achieve platform independence. In this Section we provide an example of translating an ARM assembly program to MAIL program. Figure 3 shows an example of such a translation.

Refer to caption
Figure 3: An Example of Translation of an ARM Assembly Program to MAIL Program. The Function Merge::sort() shown in Figure 1 is First Disassembled to ARM Assembly Program and Then Translated to MAIL Program.

The main obvious difference in an ARM program and a x86 program is the length of the instructions as shown in Figures 2 and 3. The size of the hex dump of each instruction of the x86 program is different. Whereas the size of the hex dump of the ARM program is the same for each instruction. For the sake of completeness we also show the CFG of this program in Figure 4.

Refer to caption
Figure 4: CFG of the Function Shown in Figure 3. Except One of the Inner Loops this CFG is almost similar to the CFG shown in Figure 1 (a). The CFG Shown in Figure 1 (a) is of the Same Function But Constructed From the x86 Program.

As shown in Figure 3 the binary analysis have identified 10 blocks in the program. The mathematical, load and mov instructions are translated to assignment statements. The one conditional move movele instruction is translated to a control statement. Branch instructions are translated to either simple jump or control statements. The instruction cmp is translated using the MAIL library function compare().

2.7 CFG Construction

Figure 5 shows the CFG of the sample program shown in Figure 2. The CFG clearly indicates two back edges and two forward edges that change the control flow of the program. The fifth edge that jumps out of the function shown in Figure 2 is not shown in this CFG. There are two loops one outer loop {19, 20, 21, 22, 23} and one inner loop {20, 21, 22}.

Refer to caption
Figure 5: CFG of the Function Shown in Figure 2

2.8 Subgraph and Pattern Matching

After the binary analysis performed above we get a CFG of a program as shown in Figure 5. For detecting if a program contains a malware we compare the CFG of a program with the CFG of a known malware. If the CFG of the malware matches the complete or part of the CFG of the program then the program contains a malware, i.e; the program is not benign. We formulate this problem of malware detection as follows:

Let G=(V,E)G=(V,E) is graph of the program and M=(V,E)M=(V^{\prime},E^{\prime}) is graph of the malware, where V,VV,V^{\prime} and E,EE,E^{\prime} are the vertices and edges of the graphs respectively. Let Gsg=(Vsg,Esg)G_{sg}=(V_{sg},E_{sg}) where VsgVV_{sg}\subseteq V and EsgEE_{sg}\subseteq E. If GsgMG_{sg}\cong M then GG is not benign.

We solve this problem using subgraph isomorphism (matching). Given the input of two graphs it determines if one of the graphs contains a subgraph that is isomorphic (similar in shape) to the other graph. Generally subgraph isomorphism is an NP-Complete problem [6]. A CFG of a program is a sparse graph therefore it is possible to compute the isomorphism of two CFGs in a reasonable amount of time.

Very small graphs when matched against a large graph can produce a false positive matching. We conducted an experiment (more details in the next paragraph) and found that some of the malware samples after normalization were reduced to a small graph of 3 nodes as shown in Table 4 and were responsible for producing a large number (87.85%, see Table 3) of false positives. To take care of these and other such graphs we also implemented a Pattern Matching sub-component within the Subgraph Matching component. Every statement in the language MAIL is assigned a pattern as explained in Section 2.5. We use this pattern to match each statement in the matching nodes of the two graphs.

An example of Pattern Matching of two isomorphic CFGs is shown in Figure 6. One of the CFGs of a malware sample, shown in Figure 6 (a), is isomorphic to a subgraph of one of the CFGs of a benign program, shown in Figure 6 (b). Considering these two CFGs as a match for malware detection will produce a wrong result, a false positive. The statements in the benign program do not match with the statements in the malware sample. To reduce this false positive we have two options: (1) we can match each statement exactly with each other or (2) assign patterns to these statements for matching. Option (1) will not be able to detect unknown malware samples and is time consuming, so we use option (2) in our approach, which in addition to reducing false positives has the potential of detecting unknown malware samples.

Refer to caption
Figure 6: Example of pattern matching of two isomorphic CFGs. The CFG in (a) is isomorphic to the subgraph (blocks 0 - 3) of the CFG in (b).

For a successful pattern matching we require all the statements in the matching blocks to have the same patterns. In Figure 6, only the statements in block 0 satisfy this requirement. The statements in all the other blocks do not satisfy this requirement, therefore these CFGs fail the pattern matching.

To verfiy that by adding the Pattern Matching component to the Subgraph Matching component we have improved the metamorphic malware matching technique, an experimental study was performed. The dataset used for the experiment consisted of total 4289 samples, including 266 malware samples. Out of the 266 malware samples 250 were metamorphic malwares. Out of these 250 metamorphic malware samples we randomly selected 27 samples (10.11% of all the malware samples) for use in the experiment for matching. Dataset distribution based on the size of the CFG after normalization is shown in Table 2.

      27       4289
      Malware Samples Used For Matching       Benign and Malware Samples
      Size of       Number of       Size of       Number of
      CFG       Samples       CFG       Samples
      3       19       0 – 4       476
      92       1       5 – 200       402
      93       2       201 – 1000       753
      95       1       1001 – 4997       1215
      96       1       5007 – 9987       579
      99       2       10012 – 19979       464
      100       1       20032 – 29889       187
      30109 – 63289       213

Table 2: Data Set Distribution, For the Experimental Study, Based on the Size (number of nodes) of the Control Flow Graph (CFG) after Normalization

The complexity (size) of these graphs range from 0 nodes to 63289 nodes. Some of the Windows DLLs (dynamic link libraries) that were used in the experiment do not have code but only data (cannot be executed) and that is why they have 0 node graphs (CFGs). All the 4289 samples were matched against the 27 selected malware samples. The results of this experiment are shown in Table 3. There are 87.85% false positives when only the Subgraph Matching technique is used. The reason for this large number of false positives, as explained earlier, is because of the small size (3 nodes as shown in Table 2) of the graph samples matched. This large false positive has been reduced to 0 when in addition to the Subgraph Matching technique the Pattern Matching technique is used.

An interesting observation: When we used the Pattern Matching technique in addition to the Subgraph Matching technique the number of graphs matched were not 27 but 168. The 27 malware graph samples were not only matched with the same 27 graphs but they were also matched with additional 141 metamorphic malware graphs. When checked manually the nodes matched contained the same patterns as found in one of the 27 malware graph samples used for matching. That means some of the unknown metamorphic malwares were also detected because of the use of the Pattern Matching technique.

Component(s) Used TNG 11footnotemark: 1 NGUM 22footnotemark: 2 NGM 33footnotemark: 3 NMGM 44footnotemark: 4 False Positives
Subgraph Matching 4289 27 4018/93.68% 257 55footnotemark: 5 3768/87.85%
Subgraph Matching
and 4289 27 168/3.92% 168 66footnotemark: 6 0/0%
Pattern Matching
1   TNG:     Total number of graphs (both benign and malware samples)
2   NGUM:  Number of graphs (only malware samples) used for matching
3   NGM:     Number of graphs (both benign and malware samples) matched
4   NMGM:  Number of malware graphs matched
5   The graphs matched were the 250 metamorphic malwares and the 7 other
       malwares.
6   An interesting observation: The 27 malware graph samples were not
       only matched with the same 27 graphs but they were also matched with
       additional 141 metamorphic malware graphs. When checked manually the
       nodes matched contained the same patterns as found in one of the 27
       malware graph samples used for matching.
Machine used:        Intel Core i5 CPU M 430 @ 2.27 GHz
RAM:                     4 GB
Operating System: Windows 8 Professional
Table 3: Results of the Experiment to Verify the Improved Matching Technique After Adding the Pattern Matching Componenent.

This experimental study confirms the improved results of these two matching components for detecting an already known metamorphic malware and the 100% detection rate we achieved also confirms these results as shown in Table 5.

3 Empirical Study of Using the MAIL Language

We carried out an empirical study to analyse the correctness and the efficiency of our techniques described above using the MAIL language. This Section describes this empirical study. We developed a prototype tool called MARD that uses MAIL for malware analysis and detection as described above. We collected different metamorphic malwares and Windows programs as samples to use in our tool.

3.1 Dataset

Our dataset for the experiments consisted of 1387 programs. Out of these: 250 are metamorphic malware samples collected from two different resources [18, 19], and the other 1137 are Windows benign programs. Table 4 gives more details about this dataset.

250 1137
Malware Samples Benign Programs
Size of Number of Size of Number of
CFG Samples CFG Samples
3 200 17 127
88 1 30 44
91 – 99 38 44 – 998 412
100 – 104 10 1000 – 9765 535
129 1 10118 – 15343 19

Table 4: Data Set Distribution, For the Empirical Study, Based on the Size (number of nodes) of the Control Flow Graph (CFG) after Normalization

The data set contains a variety of programs with simple CFGs to complex CFGs for testing. As shown the size of the CFG of the malware samples range from 3 nodes to 129 nodes, and the size of the CFG of the benign programs range from 17 nodes to 15343 nodes. This variety in the samples provides a good testing platform for graph and pattern matching techniques used in our tool.

3.2 Experiments and Results

Two experiments were carried out using our tool MARD to detect metamorphic malwares: (1) In the first experiment we wanted to see if the tool can detect all the known malwares. This experiment consisted of 250 known malware samples in the test data set. (2) In the second experiment we wanted to see if the tool can detect the unknown malwares. This experiment consisted of 225 unknown malware samples in the test data set. The results for this experiment were obtained using 10-fold cross validation. This Section gives details about these experiments and the results obtained.

3.2.1 Experiment (1): To detect known malwares

The tool MARD first builds the training dataset, also called Malware Templates, using the 250 malware samples. After a program (sample) is translated to MAIL and to a CFG the tool detects the presence of malwares in the program, using the Malware Templates and applying the graph and the pattern matching techniques described above. We ran the experiment on the two machines with 2 (using 8 threads) and 4 Cores (using 64 threads) listed in Table 5. There was no manual intervention during the complete run. The tool automatically generated the report after all the samples were processed.

Experiment Analysis Detection False Data Set Size Real-Time11footnotemark: 1 Platform22footnotemark: 2
Number Type Rate Positives Benign/Malware
13,41^{3,4} Static 100% 0% 1137 / 250 Win 32
24,52^{4,5} Static 93.92% 3.02% 1137 / 250 Win 32
24,62^{4,6} Static 99.6% 3.43% 1137 / 250 Win 32
24,72^{4,7} Static 100% 3.43% 1137 / 250 Win 32
1   Real-time here means the detection is fully automatic and finishes in a reasonable amount of time.
2   All the samples (benigns and malwares) used were Windows 32 programs.
       Machines used in the experiments:
3   With 2 Cores:        Intel Core i5 CPU M 430 @ 2.27 GHz
       RAM:                     4 GB
       Operating System: Windows 8 Professional
4   With 4 Cores:        Intel Core 2 Quad CPU Q6700 @ 2.67 GHz
       RAM:                     4 GB
       Operating System: Windows 7 Professional
5   Results obtained using 10-fold cross validation
       Training dataset: 25 samples Unknown malware samples: 225
6   Training dataset: 100 samples, Unknown malware samples: 150
7   Training dataset: 200 samples, Unknown malware samples: 50
Table 5: Results of the Experiments Carried out as Part of the Empirical Study

3.2.2 Experiment (2): To detect unknown malwares

For this experiment we selected 25 malware samples out of the 250 malwares. These 25 malware samples were used to train the tool MARD to classify a program as benign or malware. These two steps were repeated 10 times and each time different set of 25 malware samples were selected for training (10-fold cross validation). After the program (sample) is translated to MAIL a CFG for each function in the program is build. Instead of using one large CFG as done in Experiment (1), we divide a program into smaller CFGs. A program that contains some percenatge of the control flow of a training malware sample, can be classified as a malware. The CFGs of a program help the tool MARD to detect such unknown malwares as explained below.

Threshold Detection False
Rate Positives 11footnotemark: 1
10 22footnotemark: 2 100% 3.07%
20 99.2% 3.07%
25 33footnotemark: 3 99.2% 3.07%
30 93.2% 3.07%
40 86.4% 3.07%
50 82.8% 3.07%
60 76% 3.07%
70 76% 3.07%
80 76% 3.07%
90 76% 3.07%
1   The same set of training dataset is used for all the experiments,
       therefore all of them have the same number of false positives.
2   We did not pick 10 as the threshold because we used only
       one set of dataset and 100% detection rate seems too perfect.
3   We picked 25 as the threshold because after this the
       detection rate started falling considerably.
Test dataset size: 1137 benigns and 250 malwares.
Training dataset size: 25 malwares.
Machine used in the experiment:
With 4 Cores:        Intel Core 2 Quad CPU Q6700 @ 2.67 GHz
RAM:                     4 GB
Operating System: Windows 7 Professional
Table 6: Results of the Experiment Carried out to Pick the Optimum Threshold Value to be Used in Experiment (2)

Applying the graph and the pattern matching techniques described above the tool MARD matches CFGs of each 25 malware samples with the CFGs of a program to classify the program as either benign or malware. We base these classifications on a threashold value of 25%. That is, if 25% or more of the CFGs of a malware sample matches with the CFGs of a program then the program is classified as malware, else the program is classified as benign. The threshold value was computed by carrying out experiments with different range of threshold values as shown and explained in Table 6. The results of experiment (2) are listed in Table 5. The detection rate is 93.92% because of the use of small number (25) of training dataset. The detection rate improved to 99.6% and 100% when we used a training dataset of 100 and 200 samples respectively.

3.3 Limitations of MAIL

A program translated to MAIL when executed may not produce the same output as the original program. The language MAIL is designed to perform static binary analysis and is not suitable for performing dynamic binary analysis.

The patterns developed if used with a behavioral signature of a binary program such as a CFG have the capability to produce useful classifications for malware analysis and detection, as shown by the results of the above experiments. But if the patterns are used alone, it may not produce the desired results.

The side effects of an assembly instruction is not directly translated to the MAIL statement. With the presence of various flag registers in the MAIL language it is possible for a malware analysis tool to include the side effect(s) of an assembly instruction by generating more statements and updating the affected flag registers.

The MAIL language is most useful in capturing the behavior (including structural and functional) of a binary program and can be used as part of different malware detection techniques such as described in this paper and in [13, 11, 10, 9]. These techniques require behavioral, structural or functional information about a program. In its current form the MAIL language cannot be used as part of other signature-based malware detection techniques, such as [21, 19, 18]. These techniques build the signatures using the opcodes of a binary program.

4 Conclusion

We have developed the new language MAIL for malware anlaysis and have used it successfully in our tool MARD for malware analysis and detection. We carried out an experimental study and showed that we can achieve detection rates of: 100% with 0% false positives for known malwares and \scriptstyle\sim(94 – 100)% with \scriptstyle\sim(3 – 3.5)% false positives for unknown malwares, as shown in Table 5. The two main contributions of the language MAIL are: (1) Providing platform independence and automation for malware analysis and detection tools, as is shown by its use in the tool MARD. (2) Optimizing the creation of a behavioral signature of a program, as is shown by creating a ACFG (Annotated Control Flow Graph), a CFG with patterns, of a binary program. We have shown how this ACFG is used for reliable malware analysis and detection in real-time. More recent examples of the use of MAIL can be found in [2].

Currently we are carrying out further research into optimizing the tool to increase its accuracy and efficiency for detecting unknown metamorphic malwares. We found that the behavioral signatures generated by the tool MARD using the MAIL language, and the graph and the pattern matching techniques, are helpful in detecting metamorphic malwares as shown in Table 5. We are collecting more metamorphic malware samples to use in our research and carry out experiments to further improve malware classification and detection.

References

  • [1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006.
  • [2] Shahid Alam, R Nigel Horspool, Issa Traore, and Ibrahim Sogukpinar. A framework for metamorphic malware analysis and real-time detection. Computers & Security, 48:212–233, February 2015.
  • [3] J.M. Bauer, M.J.G. Eeten, and Y. Wu. Itu study on the financial aspects of network security: Malware and spam. ©International Telecommunications Union (http://www.itu.int), 2008.
  • [4] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT ’08, pages 72–81, New York, NY, USA, 2008. ACM.
  • [5] F. Cohen. Computer viruses: Theory and experiments. Comput. Security., 6(1):22–35, Feburary 1987.
  • [6] Stephen A. Cook. The complexity of theorem-proving procedures. In Proceedings of the third annual ACM symposium on Theory of computing, STOC ’71, pages 151–158, New York, NY, USA, 1971. ACM.
  • [7] Intel Corporation. Intel ® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z, January 2013.
  • [8] International Standard Organization document reference ISO/IEC. Information Technology - Syntactic Metalanguage - Extended Backus-Naur Form, 14977 : 1996(E), 1996.
  • [9] Mojtaba Eskandari and Sattar Hashemi. Ecfgm: Enriched control flow graph miner for unknown vicious infected code detection. Journal in Computer Virology, 8(3):99–108, August 2012.
  • [10] Mojtaba Eskandari and Sattar Hashemi. A graph mining approach for detecting unknown malwares. Journal of Visual Languages and Computing, 23(3):154–162, June 2012.
  • [11] Parvez Faruki, Vijay Laxmi, M. S. Gaur, and P. Vinod. Mining control flow graph as api call-grams to detect portable executable malware. In Proceedings of the Fifth International Conference on Security of Information and Networks, SIN ’12, pages 130–137, New York, NY, USA, 2012. ACM.
  • [12] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’04, Washington, DC, USA, 2004. IEEE Computer Society.
  • [13] Jusuk Lee, Kyoochang Jeong, and Heejo Lee. Detecting metamorphic malwares using code graphs. In Proceedings of the 2010 ACM Symposium on Applied Computing, SAC ’10, pages 1970 – 1977, New York, NY, USA, 2010. ACM.
  • [14] Cullen Linn and Saumya Debray. Obfuscation of executable code to improve resistance to static disassembly. In Proceedings of the 10th ACM conference on Computer and communications security, CCS ’03, pages 290–299, New York, NY, USA, 2003. ACM.
  • [15] David M. Chess and Steve R. White. An undetectable computer virus. Virus Bulletin Conference, September 2000.
  • [16] Philip OKane, Sakir Sezer, and Kieran McLaughlin. Obfuscation: The hidden malware. IEEE Security and Privacy, 9(5):41–47, September 2011.
  • [17] ARM Holdings plc. ARM ® Architecture Reference Manual ARMv7-A and ARMv7-R edition, January 2012.
  • [18] B.B. Rad, M. Masrom, and S. Ibrahim. Opcodes histogram for classifying metamorphic portable executables malware. In e-Learning and e-Technologies in Education (ICEEE), 2012 International Conference on, pages 209–213, sept. 2012.
  • [19] Neha Runwal, Richard M. Low, and Mark Stamp. Opcode graph similarity and metamorphic detection. J. Comput. Virol., 8(1-2):37–52, May 2012.
  • [20] GCC Team. GCC: The GNU Compiler Collection. http://gcc.gnu.org, 2013.
  • [21] P. Vinod, V. Laxmi, M.S. Gaur, and G. Chauhan. Momentum: Metamorphic malware exploration techniques using msa signatures. In Innovations in Information Technology (IIT), 2012 International Conference on, pages 232–237, March 2012.

Appendix A Appendix


TITLE:       MAIL (Malware Analysis Intermediate Language) Grammar in EBNF
AUTHOR:      Shahid Alam (salam@cs.uvic.ca)
DATED:       March 24, 2013
REVISION:    1.0 (March 24, 2013)

DESCRIPTION:

The grammar can be defined by a 3-tuple G = (T, N, P) where
T = set of terminals
N = set of non-terminals
P = set of production rules

This document describes the grammar for MAIL. The grammar uses the EBNF syntax, where
’|’ means a choice, ? means optional, * means zero or more times and + means one or more
times. Line Comments start with "--". Terminator symbol is ";". Terminals are enclosed
in single quotes.

----------------------------------------------------------------------------------------
--                                    PRODUCTION RULES                                --
----------------------------------------------------------------------------------------

statements          ::= ( statement* ) ;
statement           ::= assignment_s+
                        | control_s+
                        | condition_s+
                        | function_s+
                        | jump_s+
                        | lib_call_s+
                        | ’halt’
                        | ’lock’ ;


assignment_s        ::= register_s
                        | address_s ;

register_s          ::= register ’=’ (math_operator)? expr
                        | register ’=’ (expr)? math_operator expr
                        | register ’=’ lib_call_s ;

address_s           ::= address ’=’ (math_operator)? expr
                        | address ’=’ (expr)? math_operator expr
                        | address ’=’ lib_call_s ;

control_s           ::= ( ’if’ condition_s (jump_s | assignment_s) )
                        ( ’else’ (jump_s | assignment_s) )? ;

jump_s              ::= ’jmp’ address ;

lib_call_s          ::= letter+ ’(’ address (, args)* ’)’ ;

function_s          ::= ’start_function_’ digit+ statement ’end_function_’ digit+ ;

condition_s         ::= (expr rel_operator expr)+ ;

----------------------------------------------------------------------------------------
--                                      HELPER RULES                                  --
----------------------------------------------------------------------------------------

expr                ::= register
                        | address
                        | digit+ ;

register            ::= ’eflags’
                        | ’gr_’ digit+
                        | ’fr_’ digit+
                        | ’sp’
                        | register_name (’:’ register_name)? ;

register_name       ::= letter+ [’0’ - ’9’]? ;

address             ::= ’[’ digit+ ’]’
                        | reg_address
                        | ’UNKNOWN’ ;

reg_address         ::= ’[’ register ( arith_operator (register | digit+) )* ’]’
                        | ’[’ sp ’=’ sp (’+’ | ’-’) digit+ ’]’
                        | ’[’ register (’:’ register)? ’]’ ;

letter              ::= [’a’ - ’z’] [’A’ - ’Z’] ;

digit               ::= ’0x’ [’0’ - ’9’] | [’A’ - ’F’] ;

math_operator       ::= arith_operator | log_operator ;

arith_operator      ::= ’+’ | ’-’ | ’*’ | ’/’ | ’%’ | ’.’ ;

log_operator        ::= ’and’ | ’or’ | ’xor’ | ! | ’<<’ | ’>>’ ;

args                ::= address (’,’ address)* ;

rel_operator        ::= ’<’ | ’>’ | ’<=’ | ’>=’ | ’==’ | ’!=’ ;

comment             ::= ’--’ blank | tab | character | comment* newline ;

character           ::= ’!’ | ’"’ | ’#’ | ’$’ | ’%’ | ’&’ | ’’’ | ’(’ | ’)’
                        | ’[’ | ’\’ | ’]’ | ’^’ | ’_’ | ’‘’ | ’{’ | ’|’ | ’}’
                        | ’*’ | ’+’ | ’-’ | ’/’ | ’,’ | ’.’ | ’~’
                        | ’:’ | ’;’ | ’<’ | ’=’ | ’>’ | ’?’ | ’@’
                        | [’0’ - ’9’]Ψ| letter ;

----------------------------------------------------------------------------------------
--                                          TOKENS                                    --
----------------------------------------------------------------------------------------

WS                  ::= blank | tab | newline ;
COMMENT             ::= ’--’ blank | tab | character | comment* newline ;
NUM                 ::= digit+ ;
COMMA               ::= ’,’ ;
COLON               ::= ’:’ ;
SCOLON              ::= ’;’ ;
LOP                 ::= ’and’ | ’or’ | ’xor’ | ! | ’<<’ | ’>>’ ;
AOP                 ::= ’+’ | ’-’ | ’*’ | ’/’ | ’%’ | ’.’ ;
ROP                 ::= ’<’ | ’>’ | ’<=’ | ’>=’ | ’==’ | ’!=’ ;
SFUN                :: ’start_function_’ digit+ ;
EFUN                :: ’end_function_’  digit+ ;
EQUAL               :: ’=’ ;
MUL                 ::= ’*’ ;
DIV                 ::= ’/’ ;
PLUS                ::= ’+’ ;
MINUS               ::= ’-’ ;
LBRKT1              ::= ’(’ ;
RBRKT1              ::= ’)’ ;
LBRKT2              ::= ’[’ ;
RBRKT2              ::= ’]’ ;
IF                  ::= ’if’ ;
ELSE                ::= ’else’ ;
UNKNOWN             ::= ’UNKNOWN’ ;

Appendix B Appendix

List of x86 and ARM instructions, in alphabetical order, that are not translated to MAIL statements:

x86

3DNOW
AAA, AAD, AAM, AAS, AESDEC, AESDECLAST, AESENC
AESENCLAST, AESIMC, AESKEYGENASSIST, ARPL
BOUND
DAA, DAS
EMMS, ENTER
GETSEC
CLFLUSH, CLTS, CMC, CPUID, CRC32
FCLEX, FDECSTP, FEDISI, FEEMS, FENI, FFREE, FINCSTP, FINIT
FLDCW, FLDENV, FNCLEX, FNINIT, FNSAVE, FNSTCW, FNSTENV
FNSTSW, FRSTOR, FSAVE, FSETPM, FSTCW, FSTENV, FSTSW
FXRSTOR, FXRSTOR64, FXSAVE, FXSAVE64, FXTRACT
INT 3, INVD, INVEPT, INVLPG, INVLPGA, INVPCID, INVVPID
LEAVE, LFENCE, LZCNT
MFENCE, MONITOR, MPSADBW, MWAIT
PAUSE, PREFETCH, PSAD, PSHUF, PSIGN
RCL, RCR, RDRAND, RDTSC, RDTSCP, ROL, ROR, RSM
SFENCE, SHUFPD, SHUFPS, SKINIT, SMSW
SYSCALL, SYSENTER, SYSEXIT, SYSRET
VAESDEC, VAESDECLAST, VAESENC, VAESENCLAST, VAESIMC
VAESKEYGENASSIST, VERR, VERW, VMCALL, VMCLEAR, VMFUNC
VMLAUNCH, VMLOAD, VMMCALL, VMPSADBW, VMREAD, VMRESUME
VMRUN, VMSAVE, VMWRITE, VMXOFF, VSHUFPD, VSHUFPS
VZEROALL, VZEROUPPER
WAIT, WBINVD
XRSTOR, XSAVE, XSAVEOPT

ARM

BKPT
CDP, CDP2, CLREX, CLZ, CPS, CPSID, CPSIE, CRC32, CRC32C
DBG, DCPS1, DCPS2, DCPS3, DMB, DSB
HVC
ISB
LDC, LDC2
MCR, MCR2, MCRR, MCRR2, MRC, MRC2, MRRC, MRRC2
PLD, PLDW, PLI, PRFM
SETEND, SEV, SHA1C, SHA1H, SHA1M, SHA1P, SHA1SU0, SHA1SU1
SHA256H, SHA256H2, SHA256SU0, SHA256SU1, SMC
SSAT, SSAT16, STC, STC2, SVC
USAT, USAT16
VCVT, VCVTA, VCVTB, VCVTM, VCVTN, VCVTP, VCVTR, VCVTT, VEXT
VLD1, VLD2, VLD3, VLD4, VLD1, VLD2, VLD3, VLD4
VQRDMULH, VQRSHL, VQRSHRN, VQRSHRN, VQRSHRUN, VQRSHRUN,
VQSHL, VQSHLU, VQSHL, VQSHRN, VQSHRN, VQSHRUN, VQSHRUN
VRINTA, VRINTM, VRINTN, VRINTP, VRINTR, VRINTX, VRINTZ
VRSHL, VRSHR, VRSHRN, VRSRA, VRSUBHN
VST1, VST2, VST3, VST4, VTBL, VTBX, VTRN, VUZP, VZIP
WFE, WFI, YIELD