This is a blog about a LLVM IR mutation tool called alive-mutate based on Alive2.
Alive-mutate can perform a lot mutations on the codes divided into two categories. One is called simple mutations and the other is complex mutations. Simple mutations are in-place mutations. They don’t require extra information from the code and easy to undo. Complex mutations are performed on a copy of the original modular module and need to maintain data structures as iterating the code such as dominated variables.
Simple Mutations are basically intuitive and simple operations such as swapping operands or reset instruction flags. They are easy to be performed and have the “undo” ability to cancel their change; thus, instead of cloning whole module and destroy it after mutation, all performed mutations would be undo at the last. Besides operations supported could be divided into different groups according to what objects they are applied to.
LLVM provides attribute sets associated with functions as well as their parameters such as “noinline” and “immarg”. Attributes constrain certain behaviors of compilers and are served as assumptions for particular optimization passes. It is necessary to see what those optimizations would do with unsatisfied assumptions.
In addition, for function with void as return type, we can remove them safely because we don’t need to care about contaminating the dominance tree due to no new definitions would be introduced here. It is interesting to see how optimization passes would do with removing random void functions.
In summary, we support below mutations on functions.
- Remove void function calls
- Remove or set “nofree” flag for function definitions
- Remove or set “dereferenceable” and “nocapture” for function parameters
LLVM supports a large amount of binary instructions. It is important to have the ability to modify them freely in order to improve search space. However, because flags are associated with certain instructions we can’t just replace the operator randomly. For instance, bit operations such as bit-xor has no associated flags while integer operations such as add or multiply need flags related to overflow. We divide binary instructions into different groups and every member in the group share a common set of attributes. Thus we can replace the operator with any one if both of them are in the same group or reset flags by the group information. Besides, alive-mutate contains two more mutations, swapping the operands and replacing constant with a random chosen one if any.
In a nutshell, we support below mutations on binary instructions.
- Replace operator
- Swap operands
- Reset instruction flags
- Replace constant
Interestingly, there is one more instruction called GEP(GetElementPointer) Instruction. It has an “inbounds” attribute which tends to be buggy so we support remove or reset the flag on GEP instructions as well.
Complex mutations is used for those complicated mutations which requires extra information obtained from the code such as dependency or dominance. Besides, some mutations have difference range to be applied. Unlike simple mutations explained before, some mutations work on a basic block or whole function while some of them are applied on an instruction. It is necessary to distinguish between them and applied appropriately.
Given alive-mutate is often asked to generate a huge amount of variants, it is natural to store some information beforehand and save it for later use instead calculating it every time whenever meets the scenario. For example, we need to use the dominance information thus dominance trees are saved for every function in the module during initialization. Besides, by iterating all instructions over functions, we can collect a set of used constant which is helpful when obtaining a random constant because an existent constant is often a better choice comparing to a totally random one.
Because complex mutations are hard to undo, we will perform mutations on a copy of the origin. However, one question shows up here is the discrepancy between the copy and origin. The information we saved is based on the origin, but the instruction we might update is on the copy. There are different objects in the memory. Moreover, mutations can generate new variables or definition and they would be discarded later. To solve the first problem, a 1-to-1 map is used for building the connection from values(instructions, functions, etc.) in the origin to values in the copy. For the second problem, a 2-level look-up mechanism is used. All information from the origin is saved at the first level container. All temporary definitions are saved at the second level container. When looking up a value, the searcher can pick one freely on two levels. After mutations done, only the second level container would be cleared.
This mutations is used for shuffling consecutive instructions without any dependencies among them. Because of no dependencies, any permutation of instructions is viable and won’t cause any influence on dominated tree. We call such consecutive instructions as a shuffle block. During initialization phase, alive-mutate will record all shuffle blocks with size larger than 1 when iterating all instructions. A hash set is enough to the property of shuffle blocks. When arriving a new LLVM basic block, it will read all available shuffle blocks inside current basic block and shuffle them one by one. One other thing needs to be noticed is that LLVM requires all phi-node must be at the beginning of a basic block; thus phi-nodes must be treated differently with ordinary instructions.
(one shuffle example here)
In the simple mutation section, we talked about swapping operands and replacing constant in a binary instructions but we can do more. For example, we can replace one operand on an int-add with any accessible integer in the context, maybe from global variables, function parameters or some dominated integer definition. To achieve this goal, we need to maintain the information in the 2-level structure. The other possibility is making a fresh variable instead using a existing one. In summary, for an instruction with the form “%a = op %b %c”, this mutation can replace %b or %c with (1) a random picked variable from global variables, function parameters, or accessible variables with the same type. (2) a fresh variable %d whose form is defined by a new generated binary instruction “%d = op1 %d1 %d2”. It ensures %d has the same type and %d1 and %d2 are made by condition (1).
(two examples here, one with (1) and the other with (2))
Moving an instruction around require handling more situations. Say definition B use definition A, i.e. A is defined before B. If we move B before A then we have to update the usage of A to another definition. Similarly, if we move A after B, we have to update the usage at B as well. Moving instruction forward and backwards are two scenarios we must handle.
For updating the usages, we have two strategies like the previous section. The first is a random picked variable from global variables, function parameters, or accessible variable with the same type. However, the second strategy is more complicated because we only need to handle binary instructions before whose types are simple and primitives. Here, the type of usages could be any; thus instead generate a instruction with the type, we choose to insert a new global variable or add an function parameter for the type. In execution, if alive-mutate failed in the first strategy it will turn to the second strategy and randomly pick a constructor.
(two examples here, one move forward and the other move backwards)
Thanks to llvm-stress, a tool for generating random .ll files, we can insert a random code piece to any location we want. By calling the library function, we would insert a code piece whose size ranges from 5 to 9. The generated random instructions includes load and store to memory, extract and insert to vector, select, binary and compare instructions.
(one example here)