The root cause of software vulnerabilities is that modern computers do not strictly distinguish program code and data in memory when implementing Turing machine models, and data in memory can be executed as code by programs. It is for this reason that an attacker can construct a malicious program through some means and successfully implant the target program to attack, thereby gaining control of the target program. In the face of endless attack methods, academia and industry have carried out a lot of related research in the field of software vulnerability analysis, and proposed many methods and tools to solve security problems.
Driven by the rapid development of programming language design and program analysis, source-oriented software vulnerability mining technology has made significant progress. However, there are also some shortcomings in source-oriented software vulnerability mining techniques. First, most software does not provide provider source code; Secondly, high-level languages and core operating systems will eventually be compiled into binary code, and high-level languages may also introduce vulnerabilities in the process of compiling and linking, making the vulnerabilities introduced from them difficult to detect. The vulnerability mining research for binary programs can be directly executed because it is language-independent and does not require program source code, and does not need to be compiled and linked, so it is more practical to directly study the vulnerability of binary code.
However, vulnerability analysis for binary programs also faces some challenges. The biggest problem is that the underlying instructions are more complex, and the program lacks type information containing semantics and grammar, making it difficult to understand, so researchers must have a lot of underlying computer knowledge. Based on this issue, the industry focuses on translating binary programs into intermediate languages, obtaining the semantic and type information needed through intermediate languages, and then using this information for vulnerability analysis research work. Intermediate languages have the advantages of instruction simplicity and semantic richness, thus compensating for the direct analysis of the underlying instructions.
There are a variety of existing binary vulnerability analysis techniques, and there are many ways to classify them. From the automation point of view of operation, it can be divided into manual or automatic/semi-automated analysis; From the perspective of software operation, it can be divided into dynamic analysis, static analysis and dynamic and static combination analysis; From the perspective of the openness of software code, it can be divided into black box test, white box test and gray box test; From the perspective of the morphology of the research object, it can be divided into vulnerability analysis techniques based on intermediate languages and based on the underlying instruction set. Because it is divided from different perspectives, no matter which standard is classified according to, there are overlaps and overlaps between various classifications, and its specific implementation technology can also belong to different categories under different classification standards at the same time. Typical specific techniques include fuzz testing, symbol execution, and spot analysis, all three of which cover almost all of these classification criteria. Both spot analysis and symbolic execution are dynamic and static analysis from the perspective of software operation, and can also be white-box testing from the perspective of code openness. Fuzz testing is both dynamic analysis and black-box testing. In addition, the above three technologies can be vulnerability analysis techniques based on intermediate languages and underlying instructions.
Therefore, although there are many current binary vulnerability analysis techniques, some typical specific technologies can almost represent the entire binary vulnerability analysis technology. Taint analysis starts from the root of software vulnerability exploitation, marks the untrusted input as taint data, tracks the use of taint data in the process of program execution, and determines that there is a vulnerability in the program if the key operation in the program illegally uses the taint data. Because it captures the essence of the exploit, the method is effective in principle. At the same time, with the improvement of hardware computing power, symbolic execution technology has been more and more widely used due to the high path coverage. In contrast to dynamic stain analysis, which focuses on the flow of data, symbolic execution is the analysis of the flow of control. Therefore, these two technologies are well complementary in their application to vulnerability analysis. In addition, as a representative technology of software vulnerability analysis, fuzz testing occupies an important position in the field of software vulnerability analysis because it has the advantage of not relying on the source code of the program and low system consumption.
Since most of the binary vulnerability analysis tools at this stage are not integrated on a platform, the results obtained from previous research are often difficult to collaborate and reuse in the later stage, and subsequent researchers cannot expand on this basis, but can only be re-implemented, resulting in a lot of waste of resources. In order to solve this problem, a better idea is to combine existing analysis techniques and integrate them into a platform to form a comprehensive binary vulnerability analysis framework. Therefore, in future research work, a more powerful binary analysis framework will be constructed according to the existing analytical framework.
1 Binary Program Vulnerability Analysis Framework
For vulnerabilities in software, researchers have proposed many analysis techniques and have been widely used in the field of vulnerability analysis. However, existing research work mostly uses some kind of analysis tool to perform vulnerability analysis of a specific type of program, or optimizes these analysis techniques, so it faces the following problems. First of all, the research work of many binary analysis techniques is difficult to reuse, and subsequent researchers cannot expand and supplement on the basis of previous research, and can only be re-realized according to the methods proposed by predecessors, which means that a lot of research work is wasted; Second, the use of a single analytical technique does not achieve complementary advantages. Because each technology has its advantages and disadvantages, some of the key problems faced by some analytical technologies have not been solved so far, restricting the development of these analytical technologies, so by combining two or more analytical techniques to complement each other's advantages will be the future research direction.
Based on the problems existing in the current research work, a better solution is to build a unified binary vulnerability analysis framework and integrate some major analysis techniques, so that subsequent researchers can optimize or integrate the next generation of binary analysis technologies on the basis of the analysis framework to achieve technical reuse.
A number of binary vulnerability analysis frameworks already exist.
BitBlaze is a unified binary analytics platform that combines dynamic and static analytics with scalability. BitBlaze mainly contains 3 components, namely Vine, TEMU and Ruder. Vine is a static analysis component that translates the underlying instructions into a simple and standardized intermediate language, and provides practical tools for some common static analysis on the basis of the intermediate language, such as drawing program dependency diagrams, data flow diagrams, and program control flow diagrams. TEMU is a dynamic analysis component that provides dynamic analysis of the entire system and implements semantic extraction and user-defined dynamic spot analysis; Rudder is a combination of verbostatic analysis of concrete and symbolic execution components, which use the functions provided by Vine and TMEU to achieve a mix of concrete execution and symbolic execution at the binary level, and provides path selection and constraint solving functions. The main processing flow of the components in BitBlaze is shown in Figure 1.
Посмотреть вложение 35300
Figure 1 BitBlaze process flowchart
SHOSHITAISHVILI et al. proposed a multi-architecture binary analysis framework angu, which integrates many binary analysis techniques and has the ability to dynamically perform symbolics and static analysis of binary programs. angr is a binary automated analysis tool developed by the shellfish team in the CTF competition, which was originally used to find backdoors in the program and can now be applied to the field of vulnerability analysis. Since angur integrates some existing analytical techniques and implements different functions using different modules, it is easy to compare existing analytical techniques on the platform and take advantage of the advantages of different analytical techniques. Its brief processing process is: first, loading the binary program into the anang analysis platform; Then, convert the binary code into an intermediate language; Finally, further analysis of the program, which includes static analysis or exploration of the symbolic execution of the program. The angr process flowchart is shown in Figure 2.
Посмотреть вложение 35301
Figure 2 Angr processing flowchart
angr mainly includes the following modules: the intermediate representation module (IR), which translates the binary code into an intermediate language, where the intermediate language VEX can analyze binary programs on different architectures; Binary Program Loading Module (CLE), which loads a binary program into the analysis platform; Program state represents the module (SimuVEX), which represents the state of the program, and SimState in SimuVEX implements a set of state plug-ins, such as registers, abstract memory and symbol memory, etc., the state of these plug-ins can be specified by the user; The Data Model Module (Claripy), which provides an abstract representation of the values stored in the registers or memory of simState; The Complete Program Analysis Module, which combines all the modules allows the anand to perform complex and complete program analysis.
angr is an open source tool with good compatibility, cross-platform, cross-architecture support, and provides a powerful symbolic execution engine with strong analysis capabilities. However, since the original intention of the angel was to find a backdoor in the program, if it is applied to the field of vulnerability analysis, it is not yet possible to perform a complete vulnerability analysis. First of all, in the symbolic execution section, angr only provides the path information of the program, and manual assistance is required for subsequent analysis; Second, angr is currently only more powerful for symbolic execution and has not integrated other more typical analysis techniques. Therefore, further research is required to automate the subsequent analytical process and to integrate other analytical techniques as a supplement.
In the future research work, a comprehensive binary vulnerability analysis platform will be built in combination with the current hot technologies, so as to form a large-scale binary vulnerability analysis framework. The specific construction ideas are as follows: 1. Add modeling of binary programs and formal descriptions on this basis to establish a security policy library. 2. Optimize the intermediate language part, and further supplement and improve the information loss and low conversion efficiency faced by the binary program in the process of converting to the intermediate language, hoping to obtain a more accurate information flow (including data flow and control flow). 3.Integrate symbol execution and further optimize the technology to realize the automation of subsequent analysis; At the same time, the spot analysis technology is integrated to assist the symbol execution to reduce the number of paths, thereby slowing down the problem of path explosion. In addition, taint analysis can also be combined with the security policy database established above for vulnerability analysis. 4. Integrate fuzz testing technology, and solve the constraints of each path according to the constraint solver, and then generate test cases in a targeted manner, and then combine the fuzz test to determine the vulnerability. The overall flow of the binary vulnerability analysis framework is shown in Figure 3.
Посмотреть вложение 35302
Figure 3 Overall flowchart of the binary vulnerability analysis framework
There are many types of binary vulnerability analysis technologies, and the application scenarios are different, so three mainstream and representative analysis technologies are selected here, namely taint analysis, symbol execution and fuzz testing technology, which are combined to integrate into a comprehensive binary vulnerability analysis platform and then perform vulnerability analysis. The research status of these three analytical techniques will be introduced separately in this article.
2 Background and significance of intermediate language research
Compared with source-oriented vulnerability mining technology, binary code-oriented vulnerability mining technology is more valuable for research, mainly because it has the characteristics of not relying on source code and has a wider range of practicality. However, due to the unique complexity of vulnerability analysis techniques for binary programs, it is difficult to analyze binary programs directly. There are mainly the following two reasons: 1.The number of underlying instruction sets is large, such as the instruction set of x86 has hundreds, and it is more complex; 2.Some simple problems, such as separating the code from the data, are undecidable, that is, the binary code lacks the corresponding semantic and type information.
Based on the above difficulties, industry researchers prefer to represent binary programs in an easier way, so they propose to use intermediate languages to replace binary codes, that is, to convert binary codes into intermediate codes with semantic information for subsequent analysis by researchers. This requires that the intermediate language used must not only be lower than the high-level language so that the conversion from binary code to an intermediate language is not complicated, but also needs to be higher than the binary program to be able to provide a large amount of semantic information. The intermediate language itself is a further specification of the underlying language, which is mainly for later vulnerability analysis. The conversion of intermediate language is the most important step in binary program vulnerability analysis, which needs to be analyzed through the semantic information that the intermediate language has in order to be able to obtain the CFG diagram of the program, so as to obtain the information flow of the program, including the control flow and data flow. Therefore, getting an intermediate language translated from binary code is the basis for subsequent work.
There are many methods and tools for how to convert intermediate languages. Chen Kaiming et al. proposed the method of using symbol execution technology for transformation, which first needs to define the environment and semantics of symbol execution, and then execute assembly language in the symbolic environment and convert it into the corresponding intermediate language. Because it needs to emulate program execution like symbolic execution, it adds overhead. Jiang Lingyan [4] et al. proposed an intermediate representation method called VINST, which follows the basic principle of keeping the part that executes frequently and keeping the other parts correct, so it only contains more than twenty instructions that are most commonly used in various instruction sets, and the remaining relatively few simple instructions are simulated with a variety of intermediate instructions corresponding to their functions, and complex instructions are simulated with CALL intermediate instructions, so that the functions of the high-level language are simulated, this method is relatively simple in form, but the efficiency is very low. CIFUENTES[8] et al. developed the binary translation system UQBT, which describes the instruction information of the system through the semantic canonical language SSL.
At the same time, the conversion from assembly language to intermediate language is done by generating a dictionary of instructions. Since the generation of the middle language depends entirely on the description of the instruction system by the SSL language, if the SSL is not accurately described, there will be semantic missing or inaccurate problems in the intermediate languages converted according to it. Subsequent decompiler Boomerang was developed on the basis of UQBT and is therefore similar to the principle of UQBT. Ma Jinxin et al. introduced an intermediate language with a static single assignment form for type reconstruction, in which each variable is defined only once, thereby reducing the large and complex x86 assembly instruction to a simple equation form, and on this basis, a method of constructing a register abstract syntax tree is proposed to solve the base address pointer alias problem. However, due to the fact that many cases are ignored during the translation process, there is a problem of information loss in the intermediate languages translated by this method. The Valgrind platform also employs an intermediate representation, the VEX intermediate language, which is architecturally independent. VEX uses the RISC instruction set, which has read and write operations for memory and registers on its platform. LOAD and STORE in memory variables are read and write operations, respectively; The GET operation and the PUT operation in the register variable, which are responsible for reading the value in the register variable into the temporary variable or writing the value in the temporary variable to the register variable, respectively; the IMark statement is a token for the address of a specific instruction.
However, the current vulnerability analysis technology based on intermediate languages also faces some problems. First, some information is lost during translation into intermediate languages, such as the failure to save EFLAGS registers in the VEX used by Valgrind; Second, intermediate languages translate a single instruction into several instructions, which can reduce efficiency, such as VEX used by Valgrind and VINEIL used by BitBlaze.
Since intermediate languages are the premise of subsequent work, how to improve the translation efficiency and accuracy of binary translation programs is an urgent problem that each translation program needs to solve urgently. For the intermediate language part, future research work mainly includes the following two aspects: First, for the intermediate code will bring a lot of redundant information problems, in the design process needs to be concise as the primary principle, minimize repetitive information, each instruction to achieve a single function as the main goal; Secondly, under the premise of reducing redundant information, it is necessary to ensure that the translated intermediate language has complete semantic information as much as possible. Therefore, further optimization is needed for the current problem of information loss in intermediate languages. By comparing the intermediate language with the assembly language obtained by disassembling the corresponding binary program, it is investigated whether the intermediate language lacks some important semantic information, and the semantic information missing from the intermediate language needs to be supplemented and improved in order to provide the most complete data flow and control flow as possible for subsequent vulnerability analysis.
3 Dynamic stain analysis techniques
3.1 Dynamic stain analysis technology principle and classification
The stain analysis technique is a technique for tracking and analyzing the flow of spot information through a program, first proposed by DENNING in 1976. The main principle of taint analysis techniques is to mark data from untrusted sources such as networks and files as "contaminated", and a series of arithmetic and logical operations acting on these data and newly generated data also inherit the "contaminated" attributes of the source data. By analyzing these properties, certain characteristics of the program are derived. Stain analysis technology can generally be divided into two types: dynamic spot analysis and static spot analysis from the perspective of whether to perform the program.
The advantage of static stain analysis technology is that it can quickly locate all the situations that may occur in the program, but because the static analysis cannot obtain some information when the program is running, there is a problem of low accuracy, so it is often necessary to manually review and confirm the analysis results . Typically, binary static stain analysis techniques refer to techniques that use disassembly tools, such as IDA, to disassemble binary code based on static stain analysis.
Dynamic stain analysis technology is another stain analysis technique that has become popular in recent years, also known as dynamic information flow analysis. The technique is to mark and track specific data in the program while it is running, and determine whether the program has vulnerabilities based on whether this specific data will affect certain pre-agreed key operations. In the security realm of applications, dynamic taint analysis has been successfully used to prevent several types of attacks, such as buffer overflows, format string attacks, SQL command injection, and cross-site scripting attacks (XSS). Recently, researchers have begun to investigate the application of stain analysis-based methods to areas beyond security, such as understanding programs, software testing, and debugging.
Dynamic stain analysis technology can be divided into fine-grained dynamic stain analysis technology and coarse-grained dynamic stain analysis technology according to the analysis particle size. Coarse-grained spot analysis can generally be used to detect abnormal behavior, the advantage is that the analysis speed is faster, the storage space is small, but there is often a problem of low analysis accuracy; Fine-grained stain analysis is mainly used to detect the vulnerability attack point of the program, and its advantage is that the analysis accuracy is high, and it can solve problems such as data flow backtracking, but there will be problems such as large test cost and excessive space occupation. Shi Dawei et al. combined the two and complemented each other's advantages, and proposed a dynamic stain analysis method combining coarse and fine particle size.
3.2 Basic process of dynamic spot analysis
The main process of dynamic stain analysis consists of 3 stages, as shown in Figure 4.
Посмотреть вложение 35303
Figure 4 The basic process of dynamic spot analysis
1. Identify the source of the stain
Identifying the source of a spot is a prerequisite for spot analysis. The current methods of stain source identification mainly include: all external inputs of the program are marked as stain data; Manual labeling of stain sources based on manual analysis; Use statistical or machine learning techniques to automatically identify and label stain sources.
2.Spot tracking
Trace the tarment data according to specific propagation rules to see the propagation path of the tarment data. The stain propagation rules are the most important part of the stain analysis process. There are two approaches to taint propagation analysis of binary code, one is to analyze the underlying instruction set directly, and the other is to analyze the intermediate language of the binary code after translation. Due to the lack of semantic information for direct binary code analysis, the analysis is difficult, so the taint analysis technique based on the intermediate language will be used to analyze the taint according to the information flow obtained by the intermediate language.
3.Key point detection
Detect critical points according to reasonable detection rules to see if the key points of the program use data marked with stains.
3.3 Research Status
In recent years, many scholars at home and abroad have conducted in-depth research on the stain analysis method, YIN and others proposed an extension platform TEMU based on the QEMU virtual machine, which can solve some difficulties in dynamic analysis and facilitate program analysis on it. TEMU uses a system-wide perspective to analyze activities in the kernel and interactions between multiple processes, with in-depth analysis in a fine-grained manner. KANG et al. proposed a dynamic stain propagation scheme DTA++, consisting of two stages: first, by diagnosing the branch generation rules that produce incomplete stains, and determining the additional propagation required for offline analysis;Second, apply those rules in later dynamic stain analysis. The basic principle is to find the path of the control flow of incomplete stains, and use both symbolic execution and path-predicate-based methods to solve the constraints of the path, in which the execution trajectory of the program can be represented as a set of path constraints judged by a series of branching conditions. Zhuge Jianwei et al. proposed dynamic stain analysis technology based on type perception and symbolic execution technology for type variables, which effectively solved the problem that the current stain analysis technology working at the binary level could not provide high-level semantics for software vulnerability analysis. The technique labels the type source spot's taint variable and adds type information as its stain attribute to form the spot source. In the process of stain propagation, the semantics of instructions and open library functions are used, combined with the type information disclosed by the Type Sink point as a supplement, the type information of the variable is passed, the variable-level stain propagation is done, and the symbolic execution of the type variable is carried out while the stain is propagated, and the program execution path conditions are obtained to form a library of security vulnerability features. Ma Jinxin et al. proposed a stain analysis method based on the offline indexing of the execution trail, using a dynamic piling tool to track the execution of the program, and record the status information of the execution into the trace file; The trace file is then analyzed offline and indexed, and only the operations related to the stain data are focused on in the spread of the spot, and the instructions unrelated to the spot data are directly skipped, thus saving the analysis time.
3.4 Problems
In the current dynamic stain analysis research, there are mainly the following problems.
3.4.1 Implicit StreamIng Problems
The implicit flow problem is the pollution problem under the dependency of the control flow. In the process of taint propagation analysis, it can be divided into explicit flow analysis and implicit flow analysis according to the different program dependencies of interest. Explicit flow analysis is the propagation of data dependencies between variables in the analysis of stain marks; Implicit flow analysis is the propagation of control dependencies between analytical taint tags as they travel with program variables. Implicit flow spot propagation is an important issue in stain analysis techniques, and if the spot data is not properly treated, the results of the spot analysis will be inaccurate. At present, there are two main cases, namely under-taint and over-taint of taint data. The problem of under-contamination is due to the fact that the implicit flow taint propagation is not properly treated, resulting in variables that should have been labeled not being labeled. The over-contamination problem is due to the proliferation of stain variables due to the excessive number of stains marked. Current research on implicit flow problems focuses on minimizing under- and over-contamination.
The first concern of dynamic implicit flow is how to determine the scope of statements that need to be labeled under taint control conditions. Since the dynamic execution trajectory does not reflect the control dependencies between the executed programs, the current approach is to use offline static analysis to assist in determining the range of implicit flow markers in dynamic taint propagation. The second problem with dynamic analysis is the underreporting of potential security issues due to partially tainted information leaks. Because there are paths that are not covered, that is, paths that are not executed, there are cases where spot information is propagated and leaked through these unexecuted paths during dynamic spot analysis. The current solution is also to add offline static analysis to the dynamic analysis. The third problem is how to choose the appropriate stain marker branch for spot propagation. Simply propagating all branches containing stain marks will lead to over-contamination, so filter-labeled branches of stain markers are required to reduce over-contamination. The DTA++ tool proposed by KANG et al. uses a symbolic execution method based on offline execution trails to find branches for taint propagation.
3.4.2 Stain removal problems
In a spot analysis, if the number of spot marks increases continuously and spreads continuously during execution, some of the spots that should be removed are not removed, resulting in false positives that affect the accuracy of the results of the analysis. Proper use of spot removal can reduce the number of spot marks in the system, improve the efficiency of spot analysis, and avoid inaccurate analysis results due to spot spread. Therefore, for special cases, such as the encryption of sensitive data and the presence of constant functions, decontamination can be considered.
3.4.3 Analysis is expensive
Some stain analysis tools require the use of piling or dynamic code rewriting techniques, which can be a significant overhead to the analysis system . For example, TaintCheck uses the piling tool Valgrind to represent Ucode stakes in the middle and implement dynamic spot analysis for x86 programs; Dytan uses the plug program Pin to implement dynamic spot analysis for x86 programs. In order to control the cost of analysis, the existing research work adopts the idea of selectively analyzing the propagation of taints on the instructions in the system. For example, the fast-path optimization technique proposed by QIN et al. determines whether the input and output of a module are threatened in advance;This is how to determine whether taint propagation is required, and if there is no threat, it is not taint propagation, thereby reducing costs. Peng Jianshan et al. used stain backtracking (a reverse stain propagation analysis) to trace the source of the stain by establishing a collection of suspicious stains to narrow the scope of the spots to be analyzed, and thus reduced the cost. Lin Wei et al. further solved the problem of large time and cost of spot propagation analysis for trajectory record files in offline stain analysis, and proposed an efficient stain propagation method based on semantic rules, thereby reducing cost and improving efficiency. However, these methods have a greater or lesser impact on the accuracy of the analysis, so further research is needed on how to reduce the cost of the analysis without reducing the accuracy of the analysis.
3.4.4 The rate of false negatives is high
Dynamic spot analysis technology has a high false alarm rate and low accuracy, which is a more difficult problem to solve, because it is determined by the nature of the dynamic operation of the program. The execution of any program's code can change during a single run, which is different from the test cases that are entered into the program. Therefore, the results of a single analysis of the dynamic stain analysis technology may only cover part of the code of the program, so that it cannot dig out the part of the unrigged code, resulting in a high false positive rate. To solve this problem, it is necessary to analyze the entire program code and function, and try to make the code coverage of dynamic analysis close to 100% by constantly constructing suitable test cases.Zhu Zhengxin et al. proposed a dynamic symbolic stain analysis method to solve the problem of high underreporting rate of stain analysis, which uses the idea of symbolization to symbolize the stain information and risk rules, and at the same time detect whether there is a violation of certain risk rules in the process of spreading the stain data, and then detect the unsafe behavior of the program. Cui Hualiang et al. proposed the offline stain analysis method, which split the stain analysis system into two modules (dynamic recording module and static replay module), and reduced the problem of high underreport rate in dynamic analysis by adding static analysis.
3.4.5 High-level semantic information is missing
The lack of necessary semantic and type information in the underlying binary code limits the application of dynamic spot analysis techniques in binary program analysis. Zhuge Jianwei et al. proposed dynamic stain analysis technology based on type perception and symbolic execution technology for type variables, which effectively solved the problem that the current stain analysis technology working at the binary level could not provide high-level semantics for software vulnerability analysis. This technology makes full use of the type information of the input point for type transmission and derivation, and combines the type information of the Type Sink point to promote the dynamic spot analysis in binary program analysis to variable granularity, and performs symbolic execution for type variables while the stain propagates, so that vulnerability analysis and feature extraction have better semantic support.
The advantage of dynamic stain analysis is that accurate program runtime information can be obtained through the execution of the program, but because the dynamic stain analysis can only perform a single program path for a dynamic run, after multiple executions, there will still be paths that are not covered, then the security problems on these uncovered paths will be ignored, and there will be some stain information leakage problems. As a result, dynamic spot analysis can be used in conjunction with symbol execution techniques to improve path coverage. At present, the research focus of dynamic stain analysis technology includes the design of propagation logic, the optimization of analysis efficiency, and implicit flow analysis.For the current stain analysis tools can not take into account the lack of speed and accuracy, some researchers proposed a combination of coarse and fine particle size dynamic spot analysis method, the method combined with the advantages of the two through the online coarse particle size mode to ensure the rapid collection of stain analysis information, while the use of offline fine particle mode to a reasonable time consumption to improve the accuracy of spot analysis. In the future, dynamic taint analysis will be carried out according to the information flow obtained in the intermediate language, and the vulnerability determination will be carried out in combination with the formulated security policy, and the vulnerability determination will also be combined with the hybrid symbol execution technology.
4 Hybrid symbolic execution techniques
4.1 Fundamentals and classification of symbol execution
4.1.1 Symbolic execution and mixed symbol execution
The basic idea of symbolic execution is to replace program variables with abstract symbols, or to represent the values of program variables as a calculation expression composed of symbolic values and constants, simulating program execution, and thus performing correlation analysis. Using symbolic execution has the following advantages: First, a symbolic execution is equivalent to the specific execution of a set of inputs that may be infinitely more, and this set of inputs can be called input equivalent classes; Second, algebraic relationships between variables can be discovered, which makes the internal logic of the program easy to understand; Finally, symbolic execution accurately records the constraints of the path, making it easier to further judge the path feasibility and construct test cases for the program based on the constraints.
However, due to the lack of dynamic runtime information of the program in static symbol execution, the program execution is difficult to be fully simulated, and the analysis is not accurate enough. To solve this problem, the relevant researchers proposed the hybrid symbolic execution technique, which is a dynamic analysis technique. The core idea of hybrid symbolic execution is to determine which part of the program can run directly and which part needs to be executed by symbols in the real operation environment of the program. In this way, the runtime information of the program is fully utilized, improving the accuracy of the analysis.
4.1.2 Binary programs perform classification of mixed symbols
Binary-oriented hybrid symbol execution can be divided into online symbol execution and offline symbol execution from the perspective of whether the analysis is simultaneous during the execution process; From the perspective of analyzing the morphology of objects, it can be divided into symbolic execution for the underlying instruction system and symbolic execution for intermediate code.
Online symbolic execution refers to the simultaneous execution of programs and symbolic calculations, such as SmartFuzz; Offline symbol execution refers to the recording of the program execution trajectory first, and then the symbolic calculation is performed according to the replay of the trajectory execution, such as SAGE .
Online symbol enforcement faces the following problems. 1. Since the symbol execution process will occupy more resources, online symbol execution may reduce the performance of the target program. For example, during symbol execution, the target program may exit early due to lack of computing resources, which limits the depth of the program execution trajectory, resulting in the final collected program may lack some necessary internal information. 2. Usually there will be concurrency and some uncertain events in the execution process of the program, which may bring some unpredictable problems to the online symbol execution.Synchronization relationships that exist in multithreaded or multi-process programs are likely to be broken by online mixed symbol execution, which can lead to resource conflicts or even errors in the target program. 3. If code obfuscation, hulling and other protection techniques are used in the target program, the hybrid symbol execution is likely to not work properly. Based on the above problems, offline symbol execution technology will be used for vulnerability mining in future research work.
The use of offline symbol execution can effectively alleviate the above problems. On the one hand, in the process of offline symbol execution, only the program execution trajectory is recorded, and there is no need for symbol calculation, which brings a small performance loss to the target program; Even with the support of hardware virtualization conditions, it is possible to record the execution trajectory at the hardware level. On the other hand, the function of recording the execution trajectory of the program is relatively simple, which has less impact on the execution of the target program and does not break the synchronization relationship of the target program's multi-threading or multi-process. Moreover, the symbolic computation during the performance of the trajectory replay can be fully implemented on a high-performance computing platform, which also means that it can be run on a different platform than the target program, thereby improving the performance of symbolic analysis.
Symbolic execution for the underlying instruction system refers to symbolic execution according to the semantics of the underlying instructions, such as SAGE; Symbolic execution for intermediate code refers to the translation of the underlying instructions on the execution trajectory into the intermediate code, and then the symbolic calculation of the intermediate code, such as SmartFuzz .
The main advantage of symbolic execution for intermediate code is that it has good compatibility for different instruction systems. At the same time, the symbolic execution engine interprets intermediate code rather than the underlying instructions, which makes it highly extensible. For programs running on different instruction systems, it is only necessary to re-implement the execution trajectory recording, and the different instruction systems can be translated into the same intermediate code, and the program running on the platform can be analyzed accordingly.In the process of replaying the execution trajectory, the underlying instructions are first translated into an intermediate language, and then the corresponding information flow is obtained according to the intermediate language, and then the hybrid symbol execution is carried out.
4.2 The process of symbolic execution
Typically, the basic process of performing vulnerability analysis using symbols is shown in Figure 5. It mainly includes two parts: binary program modeling and vulnerability analysis.
Посмотреть вложение 35304
Figure 5 The basic process of symbolic execution
1.Binary program modeling
Modeling a binary program involves performing a basic parsing of the binary code to obtain an intermediate representation of the program code. Because the symbol execution process is usually a path-sensitive analysis process, after the binary code is parsed, the system usually needs to build a structure diagram to describe the program path, such as the program call diagram and the control flow diagram, which may be used in the symbol execution analysis process to assist the symbol execution.
2. Vulnerability analysis
The vulnerability analysis process mainly includes two parts: symbolic execution and constraint solving. Symbolic execution represents the values of variables as evaluation expressions for symbols and constants, and represents path conditions and conditions with vulnerabilities in the program as constraints on symbolic values; The constraint solving process not only determines whether the path conditions are satisfied, and makes a choice of paths based on the judgment results, but also checks whether the conditions for the vulnerability of the program are satisfied. In the process of symbol execution, it is often necessary to exploit certain vulnerability analysis rules, which describe the situations under which symbols need to be introduced and under what circumstances the program may have vulnerabilities.
4.3 Development and research status of symbol execution
Symbolic execution was proposed in 1975, and was first used in software testing. Later, BUSH et al. proposed The Static Symbol Execution Tool Prefix, which uses path sensitivity and inter-process analysis to reduce false positives and apply symbol execution techniques to the field of vulnerability mining. In the simulation execution, in order to avoid path explosion, Prefix only checks a maximum of 50 paths inside each function, after all paths are completed, Prefix merges the results of all paths, establishes a summary of the function, and when the function is called again, the built summary can be directly applied to accelerate the simulation. Microsoft then acquired Prefix, and on this basis implemented Prefast, which has become one of the standard source code static detection tools within Microsoft.
Static symbol execution technology requires simulating the program execution environment during analysis, and due to its lack of relevant information about the dynamic operation of the program, program execution is difficult to be fully simulated, and the analysis is not accurate enough. To solve this problem, the researchers proposed the concept of mixed symbolic execution, which transformed symbolic execution from a program static analysis technique to a dynamic analysis technique, and implemented a series of representative tools such as CUT, DART, etc. CUTE Uses hybrid execution techniques to represent test cell inputs in memory graphs, generating constraint expressions and tracking during symbol execution. In order to reduce the overhead of symbolic execution, CUTE adopts a combination of random testing and symbolization, and by default uses random testing to detect the program. When the random test is close to saturation, the test case is generated by symbolic execution and then moves on to a new path to continue execution. KLEE developed by CADAR et al. is an open source tool for constructing program test cases using symbolic execution technology, which can be used to analyze C language programs under the Linux system, and at the same time as analyzing the test cases constructed by the program, it also uses symbolic execution and constraint solving techniques to analyze the value range of symbols at key program points, and check whether the value range of symbols is within the scope of safety regulations. If the value of the symbol found in the analysis does not meet the security regulations, the program is considered to have a corresponding vulnerability. When KLEE constructs test cases, it considers not only path conditions, but also the conditions that trigger the program vulnerability.
Because source-oriented tools are limited in use due to the inability to obtain program source code, researchers have proposed to implement hybrid symbolic execution techniques at the binary code level, and have developed many tools based on this, such as SAGE, BitBlaze, SmartFuzz and so on. SAGE dynamically tests programs with legitimate inputs, tracks and records instruction execution with the help of the binary analysis platform iDNA, collects path constraints in instruction replay and solves them one by one, and generates new test sets to drive program execution of different paths and maximize code coverage. BitBlaze is a binary analysis platform that integrates technologies such as static analysis, dynamic analysis, and dynamic symbol execution, which is mainly composed of 3 parts, namely Vine, TEMU and Ruder, which implement functions such as static analysis, dynamic analysis, and dynamic symbol execution, respectively. SmartFuzz is a Valgrind-based binary instrumentation tool that can be used to discover integer vulnerabilities in binary programs under Linux. SmartFuzz converts binary code into VEX intermediate languages, then performs online symbol execution and constraint solving, and dynamically generates test cases.
4.4 Problems
In the current research on symbol execution technology, there are mainly the following two problems.
4.4.1 Path explosion problem
Symbolic execution technology faces a major problem, namely the path explosion problem. The main reason for this is that whenever an if-like statement is encountered, it is possible to make the present set of paths one more new path on top of the original. The way this path grows is exponential.
There are several main ways to solve the path explosion problem: 1. The number of paths within each process is conventionally limited, which is somewhat similar to the execution of the limit symbol. 2. Improve the analytical performance of symbol execution by improving the path scheduling algorithm. For example, BOONSTOPPEL et al. proposed the idea of cropping the path on the basis of the EXE tool according to the way and characteristics of the program execution, and reduced some unnecessary path analysis through the cropping idea. This method has proven to be a good way to mitigate path explosions during symbol execution. 3. Using the parallelization strategy of symbolic execution , the solution speed of the constraint solver is improved according to the current parallel computing technology.
4.4.2 Performance Issues
Symbolic execution techniques use constraint solvers to solve expressions for path conditions, and when analyzing the program, they also need to instrument the program. According to statistics, 2000 lines of C code can reach 40,000 lines after insertion, which not only means that a large number of lines of code are increased, but also means that the number of calls to the solver is increased, and the performance of the solver determines the efficiency of symbol execution. The current solution idea is to consider how to reduce the frequency of use of the solver, there are two main methods , one is to use caching technology to cache the commonly used constraint expressions or some intermediate expressions obtained during the solution process, so as to reduce some unnecessary repeated solutions; The second is to improve the solution speed by further optimizing the constraints, such as eliminating irrelevant constraints, rewriting expressions, and simplifying constraint sets.
Today, mixed symbolic execution has become an important code execution space traversal technique. Different from the traditional static symbol execution technology, hybrid symbol execution technology makes full use of the information when the program is running, improves the accuracy of symbol execution, and promotes the development of symbol execution. In addition, as the industry's demand for binary program analysis continues to grow, hybrid symbolic execution techniques for binary programs are becoming increasingly important. The main research directions in the future are as follows:
1.Symbol execution can be further combined with other techniques. Specifically, the advantages of prior art are leveraged to complement those of symbol execution techniques to compensate for some of the problems faced by individual analytical techniques.
2.For the path explosion problem faced in symbol execution, further improvement is needed. Abstract ideas can be used to process and abstract numerous path information to form a higher-level collection, so that only these abstract collections need to be traversed, which greatly reduces the number of paths and can alleviate the path explosion problem.
5 Fuzz testing techniques
5.1 Basic principles and classification of fuzz testing techniques
Fuzz testing technique is a representative technique of software vulnerability analysis, which was first proposed by MILLER et al. in 1989. Because it does not require in-depth analysis of the target program, and the vulnerabilities found through fuzz testing are generally real, fuzz testing technology occupies an important position in the field of software vulnerability analysis. The basic idea of fuzz testing is to provide a large amount of specially constructed or random data as input to the program under test, monitor the abnormal operation of the program and record the input data that causes the abnormality, supplemented by manual analysis, and further locate the location of the vulnerability in the software based on the input data that caused the error. From the perspective of test case generation strategy, fuzz testing can be divided into variation-based and generated-based fuzz testing; From the perspective of the openness of the code, the fuzz test can be divided into black box fuzz test and white box fuzz test.
Variation-based fuzz testing methods use techniques to mutate existing data to create new test cases; The generated fuzz testing method generates test cases from scratch by modeling the specific program under test.
Originally called random testing, black-box fuzz testing is a typical black-box dynamic analysis technique. The main idea is to input a large amount of irregular data into the program under test to detect whether the system is behaving abnormally. This method has the characteristics of simplicity and fast execution, but the flaw itself is that the generation of test cases is blind, that is, it lacks pertinence, so the path coverage is low. The so-called fuzz test generally refers to the black box fuzz test.
White box fuzz testing is another fuzz testing technique that has become increasingly popular in recent years, which combines fuzz testing and symbol execution. The basic idea is: first of all, the symbol execution technology traverses the path of the program, and analyzes the path conditions, solves the path conditions, and obtains the information of whether the path is reachable; Fuzz testing is then performed by generating specific test cases based on path constraints for reachable paths. This approach has high path coverage due to the combination of symbolic execution techniques to assist in generating test cases. But also because this method combines symbolic execution, it introduces the problem of path explosion and limited constraint solving ability.
5.2 Basic Process of Fuzz Testing
The fuzz test goes through roughly 5 basic stages during the execution, as shown in Figure 6.
Посмотреть вложение 35306
Figure 6 Basic process of fuzz testing
1.Identify the target. In order to choose the appropriate fuzz testing tool or technology, it is first necessary to fully understand the target program and clarify the target program.
2.Identify the input. Most vulnerabilities are often exploited because programs fail to properly handle some illegal data when receiving external input, so considering all possible input scenarios for a program is important for the success of fuzz testing.
3. Build test cases. Building test cases is a critical step in the process of fuzz testing, and the generation strategy of test cases determines the test efficiency of fuzz tests. Test case generation methods can be divided into two broad categories: variation-based and generated test case generation strategies. Test case sets can be built using one or both. There are three common ways to build test cases: randomly generated test cases, pre-generated test cases, variations, or forced generation of test cases.
4. Monitor execution and filter for exceptions. According to the execution of the test case in the target program, observe whether the target program has an abnormal condition or system crash, and determine the vulnerability according to the abnormal result.
5. Determine availability. Further identify the identified vulnerability based on the application to determine whether the vulnerability can be exploited.
5.3 Development and research status of fuzz testing
Fuzz testing, which was adopted in software engineering for a long time and was originally called random testing, initially focused not on evaluating the security of software, but on verifying the quality of code. In 1989, MILLER et al. developed a fuzz test tool to detect the robustness of programs on UNIX systems. This early method of fuzzing testing was a relatively simple black-box test: simply constructing a random string and passing it to the target application, if the application has an exception during execution, then the test is considered to fail, otherwise it is considered to pass, and it turns out that the method makes the probability of the program crashing at about 25%. In 1999, the Protocol TestIng Project Team of the University of Oulu, Finland, developed the cybersecurity testing software PROTOS and first proposed the concept of a test set.
In 2002, AITEL released the fuzz testing framework SPIKE. The framework is designed to test network protocols, employing a block-based approach that is effective for testing blocks of data containing variables and corresponding variable lengths, while allowing users to create their own custom fuzz testers. In 2004, EDDINGTON released Peach, a fuzz testing framework that supports both network protocols and file formats. Written in Python, Peach primarily uses an XML file to guide the entire fuzz testing process, while being able to detect almost any network protocol because it supports cross-platform operations. In 2007, AMINI released the fuzz test framework Sulley. Sulley is a Python-based fuzz testing framework that fuzzes various network protocols and is composed of multiple extensible components, including data generation, proxies, and tools. Sulley's advantage is that it is simple to use and path-based analysis, taking into account the problem of path coverage.
In the application of fuzz testing technology to vulnerability mining, some scholars have also developed some tools accordingly. In view of the limitations of fuzz testing when encountering checksum protection mechanisms, in 2010, WANG et al. of Peking University combined mixed symbol execution and fine-grained dynamic stain propagation technology to propose a method that can bypass the checksum mechanism, and developed the corresponding toolTainScope, which removes obstacles for the application of fuzz testing to find deep vulnerabilities. Since 2011, genetic algorithms and heuristic algorithms have been used by many researchers to assist in generating test cases, thereby improving the efficiency of test case generation.
5.4 Problems
Fuzz testing has many advantages over other vulnerability mining methods, but there are also shortcomings, there are still many limitations, one of the more important is that the test case of fuzz testing is not guaranteed to cover all statement branches, and the degree of automation of fuzz testing is not high at present. Therefore, optimizing the generation of test cases for fuzz testing and improving the efficiency of fuzz testing are still the main research directions.
5.4.1 Test case generation strategy needs to be improved
Test case generation is the most critical part of the fuzz test, and its efficiency is related to the efficiency of the entire fuzz test. Early fuzz testing often used blindly enumerated test case generation strategies, which belonged to the black-box testing method. With this approach to testing, although the size of the test cases may have reached a very high level of magnitude, the test results are often not significant. Moreover, in some software, private data and checksums are widely used, and often these randomly generated data are directly discarded because they cannot pass the checksum, and the deeper vulnerabilities are difficult to detect, so that the effect of the test is not ideal. Therefore, how to optimize the generation strategy of test cases is a problem that needs to be focused on at present.
Currently, researchers have introduced some similar concepts from genetics into the generation strategy of test cases. For example, concepts such as selection and cross-reorganization, and optimization of the generation strategy for test cases. Among them, the use of genetic algorithms to assist in the generation of specific test cases has been shown to greatly improve the efficiency of test case generation, and is a very useful automated testing technology.
At the same time, you can also consider combining fuzz testing with other binary vulnerability analysis techniques. For example, dynamic spot analysis can extract the dynamic data flow or control flow information of the program and integrate it into the binary vulnerability analysis process, guide the generation of input data in the fuzz test, make the traditional black box fuzz test targeted, and greatly improve the effect of the fuzz test.
5.4.2 The degree of automation needs to be improved
Fuzz testing is a blind injection method, manual testing takes more time, resulting in its practical value is affected. In some common fuzz testing frameworks, manual involvement is often required to determine constraints on input data and generate test cases. Therefore, most of the current research focuses on automated and intelligent fuzz testing methods. One solution is still to consider using genetic algorithms in genetics to improve the degree of automation of tests.
5.4.3 Unable to resolve multipoint triggering vulnerabilities
Fuzz testing techniques often find vulnerabilities caused by only one condition, rather than vulnerabilities that require multiple conditions. In recent years, some researchers have attempted to exploit fuzz testing in mining multipoint trigger vulnerabilities and have come up with the concept of multidimensional fuzz testing. However, multidimensional fuzz testing currently has the problem of combinatorial path explosion, so it is slow to develop and further research is needed.
Fuzz testing has a great advantage over other vulnerability mining methods, as it does not require an understanding of the internal structure of the program, so it is less expensive. At the same time, fuzz testing also has some limitations, such as large test blindness, low test efficiency, and code coverage cannot be guaranteed. To address these issues, consider combining fuzz testing techniques with other binary vulnerability analysis techniques, such as dynamic spot analysis or symbolic execution. Future research efforts will integrate fuzz testing into the built binary analysis framework, combining it with symbolic execution techniques. Since the symbol execution technology has good path coverage, when generating test cases, you can first use the symbol execution technology to constrain the program, obtain a set of path constraints, and then design the test cases according to the constraints of the path, and then perform fuzz testing.The combination of fuzz testing and symbol execution to generate guided test cases will theoretically greatly improve the efficiency of fuzz testing, and the detection effect will be significantly improved. In addition, using genetic algorithms in genetics to assist in generating test cases is also a good way to do so. In view of this method, in the future, we can also pay further attention to some similar concepts and ideas in other disciplines, judge whether they can be applied to the field of vulnerability analysis, and solve some problems faced by current vulnerability analysis technology, which is also a new breakthrough point.
Driven by the rapid development of programming language design and program analysis, source-oriented software vulnerability mining technology has made significant progress. However, there are also some shortcomings in source-oriented software vulnerability mining techniques. First, most software does not provide provider source code; Secondly, high-level languages and core operating systems will eventually be compiled into binary code, and high-level languages may also introduce vulnerabilities in the process of compiling and linking, making the vulnerabilities introduced from them difficult to detect. The vulnerability mining research for binary programs can be directly executed because it is language-independent and does not require program source code, and does not need to be compiled and linked, so it is more practical to directly study the vulnerability of binary code.
However, vulnerability analysis for binary programs also faces some challenges. The biggest problem is that the underlying instructions are more complex, and the program lacks type information containing semantics and grammar, making it difficult to understand, so researchers must have a lot of underlying computer knowledge. Based on this issue, the industry focuses on translating binary programs into intermediate languages, obtaining the semantic and type information needed through intermediate languages, and then using this information for vulnerability analysis research work. Intermediate languages have the advantages of instruction simplicity and semantic richness, thus compensating for the direct analysis of the underlying instructions.
There are a variety of existing binary vulnerability analysis techniques, and there are many ways to classify them. From the automation point of view of operation, it can be divided into manual or automatic/semi-automated analysis; From the perspective of software operation, it can be divided into dynamic analysis, static analysis and dynamic and static combination analysis; From the perspective of the openness of software code, it can be divided into black box test, white box test and gray box test; From the perspective of the morphology of the research object, it can be divided into vulnerability analysis techniques based on intermediate languages and based on the underlying instruction set. Because it is divided from different perspectives, no matter which standard is classified according to, there are overlaps and overlaps between various classifications, and its specific implementation technology can also belong to different categories under different classification standards at the same time. Typical specific techniques include fuzz testing, symbol execution, and spot analysis, all three of which cover almost all of these classification criteria. Both spot analysis and symbolic execution are dynamic and static analysis from the perspective of software operation, and can also be white-box testing from the perspective of code openness. Fuzz testing is both dynamic analysis and black-box testing. In addition, the above three technologies can be vulnerability analysis techniques based on intermediate languages and underlying instructions.
Therefore, although there are many current binary vulnerability analysis techniques, some typical specific technologies can almost represent the entire binary vulnerability analysis technology. Taint analysis starts from the root of software vulnerability exploitation, marks the untrusted input as taint data, tracks the use of taint data in the process of program execution, and determines that there is a vulnerability in the program if the key operation in the program illegally uses the taint data. Because it captures the essence of the exploit, the method is effective in principle. At the same time, with the improvement of hardware computing power, symbolic execution technology has been more and more widely used due to the high path coverage. In contrast to dynamic stain analysis, which focuses on the flow of data, symbolic execution is the analysis of the flow of control. Therefore, these two technologies are well complementary in their application to vulnerability analysis. In addition, as a representative technology of software vulnerability analysis, fuzz testing occupies an important position in the field of software vulnerability analysis because it has the advantage of not relying on the source code of the program and low system consumption.
Since most of the binary vulnerability analysis tools at this stage are not integrated on a platform, the results obtained from previous research are often difficult to collaborate and reuse in the later stage, and subsequent researchers cannot expand on this basis, but can only be re-implemented, resulting in a lot of waste of resources. In order to solve this problem, a better idea is to combine existing analysis techniques and integrate them into a platform to form a comprehensive binary vulnerability analysis framework. Therefore, in future research work, a more powerful binary analysis framework will be constructed according to the existing analytical framework.
1 Binary Program Vulnerability Analysis Framework
For vulnerabilities in software, researchers have proposed many analysis techniques and have been widely used in the field of vulnerability analysis. However, existing research work mostly uses some kind of analysis tool to perform vulnerability analysis of a specific type of program, or optimizes these analysis techniques, so it faces the following problems. First of all, the research work of many binary analysis techniques is difficult to reuse, and subsequent researchers cannot expand and supplement on the basis of previous research, and can only be re-realized according to the methods proposed by predecessors, which means that a lot of research work is wasted; Second, the use of a single analytical technique does not achieve complementary advantages. Because each technology has its advantages and disadvantages, some of the key problems faced by some analytical technologies have not been solved so far, restricting the development of these analytical technologies, so by combining two or more analytical techniques to complement each other's advantages will be the future research direction.
Based on the problems existing in the current research work, a better solution is to build a unified binary vulnerability analysis framework and integrate some major analysis techniques, so that subsequent researchers can optimize or integrate the next generation of binary analysis technologies on the basis of the analysis framework to achieve technical reuse.
A number of binary vulnerability analysis frameworks already exist.
BitBlaze is a unified binary analytics platform that combines dynamic and static analytics with scalability. BitBlaze mainly contains 3 components, namely Vine, TEMU and Ruder. Vine is a static analysis component that translates the underlying instructions into a simple and standardized intermediate language, and provides practical tools for some common static analysis on the basis of the intermediate language, such as drawing program dependency diagrams, data flow diagrams, and program control flow diagrams. TEMU is a dynamic analysis component that provides dynamic analysis of the entire system and implements semantic extraction and user-defined dynamic spot analysis; Rudder is a combination of verbostatic analysis of concrete and symbolic execution components, which use the functions provided by Vine and TMEU to achieve a mix of concrete execution and symbolic execution at the binary level, and provides path selection and constraint solving functions. The main processing flow of the components in BitBlaze is shown in Figure 1.
Посмотреть вложение 35300
Figure 1 BitBlaze process flowchart
SHOSHITAISHVILI et al. proposed a multi-architecture binary analysis framework angu, which integrates many binary analysis techniques and has the ability to dynamically perform symbolics and static analysis of binary programs. angr is a binary automated analysis tool developed by the shellfish team in the CTF competition, which was originally used to find backdoors in the program and can now be applied to the field of vulnerability analysis. Since angur integrates some existing analytical techniques and implements different functions using different modules, it is easy to compare existing analytical techniques on the platform and take advantage of the advantages of different analytical techniques. Its brief processing process is: first, loading the binary program into the anang analysis platform; Then, convert the binary code into an intermediate language; Finally, further analysis of the program, which includes static analysis or exploration of the symbolic execution of the program. The angr process flowchart is shown in Figure 2.
Посмотреть вложение 35301
Figure 2 Angr processing flowchart
angr mainly includes the following modules: the intermediate representation module (IR), which translates the binary code into an intermediate language, where the intermediate language VEX can analyze binary programs on different architectures; Binary Program Loading Module (CLE), which loads a binary program into the analysis platform; Program state represents the module (SimuVEX), which represents the state of the program, and SimState in SimuVEX implements a set of state plug-ins, such as registers, abstract memory and symbol memory, etc., the state of these plug-ins can be specified by the user; The Data Model Module (Claripy), which provides an abstract representation of the values stored in the registers or memory of simState; The Complete Program Analysis Module, which combines all the modules allows the anand to perform complex and complete program analysis.
angr is an open source tool with good compatibility, cross-platform, cross-architecture support, and provides a powerful symbolic execution engine with strong analysis capabilities. However, since the original intention of the angel was to find a backdoor in the program, if it is applied to the field of vulnerability analysis, it is not yet possible to perform a complete vulnerability analysis. First of all, in the symbolic execution section, angr only provides the path information of the program, and manual assistance is required for subsequent analysis; Second, angr is currently only more powerful for symbolic execution and has not integrated other more typical analysis techniques. Therefore, further research is required to automate the subsequent analytical process and to integrate other analytical techniques as a supplement.
In the future research work, a comprehensive binary vulnerability analysis platform will be built in combination with the current hot technologies, so as to form a large-scale binary vulnerability analysis framework. The specific construction ideas are as follows: 1. Add modeling of binary programs and formal descriptions on this basis to establish a security policy library. 2. Optimize the intermediate language part, and further supplement and improve the information loss and low conversion efficiency faced by the binary program in the process of converting to the intermediate language, hoping to obtain a more accurate information flow (including data flow and control flow). 3.Integrate symbol execution and further optimize the technology to realize the automation of subsequent analysis; At the same time, the spot analysis technology is integrated to assist the symbol execution to reduce the number of paths, thereby slowing down the problem of path explosion. In addition, taint analysis can also be combined with the security policy database established above for vulnerability analysis. 4. Integrate fuzz testing technology, and solve the constraints of each path according to the constraint solver, and then generate test cases in a targeted manner, and then combine the fuzz test to determine the vulnerability. The overall flow of the binary vulnerability analysis framework is shown in Figure 3.
Посмотреть вложение 35302
Figure 3 Overall flowchart of the binary vulnerability analysis framework
There are many types of binary vulnerability analysis technologies, and the application scenarios are different, so three mainstream and representative analysis technologies are selected here, namely taint analysis, symbol execution and fuzz testing technology, which are combined to integrate into a comprehensive binary vulnerability analysis platform and then perform vulnerability analysis. The research status of these three analytical techniques will be introduced separately in this article.
2 Background and significance of intermediate language research
Compared with source-oriented vulnerability mining technology, binary code-oriented vulnerability mining technology is more valuable for research, mainly because it has the characteristics of not relying on source code and has a wider range of practicality. However, due to the unique complexity of vulnerability analysis techniques for binary programs, it is difficult to analyze binary programs directly. There are mainly the following two reasons: 1.The number of underlying instruction sets is large, such as the instruction set of x86 has hundreds, and it is more complex; 2.Some simple problems, such as separating the code from the data, are undecidable, that is, the binary code lacks the corresponding semantic and type information.
Based on the above difficulties, industry researchers prefer to represent binary programs in an easier way, so they propose to use intermediate languages to replace binary codes, that is, to convert binary codes into intermediate codes with semantic information for subsequent analysis by researchers. This requires that the intermediate language used must not only be lower than the high-level language so that the conversion from binary code to an intermediate language is not complicated, but also needs to be higher than the binary program to be able to provide a large amount of semantic information. The intermediate language itself is a further specification of the underlying language, which is mainly for later vulnerability analysis. The conversion of intermediate language is the most important step in binary program vulnerability analysis, which needs to be analyzed through the semantic information that the intermediate language has in order to be able to obtain the CFG diagram of the program, so as to obtain the information flow of the program, including the control flow and data flow. Therefore, getting an intermediate language translated from binary code is the basis for subsequent work.
There are many methods and tools for how to convert intermediate languages. Chen Kaiming et al. proposed the method of using symbol execution technology for transformation, which first needs to define the environment and semantics of symbol execution, and then execute assembly language in the symbolic environment and convert it into the corresponding intermediate language. Because it needs to emulate program execution like symbolic execution, it adds overhead. Jiang Lingyan [4] et al. proposed an intermediate representation method called VINST, which follows the basic principle of keeping the part that executes frequently and keeping the other parts correct, so it only contains more than twenty instructions that are most commonly used in various instruction sets, and the remaining relatively few simple instructions are simulated with a variety of intermediate instructions corresponding to their functions, and complex instructions are simulated with CALL intermediate instructions, so that the functions of the high-level language are simulated, this method is relatively simple in form, but the efficiency is very low. CIFUENTES[8] et al. developed the binary translation system UQBT, which describes the instruction information of the system through the semantic canonical language SSL.
At the same time, the conversion from assembly language to intermediate language is done by generating a dictionary of instructions. Since the generation of the middle language depends entirely on the description of the instruction system by the SSL language, if the SSL is not accurately described, there will be semantic missing or inaccurate problems in the intermediate languages converted according to it. Subsequent decompiler Boomerang was developed on the basis of UQBT and is therefore similar to the principle of UQBT. Ma Jinxin et al. introduced an intermediate language with a static single assignment form for type reconstruction, in which each variable is defined only once, thereby reducing the large and complex x86 assembly instruction to a simple equation form, and on this basis, a method of constructing a register abstract syntax tree is proposed to solve the base address pointer alias problem. However, due to the fact that many cases are ignored during the translation process, there is a problem of information loss in the intermediate languages translated by this method. The Valgrind platform also employs an intermediate representation, the VEX intermediate language, which is architecturally independent. VEX uses the RISC instruction set, which has read and write operations for memory and registers on its platform. LOAD and STORE in memory variables are read and write operations, respectively; The GET operation and the PUT operation in the register variable, which are responsible for reading the value in the register variable into the temporary variable or writing the value in the temporary variable to the register variable, respectively; the IMark statement is a token for the address of a specific instruction.
However, the current vulnerability analysis technology based on intermediate languages also faces some problems. First, some information is lost during translation into intermediate languages, such as the failure to save EFLAGS registers in the VEX used by Valgrind; Second, intermediate languages translate a single instruction into several instructions, which can reduce efficiency, such as VEX used by Valgrind and VINEIL used by BitBlaze.
Since intermediate languages are the premise of subsequent work, how to improve the translation efficiency and accuracy of binary translation programs is an urgent problem that each translation program needs to solve urgently. For the intermediate language part, future research work mainly includes the following two aspects: First, for the intermediate code will bring a lot of redundant information problems, in the design process needs to be concise as the primary principle, minimize repetitive information, each instruction to achieve a single function as the main goal; Secondly, under the premise of reducing redundant information, it is necessary to ensure that the translated intermediate language has complete semantic information as much as possible. Therefore, further optimization is needed for the current problem of information loss in intermediate languages. By comparing the intermediate language with the assembly language obtained by disassembling the corresponding binary program, it is investigated whether the intermediate language lacks some important semantic information, and the semantic information missing from the intermediate language needs to be supplemented and improved in order to provide the most complete data flow and control flow as possible for subsequent vulnerability analysis.
3 Dynamic stain analysis techniques
3.1 Dynamic stain analysis technology principle and classification
The stain analysis technique is a technique for tracking and analyzing the flow of spot information through a program, first proposed by DENNING in 1976. The main principle of taint analysis techniques is to mark data from untrusted sources such as networks and files as "contaminated", and a series of arithmetic and logical operations acting on these data and newly generated data also inherit the "contaminated" attributes of the source data. By analyzing these properties, certain characteristics of the program are derived. Stain analysis technology can generally be divided into two types: dynamic spot analysis and static spot analysis from the perspective of whether to perform the program.
The advantage of static stain analysis technology is that it can quickly locate all the situations that may occur in the program, but because the static analysis cannot obtain some information when the program is running, there is a problem of low accuracy, so it is often necessary to manually review and confirm the analysis results . Typically, binary static stain analysis techniques refer to techniques that use disassembly tools, such as IDA, to disassemble binary code based on static stain analysis.
Dynamic stain analysis technology is another stain analysis technique that has become popular in recent years, also known as dynamic information flow analysis. The technique is to mark and track specific data in the program while it is running, and determine whether the program has vulnerabilities based on whether this specific data will affect certain pre-agreed key operations. In the security realm of applications, dynamic taint analysis has been successfully used to prevent several types of attacks, such as buffer overflows, format string attacks, SQL command injection, and cross-site scripting attacks (XSS). Recently, researchers have begun to investigate the application of stain analysis-based methods to areas beyond security, such as understanding programs, software testing, and debugging.
Dynamic stain analysis technology can be divided into fine-grained dynamic stain analysis technology and coarse-grained dynamic stain analysis technology according to the analysis particle size. Coarse-grained spot analysis can generally be used to detect abnormal behavior, the advantage is that the analysis speed is faster, the storage space is small, but there is often a problem of low analysis accuracy; Fine-grained stain analysis is mainly used to detect the vulnerability attack point of the program, and its advantage is that the analysis accuracy is high, and it can solve problems such as data flow backtracking, but there will be problems such as large test cost and excessive space occupation. Shi Dawei et al. combined the two and complemented each other's advantages, and proposed a dynamic stain analysis method combining coarse and fine particle size.
3.2 Basic process of dynamic spot analysis
The main process of dynamic stain analysis consists of 3 stages, as shown in Figure 4.
Посмотреть вложение 35303
Figure 4 The basic process of dynamic spot analysis
1. Identify the source of the stain
Identifying the source of a spot is a prerequisite for spot analysis. The current methods of stain source identification mainly include: all external inputs of the program are marked as stain data; Manual labeling of stain sources based on manual analysis; Use statistical or machine learning techniques to automatically identify and label stain sources.
2.Spot tracking
Trace the tarment data according to specific propagation rules to see the propagation path of the tarment data. The stain propagation rules are the most important part of the stain analysis process. There are two approaches to taint propagation analysis of binary code, one is to analyze the underlying instruction set directly, and the other is to analyze the intermediate language of the binary code after translation. Due to the lack of semantic information for direct binary code analysis, the analysis is difficult, so the taint analysis technique based on the intermediate language will be used to analyze the taint according to the information flow obtained by the intermediate language.
3.Key point detection
Detect critical points according to reasonable detection rules to see if the key points of the program use data marked with stains.
3.3 Research Status
In recent years, many scholars at home and abroad have conducted in-depth research on the stain analysis method, YIN and others proposed an extension platform TEMU based on the QEMU virtual machine, which can solve some difficulties in dynamic analysis and facilitate program analysis on it. TEMU uses a system-wide perspective to analyze activities in the kernel and interactions between multiple processes, with in-depth analysis in a fine-grained manner. KANG et al. proposed a dynamic stain propagation scheme DTA++, consisting of two stages: first, by diagnosing the branch generation rules that produce incomplete stains, and determining the additional propagation required for offline analysis;Second, apply those rules in later dynamic stain analysis. The basic principle is to find the path of the control flow of incomplete stains, and use both symbolic execution and path-predicate-based methods to solve the constraints of the path, in which the execution trajectory of the program can be represented as a set of path constraints judged by a series of branching conditions. Zhuge Jianwei et al. proposed dynamic stain analysis technology based on type perception and symbolic execution technology for type variables, which effectively solved the problem that the current stain analysis technology working at the binary level could not provide high-level semantics for software vulnerability analysis. The technique labels the type source spot's taint variable and adds type information as its stain attribute to form the spot source. In the process of stain propagation, the semantics of instructions and open library functions are used, combined with the type information disclosed by the Type Sink point as a supplement, the type information of the variable is passed, the variable-level stain propagation is done, and the symbolic execution of the type variable is carried out while the stain is propagated, and the program execution path conditions are obtained to form a library of security vulnerability features. Ma Jinxin et al. proposed a stain analysis method based on the offline indexing of the execution trail, using a dynamic piling tool to track the execution of the program, and record the status information of the execution into the trace file; The trace file is then analyzed offline and indexed, and only the operations related to the stain data are focused on in the spread of the spot, and the instructions unrelated to the spot data are directly skipped, thus saving the analysis time.
3.4 Problems
In the current dynamic stain analysis research, there are mainly the following problems.
3.4.1 Implicit StreamIng Problems
The implicit flow problem is the pollution problem under the dependency of the control flow. In the process of taint propagation analysis, it can be divided into explicit flow analysis and implicit flow analysis according to the different program dependencies of interest. Explicit flow analysis is the propagation of data dependencies between variables in the analysis of stain marks; Implicit flow analysis is the propagation of control dependencies between analytical taint tags as they travel with program variables. Implicit flow spot propagation is an important issue in stain analysis techniques, and if the spot data is not properly treated, the results of the spot analysis will be inaccurate. At present, there are two main cases, namely under-taint and over-taint of taint data. The problem of under-contamination is due to the fact that the implicit flow taint propagation is not properly treated, resulting in variables that should have been labeled not being labeled. The over-contamination problem is due to the proliferation of stain variables due to the excessive number of stains marked. Current research on implicit flow problems focuses on minimizing under- and over-contamination.
The first concern of dynamic implicit flow is how to determine the scope of statements that need to be labeled under taint control conditions. Since the dynamic execution trajectory does not reflect the control dependencies between the executed programs, the current approach is to use offline static analysis to assist in determining the range of implicit flow markers in dynamic taint propagation. The second problem with dynamic analysis is the underreporting of potential security issues due to partially tainted information leaks. Because there are paths that are not covered, that is, paths that are not executed, there are cases where spot information is propagated and leaked through these unexecuted paths during dynamic spot analysis. The current solution is also to add offline static analysis to the dynamic analysis. The third problem is how to choose the appropriate stain marker branch for spot propagation. Simply propagating all branches containing stain marks will lead to over-contamination, so filter-labeled branches of stain markers are required to reduce over-contamination. The DTA++ tool proposed by KANG et al. uses a symbolic execution method based on offline execution trails to find branches for taint propagation.
3.4.2 Stain removal problems
In a spot analysis, if the number of spot marks increases continuously and spreads continuously during execution, some of the spots that should be removed are not removed, resulting in false positives that affect the accuracy of the results of the analysis. Proper use of spot removal can reduce the number of spot marks in the system, improve the efficiency of spot analysis, and avoid inaccurate analysis results due to spot spread. Therefore, for special cases, such as the encryption of sensitive data and the presence of constant functions, decontamination can be considered.
3.4.3 Analysis is expensive
Some stain analysis tools require the use of piling or dynamic code rewriting techniques, which can be a significant overhead to the analysis system . For example, TaintCheck uses the piling tool Valgrind to represent Ucode stakes in the middle and implement dynamic spot analysis for x86 programs; Dytan uses the plug program Pin to implement dynamic spot analysis for x86 programs. In order to control the cost of analysis, the existing research work adopts the idea of selectively analyzing the propagation of taints on the instructions in the system. For example, the fast-path optimization technique proposed by QIN et al. determines whether the input and output of a module are threatened in advance;This is how to determine whether taint propagation is required, and if there is no threat, it is not taint propagation, thereby reducing costs. Peng Jianshan et al. used stain backtracking (a reverse stain propagation analysis) to trace the source of the stain by establishing a collection of suspicious stains to narrow the scope of the spots to be analyzed, and thus reduced the cost. Lin Wei et al. further solved the problem of large time and cost of spot propagation analysis for trajectory record files in offline stain analysis, and proposed an efficient stain propagation method based on semantic rules, thereby reducing cost and improving efficiency. However, these methods have a greater or lesser impact on the accuracy of the analysis, so further research is needed on how to reduce the cost of the analysis without reducing the accuracy of the analysis.
3.4.4 The rate of false negatives is high
Dynamic spot analysis technology has a high false alarm rate and low accuracy, which is a more difficult problem to solve, because it is determined by the nature of the dynamic operation of the program. The execution of any program's code can change during a single run, which is different from the test cases that are entered into the program. Therefore, the results of a single analysis of the dynamic stain analysis technology may only cover part of the code of the program, so that it cannot dig out the part of the unrigged code, resulting in a high false positive rate. To solve this problem, it is necessary to analyze the entire program code and function, and try to make the code coverage of dynamic analysis close to 100% by constantly constructing suitable test cases.Zhu Zhengxin et al. proposed a dynamic symbolic stain analysis method to solve the problem of high underreporting rate of stain analysis, which uses the idea of symbolization to symbolize the stain information and risk rules, and at the same time detect whether there is a violation of certain risk rules in the process of spreading the stain data, and then detect the unsafe behavior of the program. Cui Hualiang et al. proposed the offline stain analysis method, which split the stain analysis system into two modules (dynamic recording module and static replay module), and reduced the problem of high underreport rate in dynamic analysis by adding static analysis.
3.4.5 High-level semantic information is missing
The lack of necessary semantic and type information in the underlying binary code limits the application of dynamic spot analysis techniques in binary program analysis. Zhuge Jianwei et al. proposed dynamic stain analysis technology based on type perception and symbolic execution technology for type variables, which effectively solved the problem that the current stain analysis technology working at the binary level could not provide high-level semantics for software vulnerability analysis. This technology makes full use of the type information of the input point for type transmission and derivation, and combines the type information of the Type Sink point to promote the dynamic spot analysis in binary program analysis to variable granularity, and performs symbolic execution for type variables while the stain propagates, so that vulnerability analysis and feature extraction have better semantic support.
The advantage of dynamic stain analysis is that accurate program runtime information can be obtained through the execution of the program, but because the dynamic stain analysis can only perform a single program path for a dynamic run, after multiple executions, there will still be paths that are not covered, then the security problems on these uncovered paths will be ignored, and there will be some stain information leakage problems. As a result, dynamic spot analysis can be used in conjunction with symbol execution techniques to improve path coverage. At present, the research focus of dynamic stain analysis technology includes the design of propagation logic, the optimization of analysis efficiency, and implicit flow analysis.For the current stain analysis tools can not take into account the lack of speed and accuracy, some researchers proposed a combination of coarse and fine particle size dynamic spot analysis method, the method combined with the advantages of the two through the online coarse particle size mode to ensure the rapid collection of stain analysis information, while the use of offline fine particle mode to a reasonable time consumption to improve the accuracy of spot analysis. In the future, dynamic taint analysis will be carried out according to the information flow obtained in the intermediate language, and the vulnerability determination will be carried out in combination with the formulated security policy, and the vulnerability determination will also be combined with the hybrid symbol execution technology.
4 Hybrid symbolic execution techniques
4.1 Fundamentals and classification of symbol execution
4.1.1 Symbolic execution and mixed symbol execution
The basic idea of symbolic execution is to replace program variables with abstract symbols, or to represent the values of program variables as a calculation expression composed of symbolic values and constants, simulating program execution, and thus performing correlation analysis. Using symbolic execution has the following advantages: First, a symbolic execution is equivalent to the specific execution of a set of inputs that may be infinitely more, and this set of inputs can be called input equivalent classes; Second, algebraic relationships between variables can be discovered, which makes the internal logic of the program easy to understand; Finally, symbolic execution accurately records the constraints of the path, making it easier to further judge the path feasibility and construct test cases for the program based on the constraints.
However, due to the lack of dynamic runtime information of the program in static symbol execution, the program execution is difficult to be fully simulated, and the analysis is not accurate enough. To solve this problem, the relevant researchers proposed the hybrid symbolic execution technique, which is a dynamic analysis technique. The core idea of hybrid symbolic execution is to determine which part of the program can run directly and which part needs to be executed by symbols in the real operation environment of the program. In this way, the runtime information of the program is fully utilized, improving the accuracy of the analysis.
4.1.2 Binary programs perform classification of mixed symbols
Binary-oriented hybrid symbol execution can be divided into online symbol execution and offline symbol execution from the perspective of whether the analysis is simultaneous during the execution process; From the perspective of analyzing the morphology of objects, it can be divided into symbolic execution for the underlying instruction system and symbolic execution for intermediate code.
Online symbolic execution refers to the simultaneous execution of programs and symbolic calculations, such as SmartFuzz; Offline symbol execution refers to the recording of the program execution trajectory first, and then the symbolic calculation is performed according to the replay of the trajectory execution, such as SAGE .
Online symbol enforcement faces the following problems. 1. Since the symbol execution process will occupy more resources, online symbol execution may reduce the performance of the target program. For example, during symbol execution, the target program may exit early due to lack of computing resources, which limits the depth of the program execution trajectory, resulting in the final collected program may lack some necessary internal information. 2. Usually there will be concurrency and some uncertain events in the execution process of the program, which may bring some unpredictable problems to the online symbol execution.Synchronization relationships that exist in multithreaded or multi-process programs are likely to be broken by online mixed symbol execution, which can lead to resource conflicts or even errors in the target program. 3. If code obfuscation, hulling and other protection techniques are used in the target program, the hybrid symbol execution is likely to not work properly. Based on the above problems, offline symbol execution technology will be used for vulnerability mining in future research work.
The use of offline symbol execution can effectively alleviate the above problems. On the one hand, in the process of offline symbol execution, only the program execution trajectory is recorded, and there is no need for symbol calculation, which brings a small performance loss to the target program; Even with the support of hardware virtualization conditions, it is possible to record the execution trajectory at the hardware level. On the other hand, the function of recording the execution trajectory of the program is relatively simple, which has less impact on the execution of the target program and does not break the synchronization relationship of the target program's multi-threading or multi-process. Moreover, the symbolic computation during the performance of the trajectory replay can be fully implemented on a high-performance computing platform, which also means that it can be run on a different platform than the target program, thereby improving the performance of symbolic analysis.
Symbolic execution for the underlying instruction system refers to symbolic execution according to the semantics of the underlying instructions, such as SAGE; Symbolic execution for intermediate code refers to the translation of the underlying instructions on the execution trajectory into the intermediate code, and then the symbolic calculation of the intermediate code, such as SmartFuzz .
The main advantage of symbolic execution for intermediate code is that it has good compatibility for different instruction systems. At the same time, the symbolic execution engine interprets intermediate code rather than the underlying instructions, which makes it highly extensible. For programs running on different instruction systems, it is only necessary to re-implement the execution trajectory recording, and the different instruction systems can be translated into the same intermediate code, and the program running on the platform can be analyzed accordingly.In the process of replaying the execution trajectory, the underlying instructions are first translated into an intermediate language, and then the corresponding information flow is obtained according to the intermediate language, and then the hybrid symbol execution is carried out.
4.2 The process of symbolic execution
Typically, the basic process of performing vulnerability analysis using symbols is shown in Figure 5. It mainly includes two parts: binary program modeling and vulnerability analysis.
Посмотреть вложение 35304
Figure 5 The basic process of symbolic execution
1.Binary program modeling
Modeling a binary program involves performing a basic parsing of the binary code to obtain an intermediate representation of the program code. Because the symbol execution process is usually a path-sensitive analysis process, after the binary code is parsed, the system usually needs to build a structure diagram to describe the program path, such as the program call diagram and the control flow diagram, which may be used in the symbol execution analysis process to assist the symbol execution.
2. Vulnerability analysis
The vulnerability analysis process mainly includes two parts: symbolic execution and constraint solving. Symbolic execution represents the values of variables as evaluation expressions for symbols and constants, and represents path conditions and conditions with vulnerabilities in the program as constraints on symbolic values; The constraint solving process not only determines whether the path conditions are satisfied, and makes a choice of paths based on the judgment results, but also checks whether the conditions for the vulnerability of the program are satisfied. In the process of symbol execution, it is often necessary to exploit certain vulnerability analysis rules, which describe the situations under which symbols need to be introduced and under what circumstances the program may have vulnerabilities.
4.3 Development and research status of symbol execution
Symbolic execution was proposed in 1975, and was first used in software testing. Later, BUSH et al. proposed The Static Symbol Execution Tool Prefix, which uses path sensitivity and inter-process analysis to reduce false positives and apply symbol execution techniques to the field of vulnerability mining. In the simulation execution, in order to avoid path explosion, Prefix only checks a maximum of 50 paths inside each function, after all paths are completed, Prefix merges the results of all paths, establishes a summary of the function, and when the function is called again, the built summary can be directly applied to accelerate the simulation. Microsoft then acquired Prefix, and on this basis implemented Prefast, which has become one of the standard source code static detection tools within Microsoft.
Static symbol execution technology requires simulating the program execution environment during analysis, and due to its lack of relevant information about the dynamic operation of the program, program execution is difficult to be fully simulated, and the analysis is not accurate enough. To solve this problem, the researchers proposed the concept of mixed symbolic execution, which transformed symbolic execution from a program static analysis technique to a dynamic analysis technique, and implemented a series of representative tools such as CUT, DART, etc. CUTE Uses hybrid execution techniques to represent test cell inputs in memory graphs, generating constraint expressions and tracking during symbol execution. In order to reduce the overhead of symbolic execution, CUTE adopts a combination of random testing and symbolization, and by default uses random testing to detect the program. When the random test is close to saturation, the test case is generated by symbolic execution and then moves on to a new path to continue execution. KLEE developed by CADAR et al. is an open source tool for constructing program test cases using symbolic execution technology, which can be used to analyze C language programs under the Linux system, and at the same time as analyzing the test cases constructed by the program, it also uses symbolic execution and constraint solving techniques to analyze the value range of symbols at key program points, and check whether the value range of symbols is within the scope of safety regulations. If the value of the symbol found in the analysis does not meet the security regulations, the program is considered to have a corresponding vulnerability. When KLEE constructs test cases, it considers not only path conditions, but also the conditions that trigger the program vulnerability.
Because source-oriented tools are limited in use due to the inability to obtain program source code, researchers have proposed to implement hybrid symbolic execution techniques at the binary code level, and have developed many tools based on this, such as SAGE, BitBlaze, SmartFuzz and so on. SAGE dynamically tests programs with legitimate inputs, tracks and records instruction execution with the help of the binary analysis platform iDNA, collects path constraints in instruction replay and solves them one by one, and generates new test sets to drive program execution of different paths and maximize code coverage. BitBlaze is a binary analysis platform that integrates technologies such as static analysis, dynamic analysis, and dynamic symbol execution, which is mainly composed of 3 parts, namely Vine, TEMU and Ruder, which implement functions such as static analysis, dynamic analysis, and dynamic symbol execution, respectively. SmartFuzz is a Valgrind-based binary instrumentation tool that can be used to discover integer vulnerabilities in binary programs under Linux. SmartFuzz converts binary code into VEX intermediate languages, then performs online symbol execution and constraint solving, and dynamically generates test cases.
4.4 Problems
In the current research on symbol execution technology, there are mainly the following two problems.
4.4.1 Path explosion problem
Symbolic execution technology faces a major problem, namely the path explosion problem. The main reason for this is that whenever an if-like statement is encountered, it is possible to make the present set of paths one more new path on top of the original. The way this path grows is exponential.
There are several main ways to solve the path explosion problem: 1. The number of paths within each process is conventionally limited, which is somewhat similar to the execution of the limit symbol. 2. Improve the analytical performance of symbol execution by improving the path scheduling algorithm. For example, BOONSTOPPEL et al. proposed the idea of cropping the path on the basis of the EXE tool according to the way and characteristics of the program execution, and reduced some unnecessary path analysis through the cropping idea. This method has proven to be a good way to mitigate path explosions during symbol execution. 3. Using the parallelization strategy of symbolic execution , the solution speed of the constraint solver is improved according to the current parallel computing technology.
4.4.2 Performance Issues
Symbolic execution techniques use constraint solvers to solve expressions for path conditions, and when analyzing the program, they also need to instrument the program. According to statistics, 2000 lines of C code can reach 40,000 lines after insertion, which not only means that a large number of lines of code are increased, but also means that the number of calls to the solver is increased, and the performance of the solver determines the efficiency of symbol execution. The current solution idea is to consider how to reduce the frequency of use of the solver, there are two main methods , one is to use caching technology to cache the commonly used constraint expressions or some intermediate expressions obtained during the solution process, so as to reduce some unnecessary repeated solutions; The second is to improve the solution speed by further optimizing the constraints, such as eliminating irrelevant constraints, rewriting expressions, and simplifying constraint sets.
Today, mixed symbolic execution has become an important code execution space traversal technique. Different from the traditional static symbol execution technology, hybrid symbol execution technology makes full use of the information when the program is running, improves the accuracy of symbol execution, and promotes the development of symbol execution. In addition, as the industry's demand for binary program analysis continues to grow, hybrid symbolic execution techniques for binary programs are becoming increasingly important. The main research directions in the future are as follows:
1.Symbol execution can be further combined with other techniques. Specifically, the advantages of prior art are leveraged to complement those of symbol execution techniques to compensate for some of the problems faced by individual analytical techniques.
2.For the path explosion problem faced in symbol execution, further improvement is needed. Abstract ideas can be used to process and abstract numerous path information to form a higher-level collection, so that only these abstract collections need to be traversed, which greatly reduces the number of paths and can alleviate the path explosion problem.
5 Fuzz testing techniques
5.1 Basic principles and classification of fuzz testing techniques
Fuzz testing technique is a representative technique of software vulnerability analysis, which was first proposed by MILLER et al. in 1989. Because it does not require in-depth analysis of the target program, and the vulnerabilities found through fuzz testing are generally real, fuzz testing technology occupies an important position in the field of software vulnerability analysis. The basic idea of fuzz testing is to provide a large amount of specially constructed or random data as input to the program under test, monitor the abnormal operation of the program and record the input data that causes the abnormality, supplemented by manual analysis, and further locate the location of the vulnerability in the software based on the input data that caused the error. From the perspective of test case generation strategy, fuzz testing can be divided into variation-based and generated-based fuzz testing; From the perspective of the openness of the code, the fuzz test can be divided into black box fuzz test and white box fuzz test.
Variation-based fuzz testing methods use techniques to mutate existing data to create new test cases; The generated fuzz testing method generates test cases from scratch by modeling the specific program under test.
Originally called random testing, black-box fuzz testing is a typical black-box dynamic analysis technique. The main idea is to input a large amount of irregular data into the program under test to detect whether the system is behaving abnormally. This method has the characteristics of simplicity and fast execution, but the flaw itself is that the generation of test cases is blind, that is, it lacks pertinence, so the path coverage is low. The so-called fuzz test generally refers to the black box fuzz test.
White box fuzz testing is another fuzz testing technique that has become increasingly popular in recent years, which combines fuzz testing and symbol execution. The basic idea is: first of all, the symbol execution technology traverses the path of the program, and analyzes the path conditions, solves the path conditions, and obtains the information of whether the path is reachable; Fuzz testing is then performed by generating specific test cases based on path constraints for reachable paths. This approach has high path coverage due to the combination of symbolic execution techniques to assist in generating test cases. But also because this method combines symbolic execution, it introduces the problem of path explosion and limited constraint solving ability.
5.2 Basic Process of Fuzz Testing
The fuzz test goes through roughly 5 basic stages during the execution, as shown in Figure 6.
Посмотреть вложение 35306
Figure 6 Basic process of fuzz testing
1.Identify the target. In order to choose the appropriate fuzz testing tool or technology, it is first necessary to fully understand the target program and clarify the target program.
2.Identify the input. Most vulnerabilities are often exploited because programs fail to properly handle some illegal data when receiving external input, so considering all possible input scenarios for a program is important for the success of fuzz testing.
3. Build test cases. Building test cases is a critical step in the process of fuzz testing, and the generation strategy of test cases determines the test efficiency of fuzz tests. Test case generation methods can be divided into two broad categories: variation-based and generated test case generation strategies. Test case sets can be built using one or both. There are three common ways to build test cases: randomly generated test cases, pre-generated test cases, variations, or forced generation of test cases.
4. Monitor execution and filter for exceptions. According to the execution of the test case in the target program, observe whether the target program has an abnormal condition or system crash, and determine the vulnerability according to the abnormal result.
5. Determine availability. Further identify the identified vulnerability based on the application to determine whether the vulnerability can be exploited.
5.3 Development and research status of fuzz testing
Fuzz testing, which was adopted in software engineering for a long time and was originally called random testing, initially focused not on evaluating the security of software, but on verifying the quality of code. In 1989, MILLER et al. developed a fuzz test tool to detect the robustness of programs on UNIX systems. This early method of fuzzing testing was a relatively simple black-box test: simply constructing a random string and passing it to the target application, if the application has an exception during execution, then the test is considered to fail, otherwise it is considered to pass, and it turns out that the method makes the probability of the program crashing at about 25%. In 1999, the Protocol TestIng Project Team of the University of Oulu, Finland, developed the cybersecurity testing software PROTOS and first proposed the concept of a test set.
In 2002, AITEL released the fuzz testing framework SPIKE. The framework is designed to test network protocols, employing a block-based approach that is effective for testing blocks of data containing variables and corresponding variable lengths, while allowing users to create their own custom fuzz testers. In 2004, EDDINGTON released Peach, a fuzz testing framework that supports both network protocols and file formats. Written in Python, Peach primarily uses an XML file to guide the entire fuzz testing process, while being able to detect almost any network protocol because it supports cross-platform operations. In 2007, AMINI released the fuzz test framework Sulley. Sulley is a Python-based fuzz testing framework that fuzzes various network protocols and is composed of multiple extensible components, including data generation, proxies, and tools. Sulley's advantage is that it is simple to use and path-based analysis, taking into account the problem of path coverage.
In the application of fuzz testing technology to vulnerability mining, some scholars have also developed some tools accordingly. In view of the limitations of fuzz testing when encountering checksum protection mechanisms, in 2010, WANG et al. of Peking University combined mixed symbol execution and fine-grained dynamic stain propagation technology to propose a method that can bypass the checksum mechanism, and developed the corresponding toolTainScope, which removes obstacles for the application of fuzz testing to find deep vulnerabilities. Since 2011, genetic algorithms and heuristic algorithms have been used by many researchers to assist in generating test cases, thereby improving the efficiency of test case generation.
5.4 Problems
Fuzz testing has many advantages over other vulnerability mining methods, but there are also shortcomings, there are still many limitations, one of the more important is that the test case of fuzz testing is not guaranteed to cover all statement branches, and the degree of automation of fuzz testing is not high at present. Therefore, optimizing the generation of test cases for fuzz testing and improving the efficiency of fuzz testing are still the main research directions.
5.4.1 Test case generation strategy needs to be improved
Test case generation is the most critical part of the fuzz test, and its efficiency is related to the efficiency of the entire fuzz test. Early fuzz testing often used blindly enumerated test case generation strategies, which belonged to the black-box testing method. With this approach to testing, although the size of the test cases may have reached a very high level of magnitude, the test results are often not significant. Moreover, in some software, private data and checksums are widely used, and often these randomly generated data are directly discarded because they cannot pass the checksum, and the deeper vulnerabilities are difficult to detect, so that the effect of the test is not ideal. Therefore, how to optimize the generation strategy of test cases is a problem that needs to be focused on at present.
Currently, researchers have introduced some similar concepts from genetics into the generation strategy of test cases. For example, concepts such as selection and cross-reorganization, and optimization of the generation strategy for test cases. Among them, the use of genetic algorithms to assist in the generation of specific test cases has been shown to greatly improve the efficiency of test case generation, and is a very useful automated testing technology.
At the same time, you can also consider combining fuzz testing with other binary vulnerability analysis techniques. For example, dynamic spot analysis can extract the dynamic data flow or control flow information of the program and integrate it into the binary vulnerability analysis process, guide the generation of input data in the fuzz test, make the traditional black box fuzz test targeted, and greatly improve the effect of the fuzz test.
5.4.2 The degree of automation needs to be improved
Fuzz testing is a blind injection method, manual testing takes more time, resulting in its practical value is affected. In some common fuzz testing frameworks, manual involvement is often required to determine constraints on input data and generate test cases. Therefore, most of the current research focuses on automated and intelligent fuzz testing methods. One solution is still to consider using genetic algorithms in genetics to improve the degree of automation of tests.
5.4.3 Unable to resolve multipoint triggering vulnerabilities
Fuzz testing techniques often find vulnerabilities caused by only one condition, rather than vulnerabilities that require multiple conditions. In recent years, some researchers have attempted to exploit fuzz testing in mining multipoint trigger vulnerabilities and have come up with the concept of multidimensional fuzz testing. However, multidimensional fuzz testing currently has the problem of combinatorial path explosion, so it is slow to develop and further research is needed.
Fuzz testing has a great advantage over other vulnerability mining methods, as it does not require an understanding of the internal structure of the program, so it is less expensive. At the same time, fuzz testing also has some limitations, such as large test blindness, low test efficiency, and code coverage cannot be guaranteed. To address these issues, consider combining fuzz testing techniques with other binary vulnerability analysis techniques, such as dynamic spot analysis or symbolic execution. Future research efforts will integrate fuzz testing into the built binary analysis framework, combining it with symbolic execution techniques. Since the symbol execution technology has good path coverage, when generating test cases, you can first use the symbol execution technology to constrain the program, obtain a set of path constraints, and then design the test cases according to the constraints of the path, and then perform fuzz testing.The combination of fuzz testing and symbol execution to generate guided test cases will theoretically greatly improve the efficiency of fuzz testing, and the detection effect will be significantly improved. In addition, using genetic algorithms in genetics to assist in generating test cases is also a good way to do so. In view of this method, in the future, we can also pay further attention to some similar concepts and ideas in other disciplines, judge whether they can be applied to the field of vulnerability analysis, and solve some problems faced by current vulnerability analysis technology, which is also a new breakthrough point.
Последнее редактирование: