Leveraging Artificial Intelligence on Binary Code Comprehension
Abstract.
Understanding binary code is an essential but complex software engineering task for reverse engineering, malware analysis, and compiler optimization. Unlike source code, binary code has limited semantic information, which makes it challenging for human comprehension. At the same time, compiling source to binary code, or transpiling among different programming languages (PLs) can provide a way to introduce external knowledge into binary comprehension. We propose to develop Artificial Intelligence (AI) models that aid human comprehension of binary code. Specifically, we propose to incorporate domain knowledge from large corpora of source code (e.g., variable names, comments) to build AI models that capture a generalizable representation of binary code. Lastly, we will investigate metrics to assess the performance of models that apply to binary code by using human studies of comprehension.
1. Introduction
Artificial Intelligence for Software Engineering (AI4SE) is an emerging direction aiming to leverage AI-based approaches to assist software engineering tasks (Harman, 2012; Feldt et al., 2018). In particular, binary code111In this paper, we use binary code to refer both to executable binaries and assembly code because they are trivially interchangeable. comprehension is among the most challenging tasks due to the complexity and lack of semantics in binary code (Meng and Miller, 2016). Meanwhile, understanding binary code is an essential step in critical tasks such as reverse engineering (Canfora et al., 2011), malware analysis (Alrabaee et al., 2022), and compiler optimization (Ren et al., 2021). In addition, AI is highly effective in revealing complex patterns from large corpora. For example, AI has been used for code search (Gu et al., 2018), malware classification (Kalash et al., 2018), and code summarization (Ahmad et al., 2020). Thus, leveraging AI for binary code comprehension can help address complex problems while paving the way for new research directions in both the SE and AI communities.
Existing research in code comprehension includes both SE and AI approaches. Traditional SE practices depend on the generalizability of inherent code structure and fixed properties to make complex decisions. For example, researchers use various compilation methods and consistent structural information for code clone detection (Baxter et al., 1998; Keivanloo et al., 2012), or adopt dynamic and static analysis to automatically extract statistical features from assembly functions or binary code (Harris and Miller, 2005; Caballero et al., 2009). However, these techniques have not reliably applied to binary code because it contains less useful structural and semantic statistical information. Meanwhile, research has shown that AI-based models can facilitate human comprehension of source code (Sridharan et al., 2022; Bui et al., 2021; Rus et al., 2022). Related studies apply state-of-the-art neural networks to extract code information from scratch (Martínez-Fernández et al., 2022; Giray, 2021), or adopting large-scale pretrained models (Feng et al., 2020; Guo et al., 2020, 2022) to transfer knowledge learned from corpora of Natural Languages (NLs).
However, there has been limited work assisting binary code comprehension using AI-based approaches. There are two key challenges for this direction: (1) binary code does not capture valuable, abstract semantics, and (2) there is a dearth of infrastructure and literature to support large scale training of binary code (e.g., there is no public dataset for binary code comprehension tasks). In this thesis, we propose to leverage AI to facilitate the comprehension of binary code via four research thrusts:
-
(1)
We will transfer knowledge contained in AI models for source code to apply to binary code with a particular focus on stable knowledge that generalizes across software application domains (Section 3.1).
-
(2)
we will apply contrastive learning to integrate the semantic information learned from Thrust 1 to develop an enhanced embedding for binary code. This thrust will enable transferring rich knowledge from source code to binary code (Section 3.2).
-
(3)
We will investigate the generalizability of our binary embedding from Thrust 2 across multiple programming languages. This thrust will strengthen the stability of our binary code comprehension model (Section 3.3).
-
(4)
we will systematically investigate effective metrics for evaluating the applicability of our AI models for binary code comprehension through human studies. This final research thrust will incorporate critical feedback from human engineers (see Section 3.4).
This dissertation will result in new AI models and associated representations of programs that will aid in the comprehension of binary code.
2. Related Work
Various approaches exploit machine learning techniques to analyze myriad representations of code fragments to guide the comprehension of source code, such as code summarization, clone detection, or program generation. One approach is to adopt the structure of abstract syntax trees (ASTs) to build models (Lei et al., 2022). For instance, ASTNN (Zhang et al., 2019) incorporates AST fragments into a bidirectional RNN model to build a vector representation of programs; TECCD (Gao et al., 2019) captures an AST embedding using a lightweight neural method.
In addition, compiling source to binary code can be used to assist program comprehension. For example, jTrans (Wang et al., 2022) learns representations of binary code using the Transformer neural architecture, and BugGraph (Ji et al., 2021) uses a graph triplet-loss network on the control flow graph to produce a similarity ranking.
Moreover, program comprehension can be improved through transpilation of a program from one language to another. While there is no single perfect technique for transpilation, several works have investigated transpilation under specific circumstances. For example, NGST2 (Mariano et al., 2022) leverages neural-guided synthesis to convert imperative code to functional, and HeteroGen (Zhang et al., 2022) converts C/C++ code to support High-Level Synthesis.
Last, because both NLs and PLs are based on English characters and words, researchers often adopt contemporary metrics from Natural Language Processing (NLP) to evaluate AI models in code comprehension, such as the BLEU score (Papineni et al., 2002). Most such work aim to improve such metrics using by sequential neural network models (Yu et al., 2019; Vaswani et al., 2017) and large-scale pretrained model (Feng et al., 2020; Guo et al., 2020). Researchers have demonstrated that direct application of metrics in NLP cannot help AI models to achieve satisfactory performance in SE tasks (Stapleton et al., 2020).
3. Hypotheses and Methodology
Our research methodology comprises four stages. First, we will incorporate domain knowledge of PLs to improve the stability and generalizability of source code comprehension. Second, we will transfer domain knowledge from source to binary code via constrastive learning. Third, we will develop a general representation across multiple PLs to enhance the performance of AI models. Last, we will define unified metrics to better evaluate our AI models for binary code comprehension. The planned works at each stage are shown below:
3.1. Source Code Comprehension
Many studies have already incorporated properties of source code to build AI models, but often directly use the source code or an associated intermediate representation without an in-depth understanding of the domain knowledge contained within. Therefore, we define our first hypothesis as:
Hypothesis 1: Program source code contains critical domain knowledge relevant to program comprehension. Incorporating this knowledge can improve the stability and generalizability of AI models.
To validate this hypothesis, we will design domain knowledge-guided source code comprehension on program generation, clone detection, and code summarization to test the performance of our source code comprehension model in cross-domain tasks.
3.2. Knowledge Transfer from Source Code to Binary Code Comprehension
Next, we will use contrastive learning to transfer valuable domain knowledge from source code to binary code. Most recent research adopts state-of-the-art AI models to learn properties of binary code without considering the connection between source code and binary code. Therefore, we define our second hypothesis as:
Hypothesis 2: Binary code shares similar properties to source code, but the representation precludes extracting such properties. By compiling source code, we can develop AI models that associate source code with corresponding binary code. This will provide generalizable models for tasks involving binary code when the original source code is unavailable.
To validate this hypothesis, we will develop and evaluate models that apply to source code and corresponding binary code for reverse engineering and vulnerability detection.
3.3. Generalizability Across Different PLs
Third, we will combine research outputs in the previous two steps to develop a general representation across multiple PLs. General approaches in this task still rely on a case-by-case analysis and are not well-studied from the perspective of domain generalization. Therefore, we define our third hypothesis as:
Hypothesis 3: We can extract common semantic information for a given program written in several programming languages that can aid binary code comprehension.
To validate this hypothesis, we will apply transpilation knowledge-guided domain generalization methods to extract common features from programs written in multiple languages, and use dimensionality reduction skills (e.g., PCA (Daffertshofer et al., 2004), T-SNE (Van der Maaten and Hinton, 2008)) to further visualize them and justify its effectiveness.
3.4. Unified Metrics for Evaluation of Binary Code Comprehension Models
Finally, we will define unified metrics for evaluating our AI models for binary code comprehension. Researchers have demonstrated that directly applying the metrics from NLP to SE tasks can be ineffective or even problematic in comprehending code and assisting programmers. Therefore, we define our fourth hypothesis as:
Hypothesis 4: PLs and NLs have syntactic and semantic differences. New metrics can improve the general performance of AI models in binary code comprehension by helping models learn which is the better direction in optimization of AI models.
To validate this hypothesis, we will test our new metrics on several binary code datasets for multiple tasks, including binary code similarity comparison, vulnerability detection, reverse engineering. We will investigate how models tailored to these binary comprehension tasks are affected by our new metrics.
4. Expected Contribution
This thesis will investigate how to leverage AI for binary code comprehension. The expected contributions of the thesis are:
-
(1)
A systematic study of applying domain knowledge in software engineering to improve stability and generalizability in AI models with respect to source code understanding,
-
(2)
An approach to transferring knowledge from source to binary code to account for the limited semantics in binary code,
-
(3)
A general representation across multiple PLs to enhance the performance of AI models for binary code comprehension,
-
(4)
A unified system of metrics for evaluating binary code comprehension models as a basis for improving models overall.
References
- (1)
- Ahmad et al. (2020) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020).
- Alrabaee et al. (2022) Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2022. A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features. ACM Computing Surveys (CSUR) 55, 1 (2022), 1–41.
- Baxter et al. (1998) Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272). IEEE, 368–377.
- Bui et al. (2021) Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. TreeCaps: Tree-based capsule networks for source code processing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 30–38.
- Caballero et al. (2009) Juan Caballero, Noah M Johnson, Stephen McCamant, and Dawn Song. 2009. Binary code extraction and interface identification for security applications. Technical Report. California Univ Berkeley Dept of Electrical Engineering and Computer Science.
- Canfora et al. (2011) Gerardo Canfora, Massimiliano Di Penta, and Luigi Cerulo. 2011. Achievements and challenges in software reverse engineering. Commun. ACM 54, 4 (2011), 142–151.
- Daffertshofer et al. (2004) Andreas Daffertshofer, Claudine JC Lamoth, Onno G Meijer, and Peter J Beek. 2004. PCA in studying coordination and variability: a tutorial. Clinical biomechanics 19, 4 (2004), 415–428.
- Feldt et al. (2018) Robert Feldt, Francisco Gomes de Oliveira Neto, and Richard Torkar. 2018. Ways of applying artificial intelligence in software engineering. In 2018 IEEE/ACM 6th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 35–41.
- Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
- Gao et al. (2019) Yi Gao, Zan Wang, Shuang Liu, Lin Yang, Wei Sang, and Yuanfang Cai. 2019. Teccd: A tree embedding approach for code clone detection. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 145–156.
- Giray (2021) Görkem Giray. 2021. A software engineering perspective on engineering machine learning systems: State of the art and challenges. Journal of Systems and Software 180 (2021), 111031.
- Gu et al. (2018) Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933–944.
- Guo et al. (2022) Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv preprint arXiv:2203.03850 (2022).
- Guo et al. (2020) Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
- Harman (2012) Mark Harman. 2012. The role of artificial intelligence in software engineering. In 2012 First International Workshop on Realizing AI Synergies in Software Engineering (RAISE). IEEE, 1–6.
- Harris and Miller (2005) Laune C Harris and Barton P Miller. 2005. Practical analysis of stripped binary code. ACM SIGARCH Computer Architecture News 33, 5 (2005), 63–68.
- Ji et al. (2021) Yuede Ji, Lei Cui, and H Howie Huang. 2021. Buggraph: Differentiating source-binary code similarity with graph triplet-loss network. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. 702–715.
- Kalash et al. (2018) Mahmoud Kalash, Mrigank Rochan, Noman Mohammed, Neil DB Bruce, Yang Wang, and Farkhund Iqbal. 2018. Malware classification with deep convolutional neural networks. In 2018 9th IFIP international conference on new technologies, mobility and security (NTMS). IEEE, 1–5.
- Keivanloo et al. (2012) Iman Keivanloo, Chanchal K Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In 2012 20th IEEE International Conference on Program Comprehension (ICPC). IEEE, 247–249.
- Lei et al. (2022) Maggie Lei, Hao Li, Ji Li, Namrata Aundhkar, and Dae-Kyoo Kim. 2022. Deep learning application on code clone detection: A review of current knowledge. Journal of Systems and Software 184 (2022), 111141.
- Mariano et al. (2022) Benjamin Mariano, Yanju Chen, Yu Feng, Greg Durrett, and Işil Dillig. 2022. Automated transpilation of imperative to functional code using neural-guided program synthesis. Proceedings of the ACM on Programming Languages 6, OOPSLA1 (2022), 1–27.
- Martínez-Fernández et al. (2022) Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2022. Software engineering for AI-based systems: a survey. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1–59.
- Meng and Miller (2016) Xiaozhu Meng and Barton P Miller. 2016. Binary code is not easy. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 24–35.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
- Ren et al. (2021) Xiaolei Ren, Michael Ho, Jiang Ming, Yu Lei, and Li Li. 2021. Unleashing the hidden power of compiler optimization on binary code difference: An empirical study. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 142–157.
- Rus et al. (2022) Vasile Rus, Peter Brusilovsky, Lasang Jimba Tamang, Kamil Akhuseyinoglu, and Scott Fleming. 2022. DeepCode: An Annotated Set of Instructional Code Examples to Foster Deep Code Comprehension and Learning. In International Conference on Intelligent Tutoring Systems. Springer, 36–50.
- Sridharan et al. (2022) Murali Sridharan, Mika Mäntylä, Maëlick Claes, and Leevi Rantala. 2022. SoCCMiner: A Source Code-Comments and Comment-Context Miner. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). IEEE, 242–246.
- Stapleton et al. (2020) Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, Westley Weimer, Kevin Leach, and Yu Huang. 2020. A human study of comprehension and code summarization. In Proceedings of the 28th International Conference on Program Comprehension. 2–13.
- Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
- Wang et al. (2022) Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. 2022. jTrans: Jump-Aware Transformer for Binary Code Similarity. arXiv preprint arXiv:2205.12713 (2022).
- Yu et al. (2019) Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. 2019. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation 31, 7 (2019), 1235–1270.
- Zhang et al. (2019) Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 783–794.
- Zhang et al. (2022) Qian Zhang, Jiyuan Wang, Guoqing Harry Xu, and Miryung Kim. 2022. HeteroGen: transpiling C to heterogeneous HLS code with automated test generation and program repair. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 1017–1029.