ShellCore: Defeating IoT Malware through in-depth Analysis and Detection of IoT Shell Code

Abstract

The Linux shell is a command line interpreter that provides users with a command interface with the operating system. It allows users to perform a variety of functions depending upon their access privilege. Although very useful, it also exposes the device to risk in the event of compromise. The security constraints in these persistently-connected devices gives adversaries a prime opportunity to use them for malicious activities. With access to IoT devices, malware authors attain the ability to abuse the Unix Shell of those embedded devices to propagate infections, and ultimately launch large-scale attacks, such as DDoS. In this work, we provide a first look at shell commands used in Linux-based IoT malware towards their detection. We analyze the malicious shell commands found in IoT malware and build a Neural Network-based detection model, ShellCore, to detect malicious commands. More specifically, we disassemble 2,891 IoT malware samples to extract malicious shell commands, and assemble a dataset of benign shell commands collected through real-time network traffic analysis and volunteered data collection from Linux users. Using word- and character-level features over both standard and neural network based techniques, we demonstrate that ShellCore achieves an accuracy of more than 95% in detecting individual malicious shell commands, and more than 99% in detection malicious files. We note that, among other advantages, that our approach detects malicious shell commands and binaries irrespective of the architecture they use.

Index Terms:

Linux Shell Commands, IoT Security, Malware Detection

I Introduction

The increased use of Internet of Things (IoT) devices for everyday activities has been paralleled with an increased susceptibility to risks, including major attack vectors such as vulnerabilities in hardware and software, use of default usernames and passwords, etc., leading to major attacks by IoT botnets. Recently, for example, Github experienced a high bandwidth Distributed Denial of Service (DDoS) attack [GithHub1Tb]. Similarly, an Internet infrastructure company, Dyn, faced a massive outage and network congestion launched by hacked IoT devices [DynAttack]. Malware exploit IoT attack vectors to launch targeted attacks, and upon gaining access to hosts, often due to the use of default credentials or vulnerable protocols, the malware executes a series of commands for propagation.

With the onset of devices persistently connected to the Internet, gaining remote access to a device’s shell gives the adversary (somewhat) easy access to the device. Additionally, for the embedded devices, they utilize a packed version of Linux libraries, called Busybox [wells2000], to implement Linux capabilities. Linux shell has been susceptible many attacks, including the brute-force, privilege escalation, shellshock, and other vulnerabilities (e.g., CVE-2018-9310, CVE-2019-1656, CVE-2018-0183, CVE-2017-6707) etc. [nvd_cvss, MadeOfBug, ChenMWZZK11, UittoRML15]. With the majority of work being focused on other shell interpreters (e.g., power and web), and the emergence of Linux-based IoT malware that heavily utilizes Linux shell, detecting the malicious use of the Linux shell in IoT devices becomes of prime importance. In this paper, we propose ShellCore to understand and detect the malicious use of Linux shell in context.

⬇

wget \%s -q -O DNS.txt

|| busybox wget \%s -O DNS.txt

|| /bin/busybox wget \%s -O DNS.txt

|| /usr/busybox wget \%s -O DNS.txt

Figure 1: Retrieving a list of target hosts.

Shell Commands and Automation. Malware-infected hosts use Command and Control (C2) servers to obtain payloads that include instructions to compromised machines (or bots) synchronizing their actions, including their cycles of activity by attacking targets, propagation by recruiting new bots and acting as a source of propagation, and by a stealthy operation to evade detection. In their operation, IoT botnets heavily rely on Linux shell commands. For example, bots use the shell to execute chmod command to change privileges, to launch dictionary brute-force attack to infect other hosts, and to propagate (bot) by connecting to C2 server to download instructions using the HTTP protocols via Linux shell. Additionally, to launch an attack a bot typically obtains a set of targets from a dropzone by invoking a set of commands, as shown in Figure 1, uses the shell to flood the HTTP of the victim, and to remove the traces of execution by executing the rm command on the shell [AntonakakisABBB17].

Adversaries today leverage the results from search engines for Internet-connected devices, such as Shodan [Matherly09]. For example, a simple “default password” search on Shodan returned 72,763 results. Moreover, malware authors can arm themselves by utilizing the vulnerabilities, such as in the services being used by the devices (e.g. CVE-2018-0183) and use of outdated firmware (e.g. CVE-2016-1560), to gain access to the shell. On top of all, zero-day vulnerabilities can be abused to access the shell to achieve short-term attacks. Therefore, detecting malicious shell commands to achieve the safety of a device is of paramount importance. Prior works have looked into the malicious use of Windows PowerShell. On the other hand, the malicious utilization of Linux shell hasn’t gained due importance. With this work, we inch towards filling this gap.

In this work, we design, implement, and evaluate ShellCore, a system for detecting malicious Linux shell commands used in IoT malware. To do so, we collect a dataset of residual commands used by malware. To keep our work very focused on abuse of Linux shell, we gather a dataset of IoT malware samples from IoTPOT [PaSYM2016]. Preliminary analysis shows that shell codes can be found embedded in the disassembled code. We therefore utilize static analysis to search through the disassembled code and extract the Linux commands used in malware samples. Moreover, for evaluation, we collect a dataset of benign commands from benign applications and users. In particular, we use the traffic generated from applications in real-time environment. Moreover, we augment our benign dataset by gathering commands from Linux users. We then utilize a Natural Language Processing (NLP) approach for feature generation, followed by a machine learning (Neural Network-based) model to detect malicious commands.

Contributions. Our goal is to utilize static analysis to detect the malicious use of Linux shell commands in IoT binaries, and use those shell commands as a modality for IoT malware detection. As such, we make the following two contributions:

•

Using shell commands, extracted from 2,891 recent IoT malware samples, along with commands from a benign dataset, we design an accurate detection system that detects malicious shell commands with more than 95% accuracy. Compared to the state-of-the-art approach that uses expensive deep learning approaches and yields an accuracy of 89% [HendlerKR18], our approach is quite efficient (by using shallow learning approaches) and more accurate. The explicit features are easy to explain to understand features contribution to the detection. Our approach is applied to Linux shell, used very recently by IoT malware, compared to the state-of-the-art approach applied to PowerShell.
•

We extend our command-level detection approach and design a detection model for malicious files (malware samples), which often include multiple commands. Extending the results of detecting individual commands, we group the commands by file and detect the malicious files with more than 99% accuracy, as well. Our detection can be applied to files compiled for any IoT hardware architecture (e.g., ARM, MIPS, Power PC, etc.) as long as the shell commands are extracted (which can be done statically and efficiently).

Organization. We review the literature in section II. We provide the background of this work in section III. The methodology and dataset are outlined in section IV. In section V, we explain our detection model, including the feature representation and classification. We evaluate ShellCore and discuss our results in section VI. Finally, we conclude our work in section VII.

II Related Work

TABLE I: Previous analysis and detection works of shell commands. AUC: Area Under the Curve, TPR: True Positive Rate, TNR: True Negative Rate, AC: Accuracy, FNR: False Negative Rate, FPR: False Positive Rate, NLP: Natural Language Processing, CNNs: Convolutional Neural Networks, MS: Malware Signature,MF: Malware Functions, LW: Longest Word in files header, DL: Deep Learning, MLP: Multi-Layer Perceptron, SVM: Support Vector Machine, SDA: Static and Dynamic Analysis, Pr.: Precision, Re.: Recall, and F1: F1-score.

Work

Shell Type

Dataset

Capability

Performance

Method

Hendler et al. [HendlerKR18]

PowerShell

66,388

Detection

AUC (98.5-99%), TPR (0.24-0.99%)

NLP, CNN

Pontiroli and Martinez [PontiroliM15]

PowerShell and .NET

Analysis

—

SDA

Uitto et al. [UittoRML15]

Linux shell

13,257

Analysis

—

Diversification

Starov et al. [StarovDAHN16]

Web shell

481

Analysis

—

SDA

Tian et al. [TianWZZ17]

Web shell

7,681

Detection

Pr. (98.6%), Re. (98.6%), F1 (98.6%)

CNN

Tu et al. [tuGXW14]

Web shell

39,596

Detection

AUC (82.9-100%), TPR (74.5-100%)

TNR (88.9-100%)

Score-based approach

for (MS), (MF), (LW)

Rusak et al. [RusakAO18]

PowerShell

4, 079

Detection

AC (85%)

Ours

Linux shell

190,897

Detection

AC (92.9-99.9%), FNR (0-31.7%), FPR (0-8.5%)

MLP, SVM

Shell Commands. Hendler et al. [HendlerKR18] detect malicious PowerShell commands using a combination of Machine Learning (ML) approaches, Natural Language Processing (NLP) and Conventional Neural Network (CNN). Pontiroli and Martinez [PontiroliM15] analyze PowerShell and .NET malware by analyzing the code execution. We note that both works are focused on shell commands that can only run on Microsoft Windows, for a single architecture, and it is unclear if the same insight can be applied to IoT software and command artifacts. On the other hand, Uitto et al. [UittoRML15] proposed a diversification technique to Linux shell commands by modifying and extending commands to protect against injection attacks.

Web Shell. The web shell is a script that allows the adversary to run commands on a web server to control a targeted web server remotely as an administrator. There have been some works on detecting malicious usage of the web shell. Starov et al. [StarovDAHN16] statically and dynamically analyzed a set of web shells to uncover visible and hidden features of malicious Hypertext Preprocessor (PHP) shells. Leveraging VirusTotal, they achieved an accuracy of 90% and 88.5% for the obfuscated and de-obfuscated shells, respectively, that are detected by at least one antivirus engine. Tian et al. [TianWZZ17] proposed a system to detect malicious web shell commands using CNN and word2vec-based approaches with precision, recall, and F1-score performance around 98.6%. Additionally, Tu et al. [tuGXW14] proposed a detection system to identify the web shells by calculating the threshold score values calculated from Malware Signature (MS), Malware Functions (MF), and the Longest Word (LW) in the header of the files by comparing the values with a database that have MS and MF values, and achieved an accuracy of 82.9%, 93.7%, and 100% for MS, MF, and LW, respectively. Moreover, Rusak et al. [RusakAO18] proposed a deep learning approach to classify malicious PowerShell by families based on two features from the abstract syntax trees representation of the PowerShell codes, and achieved an accuracy of 85% with 3-fold cross validation. Table Table I highlights the literature works related to the analysis and detection of the different malware types using shell commands.

IoT Malware Detection. A few works have been done on IoT malware detection. Su et al. [SuVPSFS2018] proposed a neural network based detection system with features from malware binaries converted to grayscale images with an accuracy of 94%. However, their approach is expensive. Kirat et al. [KiratVK2014] proposed an elegant sequencing-based system for detecting obfuscation and evasion techniques. Pa et al. [PaSYM2016] proposed IoTPOT, a detection system that analyzed and detected Telnet-based attacks on IoT devices. The system used a sandbox that supports different malware architectures. Bertino and Islam [BertinoI17] stated that IoT devices are vulnerable to botnet attacks, and proposed a behavior-based approach that combines behavioral artifacts and external threat indicators for malware detection. The approach, however, relies on external online threat intelligence feeds (e.g., VirusTotal) and cannot be generalized to other than home network environments (due to offloading computations).

Hossain et al. [HossainHZ18] proposed Probe-IoT, a forensic system that investigates IoT-related malicious activities. Montella et al. [MontellaRK18] proposed a cloud-based data transfer protocol for IoT devices to secure the sensitive data transferred among different applications, although not addressing the insecurity of the IoT software itself. Angrishi [Angrishi17] outlined anatomy of IoT botnets and recommended several security measures to address them. Dahl et al. [DahlSDY13] classified a large number of malware samples by minimizing their representation (features) using the Principal Component Analysis (PCA), allowing a large number of algorithms that could otherwise be impractical to be applied to the problem domain. Pascanu et al. [PascanuSSMT15] detected malware using language-level instructions and a standard classifier. Cozzi et al. [CozziGFB2018] analyzed a large scale Linux malware through studying and investigating the malicious behavior of malware, and discussed obfuscation techniques that malware authors use. We note that in all of those studies, IoT malware is addressed indirectly by addressing generic malware that is not particularly tailored for IoT devices.

III Background

This work reckons on the analysis of IoT malware samples. The embedded IoT devices use wrapped utilities and libraries of the Linux systems, such that it has the Linux capabilities without much overhead to the device. Drawing insights into IoT malware through Linux shell understanding, as well as understanding the difference between Linux shell commands used by malware and benign software, form the basis of this work. To extract the commands from the malware samples, we analyze the malware samples statically.

The Use of Shell as a Weapon. Shell is a command line interpreter that provides a command line interface for operating systems (OS) by executing a particular command from terminal to perform specific tasks by calling the appropriate OS command. It abstracts the details of the communication between the kernel and the OS by managing the technical operation. However, adversaries use shell commands to gain access to host devices to launch attacks. This can be facilitated by the use of default credentials by owners, vulnerabilities in the services such as SSH and device firmware. In 2014, Shellshock bash attacks caused a vulnerability in Apache systems through HTTP requests, and using the wget command to download a file from a remote host and save it to the tmp directory to cause infection [Koch15]. Additionally, a recent reported vulnerability (CVE-2019-1656) caused improper input validation in Linux OS to be exploited by adversaries by sending crafted commands to gain access to targeted devices [nvd_cvss]. Moreover, privilege escalation exploits are other vulnerabilities that compromise the Linux kernel. For example, the privilege escalation (econet) caused three vulnerabilities to the Linux system, including DoS attacks [MadeOfBug, ChenMWZZK11].

Abusing the shell as demonstrated above, adversaries can utilize the shell to brute-force the credentials of users to gain access to the device by launching a dictionary attack. Additionally, they can use the shell to connect to C2 servers to download instructions; e.g., infecting the device, propagating itself, or launching series of directed flooding attacks. Moreover, malware can use bash to find command to look for uninfected files in the host device and use the tmp directory to download and run malware. Figure 1 depicts the use of Linux shell by a malware to abuse the device by executing shell commands. Worth observing is the use of or operator to brute-force different commands.

Static analysis. To defend against malicious utilization of Linux shell, it is important to have a baseline of the shell commands injected by the malware. Static analysis approaches employ various techniques to reveal indicators that determine whether a software is malicious or benign. One such indicator is command executions, which can be utilized to observe and analyze the Linux shell commands used by the malware, thereby hinting at possible execution pattern of the malware. This goal can be achieved by observing the strings, functions, and disassembled code of the program etc.. While executing programs can show commands being used in real-time, static analysis can provide the commands in more efficiently. We employ static analysis on our malware dataset to create our final dataset of malicious commands. To do so, we disassemble every malware sample in our dataset and extract the strings from them. We then utilize the strings to (i) create a dataset of these commands and label them as malicious commands and (ii) group the commands by file (malware sample) and label each as a malicious sample.

Refer to caption — Figure 2: Shell command dataset creation (malicious and benign) and the detection workflow overview. Malicious dataset is created with the help of static analysis of malware, while real-time network infrastructure and benign shell usage from Linux users is leveraged for creation of benign dataset. Features, such as letters, numbers, $n$ -grams etc., are then extracted from the commands and used to classify the commands after feature reduction using PCA into malicious or benign commands. The command detection is extended by grouping the commands by files to classify the files as malicious or benign (M/B). S.A., Cmds., B.C., B.F., and Anon. refer to Static Analysis, Commands, By Command, By File, and Anonymize, respectively.

IV Dataset and Methodology

As we aim to detect malicious commands and files, we begin by disassembling malware samples to extract strings from their binaries.

IV-A Dataset and Data Processing

We obtain a dataset of 2,891 randomly selected malware samples from the IoTPOT project [PaSYM2016], a honeypot emulating IoT devices. To test our proposed detector, ShellCore, we build a dataset of benign samples gathered from real-time networks and voluntary submission of shell history. Figure 2 represents our approach, split into three modules: initial discovery, command extraction, and detection. In the initial discovery module, we disassembled the malware binaries. To create a set of rules that we can automatically apply to samples for obtaining the relevant commands, we manually examined all shell commands extracted from the strings of 18 malware samples and established patterns of those commands. Those common patterns are used to automate the extraction of shell commands from the rest of the samples.

As can be shown in Figure 2, the second component (module) in our workflow, namely the command extraction module, takes the command patterns obtained in the initial discovery phase and applies those patterns on the text obtained from each sample. More precisely, in the command extraction module, we extract the shell commands from the malicious and benign binary samples, by concentrating on the strings, and label them as malicious or benign. More details about the patterns and how they are used are below.

IV-A1 Malicious Dataset Creation

Using an off-the-shelf tool, Radare2, we disassemble each malware sample (among the 2891 samples in our initial dataset) and extract the strings from the disassembled code. We then utilize the strings appearing in each sample to magnify the commands assimilated in them, then add them to our dataset of malicious commands. For coverage of the shell commands appearing in malware, we gather the strings from the disassembled code. For faster extraction of the shell commands, we calculate the offset where the strings reside, i.e.,, the memory address where the string is referenced in the disassembled code, and then conduct the disassembly from that offset. We now pull the instruction set at the offset and extract the desired command. We analyze some whole samples manually to observe patterns that could uniquely identify the shell commands. We do this manual analysis over 18 samples and their associated strings. In total, we identified 1,273 patterns that we used later to extract shell commands from other samples not manually analyzed. Such patterns include keywords that commands, e.g., kill, wait, disown, suspend, fc, history, break, etc.—we utilize online resources to build a dataset of keywords of shell commands to augment our automation process—that are used in complex predicate-based patterns (examples are shown in Table 2 and in the appendix). For example, strings beginning with shell command keywords such as, ”cd ”, between if and fi, among other similar command structures, are extracted.

Having found the patterns to identify the shell commands, we use regular expressions to search for specific patterns within the strings obtained from the malware statically, and to automate the process to extract them for all malware samples. Although the commands contained in the strings may not be correct syntactically, e.g., the spaces are masked with special characters or spaces, they, however, hint to the location of shell command references. With an intent to find the actual commands, we go to the address where a particular string is quoted and disassemble at that offset. We then store the command gathered from the offset and label it as malicious. Algorithm 1 in appendix A shows the steps used to extract the strings from the malware samples, from which the shell commands are then obtained. We repeat the same algorithm on all malware samples to obtain all strings used in these samples. In total, we obtained a list of 2,008 unique commands.

We also analyzed our malware samples to find the architecture that they are compiled for. To do so, we use the Linux File command that determines a file’s format, target architecture, etc.. Table III depicts the malware distribution according to their architectures and their percentage in our dataset.

IV-A2 Assembling a Benign Dataset

To evaluate our proposed detection system, and along with the malicious command dataset, we also need a dataset of benign commands. Compared to the malicious dataset, assembling the command line usage by a benign application is a challenging task. First, while Linux-based applications are ubiquitous, and one can easily collect a set of Linux binaries, extracting the corresponding shell commands, and use them as a baseline for our benign dataset, those binaries are not necessarily intended for embedded Linux devices. Second, even if we are to eavesdrop on traffic to collect shell commands by benign users, encrypted traffic would represent a major hurdle: with the majority of traffic nowadays carried over HTTPS, we cannot extract such benign shell commands in the wild.

To cope with this issue, we build our monitoring network setup to collect a set of benign commands. In particular, we look for commands coming from various Linux-based tools, frameworks, and software inject. Since an entry point for many malware families is the abuse of many application-layer protocols, such as HTTP, FTP, TFTP, etc., with the intent to distribute malicious payloads and scripts, we attempt to monitor those protocols in benign use setup for similar but benign data collection. As such, we build our benign command collection framework with two separate networks, as highlighted in Figure 3. The first network is hidden behind a NAT and consists of five stations, while the second network is a home network with 11 open ports: 21, 22, 80, 443, 12174, 1900, 3282, 3306, 3971, 5900, and 9040. The main purpose of this setup is to capture the incoming and outgoing packets from the home network. Our home network in this experimental setup consisted of two 64-bit Linux devices, one Amazon Alexa, one iPhone device, one Mac device with voice assistant, Siri, being continuously used, and a router.

Figure 3 represents a high-level depiction of our benign data collection system. In the first network (left) we have five devices being used in our lab under “normal execution” (i.e., not controlled by any adversary) by graduate students, representing everyday use, where the devices are monitored over a period of 24 hours and their network traffic is captured individually. The second network is a home network designed by selecting variety of devices, as mentioned earlier, also operating under “normal execution”. The configured voice assistants are actively queried during our monitoring time. To establish a baseline, the devices are first removed from the network, and the network the network is monitored. Then, the devices are added to the network and the monitoring is continued. For the voice assistants, we iterate over a set of questions requiring access to the Internet. We actively monitor the traffic at the router to and from the home network for seven hours. Overall, we gather a dataset of around 34 GB from the first network, while the home network generated network traffic of around 1 GB.

In addition to the network traffic, we also gathered bash history data from nine volunteers. To protect the privacy of the users, we anonymize user identity by manually observing the commands and removing every clearly identifying information, including usernames, domain names, IP addresses, etc. Overall, we collect a dataset of around 143 MB, consisting of 5,772 commands. These commands correspond to services such as ssh, git, apt, Makefile, curl, etc., and generic linux commands, such as cd, rm, chmod, cp, find, etc.

Table II shows a sample of the payloads from the four data sources described earlier: three benign and one malicious. The traffic gathered from the five volunteers in Network 1 resulted in a total of 28,578,754 individual payloads, where the majority of them are encrypted and cannot be used for our data collection. On the other hand, the same dataset had 1,625,143 un-ecrypted payloads, which we utilize for our data collection. Similarly, the five sources in Network 2 generated 4,735 un-encrypted payloads, while the bash data source consisted of a total of 5,772 commands from nine volunteers.

TABLE II: Data sources for dataset. ¹ Number of files used to extract commands, ² Number of commands from source files.

Data Sources	#Sources¹	#Commands²	Example
PCAP Network 1	5	1,625,143	GET /update-delta/hfnkpimlhhgieaddgfemjhofmfblmnib/5092/5091/193cb84a0e51a5f0
			ca68712ad3c7fddd65bb2d6a60619d89575bb263fc5dec26.crxd HTTP/1.1\r\nHost: storage.
			googleapis.com\r\nConnection: keep-alive\r\nUser-Agent: Mozilla/5.0 (X11; Linux
			x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
			\r\nAccept-Encoding: gzip, deflate\r\n\r\n
PCAP Network 2	5	4,735	GET /favicon.ico HTTP/1.1\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (compatible;
			Nmap Scripting Engine; https://nmap.org/book/nse.html)\r\nHost: 192.168.2.1\r\n\r\n
Bash Commands	9	5,772	sudo wget https://download.oracle.com/otn-pub/java/jdk/8u201-b09/42970487e3af4f5a
			a5bca3f542482c60/jdk-8u201-linux-x64.tar.gz
Malware	2,891	2,008	GET /cdn-cgi/l/chk_captcha?id=%s & g-recaptcha-response=%s HTTP/1.1 User-Agent:
			%s Host: %s Accept: */⁢ Referer: http://%s/ Upgrade-Insecure-Requests: 1 Connection:
			keep-alive Pragma: no-cache Cache-Control: no-cache

TABLE III: Malware dataset by architecture. #Samples mean number of samples, while Perc. represents the percentage representation in the dataset.

Architecture	#Samples	Perc.
ARM	668	23.11%
MIPS	600	20.75%
Intel 80368	449	15.53%
Power PC	270	9.34%
X86-64	242	8.37%
Renesas SH	233	8.06%
Motorola m68k	217	7.51%
SPARC	212	7.33%
Total	2,891	100%

IV-B Methodology: High-Level Overview

The shell is a single point of entry for malware to launch attacks. Accordingly, detecting malicious commands before they are executed on the host will help secure the host, even though the malware is able to exploit a vulnerability in the device to access its shell, will help mitigate such targeted attacks. Our preliminary analysis highlight the use of shell commands for infection, propagation, and attack by malware authors. The Linux capabilities of embedded IoT devices give adversaries the required avenue to abuse the shell. Figure 1 shows an example usage of the shell for propagation. The malware attempts to download a file named DNS.txt from a dropzone, the address of which is stored in variable s. This variable contains a list of target devices for malware propagation. As is evident from the figure, the malware first attempts to download DNS.txt at home directory (default entry point upon gaining access). In case of failure, the malware attempts to initiate BusyBox by attempting three different locations, the success of which is dependent on where the BusyBox is installed on the device, followed by downloading the DNS.txt file.

We exploit this insight to build ShellCore. In the following, we explain ShellCore and its different building blocks. The goal of ShellCore is to realize a design with an effective detection model that detects malicious files based on their usage of the Linux shell. To this end, we break down the problem into two parts – (i) detecting malicious commands and (ii) detecting malicious files. We use real-time IoT malware samples and disassemble them to extract shell commands, as explained earlier. We then extract features from the commands, by taking advantage of the bag of words technique. In particular, we create a corpus of words from individual the commands, considering all the commands in our dataset, and then count the of occurrences of each of the feature in the corpus to represent each of the commands as a feature vector. Along with the words, we also use the n-grams to represent the commands as feature vectors. Moreover, we augment the feature representations of commands in our dataset using PCA.

Upon representing the malicious and benign commands as feature vectors, ShellCore aims to detect malicious commands, as shown in Figure 2. To do so, ShellCore employs machine learning algorithms to classify commands individually. We then trained our model and tested it using the different datasets above. We use cross-validation to limit the bias towards a certain training dataset. Using the same model we extend the classifier to classify the files as either benign or malicious. The following sections explain our model in more details, along with the evaluation results.

Similarly, when dealing with malicious files, ShellCore groups the commands by malware sample and benign application, followed by representing the files as a feature vector, eventually applying feature reduction. This is well abstracted in the command extraction engine in Figure 2.

V Detection Model

We avail the potential of neural networks to detect malicious commands using NLP-based feature generation. To help better learn the specifics of shell commands, we tune the default NLP algorithms to enrich the feature representation of the commands. In particular, we represent the commands as a feature vector by utilizing bag of words, reduce the features using PCA, and use ML-based algorithms for the malicious-benign command/sample classification.

V-A Featurization

Feature representation is a method to depict the attributes of samples. It is the process of cleansing and linking the data such that it is transformed in a format that is understood by the employed algorithms for detection. In this section, we discuss selecting features which better represent the characteristics of the samples in a dataset. There are many methods of featurization depending upon the nature of the data. Considering the textual nature of our samples, we focus on text-based featurization methods. Towards this, we leverage the traditional NLP-based approach by considering words in the samples as features. Additionally, since such an approach misses on very crucial attributes, we then employ customized NLP approach to meet our goals.

V-A1 Traditional NLP-based model

We leverage NLP for feature generation. This is facilitated by considering independent words as features and occurrence of space and/or characters as tokenizers, while words with a length greater than two are considered in the bag of words for feature vector creation. We adopt the bag of words approach along with $n$ -grams. Let $I_{1}$ represent the individual words in a command, and $n$ represent the total number of words in the command. Therefore, each word in the command can be represented as $I_{1i}$ , where $i\in[1,n]$ , such that

I_{1}={I_{11},I_{12},I_{13},...,I_{1n}}

V-A2 Customized NLP-based model

We acknowledge that the traditional NLP-based approach ignores the words with length less than two. Moreover, it also does not take into consideration the characters, which undermines many discriminating and dominant characteristics of a command, thereby not representing the commands accurately. Presence of many shell commands utilizing keywords $l\leq 2$ called for building a more accommodating feature generation mechanism. To do so, we change the boundaries of the definition of a word by considering every space, special characters, alphabets, and numbers as words, along with n-grams and command statistics. This augments our vocabulary with more granular features, in addition to them capturing the attributes precisely. Let $I_{2}$ represent each character, alphabet, number, etc. constituting a command, and $n$ represent the total number of such constituents in the command. Therefore, every such constituent in the command can be represented as $I_{2j}$ , where $j\in[1,n]$ , such that

I_{2}={I_{21},I_{22},I_{23},...,I_{2n}}

V-B Feature Representation

To represent every element in the dataset from a defined reference point, we represent every element with respect to axes in a space. In particular, every command/sample in the dataset is represented as a feature vector in the defined feature space. To train our detection model, each command is represented as a feature vector, where every element represents a distinct feature of the input. In this regard, we begin by finding the feature space to determine the dimensionality of the vectors. Particularly, the commands are augmented such that every feature of the commands in the dataset has a representation in the feature space. Every command in the dataset is then represented in a space of $n$ axes, where $n$ is the size of feature space. To do so, we devise multiple representations of commands, such as by including the words in the commands and by splitting the commands inon spaces and every special character. We also form a feature vector by considering each and every letter and special character as features combined with the special characters. We implement the bag of words method to define our feature space. The rest of this section explains our feature representation mechanism using the bag of words technique.

Bag of words. We realize a representation of commands/samples using the bag of words technique. Depending upon the splitting pattern of the samples, we create a central vector that stores all the words in the samples. Each sample in the dataset is then mapped to an index in the sparse vector representation, i.e.,, feature vector for every element in a dataset, where the vector has an index for every word in the vocabulary—every word, in $n$ different words in the central vector, is converted into $n$ bits, among which the bits are left zero, except for their occurrence (multi-hot encoding). Translating this into perspective, to generate the vector space, we add every word to an array. For every sample, we initialize its feature vector with a size equal to the bag of words. For every occurrence of a word in a sample, its index location is incremented. Therefore, every feature vector of a sample represents the frequency of the corresponding word in the dictionary.

A very important characteristic of the commands is their syntax. This syntax is dependent on the structure of a command. Therefore, in addition to the standard features gathered from the commands, we also augment the feature space with feature proximity, to capture the structure of the commands. To do so, we also represent the features as $n$ -grams. Every $n$ contiguous words in a sample’s shell command are considered a feature. When using $n$ -grams as features, every $n$ contiguous words occurring in a sample are added to the bag of words corresponding to them in the feature space.

For each of the two models as aforementioned, we create separate bag of words, such that, the bag contains all the words $I_{1i}^{k}$ , where $i\in[1,n]$ and $k\in[1,m]$ , such that $n$ is the total number of words in a command and $k$ is the total number of commands in the dataset. along with the $n$ -grams. Therefore, the words in all the commands as per the traditional NLP model, can be combined as $I_{11}^{1},I_{11}^{2},I_{11}^{3},....I_{12}^{1},I_{13}^{1},....I_{1n}^{m}$ Let $B$ be the bag of word for the dataset, such that $B=B_{1},B_{2},B_{3},...,B_{t},$ where $t\leq m*n$ and $B_{p}$ , such that $p\in[1,t]$ , in unique in $B$ . Moving forward, each command $I_{i}$ , where $i\in[1,n]$ , can be represented as a feature vector ( $F$ ) with respect to the bag of words $B$ , such that the $t^{th}$ index be represented as the frequency of occurrence, of the $t^{th}$ word in the bag, in the command. $F={f_{B_{1}},f_{B_{2}},f_{B_{3}},...,f_{B_{t}}},$ such that $f_{B_{p}}$ , where $p\in[1,t]$ , depicts the frequency of the word, appearing at index $p$ in the bag $B$ , in the command $I_{i}$ .

V-C Feature Reduction

We capture as many features as possible to achieve accurate results. However, beyond a certain point, the performance of the model becomes inversely proportional to the number of features. The usage of wide variety of features to represent samples leads to a very high dimensional feature vector which leads to (i) high cost to perform learning and (ii) overfitting, i.e., the model may perform very well on the training dataset, but poorly on the test dataset. Dimensionality reduction or feature reduction is applied with an aim to address the two problems. We implement the principal component analysis (PCA) for feature reduction to improve the performance and the quality of our classifier of ShellCore. PCA features (components) are extracted from the raw features. It is a statistical technique used to extract features from multiple raw features, where raw features are of $n$ -grams and statistical measurements. PCA creates new variables, named Principal Components (PCs). PCs are linear combinations of the original variables, where a possible number of correlated variables are transformed into low dimension of uncorrelated PCs. It normalizes the dataset by transforming them into a normal distribution with the same standard deviation [chiang2000fault]. This generates a standard representation of variables in order to identify a subset of them that can best characterize the underlying data [uguz11]. We deduce the $d$ -dimensional vector representation of individual commands to $q$ number of principal components onto which the retained variance under projection is maximal.

V-D Classification Methods

After representing each sample as a feature vector, we classify these samples into classes/targets – malicious or benign. Considering the text-based non-linear features and high dimensionality of the samples vectors, we utilize two well-known machine learning algorithms for non-linear classification tasks: Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM).

1) Multi-Layer Perceptron (MLP). MLP neural networks are a type of connected and feed-forward neural networks with one or more hidden layers between the input and output layers. The hidden layers consist of one or more parallel neurons, connected with a certain weight to all nodes in the following layers to generate a single output for the next layer. There are no recurrent connections in the same layer and there is no direct connections between the input and output layers. All layers are fully connected with their adjacent. The number of input neurons doesn’t need to match the number of output neurons. Also, the number of hidden neurons in a layer can be less or more than the input or output neurons [Cybenko89, Frias-MartinezSV06].

Given a feature vector $X$ of length $q$ and target $y$ , MLP learns a function $f(.):R^{q}\xrightarrow{}R^{o}$ , where $q$ is the input’s dimension and $o$ is the output’s dimension. With multiple hidden layers, the dimension of output of every hidden layer decreases with transformation. Each neuron in the hidden layer transforms the values of the preceding layer using linearly weighted summation, $w_{1}+w_{2}+w_{3}+...w_{q}$ , which passes through a non-linear activation function such that, $g(.):R\xrightarrow{}R$ . The output of the hidden layers is then transformed into output by the activation function $f$ .

2) Support Vector Machine (SVM). SVM classifies the data by finding the best hyperplane that separates the data from the two classes. For training a new classifier to achieve a preferable class, the training analyses are considered as positive examples, which are included in the class, while the remaining attempts are negative examples. To classify a new analysis, the classifier computes the margin and selects the hyperplane with the largest margin between the two classes [steinwartC08]. We use SVM due to its effectiveness in high dimensional spaces, its effectiveness when the dimension is greater than number of samples, and it being memory efficient. To achieve the goal, we utilize the following decision function [GuyonBV92, CortesV95] $sgn(\sum_{i=1}^{n}y_{i}\alpha_{i}K(x_{i},x)+\rho)$ where $x_{i}$ , $i\in[1,q]$ , is the training feature vector of sample, $\rho$ is the hyperplane margin, $y_{i}$ are the output labels, and the kernel function $K(x_{i},x)$ is defined as, $K(x_{i},x)=\phi(x_{i})^{T}\phi(x_{j})$

VI Evaluation and Discussion

In this section, we evaluate ShellCore’s performance and discuss the results. We start by classifying individual commands using the NLP-based approach. In all evaluations, our model exhibits high accuracy. We divide our evaluation into two parts. First, we build a detection system to detect malicious commands by considering every individual command in the dataset, irrespective of the application. Second, this detection system is then extended for detecting malicious files, where the above commands corresponding to an application are combined together when representing a single file as a feature vector of multiple commands.

In the following, we provide further details of the datasets and their characteristics, and the utilized evaluation metric. We then describe the traditional and customized NLP-based models. Finally, we describe how these two models are leveraged for detecting individual commands and malicious files.

TABLE IV: Size characteristics of the different datasets. Equiv. is malware equivalent, Cmd. is the number of command, Med. stands for Median, Std. is the standard deviation, and Net. stands for Network.

Data	Cmd.	Max	Min	Average	Med.	Std.
Net. 1	1,625,143	1,564	52	184.68	185	4.88
Net. 2	4,755	1,536	8	209.01	167	146.26
Bash	5,772	356	2	23.00	14	27.71
Malware	2,008	984	5	293.91	384	168.03

Dataset. Table IV shows the number of commands as well as the length statistics (maximum, minimum, average, median, and standard deviation). We note that the low deviation of length in commands in Net. 1 indicates that the commands have a similar length. Moreover, we notice that Net. 2 (corresponding to the IoT devices setting) and Malware datasets have closest lengths overall, per the average and standard deviation characteristics of their distributions.

Parameters Tuning. For better representation of commands’ features, we utilize n-grams. Particularly, we use 1- to 5-grams. For the MLP classifier, we also try multiple combinations of parameters to tune the classifier for better performance. We achieve the best performance when using five hidden layers.

K-Fold Cross-Validation. The evaluation of a machine learning algorithm depends on the training and testing data. To generalize the evaluation, cross-validation is used. For K-fold cross-validation, the data are sampled into K subsets. Then, the model is trained on one of the K subsets and tested on the other K-1 subsets. The process is then repeated allowing each subset to be the testing data while the remaining nine are used for training the model. The performance results are then taken as the average of all runs. In this work, We used 10-fold cross-validation.

Evaluation Metric. We evaluate the results of classification in terms of accuracy, false negative rate, and false positive rate. Accuracy (AC) is defined as the sum of true positive and true negative divided by the sum of true positive, false positive, false negative, and true negative. For a class $C_{i}$ , (where $i\in\{1,2,3\}$ ), False Positive (FP), False Negative (FN), True Positive (TP), and True Negative (TN) are defined as: 1. TP of $C_{i}$ is all $C_{i}$ instances classified correctly 2. TN of $C_{i}$ is all non- $C_{i}$ not classified as $C_{i}$ 3. FP of $C_{i}$ is all non- $C_{i}$ instances classified as $C_{i}$ 4. FN of $C_{i}$ is all $C_{i}$ instances not classified as $C_{i}$ . Mathematically, Accuracy (AC) is defined as the sum of true positive and true negative divided by the sum of true positive, false positive, false negative, and true negative. The False Negative Rate (FNR) is the sum of false negative divided by the sum of the true positive and false negative. The False Positive Rate (FPR) is the sum of false positive divided by the sum of false positive and true negative. We report the metrics as mean AC, mean FNR, and mean FPR for 10-folds.

VI-A Traditional NLP-based model

The traditional NLP-based learning model uses words as features, with spaces and other special characters as tokenizers. Additionally, it does not consider words less than three characters long. Moreover, to better represent the locality of the words, the model utilizes n-grams. Particularly, it uses 1- to 5-grams. With 10-fold cross-validation, the model achieves the results as shown in Table V. The results shown of both SVM and MLP classifiers, with the MLP yielding better results.

TABLE V: Evaluation results: malicious commands detection. Trad. refers to Traditional and Custom. refers to Customized

	Trad. NLP		Custom. NLP
	SVM	MLP	MLP
AC	0.929	0.934	0..953
FNR	0.317	0.21	0.0271
FPR	0	0.0085	0.0853

VI-B Customized NLP-based model

We note that the traditional approach only considers words, and neglects the characters, spaces, and words that have a length less than three. This, in turn, present a major shortcoming, since a large number of command keywords have a length less than three, including cd, ls, etc., or consist of special characters, such as ||, &&, etc. To address the shortcoming, we create the feature generation step such that it considers these important domain-specific characteristics that would otherwise be ignored. To do so, we change the way in which a word is defined by carefully declaring the tokenizers such that no character is ignored. Subsequently, the changed bag of words to operate the character-level, and contain every letter, number, and character represented as an individual feature. Moreover, and to capture the placement of the letters, characters, and spaces, we also consider combinations of these elements in the form of n-grams (up to 5-grams) into a vector space. Finally, for feature reduction, we use PCA such that it covers for 99% of variance in the training dataset.

VI-C Detecting Malicious Commands

We use the above defined model for detecting individual malicious commands. We first present the results of the traditional model followed by the customized model.

When used with MLP, the traditional model provides an accuracy of 93.4% along with an FNR of only 2.1% and FPR of 0.85% as can be observed in Table V. While the SVM shows an accuracy of 92.9%. We observe that the accuracy rate, although greater than that of SVM, is relatively low. We therefore work towards improving the accuracy, and reducing the false positives and the false negatives by considering other important features of the samples by evaluating the customized model. Moreover, we observe improved performance with MLP classifier. Therefore, we select MLP classifier as the classifier for ShellCore.

We test the veracity of the customized NLP-based model for detecting individual malicious commands over the same dataset and Table V shows the results of the approach. As shown, the approach improves the performance of the model, where the improvement in the accuracy over the traditional model is about 2%, i.e., 95.3%.

The difference in the evaluation results of the two models is expected, and is mainly because of the difference in the features of these models, and hint on the importance of special characters and letters with length less than three. Moreover, Figure 4 plots the accuracy for each fold in the 10-fold cross validation in the evaluation. As can be seen, we observe the accuracy in the range of 97.4% to 97.8% in every fold when using the traditional model for the training phase, while the accuracy for the testing phase is in the range of 78.8% to 99.1%. For the customized model, however, testing accuracy ranges between 67.3% to 99.7, with only once going less than 95%. Overall, the 95.3% accuracy in detecting malicious commands points at the effectiveness of the latter model.

TABLE VI: Evaluation results of malware detection. E.M.: Evaluation Metrics, and other abbreviations are in Table V.

E.M.	Trad. NLP	Custom. NLP
AC	0.996	0.998
FNR	0.001	0.002
FPR	0.006	0.001

VI-D Malware Detection

The next natural step is to generalize from the shell command detection to binaries (malware) detection, which we pursue in the following. In particular, in this section we classify files as malicious or benign using vectors of feature per file that combine the feature values of the shell commands associated with each file.

Dataset. We use the same malware samples and the collected benign dataset as in other experiments and analyses. Particularly, we cluster the commands by their source. For malicious dataset, we group the commands by malware sample. However, we do not have a designated source information for the benign commands dataset. To generate a benign dataset, we cluster the commands in the benign dataset under separate file labels. Keeping in mind the susceptibility of clustering to human and algorithmic biases, we cluster the commands such that the probability distribution of the benign dataset across files remains intact. We observe that the probability distribution of the number of commands in every malicious software and sample the distribution such that every sample represents the number of commands in a benign software. With the number of commands in every file, we then randomly select those many commands from the pool of benign commands.

We use ShellCore to train and test the model over the file specific dataset. Particularly, commands corresponding to a file are represented as a feature vector of that file. Similar to individual commands detection, we try both the traditional and the customized NLP-based approaches. Figure 4 presents the performance results of ShellCore for malware detection, where it is demonstrated that ShellCore can detect malicious commands with high accuracy and very low error (false positive and false negative). Moreover, the same table shows that the accuracy of ShellCore is improved when using the customized NLP-based model.

VI-E Discussion

This work studied the usage of Linux shell, with the aim to detect malicious shell commands and malware utilizing them. To do so, the proposed system, ShellCore, leverages the power of machine learning- and neural network-based algorithms. The system is then evaluated, and the results are presented. Our results show that, however, ShellCore does not detect individual commands with very high accuracy, although performing very well in detecting malware that uses those commands. We elaborate on those results.

Detecting individual shell commands. Although researchers have looked into the malicious usage of Windows PowerShell, and except for analyzing the vulnerabilities in Linux shell (e.g., shellshock), the malicious usage of shell commands has not been analyzed in the past. Prior works have analyzed and detected the use of shell commands to propagate attacks e.g., sending malicious bots [Geer05], and installing ELF executables on Android systems [SchmidtSCYKCA08]. Given the larger ecosystem of connected embedded devices with Linux capabilities, and sensing the urgency, we analyze the usage of shell commands used by malware. This work presents a system to detect the malicious commands with 95.3% accuracy.

Malware Detection. IoT malware has been on the rise. Given the difficulty of obtaining samples, very few works have been done on detecting IoT malware, and even less using residual strings in the binaries either. section II discusses the methods that work on detecting IoT malware. In this work, we use the commands in the malware samples for detecting them.

Our detection model achieves an accuracy of 99.8% with FNR and FPR of 0.2% and 0.1%, respectively, while the equivalent FPR of Handler et al. [HendlerKR18] is 0.89% using an expensive approach of deep learning.

As malware abuse the shell of the host device, detecting them at the shell will safeguard the device from becoming infected. Additionally, malware access a device by breaking into the host device by launching a dictionary attack, typically a single shell command execution. Alternatively, a host device can also be infected by a zero-day vulnerability or an outdated device with an existing exploitable vulnerability, among others, which are all also executed by individual shell commands. For a successful event, where the adversary breaks into a host, it will then abuse the shell to infect the host, followed by propagating the malware, and creating a network of botnets to launch attacks. As such, having a detector of such a high accuracy, at both the individual command level and malware sample level, with low FPR and FNR, will help stop the host device from being used as an intermediary target for launching attacks, despite the presence of vulnerabilities or the host. This makes this work very timely and necessary.

Dataset. One of the biggest challenge in developing and studying systems for detecting malicious commands is the absence of a benign dataset. We propose a way to assemble such a dataset. Additionally, to reduce the bias, we collect the dataset from two different networks with a diverse set of devices and behaviors. Moreover, we assemble the benign usage of shell by end-users, based on several volunteers’ device usage. For generating benign dataset for malware detection, we use mathematical models that aggregate files into groups while respecting the distribution of the original sample size in both benign and malicious files. As such, and to remove the possible human bias, we cluster the benign commands by files, such that the probability distribution of the number of commands in both malicious and benign dataset is the same. While the shell commands of both the benign and malicious datasets used in this study will be made public, we leave exploring additional methods for obtaining benign datasets and contrasting them to the datasets used in this study as a future work.

VII Conclusion and Future Work

In this paper, we propose a system, called ShellCore, for detecting malicious commands and malware in Linux-based IoT devices. We analyzed malicious shell commands from a dataset of 2,891 IoT malware samples, along with a dataset of benign shell commands assembled corresponding to benign applications. ShellCore leverages neural network-based algorithms to detect malicious commands and files, and an NLP-based approaches for feature creation. ShellCore detects individual malicious commands with an accuracy of more than 95%, and an accuracy of around 99.8%, with low FPR and FNR, when detecting malware. The results reflect that despite a comparatively low detection rate for individual commands, the proposed model is able to detect their source with high accuracy. While we consider the shell commands from the malware code, we do not consider bash files downloaded by the malware and then executed on the host device during run-time. Moreover, in the future we will extend this work by implementing it on real-time IoT devices.

Appendix A Appendix

A-A Algorithms

Algorithms 1 show the steps used to extract the strings from the malware samples and the benign samples, from which the shell commands are then obtained.

Input : directory to malware

Output : commands in the malware

malwareDirectory\leftarrow\texttt{directory to malware}

4for $malware\texttt{ in }malwareDirectory$ do

6 Extract the strings in the program.

strings\leftarrow\texttt{split strings by new line}

8 for $i\leftarrow 0$ to len(strings) by $1$ do

9 if pattern in strings[i] then

offset\leftarrow\texttt{offset of strings[i]}

11 Traverse to the

offset

instructionSet\leftarrow\texttt{Disassemble at }offset

command\leftarrow\texttt{command in the }instructionSet

14 else

15 continue

16 end if

18 end for

20 end for

Algorithm 1 Command extractor