From Threat Reports to Continuous Threat Intelligence: A Comparison of Attack Technique Extraction Methods from Textual Artifacts

Md Rayhanur Rahman, Laurie Williams
mrahman@ncsu.edu, lawilli3@ncsu.edu
North Carolina State University

Abstract

The cyberthreat landscape is continuously evolving. Hence, continuous monitoring and sharing of threat intelligence have become a priority for organizations. Threat reports, published by cybersecurity vendors, contain detailed descriptions of attack Tactics, Techniques, and Procedures (TTP) written in an unstructured text format. Extracting TTP from these reports aids cybersecurity practitioners and researchers learn and adapt to evolving attacks and in planning threat mitigation. Researchers have proposed TTP extraction methods in the literature, however, not all of these proposed methods are compared to one another or to a baseline. The goal of this study is to aid cybersecurity researchers and practitioners choose attack technique extraction methods for monitoring and sharing threat intelligence by comparing the underlying methods from the TTP extraction studies in the literature. In this work, we identify ten existing TTP extraction studies from the literature and implement five methods from the ten studies. We find two methods, based on Term Frequency-Inverse Document Frequency(TFIDF) and Latent Semantic Indexing (LSI), outperform the other three methods with a F1 score of 84% and 83%, respectively. We observe the performance of all methods in F1 score drops in the case of increasing the class labels exponentially. We also implement and evaluate an oversampling strategy to mitigate class imbalance issues. Furthermore, oversampling improves the classification performance of TTP extraction. We provide recommendations from our findings for future cybersecurity researchers, such as the construction of a benchmark dataset from a large corpus; and the selection of textual features of TTP. Our work, along with the dataset and implementation source code, can work as a baseline for cybersecurity researchers to test and compare the performance of future TTP extraction methods.

1 Introduction

Information technology (IT) systems have been gaining continuous attention from threat actors with financial motives [19] and organized backing (i.e., state sponsored [26]). For example, in 2021, Sonatype reported that software supply chain attacks increased by 650% in 2020 from the previous year [40]. Moreover, a cyberattack on the Colonial pipeline [46], JBS [31] and Ireland health services [27] show that threat actors can destabilize millions of people’s lives by fuel price surge, food supply shortage, and disruption in healthcare services. Thwarting cyberattacks has become more complicated as the threat landscape evolves rapidly. Hence, continuous monitoring and sharing of threat intelligence has become a priority, as emphasized in Section 2(iv) of the US Executive Order 14028: Improving the Nation’s cybersecurity: “service providers share cyber threat and incident information with agencies, doing so, where possible, in industry-recognized formats for incident response and remediation.“ [18].

Threat reports, published by cybersecurity vendors and researchers, contain detailed descriptions on how malicious actors utilize specific tactics, relevant techniques, and describe procedures for performing the attack - known as Tactics, Techniques, Procedures (TTP) (see Section 2.2) [44, 42, 24]) - to launch cyberattacks. Consider a threat report from FireEye describing the attack procedures of the Solarwinds supply chain attack [7] in Example 1, where we show the attackers’ actions in bold text. One of the observed (mentioned in the report) TTP is T1518.001: Security Software Discovery which allows an attacker to bypass the security defense by discovering security software running in the system [23]. The rise in cyberattack incidents with evolving attack techniques results in a growing number and volume of threat reports. Extracting the TTP from threat reports can help cybersecurity practitioners and researchers with cyberattack characterization, detection, and mitigation [14] from the past knowledge of cyberattacks. Analyzing TTP also helps cybersecurity practitioners in continuous monitoring and sharing of threat intelligence. For example, organizations can learn how to adapt to the evolution of cyberattacks. Cybersecurity red and blue teams also benefit in threat hunting by threat intelligence sharing [44], attack profiling [29], and forecasting [37].

Threat reports contain a large amount of text and manually extracting TTP is error-prone and inefficient [14]. Cybersecurity researchers have proposed automated extraction of TTP from threat reports (e.g. [34, 14, 15, 28, 5, 29]). Moreover, the MITRE [2] organization uses an open-source tool [3] for finding TTP from threat reports. These TTP extraction work use natural language processing (NLP) along with supervised and unsupervised machine learning (ML) techniques to classify texts to the corresponding TTPs. However, no comparison among this existing work has been conducted, and the research has not involved an established ground truth dataset [34], highlighting the need for a comparison of underlying methods of existing TTP extraction work. A comparative study among these methods would provide cybersecurity researchers and practitioners a baseline for choosing the best method for TTP extraction, finding room for improvement.

Rahman et al. systematically surveyed the literature and obtained ten TTP extraction studies [34]. None of these studies compared their work with a common baseline and only two of these studies [5, 15] compared their results with one other. In our work, we first select five studies [29, 28, 14, 5, 21] from the ten based on inclusion criteria (Section 3.2) and implement the underlying methods of the five selected studies. We then compare the performance of classifying text (i.e., attack procedure description) to the corresponding attack techniques. Moreover, as the number of attack techniques are growing due to the evolution of attack techniques, we also investigate (i) how the methods perform given that the dataset has class imbalance problems (existence of majority and minority classes); and (ii) how the methods perform when we increase the classification labels (labels are the name of techniques that would be classified from attack procedure descriptions).

The goal of this study is to aid cybersecurity researchers and practitioners choose attack technique extraction methods for monitoring and sharing of threat intelligence by comparing underlying method from the TTP extraction studies in the literature. We investigate these following research questions (RQs):

RQ1: Classification performance: How do the TTP extraction methods perform in classifying textual descriptions of attack procedures to attack techniques across different classifiers?
RQ2: Effect of class imbalance mitigation: What is the effect on the performance of the compared TTP extraction methods when oversampling is applied to mitigate class imbalance?
RQ3: Effect of increase in class labels: How do the TTP extraction methods perform when the number of class labels is increased exponentially?

We implement the underlying methods of these five studies: [29, 28, 14, 5, 21]. We construct a pipeline for comparing the methods on the same machine learning workflow. We run the comparison utilizing a dataset constructed from the MITRE ATT&CK framework [24]. We also run the methods on oversampled data to investigate how the effect of class imbalance can be mitigated. Finally, we use six different multiclass classification settings ( $n=2,4,8,16,32,64$ where $n$ denotes the number of class labels) to investigate how the methods perform in classifying a large number of available TTP. We list our contributions below.

•

A comparative study of the five TTP extraction methods from the literature. This article, to the best of our knowledge, is the first study to conduct direct comparisons of the TTP extraction methods.
•

A sensitivity analysis on the effect of using oversampling and multiclass classification on the compared method. Our work investigates these two important aspects of classification as the number of techniques is more than hundred and the technique enumeration is being updated gradually resulting in majority and minority classes.
•

A pipeline for conducting the comparison settings which ensure the methods are executed in the same machine learning workflow. We also make our dataset and implementation source code available at [4] for future researchers. The pipeline along with the dataset and implementation sources serve as a baseline for cybersecurity researchers to test and compare the performance of the future TTP extraction method.
•

We provide recommendations on how the methods can be improved for better extraction performance.

The rest of the article is organized as follows. In Section 2, we discuss a few key concepts relevant to this study. In Section 3 and 4, we discuss our process to identify the selected studies for comparison. In Section 5, we discuss our methodology for designing and running the experiment. In Section 6 and 7, we report and discuss our observations from the experiment. In Section 9 and 8, we identify several limitations to our work followed by highlighting potential future research paths. In Section 10, we discuss related work in the literature followed by concluding the article in Section 11. We report a few supplementary information in the Appendix.

2 Key Concepts

In this section, we discuss several key concepts relevant in the context of our study.

2.1 Threat Intelligence:

Threat intelligence - also known as Cyberthreat intelligence (CTI) - is defined as ‘evidence-based knowledge, including context, mechanisms, indicators, implications, and actionable advice about an existing or emerging menace or hazard to assets that can be used to inform decisions regarding the subject’s response to that menace or hazard‘ [22]. Threat intelligence can be used to forecast, prevent and defend attacks.

2.2 Tactics, techniques and procedures (TTP):

Tactics are high level goals of an attacker, whereas techniques are lower level descriptions of the execution of the attack in the context of a given tactic [24, 44]. Procedures are the lowest level step by step execution of an attack being performed. TTP can be used to profile or analyze the lifecycle of an attack on a targeted system. For example, privilege escalation is a tactic for gaining elevated permission on a system. One technique for privilege escalation can be access token manipulation [24]. An attacker can gain elevated privilege in a system by tampering the access token to bypass the access control mechanism. An example procedure is an attacker manipulating an access token by using Metasploit’s named-pipe impersonation [24].

2.3 ATT&CK:

The MITRE [2] organization developed ATT&CK [24], a framework derived from real world observations of adversarial TTPs deployed by attack groups. ATT&CK contains an enumeration of high level attack stages known as tactics. Each tactic has an enumeration of corresponding techniques, and each technique has associated procedure description(s). Procedures are written in unstructured text and describe how a particular technique has been used by the attacker to gain an objective of the corresponding tactic to launch a cyberattack. ATT&CK was first introduced in 2013 to model the lifecycle and common TTP utilized by threat actors in launching APT (advanced persistent threat) attacks. In our research, we utilized Version 9 of the ATT&CK framework which consists of 14 Tactics, 170 Techniques, and 8,104 procedures.

3 Selection of TTP extraction methods

In this section, we discuss the methodology for selecting and comparing the TTP extraction methods in five studies [29, 28, 14, 5, 21] found in literature.

3.1 Finding TTP extraction work from the literature:

Rahman et al. [34] systematically collected automated threat intelligence extraction-related studies from scholarly databases and found 64 relevant studies. From these, the first author of this paper identified ten studies that extracted TTP from the text automatically using NLP and ML techniques. We select these ten work as potential candidates for our comparison study. We refer to these ten works as the candidate set. In the Appendix, Table 7, we list the bibliographic information of the candidate set.

Id	Dataset type	Dataset source	# threat reports	NLP/ML techniques and features
$S_{1}$ *	Data breach incident reports	Github APTnotes [6] and custom search engine [1]	327	Latent Semantic Indexing(LSI)
$S_{2}$ *	APT attack reports	Github APTnotes	445	Dependency parsing, TFIDF of independent noun phrases
$S_{3}$	APT attack reports	Github APTnotes	50	Named entity recognition(NER)
$S_{4}$ *	APT attack reports	Github APTnotes	18,257	TFIDF
$S_{5}$	Malware report	Github APTnotes, MicrosoftS/Adobe Security Bulletins, National Vulnerability Database description	474	NER, Cybersecurity ontology
$S_{6}$ *	APT attack reports	Attack technique dataset (Source not reported)	200	LSI
$S_{7}$	Computer security literature and Android developer documentation	IEEE S&P, CCS, USENIX articles, Android API [12]	1,068	Dependency parsing
$S_{8}$	-	-	18	NER, Dependency parsing, Basilisk
$S_{9}$ *	Malware report	Symantec threat reports	17,000	Dependency parsing, BM25
$S_{10}$	Malware report	Symantec threat reports	2,200	Dependency parsing, BM25
Id with (*) symbol denotes that the study is selected for comparison

Table 1: Datasets and methods used in candidate set

3.2 Inclusion criteria for TTP extraction work:

A comprehensive comparison of TTP extraction methods is not a straightforward task. One difficulty in setting up the study is to find a labelled and universally agreed upon dataset. Moreover, constructing such a dataset is inherently challenging as the set of TTP is subject to change with evolution of the manner of attack. Another challenge is to determine whether the extraction should be performed on the sentence level or paragraph level. Finally, in the candidate set, TTP extraction methods were designed targeting different use cases, such as transforming the extracted TTP to structured threat intelligence formats [14] or building a knowledge graph [32]. Hence, not every study in the candidate set is able to extract all known TTPs. Hence, we define the following inclusion criteria:

1.

All methods selected for the comparison can work on the same textual artifacts
2.

Besides labelling the text to corresponding technique, no other manual labelling is required for comparison
3.

All methods can be compared using the same set of technique names which will be used as labels for classification tasks.

3.3 Filtering the TTP extraction work for comparison:

In Table 1, we report the dataset type, dataset source, and relevant NLP/ML techniques used for our candidate set. Next, we report how we filter the candidate set.

•

We drop $S_{3}$ , $S_{5}$ , and $S_{8}$ because Named Entity Recognition (NER) labelling of words from the text is required (violates filtering criteria [2]).
•

We drop $S_{7}$ because this work (a) uses Android development documentation (violates filtering criteria [1]), and (b) extracts the features for Android-specific malware only (violates filtering criteria [3]).
•

We drop $S_{10}$ because the work requires additional manual work on identifying relevant verbs and objects from Wikipedia articles on computing and cybersecurity related concepts (violates filtering criteria [2]).

Finally, we keep the remaining work for our comparison study: $S_{1}$ , $S_{2}$ , $S_{4}$ , $S_{6}$ , and $S_{9}$ . $S_{1}$ and $S_{6}$ utilized Latent Semantic Indexing (LSI) [20]; $S_{2}$ and $S_{4}$ utilized Term frequency - inverse document frequency (TFIDF); and $S_{9}$ utilized dependency parsing and BM25.

4 Overview of the selected studies for comparison

We report a brief overview of the studies selected for comparison followed by observed similarities and dissimilarities.

$S_{1}$: The authors used the data breach incident reports produced by cybersecurity vendors and then searched high level attack patterns from those reports. The authors used the ATT&CK framework for the common vocabulary of attack pattern names. They used LSI for searching the attack pattern names from the texts. Finally, they correlated these searched attack patterns with responsible APT actor groups.
$S_{2}$: The authors used APT attack related articles as dataset and MITRE ATT&CK framework for the common vocabulary of TTP. Then they extracted independent noun phrases from the corpus that appear in the corpus at least once without being part of a larger noun phrase. Then they computed TFIDF vectors of these noun phrases. Finally, using these vectors, they retrieved the most relevant set of articles associated with specific TTP keywords such as data breach, privilege escalation.
$S_{4}$: The authors used APT attack-related articles and Symantec threat reports as dataset and MITRE ATT&CK framework for the common vocabulary of TTP. They computed TFIDF vectors of the articles and then applied three bias correction techniques named kernel mean matching [13], Kullback-Liebler importance estimation procedure [43], and relative density ratio estimation. Finally, they used SVM classifier on bias corrected data.
$S_{6}$: The authors used advanced persistent threat (APT) attack related online articles as dataset and MITRE ATT&CK framework for the common vocabulary of TTP. They first computed the TFIDF vectors of the description of TTP. Then they applied LSI on articles for retrieving a set of topics. After that, for each article, the authors computed the cosine similarity score between TFIDF vectors of each TTP and the retrieved topics. Then they used these computed similarity scores as features. Finally, the authors used two multi-label classification techniques named Binary Relevance and Label Powerset [36, 41].
$S_{9}$: The authors used Symantec threat reports as dataset and MITRE ATT&CK framework for the common vocabulary of TTP. First, they created an ontology of threat actions from the description of ATT&CK. Then they extracted threat actions as (subject, verb, object) tuples from each sentence in the corpus. Finally, they computed BM25 score for threat actions against their created ontology and mapped the extracted threat actions to the corresponding entities in their ontology and converted the reports to the Structured Threat Intelligence Exchange (STIX [30]) format.

From the above description, we see the following similarities among the articles:

•

Threat reports are used as dataset,
•

The MITRE ATT&CK framework is used as the common vocabulary of TTP
•

NLP techniques (such as TFIDF, LSI, BM25) are used for feature extraction or computation from text.
•

The extracted or computed textual features are fed to machine learners.

However, we also observed the following dissimilarities among the studies as well:

•

The purpose of the TTP extraction is different. For example, $S_{1}$ correlated extracted TTP with APT actor groups while $S_{9}$ constructed STIX threat intelligence format from unstructured threat reports
•

Not all studies used classification techniques, such as $S_{1}$ , $S_{2}$
•

Bias correction on dataset is only applied in $S_{4}$
•

Only one study ( $S_{6}$ ) modelled their approach as multi label classification problem

5 Comparison Study

In this section, we provide our research steps for the comparison study. We report how we design and utilize a five-step TTP extraction pipeline (see Section 5.3) for running the comparison study. We construct our dataset from MITRE ATT&CK framework. Next, we apply pre-processing to the corpus. After that, we implement the underlying method of the five selected studies. Finally, we feed the extracted/computed features to classifiers with oversampling (see Section 5.7) and multiclass classification (see Section 5.9) settings.

5.1 Comparison scope

In Section 4, we report similarities and dissimilarities among the articles we choose for comparison and as a result, we define the following scope of the comparison.

•

We mention in Section 2 that ATT&CK contains the mapping between (a) tactics and techniques, and (b) techniques and textual descriptions of procedures. Classifying the procedure description to the corresponding technique would find the corresponding tactics as well as ATT&CK provides the tactics-technique mapping as part of the framework. Hence, for TTP extraction, we choose to classify techniques from a given text (which are procedure descriptions), which will also give us the associated tactics from the tactic-technique mapping.
•

We assume that a piece of text (i.e., sentence(s)) is related to one technique. Hence, we will use classifiers to classify a piece of text to its corresponding technique.
•

All methods will be compared using the same dataset, class labelling, same set of classifiers
•

All the methods will be compared in the same machine learning pipeline (See Section 5.3)

As we will use the same dataset, labels, and classifiers, hence, we will only compare the NLP methods for extracting textual features that we will feed towards classifiers.

1.

For $S_{1}$ , we will use LSI vectors as features. We will refer this method as M:LSI.
2.

For $S_{2}$ , we will use TFIDF of unique noun phrases (See Section 4) as features. We will refer this method as M:TFIDF-NP.
3.

For $S_{4}$ , we will use TFIDF vectors of the corpus as features. We will refer this method as M:TFIDF.
4.

For $S_{6}$ , we will use the cosine similarity score as features. The score is computed from TFIDF vector of each technique description and vectors of topics generated by LSI. We will refer this method as M:LSI-Co.
5.

For $S_{9}$ , we will use BM25 score of (subject, verb, object) tuples of TTP description and corpus texts as features. We will refer this method as M:BM25.

5.2 Metrics for performance measurement

We use the following metrics for measuring the performance of TTP extraction methods we are comparing.

Precision: the ratio of true positives and the sum of true positives and false positives indicating the relevance of the performed classification. The higher the precision score is, the better the classifier is in finding relevant examples from a given class.
Recall: the ratio of true positives and the sum of true positives and false negatives indicating the completeness of the performed classification. The higher the recall score is, the better the classifier is in finding all examples from a given class.
F1 score: The harmonic mean of precision and recall indicating how good the classification task in terms of both precision and recall. The higher the F1 score is, the better the classifier is in both classifying relevant examples to correct classes and classifying all examples correctly belonging to the same class.
AUC score: the area under the curve of Receiver Operating Characteristics indicating the ability of a classifier to separate the true positive and true negative examples. Higher AUC score indicates better performance in classification. AUC score equalling $0.5$ denotes that the classifier is as good as a random guess only, less than $0.5$ means worse than a random guess. AUC closer to $1$ is desirable and a higher AUC score indicates the classifier is able to choose a randomly selected true positive example with higher confidence than choosing a true negative example.

Refer to caption — Figure 1: Instantiated pipeline for our comparison experiment

5.3 TTP extraction pipeline for comparison

We construct a TTP extraction pipeline for comparing the methods through five steps, as defined below.

Step 1: dataset collection In this step, we collect the dataset (Section 5.4) for the comparison experiment. The dataset contains (a) the textual description of attack procedures; and (b) ground truth for the classification task which are the corresponding attack techniques used in the description.

Step 2: text pre-processing In this step, we apply the following on the corpus: (a) pre-processed for filtering the punctuation marks and whitespaces; (b) tokenization; (c) stop word removal; (d) stemming; and (e) lemmatization.

Step 3: feature extraction The underlying methods of the selected TTP extraction work (see Section 4) varies from one another in the context of NLP techniques used and the choice of the classifiers. Hence, in this step, we implement the methods we discuss in Section 4 to extract the textual features that we would provide to classifiers.

Step 4: oversampling (optional) In this step, we apply oversampling to our dataset to mitigate the introduced bias by class imbalance in the dataset. This step is optional and the methods would be compared both with this step and without this step.

Step 5: classification In this step, we train the classifiers with a portion from our dataset and then test the classification performance with the rest of the dataset that we do not use for training purposes.

We instantiate these five steps in the following five subsections. Figure 1 shows the instantiated pipeline.

Procedure Id	Procedure description	Technique Id	Technique name
G0016	APT29 has used encoded PowerShell scripts uploaded to CozyCar installations to download and install SeaDuke. APT29 also used PowerShell to create new tasks on remote machines, identify configuration settings, evade defenses, exfiltrate data, and to execute other commands.	T1059	Command and Scripting Interpreter
S0045	Most of the strings in ADVSTORESHELL are encrypted with an XOR-based algorithm; some strings are also encrypted with 3DES and reversed. API function names are also reversed, presumably to avoid detection in memory	T1027	Obfuscated Files or Information
S0154	Cobalt Strike can conduct peer-to-peer communication over Windows named pipes encapsulated in the SMB protocol. All protocols use their standard assigned ports.	T1071	Application Layer Protocol
G007	APT28 has downloaded additional files, including by using a first-stage downloader to contact the C2 server to obtain the second-stage implant	T1105	Ingress Tool Transfer
S0449	Maze has used the "Wow64RevertWow64FsRedirection" function following attempts to delete the shadow volumes, in order to leave the system in the same state as it was prior to redirection	T1070	Indicator Removal On Host

Table 2: Example of attack procedure description and corresponding technique in our dataset

5.4 Step 1: dataset collection

We construct the dataset from the textual description of attacks procedures mentioned in the MITRE ATT&CK framework. In the MITRE ATT&CK framework, each of the tactics has a one-to-many mapping with techniques and each of these techniques has a one-to-many mapping with textual descriptions of procedures taken from real-world cybersecurity incidents described in threat reports. We choose MITRE ATT&CK framework for constructing the dataset for the following reasons:

•

Although there is an abundance of threat reports in the internet, threat reports have to be manually filtered for relevance and manual labelled to TTP for each sentence of those threat reports. On the other hand, in the ATT&CK framework, the mapping between techniques and descriptions of procedures are already present and these mappings have been performed by cybersecurity professionals and researchers. The procedure-technique mapping done by these professionals are our ground truth.
•

All studies we selected for comparison use the ATT&CK framework for the common vocabulary of attack technique names.
•

The dataset contains textual descriptions of the attack procedures listed in the ATT&CK framework. The textual description of attack procedures consists of a few sentences and has the mapping to the associated technique name.
•

The ATT&CK framework is regularly updated and maintained with the evolution of attacks, hence, we get the latest set of TTP with the latest version of the ATT&CK framework.

We use Version 9 of the ATT&CK framework [25]. The dataset contains 8,104 attack procedure descriptions and corresponding techniques used in a real world cyber attack. The dataset contains 170 techniques. However, some techniques have more than hundreds of procedure descriptions while other techniques have only one. In our study, we use techniques that have at least 30 procedure descriptions, resulting in 7,061 procedure descriptions and 64 techniques. In Table 2, we show a few examples of techniques and textual description of procedures. For each of the textual descriptions, the mapped technique name will be used as the label for the classification task.

5.5 Step 2: text pre-processing

We first remove the urls and citations. Then, we use the gensim.parsing.preprocessing from gensim Python library for removing punctuation, whitespaces, stop-words and performing tokenization, stemming and lemmatization.

5.6 Step 3: feature extraction

M:TFIDF: We compute the TFIDF vectors of the corpus. Then, we normalize the TFIDF vectors to unit length. Finally, we feed these normalized TFIDF vectors to classifiers. We use TfidfVectorizer from the scikit-learn Python package.
M:LSI: We first compute the TFIDF vectors of the corpus. Then we normalize the TFIDF vectors to unit length. Then, we apply LSI to apply dimensionality reduction of the computed vectors. Finally we feed these normalized TFIDF vectors to classifiers. We use tfidfmodel and lsimodel from gensim Python packages.
M:LSI-Co: We apply pre-processing on the corpus as well as textual description of the techniques mentioned in the ATT&CK framework. Then, we compute the TFIDF vector for each of the technique description in the ATT&CK framework followed by normalization. After that, we apply LSI to each textual description in the corpus. We then compute the cosine similarity between TFIDF vectors (of each of the techniques) and topic vectors (of each of the textual descriptions from the corpus). Finally, we use these cosine similarities as features that we will feed towards classifiers. We use tfidfmodel, lsimodel from gensim, and scipy Python packages.
M:TFIDF-NP: We use parts of speech tagging to determine the noun phrases in the corpus. Then we construct a list of all noun phrases identified in the corpus. After that, we identify the noun phrases from the list which have been found in the corpus at least once without being part of a larger noun phrase. Then we compute the TFIDF vectors of these independent noun phrases for each procedure description. Then we normalize the TFIDF vectors to unit length. Finally, we feed these normalized TFIDF vectors to classifiers. We use TfidfVectorizer from scikit-learn and spacy Python package.
M:BM25: We first apply pre-processing to the corpus and textual description of the techniques from the ATT&CK framework. We then apply dependency parsing to both the corpus and textual description of the techniques from the ATT&CK framework. After that, we then extract the (subject, verb, object) (SVO) tuples from both the corpus and textual description of techniques from the ATT&CK framework. We randomly select 100 procedure descriptions and the first two authors manually verified the tuple extraction performance. The first author finds 74% of the tuples extracted and the second author finds 88% of the tuples extracted by our implementation. Next, we construct Bag-of-words representation of the extracted (subject, verb, object) tuples from both the corpus and textual description of techniques from the ATT&CK framework. After that, we compute the similarity score using BM25 ranking method for the extracted (subject, verb, object) tuples for each corpus and (subject, verb, object) tuples for each technique from ATT&CK framework. Finally, we use these similarity scores as features that we would feed to the classifiers. We use spacy, BM25Okapi.

5.7 Step 4: oversampling (optional)

As we mention in Section 5.4, in our dataset, the techniques do not have the same amount of procedure descriptions. Some techniques have more than hundreds and some of the techniques have only 30. As a result, there are majority and minority classes in the dataset leading to class imbalance in the dataset. We use oversampling to mitigate the class imbalance problem in our dataset, and we apply a technique called SMOTE which stands for Synthetic Minority Oversampling Technique [10]. SMOTE works by choosing a random example from a minority class and then selecting the nearest $k$ neighbors of that minority class to generate more synthetic examples of that minority class. We ran all methods with and without oversampling technique to observe their performance with and without handling the class imbalance issue. As SMOTE can only apply oversampling to numeric features, we applied SMOTE on the computed features in each method, such as TFIDF vectors, similarity score. We use SMOTE Python package to implement the oversampling.

# Classes	# Descriptions
# Classes	Before oversampling	After oversampling
$n=2$	890	1,064
$n=4$	1,492	2,128
$n=8$	2,448	4,256
$n=16$	3,688	8,512
$n=32$	5,390	17,024
$n=64$	7,061	34,048

Table 3: Procedure description count for each multiclass classification setting before and after oversampling

5.8 Step 5: classification

We use six classifiers named Naive Bayes(NB), Support Vector Machine(SVM), Neural Network(NN), K-nearest Neighbor(KNN), Decision Tree(DT) and Random Forest(RF) classifiers. Collectively, these six classifiers are used by the authors of the study selected for the comparison. We use scikit-learn Python package for these classifiers and we use GaussianNB, SVC, KNeighborsClassifier, MLPClassifier, DecisionTreeClassifier, and RandomForestClassifier for NB, SVM, KNN, NN, DT, and RF classifiers respectively.

5.9 Multi-class classification settings

We mention in Section 5.4 that our dataset contains 64 technique names. Hence, in a corpus, a piece of text can be classified to one of the maximum 64 possible technique names. Thus, in this experiment, the TTP extraction can be considered a multi-class classification problem. We run each of the methods with the classifiers in these two following cases: (a) where a piece of text can be classified to one from two possible technique names, and (b) where a piece of text can be classified to one from more than two possible technique names to observe how the method performs in both cases. For case (a), we sort the technique names by the count of the corresponding procedure descriptions in the dataset, and then we select the top two techniques and their corresponding descriptions. For case (b), we we sort the technique names by the count of the corresponding procedure descriptions in the dataset, and then we select the top $n$ (here, $n$ is the number of possible classification labels) techniques and their corresponding descriptions. We run all methods in the following six different cases ( $n=2,4,8,16,32,64$ where $n$ denotes the number of class labels) which we will refer as multiclass classification settings settings. We report the procedure description count for each of the multiclass classification settings before and after oversampling in Table 3.

5.10 Cross-validation

We apply K-fold cross-validation technique to split our dataset into different sets of training and testing set. We choose K=5 and for each fold, we use $80\%$ of the dataset as training set and rest of the dataset ( $20\%$ ) as testing set.

5.11 Hyperparameters

We use the default hyperparameter settings reported in the scikit-learn package for the classifiers. For SMOTE, we choose $k=6$ for generating synthetic examples from the nearest neighbors. Finally, before executing any method, we set the np.random.RandomState property of the Python environment to $0$ to ensure the replicability of our results from execution. For determining the number of topics for LSI model, we use the coherencemodel.get_coherence() method from lsimodel to get the coherence value for num_topic from 1 to 500. We observe that the coherence value increases monotonically. Hence, for LSI models, we use the num_topic = 500.

6 Results

In this section, we report the findings for each of the RQs.

6.1 RQ1: Classification performance

We report the precision, recall, F1 score, and AUC scores for all implemented methods paired with each of the six classifiers in Table 4. Each corresponding cell in the table reports the score in $A-B(C)$ format where $A$ is the minimum observed score, $B$ is the maximum observed score, and $C$ is the arithmetic average of a method run with a classifier in all multiclass classification settings (see Section 5.9): $n=2,4,8,16,32,64$ where $n$ denotes the number of class labels in the dataset. For example, the top left cell containing $63-88(76)$ denotes the precision score of M:TFIDF method paired with KNN classifier. The minimum and maximum precision scores observed are $63$ and $88$ , respectively, from all of six possible multi-class classification settings ( $n=2,4,8,16,32,64$ ), and the average precision score observed is $76$ . We bold the cell which shows the maximum average score for each method paired with six classifiers. We report the performance score of all methods across six classifiers, all oversampling settings, and all multiclass classification settings at Table 8 in Appendix.

M	C	P	R	F	A
M:TFIDF	KNN	63-88(76)	55-84(71)	56-85(71)	88-93(91)
	NB	28-71(41)	25-71(40)	25-71(40)	62-71(64)
	SVM	77-92(87)	65-90(81)	68-90(83)	98-99(99)
	DT	58-89(77)	56-84(74)	56-84(74)	80-88(85)
	RF	72-92(85)	66-89(82)	67-89(82)	98-99(98)
	NN	74-91(86)	71-90(84)	71-90(84)	97-99(98)
M:TFIDF-NP	KNN	39-77(59)	33-74(55)	33-73(55)	79-86(84)
	NB	31-83(54)	35-76(53)	30-77(51)	76-88(82)
	SVM	42-80(63)	28-78(55)	30-77(56)	88-89(89)
	DT	48-81(68)	46-78(66)	45-77(66)	77-86(83)
	RF	58-84(75)	53-82(73)	53-81(72)	93-96(95)
	NN	57-84(73)	52-82(71)	53-81(71)	92-94(93)
M:LSI	KNN	61-84(71)	51-81(64)	53-81(65)	86-89(87)
	NB	62-72(67)	55-69(61)	40-69(59)	62-94(86)
	SVM	75-92(87)	67-90(83)	69-90(83)	98-98(99)
	DT	35-87(66)	35-86(66)	35-86(66)	71-86(81)
	RF	67-90(83)	55-86(77)	57-86(78)	95-98(97)
	NN	72-86(80)	70-85(78)	70-84(77)	82-98(94)
M:LSI-Co	KNN	32-70(49)	25-69(47)	25-69(47)	75-81(79)
	NB	28-71(44)	23-67(40)	18-68(38)	74-82(79)
	SVM	37-71(51)	21-67(41)	22-67(41)	74-87(81)
	DT	22-75(46)	22-75(46)	21-75(46)	63-75(70)
	RF	39-77(59)	31-77(56)	32-77(56)	85-90(88)
	NN	33-71(49)	29-66(45)	29-66(45)	77-88(83)
M:BM25	KNN	35-80(59)	26-78(55)	27-78(56)	77-90(84)
	NB	31-82(56)	23-82(52)	22-82(51)	79-90(85)
	SVM	55-86(74)	36-85(65)	39-85(66)	93-96(95)
	DT	28-77(55)	28-76(55)	27-76(54)	68-81(75)
	RF	42-85(67)	36-84(64)	36-84(64)	90-95(93)
	NN	47-87(72)	45-87(70)	45-87(71)	93-96(95)

Table 4: Performance of methods across six classifiers across all multiclass classification settings, without oversampling. M = Method, C = Classifier, P = Precision, R = Recall, F = F1 measure, A = AUC, unit is percentage

We also report the boxplot of F1 score of for all implemented methods paired with each of the six classifiers in Figure 2. Each of the classifiers were run using all values of $n$ . We discuss our observations from Table 4 and Figure 2.

SVM and NN classifiers work best for the M:TFIDF method. We find SVM classifier shows the best performance in precision (87) and AUC score (99). The NN classifier shows the best performance in recall (84) and F1 score (84). SVM classifier differs by 3% and 1% in recall and F1 performance respectively compared with the NN classifier. The NN classifier differs by 1% in both precision and AUC performance compared with the SVM classifier. The NB classifier performs the worst among the six classifiers. The F1 performance difference is around 40% compared to NN and SVM classifiers. From Figure 2, we observe that SVM, RF, NN classifier shows close Q1 (25th percentile) and Q3 (75th percentile) in F1 score. KNN and DT classifiers also shows close Q1 and Q3 in F1 score, however, lagging behind the SVM, RF, and NN. NB has no overlap in Q1-Q3 range with rest of the five classifiers and performs the worst than rest of the classifiers.

RF classifiers work best for M:TFIDF-NP method. We find RF classifier shows the best performance in all four metrics. NN classifier performs similar to RF classifier, however, differs by $1-2\%$ in four metrics compared to RF classifier. Moreover, NB classifier performs the worst among the six classifiers and the performance difference is around 20% in F1 score compared to RF and NN classifiers. From Figure 2, we observe that all six classifiers have mutual overlap in Q1-Q3 range and the interquartile range is bigger than that of M:TFIDF.

SVM classifiers work best for M:LSI method. We find SVM classifier shows the best performance in all four metrics. RF and NN classifier performs similar to one another, however, differs by $5\%$ in f1 score compared to SVM classifier. Moreover, NB classifier performs the worst among the six classifiers in F1 score differing by 24%. From Figure 2, we observe that RF and NN also follow SVM closely. Interquartile range varies among the six classifiers (i.e., DT and rest of the classifiers).

RF classifiers work best for M:LSI-Co method. We find RF classifier shows the best performance in all four metrics. KNN and DT classifier performs similar to one another, however, differs by $10\%$ in f1 score compared to RF classifier. Moreover, NB classifier performs the worst among the six classifiers and the performance difference is around 18% in F1 score compared to RF classifier. From Figure 2, we observe that RF is closely followed by KNN and DT. All six classifiers have mutual overlap with one another in Q1-Q3.

NN classifiers work best for M:BM25 method. We find NN classifier shows the best performance in recall, F1 and AUC metrics. However, SVM classifier performs best in precision and equal to NN in AUC score. Moreover, NB classifier performs the worst among the six classifiers in F1 score differing by 20%. From Figure 2, we observe that there are mutual overlaps among all six classifiers and the interquartile range is bigger than that of M:TFIDF and M:LSI.

Method	Metric	Oversampling		Gain(%)
Method	Metric	No	Yes	Gain(%)
M:TFIDF	F1	72	92	28
M:TFIDF	AUC	89	97	9
M:TFIDF-NP	F1	62	82	32
M:TFIDF-NP	AUC	87	95	9
M:LSI	F1	71	89	20
M:LSI	AUC	91	96	5
M:LSI-Co	F1	46	68	48
M:LSI-Co	AUC	89	97	9
M:BM25	F1	60	82	36
M:BM25	AUC	88	95	8

Table 5: F1 and AUC score of five methods with and without applying oversampling

In Figure 3, we report the boxplot of precision, recall, F1 and AUC score of the five methods run with all classifiers and classification settings. We list our observation below.

M:TFIDF and M:LSI performs best on TTP classification. We observe that in all four metrics, M:TFIDF and M:LSI are ahead of the rest three methods. However, M:TFIDF shows slightly better performance than M:LSI. M:TFIDF-NP and M:BM25 show similar median score, however, M:BM25 shows a higher interquartile range than that of M:TFIDF-NP. M:LSI-Co lags behind the rest four methods. We also observe that the interquartile ranges of M:TFIDF and M:LSI are lower than the rest three methods. Finally, in AUC score, all methods are closer than the rest three metrics.

6.2 RQ2: Effect of class imbalance mitigation

We report the average F1 and AUC scores of each method with and without oversampling run with six classifiers and six classification settings in Table 5. We list observations below.

Oversampling helps gaining more performance. We observe that in the case of both F1 and AUC scores, all methods show better performance when oversampling is applied. In case of F1 score, the oversampling improved the score by 20 to 48%. M:LSI-Co gained the most performance (48%) and M:LSI gained the least (20%). In case of AUC score, the oversampling improved the score by 5 to 9%. M:LSI gained the least (5%), and rest of the method gained 8-9%. We also observe that the gain in F1 score is much higher than that of AUC score.

After applying oversampling, order of the methods in terms of performance remains similar. Before applying oversampling, we see M:TFIDF shows the best F1 score, followed by M:LSI, M:TFIDF-NP, M:BM25, and M:LSI-CO. After applying oversampling, we observe the same order except M:TFIDF-NP and M:BM25 having a tie. Although M:BM25, M:LSI-Co, M:TFIDF-NP made the most gain in performance, these three methods still perform worse than M:TFIDF and M:LSI even after applying oversampling.

6.3 RQ3: Effect of increase in class labels

We report the boxplot of F1 and AUC scores using six classification settings ( $n=2,4,8,16,32,64$ ) of all methods run with and without applying oversampling in Figure 4, 5, 6, and 7. We list our observations from the figure below.

The F1 score monotonically decrease when oversampling was not applied. We observe from Figure 4 that all the methods’ F1 score drops strictly when $n$ increases, indicating that (i) methods perform better if the classifiers need to classify on two labels, (ii) methods show strictly decreasing order of performance when classifiers need to classify in multiclass classification, the more is the number of class labels, the less the performance is. We also observe that M:TFIDF and M:LSI show more robustness than the rest three methods when we increase the value of $n$ .

The F1 score does not monotonically increase or decrease when oversampling was applied. We observe Figure 5 that, on oversmapled data, the methods behave differently to one another when we increase the value of $n$ . M:TFIDF, M:TFIDF-NP, M:LSI, and M:BM25 show increase in performance, however, not in strict order. We also observe M:LSI-Co drops the performance from $n=2$ to $n=8$ , however, then the performance increases bettering the case: $n=2$ .

The AUC score first starts to increase and then decrease when oversampling was not applied. We observe Figure 6 that, for all methods, with the increase of $n$ , the scores first increase, reach a plateau, and then decrease. We observe that M:TFIDF shows the least variation of performance change, while M:LSI-Co shows the most variance.

The AUC score monotonically increase when oversampling was applied. We observe Figure 7 that for all methods, AUC scores strictly increase with the increase of $n$ . M:TFIDF-NP and M:LSI-Co made the most gain in AUC score. M:TFIDF demonstrates the least variance of AUC score with the increase of $n$ .

7 Discussion

We observe the following order of performance for the compared methods: M:TFIDF, M:LSI, M:TFIDF-NP, M:BM25, M:LSI-Co. The performance rank of the methods does not change over different settings. We observe, with or without applying oversampling and across all multi-class classification settings, the M:TFIDF and M:LSI methods perform better than the other three methods. These two methods use the TFIDF vector from the whole corpus as features, while the other three methods use vectors of certain parts of speeches or similarity score. These results suggest that the specific existence of noun or verbs related to TTP is likely to be the most dominant feature for extracting TTP.

Table 6 shows the maximum observed score from our implemented methods and the reported maximum performance score of the corresponding publications [29, 28, 5, 21, 14]. We observe similarity in the performance score of methods and corresponding studies. M:LSI and M:TFIDF shows 90% F1 score while their corresponding work $S_{1}$ , and $S_{4}$ shows F1 score of 96% and accuracy score of 86% respectively. M:LSI-Co is the least performing method from our observation and the reported score of the corresponding work $S_{6}$ is also associated with the least F1 score. In the case of M:BM25, we observe 87% F1 score and the reported F1 score in the corresponding work $S_{9}$ is 86%.

Precision and recall performance do not vary by a large margin. We observe that precision and recall performance of each row in Table 4 do not have more than 9% difference, which is observed in the case of M:BM25 paired with SVM classifier. Our observations suggest that the methods offer a similar proportion of relevance and completeness while classifying TTP from a given text.

We observe that AUC scores of all methods are higher than F1 score of each method, indicating that (a) there could be a certain set of classification hyperparameters that would make the classification performance better; and (b) there could be bias in the dataset which impacts the precision and recall score.

We implement SVO tuples for comparing M:BM25 method, where we observe that our implementation for extracting SVO tuples is not perfect as the implementation cannot completely extract all words of objects in the case of: (i) when the object consists of multiple nouns; and (b) when the sentence is complex or written in passive voice. Improving the SVO tuple extraction, converting the passive voice to active voice, and breaking down a complex sentence to multiple simple sentences could make the performance better.

Study	Reported Score(F1)	Implementation	Observed Score(F1)
$S_{1}$	96	M:LSI	90
$S_{2}$	-	M:TFIDF-NP	81
$S_{4}$	86*	M:TFIDF	90
$S_{6}$	57	M:LSI-Co	77
$S_{9}$	86	M:BM25	87
*: the reported score in $S_{4}$ is accuracy
-: no performance score is reported in $S_{2}$

Table 6: Reported and observed performance score of compared methods

8 Future research direction

We advocate cybersecurity researchers and practitioners to investigate the performance of the discussed methods on applying a large corpus of threat reports. Moreover, we recommend establishing a benchmark dataset of large threat report corpus for conducting such experiments. Moreover, using NER, patterns of verb and noun co-occurrence, and regular expression of specific IoCs as features could have made the classification performance better. We recommend cybersecurity researchers investigate the performance benefits of incorporating these features while extracting TTP. As we discussed in Section 7, existence of specific TTP related nouns and verbs could be the most dominant feature for TTP classification, and hence, we recommend researchers investigate the best performing textual features for the classification task through feature selection and ranking techniques. We run six classifiers with their default parameters, researchers can investigate the optimal set of hyperparameter [39] for each classifier for the best classification performance. Finally, classifying texts to corresponding tactics, procedures and applying multi-label classification (a sentence can be related to more than one TTP) could also be investigated.

9 Threats to validity

We report the limitations we identify in this study. We did not compare the methods on corpus constructed from large threat reports. We also assume that one sentence is associated with only one corresponding attack technique. However, in threat reports, there could be sentences that might have more than one corresponding technique which can be classified with multi-label classifiers. For classification tasks, the dataset contains 64 technique names and hence, the classification performance for rest of the TTP is not evaluated. We also do not perform hyperparameter tuning for individual classifiers, which would have led to better performance. We also applied SMOTE technique to generate synthetic samples from numeric textual features, which might have introduced bias in the dataset. We did not perform a comparison study of all ten identified work and we also did not compare the five methods with tools from the industry such as [3]. Finally, we implement the underlying methods from the study with our best effort, however there could be additional bias introduced along with our implementation.

10 Related work

Rahman et al. [33] performed a systematic literature review on threat intelligence extraction from unstructured texts where they found 34 relevant studies and 8 data sources. The authors later surveyed 64 related studies in the extraction of threat intelligence from unstructured text where they identified ten CTI extraction goals [34]. The authors proposed a generic pipeline for CTI extraction and surveyed the NLP and ML techniques utilized for extraction. Bridges et al. [8] performed a comparison study of prior work [9, 17, 16] on cybersecurity entity extraction techniques. The authors used online blog articles on cybersecurity, National Vulnerability Database, and Common Vulnerabilities and Exposure (CVE) databases as corpus. The authors reported low recall of the compared methods and lack of published dataset. Our work differs from this work from Bridges et al. as our work compares the studies on TTP extraction while Bridges et al. compared the cybersecurity entity extraction from text. Tounsi et al. [44] defined the four categories of CTI and discussed technical CTI, existing issues, emerging research, and trends in CTI domain. They also compared the features of existing CTI gathering and sharing tools. Wagner et al. [47] evaluated the technical and nontechnical challenges in state-of-the-art CTI sharing systems. Sauerwein et al. [38] studied 22 CTI-sharing platforms enabling automation of the generation, refinement, and examination of security data. Tuma et al. [45] conducted a systematic literature review on 26 methodologies on applicability, outcome, and ease of access of cyberthreat analysis. While these work focuses on CTI extraction and technical aspects of CTI tools, in our work, we focus on comparing TTP extraction methods to determine what method performs best and what are the areas for further enhancement.

11 Conclusion

In this work, we compare the underlying methods of five existing TTP extraction work [29, 28, 5, 21, 14] and the corresponding implementations are M:LSI, M:TFIDF-NP, M:TFIDF, M:LIS-Co, and M:BM25, respectively. We compared these methods on the performance of classifying textual descriptions of procedures to the corresponding technique names given in the ATT&CK framework. From our experiment, we observe that: (a) M:TFIDF and M:LSI perform best in TTP classification from text; (b) performance of the methods drops when we increase the class labels in the classification task; and (c) oversampling improves the performance by mitigating the bias introduced by majority and minority classes in the dataset. We recommend (i) constructing an agreed-upon benchmark dataset; (ii) investigating all TTP extraction work from the literature and industry on large corpus dataset, and (iii) selecting optimal features for extracting TTPs from text. Cybersecurity researchers can use our work as a baseline for comparing and testing future TTP extraction methods.

Availability

The dataset and source code of the implemented methods are available to download at this Github repository: [4].

References

[1] Programmable search engine. https://cse.google.com/cse?cx=003248445720253387346:turlh5vi4xc, 2016. [Online; accessed 1-December-2020].
[2] The MITRE Corporation. https://www.mitre.org/, 2021. [Online; accessed 14-May-2021].
[3] Threat report - ATT&CK mapping. https://github.com/center-for-threat-informed-defense/tram, 2021. [Online; accessed 14-May-2021].
[4] usenix-22-ttps-extraction-comparison. https://github.com/brokenquark/usenix-22-ttps-extraction-comparison/tree/main Accessed 4 Oct 2021, 2021.
[5] Gbadebo Ayoade, Swarup Chandra, Latifur Khan, Kevin Hamlen, and Bhavani Thuraisingham. Automated threat report classification over multi-source data. In 2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC), pages 236–245. IEEE, 2018.
[6] Kiran Bandla. APTnotes. https://github.com/aptnotes/data, 2016. [Online; accessed 1-December-2020].
[7] FireEye Threat Research Blog. Highly Evasive Attacker Leverages SolarWinds Supply Chain to Compromise Multiple Global Victims With SUNBURST Backdoors. https://www.fireeye.com/blog/threat-research/2020/12/evasive-attacker-leverages-solarwinds-supply-chain-compromises-with-sunburst-backdoor.html, 2020. [Online; accessed 22-August-2020].
[8] Robert A Bridges, Kelly MT Huffer, Corinne L Jones, Michael D Iannacone, and John R Goodall. Cybersecurity automated information extraction techniques: Drawbacks of current methods, and enhanced extractors. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 437–442. IEEE, 2017.
[9] Robert A Bridges, Corinne L Jones, Michael D Iannacone, Kelly M Testa, and John R Goodall. Automatic labeling for entity extraction in cyber security. arXiv preprint arXiv:1308.4941, 2013.
[10] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
[11] Yumna Ghazi, Zahid Anwar, Rafia Mumtaz, Shahzad Saleem, and Ali Tahir. A supervised machine learning based approach for automatically extracting high-level threat intelligence from unstructured sources. In 2018 International Conference on Frontiers of Information Technology (FIT), pages 129–134. IEEE, 2018.
[12] Google. Android api reference. https://developer.android.com/reference Accessed 4 Oct 2021, 2021.
[13] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sch0lkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
[14] Ghaith Husari, Ehab Al-Shaer, Mohiuddin Ahmed, Bill Chu, and Xi Niu. Ttpdrill: Automatic and accurate extraction of threat actions from unstructured text of cti sources. In Proceedings of the 33rd Annual Computer Security Applications Conference, pages 103–115, 2017.
[15] Ghaith Husari, Xi Niu, Bill Chu, and Ehab Al-Shaer. Using entropy and mutual information to extract threat actions from cyber threat intelligence. In 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), pages 1–6. IEEE, 2018.
[16] Corinne L Jones, Robert A Bridges, Kelly MT Huffer, and John R Goodall. Towards a relation extraction framework for cyber-security concepts. In Proceedings of the 10th Annual Cyber and Information Security Research Conference, pages 1–4, 2015.
[17] Arnav Joshi, Ravendar Lal, Tim Finin, and Anupam Joshi. Extracting cybersecurity related linked data from text. In 2013 IEEE Seventh International Conference on Semantic Computing, pages 252–259. IEEE, 2013.
[18] Joseph R. Biden Jr. Executive order on improving the nations cybersecurity. https://www.whitehouse.gov/briefing-room/presidential-actions/2021/05/12/executive-order-on-improving-the-nations-cybersecurity/ Accessed 4 Oct 2021, 2021.
[19] Swati Khandelwal. New Group of Hackers Targeting Businesses with Financially Motivated Cyber Attacks. https://thehackernews.com/2019/11/financial-cyberattacks.html, 2019. [Online; accessed 22-August-2020].
[20] Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse processes, 25(2-3):259–284, 1998.
[21] Mengming Li, Rongfeng Zheng, Liang Liu, and Pin Yang. Extraction of threat actions from threat-related articles using multi-label machine learning classification method. In 2019 2nd International Conference on Safety Produce Informatization (IICSPI), pages 428–431. IEEE, 2019.
[22] Rob McMillan. Definition: threat intelligence. Gartner. com, 2013.
[23] MITRE. Software Discovery: Security Software Discovery. https://attack.mitre.org/techniques/T1518/001/, 2020. [Online; accessed 22-August-2020].
[24] MITRE. MITRE ATT&CK. https://attack.mitre.org/, 2021. [Online; accessed 14-May-2021].
[25] MITRE. MITRE ATT&CK. https://attack.mitre.org/versions/v9/resources/working-with-attack/, 2021. [Online; accessed 14-May-2021].
[26] Angela Moon. State-sponsored cyberattacks on banks on the rise: report. https://www.reuters.com/article/us-cyber-banks/state-sponsored-cyberattacks-on-banks-on-the-rise-report-idUSKCN1R32NJl, 2019. [Online; accessed 22-August-2020].
[27] BBC News. Cyber-crime: Irish health system targeted twice by hackers. https://www.bbc.com/news/world-europe-57134916 Accessed 4 Oct 2021, 2021.
[28] Amirreza Niakanlahiji, Jinpeng Wei, and Bei-Tseng Chu. A natural language processing based trend analysis of advanced persistent threat techniques. In 2018 IEEE International Conference on Big Data (Big Data), pages 2995–3000. IEEE, 2018.
[29] Umara Noor, Zahid Anwar, Tehmina Amjad, and Kim-Kwang Raymond Choo. A machine learning-based fintech cyber threat attribution framework using high-level indicators of compromise. Future Generation Computer Systems, 96:227–242, 2019.
[30] OASIS. Introduction to stix. https://oasis-open.github.io/cti-documentation/stix/intro.html Accessed 4 Oct 2021, 2021.
[31] Ericka Pingol. Meat supply giant jbs suffers cyberattack. https://www.trendmicro.com/en_us/research/21/f/meat-supply-giant-jbs-suffers-cyberattack.html Accessed 4 Oct 2021, 2021.
[32] Aritran Piplai, Sudip Mittal, Anupam Joshi, Tim Finin, James Holt, and Richard Zak. Creating cybersecurity knowledge graphs from malware after action reports. IEEE Access, 8:211691–211703, 2020.
[33] Md Rayhanur Rahman, Rezvan Mahdavi-Hezaveh, and Laurie Williams. A literature review on mining cyberthreat intelligence from unstructured texts. In 2020 International Conference on Data Mining Workshops (ICDMW), pages 516–525. IEEE, 2020.
[34] Md Rayhanur Rahman, Rezvan Mahdavi-Hezaveh, and Laurie Williams. What are the attackers doing now? automating cyber threat intelligence extraction from text on pace with the changing threat landscape: A survey. arXiv preprint arXiv:2109.06808, 2021.
[35] Roshni R Ramnani, Karthik Shivaram, and Shubhashis Sengupta. Semi-automated information extraction from unstructured threat advisories. In Proceedings of the 10th Innovations in Software Engineering Conference, pages 181–187, 2017.
[36] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine learning, 85(3):333–359, 2011.
[37] Carl Sabottke, Octavian Suciu, and Tudor Dumitra\textcommabelows. Vulnerability disclosure in the age of social media: Exploiting twitter for predicting real-world exploits. In 24th USENIX Security Symposium (USENIX Security 15), pages 1041–1056, 2015.
[38] Clemens Sauerwein, Christian Sillaber, Andrea Mussmann, and Ruth Breu. Threat Intelligence Sharing Platforms: An Exploratory Study of Software Vendors and Research Perspectives. In Jan Marco Leimeister and Walter Brenner, editors, Proceedings of the 13th International Conference on Wirtschaftsinformatik (WI 2017), pages 837–851, 2017.
[39] Rui Shu, Tianpei Xia, Jianfeng Chen, Laurie Williams, and Tim Menzies. How to better distinguish security bug reports (using dual hyperparameter optimization). Empirical Software Engineering, 26(3):1–37, 2021.
[40] Sonatype. 2021 state of the software supply chain. https://www.sonatype.com/resources/state-of-the-software-supply-chain-2021 Accessed 4 Oct 2021, 2021.
[41] Newton Spolaor, Everton Alvares Cherman, Maria Carolina Monard, and Huei Diana Lee. A comparison of multi-label feature selection methods using the problem transformation approach. Electronic Notes in Theoretical Computer Science, 292:135–151, 2013.
[42] DNS Stuff. What is Threat Intelligence. https://www.dnsstuff.com/what-is-threat-intelligence, 2020. [Online; accessed 22-August-2020].
[43] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bunau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
[44] Wiem Tounsi and Helmi Rais. A survey on technical threat intelligence in the age of sophisticated cyber attacks. Computers & security, 72:212–233, 2018.
[45] Katja Tuma, Gul Çalikli, and Riccardo Scandariato. Threat analysis of software systems: A systematic literature review. Journal of Systems and Software, 144:275–294, 2018.
[46] William Turton and Kartikay Mehrotra. Hackers breached colonial pipeline using compromised password. https://www.bloomberg.com/news/articles/2021-06-04/hackers-breached-colonial-pipeline-using-compromised-password Accessed 4 Oct 2021, 2021.
[47] Thomas D Wagner, Khaled Mahbub, Esther Palomar, and Ali E Abdallah. Cyber threat intelligence sharing: Survey and research directions. Computers & Security, 87:101589, 2019.
[48] Ziyun Zhu and Tudor Dumitraş. Featuresmith: Automatically engineering features for malware detection by mining the security literature. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 767–778, 2016.

Appendix

Id	Publication
$S_{1}$ [29]	Noor, Umara, Zahid Anwar, Tehmina Amjad, and Kim-Kwang Raymond Choo. "A machine learning-based FinTech cyberthreat attribution framework using high-level indicators of compromise." Future Generation Computer Systems 96 (2019): 227-242.
$S_{2}$ [28]	Niakanlahiji, Amirreza, Jinpeng Wei, and Bei-Tseng Chu. "A natural language processing based trend analysis of advanced persistent threat techniques." In 2018 IEEE International Conference on Big Data (Big Data), pp. 2995-3000. IEEE, 2018.
$S_{3}$ [11]	Ghazi, Yumna, Zahid Anwar, Rafia Mumtaz, Shahzad Saleem, and Ali Tahir. "A supervised machine learning based approach for automatically extracting high-level threat intelligence from unstructured sources." In 2018 International Conference on Frontiers of Information Technology (FIT), pp. 129-134. IEEE, 2018.
$S_{4}$ [5]	Ayoade, Gbadebo, Swarup Chandra, Latifur Khan, Kevin Hamlen, and Bhavani Thuraisingham. "Automated threat report classification over multi-source data." In 2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC), pp. 236-245. IEEE, 2018.
$S_{5}$ [32]	Piplai, Aritran, Sudip Mittal, Anupam Joshi, Tim Finin, James Holt, and Richard Zak. "Creating cybersecurity knowledge graphs from malware after action reports." IEEE Access 8 (2020): 211691-211703.
$S_{6}$ [21]	Li, Mengming, Rongfeng Zheng, Liang Liu, and Pin Yang. "Extraction of Threat Actions from Threat-related Articles using Multi-Label Machine Learning Classification Method." In 2019 2nd International Conference on Safety Produce Informatization (IICSPI), pp. 428-431. IEEE, 2019.
$S_{7}$ [48]	Zhu, Ziyun, and Tudor Dumitraş. "Featuresmith: Automatically engineering features for malware detection by mining the security literature." In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 767-778. 2016.
$S_{8}$ [35]	Ramnani, Roshni R., Karthik Shivaram, and Shubhashis Sengupta. "Semi-automated information extraction from unstructured threat advisories." In Proceedings of the 10th Innovations in Software Engineering Conference, pp. 181-187. 2017.
$S_{9}$ [14]	Husari, Ghaith, Ehab Al-Shaer, Mohiuddin Ahmed, Bill Chu, and Xi Niu. "Ttpdrill: Automatic and accurate extraction of threat actions from unstructured text of cti sources." In Proceedings of the 33rd Annual Computer Security Applications Conference, pp. 103-115. 2017.
$S_{10}$ [15]	Husari, Ghaith, Xi Niu, Bill Chu, and Ehab Al-Shaer. "Using entropy and mutual information to extract threat actions from cyber threat intelligence." In 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 1-6. IEEE, 2018.

Table 7: List of studies in the candidate set for the comparison

Table 8: Performance score of all the methods across six classifiers, six multiclass classification settings, and oversampling settings

M	C	OS	n = 2				n = 4				n = 8				n = 16				n = 32				n = 64
M	C	OS	P	R	F	A	P	R	F	A	P	R	F	A	P	R	F	A	P	R	F	A	P	R	F	A
M:TFIDF	KNN	No	88	84	85	93	80	76	77	92	79	76	76	92	76	69	70	91	70	63	64	89	63	55	56	88
	KNN	Yes	85	79	78	91	83	80	78	94	86	85	83	97	88	88	87	98	91	91	90	99	94	94	94	99
	NB	No	71	71	71	71	41	42	41	62	40	39	39	64	34	33	32	63	31	30	29	63	28	25	25	62
	NB	Yes	81	79	79	80	74	74	73	84	80	80	79	90	85	85	83	93	88	88	87	95	92	92	91	97
	SVM	No	92	90	90	99	90	86	86	99	91	86	87	99	90	84	86	99	84	76	78	99	77	65	68	98
	SVM	Yes	96	96	96	99	96	96	96	100	97	97	97	100	98	97	98	100	98	98	98	100	99	99	99	100
	DT	No	89	84	84	84	86	81	81	87	81	80	79	88	78	77	76	87	69	68	67	84	58	56	56	80
	DT	Yes	93	93	93	93	93	92	92	95	91	91	91	95	92	92	92	96	92	92	92	96	93	93	93	96
	RF	No	92	89	89	99	90	89	89	98	88	88	87	99	87	85	85	98	80	77	77	98	72	66	67	98
	RF	Yes	95	95	95	99	95	95	95	99	95	95	95	100	96	96	96	100	97	97	97	100	98	98	98	100
	NN	No	91	90	90	97	91	89	90	98	90	89	89	99	87	87	86	99	80	79	79	99	74	71	71	98
	NN	Yes	95	95	95	99	95	95	95	100	97	96	96	100	98	98	97	100	98	98	98	100	99	99	99	100
M:TFIDF-NP	KNN	No	77	74	73	85	71	70	68	86	63	60	60	85	56	51	52	84	50	44	44	82	39	33	33	79
	KNN	Yes	82	81	81	89	75	74	74	91	76	75	75	94	77	77	77	95	80	80	79	97	82	82	82	97
	NB	No	83	76	77	88	60	58	56	80	58	57	55	83	49	52	47	82	43	42	39	80	31	35	30	76
	NB	Yes	84	81	81	92	71	66	65	86	77	74	73	93	77	74	73	95	79	74	74	97	81	78	77	98
	SVM	No	80	78	77	89	71	70	70	88	66	60	60	88	62	52	53	89	55	42	44	89	42	28	30	88
	SVM	Yes	84	83	83	91	77	77	77	92	70	68	68	93	69	66	66	96	68	65	65	97	72	67	67	98
	DT	No	81	78	77	86	76	76	75	85	75	74	73	85	68	66	66	83	60	58	58	80	48	46	45	77
	DT	Yes	88	87	87	91	84	84	83	90	86	85	85	93	87	86	86	93	88	88	88	94	89	89	89	95
	RF	No	84	81	80	93	82	82	81	94	81	80	79	96	77	74	74	96	69	65	65	96	58	53	53	94
	RF	Yes	89	88	88	95	88	88	88	97	90	90	90	99	92	91	91	99	94	94	94	100	95	95	95	100
	NN	No	84	82	81	92	80	80	79	93	77	77	76	93	74	72	73	94	67	64	64	94	57	52	53	93
	NN	Yes	88	87	87	95	87	87	87	96	89	88	88	98	90	90	90	99	92	92	92	99	93	93	93	100
M:LSI	KNN	No	84	81	81	89	71	67	67	86	70	66	66	87	70	61	63	87	67	59	61	87	61	51	53	86
	KNN	Yes	82	76	75	87	77	75	75	91	81	80	79	95	86	85	85	97	90	90	90	98	94	94	93	100
	NB	No	68	56	40	62	65	55	55	81	69	67	67	91	72	69	69	93	68	64	64	94	62	56	57	94
	NB	Yes	75	58	48	65	74	68	68	89	82	79	79	96	84	80	81	98	86	81	83	99	86	82	83	99
	SVM	No	92	90	90	98	91	87	87	99	91	87	88	99	90	86	87	99	83	78	79	99	75	67	69	98
	SVM	Yes	96	96	96	99	96	96	96	100	97	97	97	100	98	98	98	100	98	98	98	100	98	98	98	100
	DT	No	87	86	86	86	83	83	83	88	76	75	75	85	66	66	65	82	51	51	50	76	35	35	35	71
	DT	Yes	91	90	90	90	89	89	89	93	87	87	87	92	86	86	86	93	85	85	85	92	84	84	84	92
	RF	No	90	86	86	96	90	86	87	97	88	85	85	98	85	80	81	97	78	70	72	97	67	55	57	95
	RF	Yes	95	95	95	99	94	94	94	99	95	95	95	100	96	96	96	100	97	97	97	100	98	98	98	100
	NN	No	77	72	72	82	78	77	77	92	85	85	83	97	86	85	84	98	79	77	77	98	72	70	70	98
	NN	Yes	91	89	90	96	91	91	90	98	96	96	96	100	97	97	97	100	98	98	97	100	98	98	98	100
M:LSI-Co	KNN	No	70	69	69	79	61	61	61	81	50	50	49	80	45	41	41	80	38	35	34	78	32	25	25	75
	KNN	Yes	79	79	79	85	72	72	71	89	67	67	67	91	72	72	71	95	81	81	80	97	88	88	88	98
	NB	No	71	67	68	77	52	46	46	74	41	40	38	77	39	37	35	80	33	26	25	82	28	23	18	82
	NB	Yes	71	68	67	77	56	54	54	78	41	41	38	79	44	40	40	86	45	34	35	88	47	37	36	92
	SVM	No	71	67	67	74	54	51	49	76	47	43	42	80	51	36	36	85	44	30	31	87	37	21	22	86
	SVM	Yes	71	69	68	77	61	59	58	82	57	55	55	87	62	57	58	93	74	71	71	97	85	83	84	99
	DT	No	75	75	75	75	62	63	62	75	50	50	50	72	40	39	39	69	28	28	28	65	22	22	21	63
	DT	Yes	79	79	79	79	72	72	71	82	66	66	66	81	68	68	68	83	71	71	71	85	76	76	76	88
	RF	No	77	77	77	85	69	68	68	87	64	61	62	89	57	53	53	89	50	44	44	90	39	31	32	89
	RF	Yes	82	82	82	89	78	78	78	93	80	80	79	96	83	83	83	98	89	89	89	99	94	94	94	100
	NN	No	71	66	66	77	56	54	53	77	48	45	45	81	48	41	41	86	40	37	37	88	33	29	29	88
	NN	Yes	68	67	67	78	58	58	57	82	53	53	52	85	61	60	60	93	72	72	71	97	85	85	84	99
M:BM25	KNN	No	80	78	78	87	74	72	72	90	68	64	65	88	54	51	52	84	44	38	39	80	35	26	27	77
	KNN	Yes	83	82	82	91	80	79	79	94	80	80	80	96	82	82	81	97	84	84	83	98	89	89	88	99
	NB	No	82	82	82	90	67	68	65	87	62	58	57	87	51	47	45	85	41	35	34	84	31	23	22	79
	NB	Yes	83	83	83	90	71	69	67	89	68	62	62	90	60	53	53	89	55	45	47	90	52	40	41	88
	SVM	No	86	85	85	94	84	79	79	96	79	74	75	96	74	63	66	95	67	51	53	94	55	36	39	93
	SVM	Yes	89	89	89	96	89	88	89	98	89	89	89	99	87	86	86	99	88	87	87	100	91	91	91	100
	DT	No	77	76	76	78	73	72	71	81	61	61	60	78	51	51	50	75	41	41	41	71	28	28	27	68
	DT	Yes	84	83	83	85	81	80	81	88	78	78	78	88	76	76	76	88	76	76	76	88	78	78	78	89
	RF	No	85	84	84	93	81	77	78	95	74	72	72	94	65	62	63	93	55	50	51	92	42	36	36	90
	RF	Yes	89	89	89	96	88	88	88	97	88	88	88	98	89	89	89	99	90	90	90	99	93	93	93	100
	NN	No	87	87	87	95	86	83	84	96	80	79	79	96	71	70	70	95	60	58	58	94	47	45	45	93
	NN	Yes	92	92	92	97	92	91	91	98	91	90	90	99	92	91	91	99	92	92	92	99	93	93	93	100