This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Meta-Path-based Probabilistic Soft Logic for Drug-Target Interaction Prediction

Shengming Zhang shengming.zhang@rutgers.edu Rutgers UniversityNewarkNew JerseyUSA  and  Yizhou Sun yzsun@cs.ucla.edu University of California, Los AngelesLos AngelesCaliforniaUSA
Abstract.

Drug-target interaction (DTI) prediction, which aims at predicting whether a drug will be bounded to a target, have received wide attention recently, with the goal to automate and accelerate the costly process of drug design. Most of the recently proposed methods use single drug-drug similarity and target-target similarity information for DTI prediction, which are unable to take advantage of the abundant information regarding various types of similarities between them. Very recently, some methods are proposed to leverage multi-similarity information, however, they still lack the ability to take into consideration the rich topological information of all sorts of knowledge bases where the drugs and targets reside in. More importantly, the time consumption of these approaches is very high, which prevents the usage of large-scale network information. We thus propose a network-based drug-target interaction prediction approach, which applies probabilistic soft logic (PSL) to meta-paths on a heterogeneous network that contains multiple sources of information, including drug-drug similarities, target-target similarities, drug-target interactions, and other potential information. Our approach is based on the PSL graphical model and uses meta-path counts instead of path instances to reduce the number of rule instances of PSL. We compare our model against five methods, on three open-source datasets. The experimental results show that our approach outperforms all the five baselines in terms of AUPR score and AUC score.

Drug target interaction predictions, Data mining, Meta-path, Probabilistic soft logic, Convex optimization
ccs: Applied computing Bioinformaticsccs: Information systems Association rulesccs: Mathematics of computing Markov networksccs: Theory of computation Convex optimization

1. Introduction

New drug development is usually a very time consuming and expensive procedure. Recently, computer aided drug design has received wide attention, with the goal to accelerate drug design. Among them, the study of predicting unknown drug-target interactions based on existing domain-specific knowledge using mathematical models becomes an area of growing interest (Chen et al., 2016). By quantitatively expressing the similarity between drug-drug and target-target, one can find a mathematical relationship between drugs and targets, which could help to predict potential interactions between existing drugs and unknown targets, or vice versa (Fourches et al., 2010).

There are several existing methods to model the drug-target interaction prediction (Ding et al., 2014) task, most of which apply a network-based representation (Yildirim et al., 2007): (Fakhraei et al., 2014) constructs a bipartite interaction network which has two types of nodes: drug nodes and target nodes. There can be edges between two drug nodes or two target nodes, denoting as similarity information. Edges between drug nodes and target nodes represent interaction information. However, such bipartite interaction network constrains the type of nodes within two, and is unable to add additional measures.

In addition to a bipartite interaction network, (Chen et al., 2015) constructs a heterogeneous internet that could directly integrate richer domain-specific knowledge into the network, such as drug-drug/target-target interaction information, drug-cure-disease and disease-caused-by-target information etc. Instead of transferring these information into similarities using standard measures (such as the Jaccard and Spearman indexes), the DTI prediction task can be solved using the information directly extracted out of the heterogeneous network.

Using link prediction methods can predict potential interactions within a network, which is proposed both in (Getoor and Diehl, 2005) and (Lü and Zhou, 2011). However, some link prediction methods ignore the inner relationship between different semantic similarity information (Fu et al., 2016) and other domain-specific knowledge, while some methods tend to take all the detail into consideration to achieve good results, but the time consumption becomes very heavy (Fakhraei et al., 2014; Cichonska et al., 2018; Luo et al., 2017).

In this paper we present a drug-target interaction prediction method based on probabilistic soft logic (PSL) (Kimmig et al., 2012). We predict unknown drug-target interactions using multi-relational information and existing interactions of the network. In order to avoid grounding out all the rule instances that could significantly slow down the inference process like the original PSL model, We apply summated meta-path (Fu et al., 2016) which defines several semantic meta-paths and uses matrix multiplications to calculate the path counts, storing as commuting matrices. We apply a Bayesian probabilistic approach that transfers the path counts into probabilities, indicating probabilities for the body parts of PSL rules (Bach et al., 2017), then apply the PSL model for the DTI prediction. Our summated meta-path PSL model outperforms all the five baselines (Fakhraei et al., 2014; Fu et al., 2016; Cichonska et al., 2018; Luo et al., 2017) in both AUPR score and AUC score on three open-sourced datasets together with significant time-consumption reductions.

In this paper, we define several semantic meta-paths and use matrix multiplications to generate commuting matrices corresponding to each semantic meta-path, which is similar to (Fu et al., 2016). Afterwards, we apply a Bayesian probabilistic approach that transfers the path-counts of the commuting matrices into probabilities. Then we could define several PSL rules. The first type of the PSL rules is: ”Each semantic meta-path metric may imply potential interactions”; the second type of PSL rules is: ”By default, drug and target does not interacts with each other”. The first type of rules corresponds with the pre-defined semantic meta-paths, and the second type of rules is a negative prior. Different with the original PSL model, for each drug-target pair, we only have one rule instance within each PSL rule, since we applied summation using meta-path counts on all the existing rule instances. For each drug-target pair, we could treat the probabilities based on different semantic meta-paths as the body part of the PSL rule instances. For more details about the probabilistic soft logic (PSL) please refer to (Bach et al., 2017, 2013).

Our main contribution and novelty is that we figure out the shortcuts in both Meta-path method and PSL method and provide a perfect solution: Although the Meta-path approach proposes a robust path count topological feature, it does not give the feature a comprehensible meaning, only treating them as vector feature; the PSL model is a good probabilistic graphical model, but it tends to ground out every single rule instances and take them all into consideration, which results in heavy time consumption. By using the PSL model, the meta-path count topological feature can get a probabilistic comprehension, and by using path count summation strategy, the total number of rule instances for PSL significantly reduces so that it could be used for much larger datasets.

Based on the new settings, We implement a new DTI prediction framework, and compare our model with other five multi-similarity or internet-based approaches (Fakhraei et al., 2014; Fu et al., 2016; Cichonska et al., 2018; Luo et al., 2017) on three open-sourced datasets used by (Fakhraei et al., 2014; Fu et al., 2016; Luo et al., 2017) ((Fu et al., 2016) proposes two methods). The experimental results indicate that our model significantly reduces the running time as well as outperforming all five baseline models on all three datasets in AUC score and AUPR score.

2. Related Work

There are a number of Network structure-based methods for drug-target interaction predictions. (Cockell et al., 2010) constructs a network structure metagraph to organize drug, target, protein and genes showing their relationships. They point out that drugs with resemble structures can behave similarly in interactions. Based on a network structure, (Cheng et al., 2012) treats the DTI prediction problem as an inference problem on a drug-target bipartite network which integrates chemical drug-drug similarities, sequencial target-target similarities and known drug-target interactions.

(Yamanishi et al., 2008) integrates compound structure similarities, protein sequence similarities and several open-sourced drug-target interaction datasets in a network, generating the Gold standard datasets which characterizes the drug-target interaction network into four classes: Enzyme, Ion channel, GPCR and Nuclear receptor. More recently, there are a number of matrix factorization drug-target interaction prediction methods (Liu et al., 2016; Ezzat et al., 2017b) which give state-of-the-art results on the golden standard dataset generated by (Yamanishi et al., 2008).

In order to take multiple similarities and additional domain-specific measures into consideration, (Luo et al., 2017) uses a dimensionality reduction scheme, diffusion component analysis (DCA), to obtain informative feature representations based on relational properties, association information and topological context of each drug and target within a heterogeneous network. After generating the feature representations, they use a learned projection matrix that best projects the drug feature into protein space so that the distance between the projected feature vectors from drug space to target space and the interacting proteins is minimized.

(Chen et al., 2012) uses advanced topological features such as distance and number of shortest paths between node pairs for drug-target predictions. Furthermore, (Fu et al., 2016) uses semantic meta-path topological features and apply SVM and Random Forest algorithm as classifier on a network integrated with multiple objects, including compound, protein, disease and gene etc.

(Cichonska et al., 2018) uses a kernel-based approach that first learns a corresponding weight for each similarity measure, then calculates a corresponding weight for each pairwise kernel. By detecting high ranking values within the weighted summed kernel matrix for the unknown drug-target pairs, they could be considered as new interactions. The dataset provided by this paper is a drug bioactivity prediction to cancer cells dataset, which labels are numerical bioactivity measures. As a result, it is not proper to be used as a comparison dataset for DTI prediction use.

Instead of using topological features, (Perlman et al., 2011) combines multiple sources of drug-drug similarity and target-target similarity data together into features, using logistic regression as classifier. (Fakhraei et al., 2014) extracts a subset of drugs and targets from the dataset created by (Perlman et al., 2011), formulating the DTI prediction as an inference problem on a network and use probabilistic soft logic (PSL) framework for inference.

There are also non-similarity based approaches that takes pre-calculated drug and target features out of raw descriptors as the input of their models (Ezzat et al., 2016, 2017a). Furthermore, (Wen et al., 2017) uses deep learning based approaches for DTI prediction that takes the raw drug and target descriptors as input and train their model among interactions of all the FDA approved drugs and targets. Since the input of datasets provided by these approaches are descriptors, the datasets are not proper to be used for comparison experiments using our multi-similarity and internet-based approaches.

3. Preliminaries

3.1. Meta-path Count Topological Feature

A semantic meta-path defines a certain type of paths linking the starting and ending objects. The total number of path instances of one semantic meta-path can be treated as a topological feature which evaluates the strength of associations between the starting and ending objects and is also called path count (Fu et al., 2016). In the DTI prediction task, a meta-path starts from a drug, and ends with a target, meaning there is one valid meta-path instance between the drug and target.

For instance, we can define two types of semantic meta-paths:

(A) DrugsimilartoDruginteractsTargetDrug\operatorname*{\longrightarrow}\limits^{similar\ to}Drug\operatorname*{\longrightarrow}\limits^{interacts}Target
(B) DruginteractsTargetsimilartoTargetDrug\operatorname*{\longrightarrow}\limits^{interacts}Target\operatorname*{\longrightarrow}\limits^{similar\ to}Target

Meta-path (A) indicates that if a drug interacts with a target while this drug has similarity with another drug, then the other drug is also likely to interact with this target. Meta-path (B) has a similar interpretation.

For each drug-target interaction pair, we first calculate the corresponding path counts within the dataset under each semantic meta-path, then concatenate the numbers into a topological feature vector, each dimension denoting as the path count value of one semantic meta-path.

3.2. Hinge-loss Markov Random Fields and Probabilistic Soft Logic

Probabilistic soft logic (PSL) uses first-order logic syntax to form a hinge-loss Markov random fields (HL-MRFs) model. A hinge-loss Markov random fields model is defined over continuous variables which can naturally assemble probabilities and other real-valued attributes. By applying a maximum a posteriori (MAP) inference on a Hinge-loss Markov random fields, we can efficiently get exact inference result for all variables, as the MAP inference on a HL-MRFs model is a convex optimization problem (Bach et al., 2013). A hinge-loss Markov random fields can be formulized as a log-concave joint probability density function:

(1) P(Y|X)=1Zexp(r=1Mλrϕr(Y,X))P(Y|X)=\frac{1}{Z}exp(-\sum\limits^{M}\limits_{r=1}\lambda_{r}\phi_{r}(Y,X))

Y is the set of unknown variables and X is the set of observed values. λ\lambda represents the set of weight parameters, Z is a normalization constant. The potential function ϕr(Y,X)\phi_{r}(Y,X) for HL-MRFs is defined as:

(2) ϕr(Y,X)=(max(lr(Y,X),0))p\phi_{r}(Y,X)=(\max{(l_{r}(Y,X),0)})^{p}

lrl_{r} is a linear function of Y and X and p{1,2}p\in\{1,2\}. The MPE inference is also equivalent to minimizing the convex energy.

In the PSL setting, all the grounded out rule instances which are associated with both known values, such as similarity information, and unknown variables, such as potential drug-target interactions, will be treated as terms in the potential function of a HL-MRFs model.

More in detail, we consider a general form of PSL rule:

(3) w:P(A,B)Q(B,C)R(A,C)w:P(A,B)\wedge Q(B,C)\rightarrow R(A,C)

R (A, C) represents a continuous target variable, such as the probability of interactions between drug A and target C; P (A,B) and Q (B,C) represent observed values, such as the similarity between drug A and drug B and the probability of interaction between drug B and target C. The soft logic defines a relaxed convex representation of this logical implication:

(4) max{P(A,B)+Q(B,C)R(A,C)1,0}\max\{P(A,B)+Q(B,C)-R(A,C)-1,0\}

This continuous value can be treated as the distance to satisfaction of the logical implication.

The MAP inference algorithm aims to minimize the energy of a hinge-loss Markov random fields, which is equivalent to minimize the total weighted distance to satisfaction for all the grounded out rule instances. A full description of PSL is described in (Bach et al., 2015).

4. Problem Definition

We consider the problem of drug-target interaction problem as inferring new edges between drug nodes and target nodes on a heterogeneous network. Given a set of drugs D = {D1,,DmD_{1},...,D_{m}} and a set of targets T = {T1,,TnT_{1},...,T_{n}}, the total potential interactions between drugs and targets can be denoted as an interaction matrix Im×nI_{m\times n}, where Ii,j=1I_{i,j}=1 represents a positive interaction between drug DiD_{i} and target TjT_{j}. In addition, a set of domain-specific measures are represented as edges between nodes in the network. For example, the multiple drug-drug similarities {Sd1,,SdkdS^{1}_{d},...,S^{k_{d}}_{d}} can be treated as edges between drug nodes, and the disease-caused-by-target measure can be treated as edges between disease nodes and target nodes. Figure 1 shows a heterogeneous network based on the dataset provided by (Fu et al., 2016).

Refer to caption
Figure 1. A visual demonstration of the DTI prediction heterogeneous network.

The drug-target interaction prediction problem is to take use of all the information within the heterogeneous network to predict the unobserved interaction edges between drug nodes and target nodes.

5. Summated Meta-path based Probabilistic Soft Logic (SMPSL)

In this section, we are going to introduce our Summated Meta-path based Probabilistic Soft Logic (SMPSL) model.

5.1. Overview

The principal aim for our method is to find an efficient and robust summation method onto the PSL rule instances, so that we could accelerate the inference, which is relatively slow in the original PSL model.

In a drug-target interaction network with m drugs and n targets, there are mnmn possible interactions and if we consider all the similarities between each drug-drug and target-target pairs, the total count of meta-path instances can add up to O(m2n2)O(m^{2}n^{2}), which is very expensive for a large-scale network. Although (Fakhraei et al., 2014) applies a blocking method that finds the nearest k neighbors to pick the most similar drugs or targets and sets similarity values with farther neighbors to zero in order to reduce the total rule instances, the total number of rule instances can still be very big. Also, due to the fact that for some association measures, we only have binary value instead of continuous relevance, operating blocking method on these measures is unachievable. In order to reduce the number of rule instances, we propose our summated meta-path based probabilistic soft logic approach (SMPSL).

5.2. Summated Meta-path based Probabilistic Soft Logic (SMPSL)

(Fu et al., 2016) introduces a topological feature, Meta-path count, that uses matrix multiplications to sum up the total count of all the meta-paths. Since both meta-path and PSL rule are association rules, we can take the same procedure to calculate a rule instance count.For instance, consider a drug-drug similarity based rule like this:

D1ChemicalD2InteractsT1D1\operatorname*{\longrightarrow}\limits^{Chemical}D2\operatorname*{\longrightarrow}\limits^{Interacts}T1

It says that if drug D1 is similar in chemical fingerprints to drug D2, whil D2 is known to interact with target T1, then we could imply that drug D1 may also interacts with target T1. Figure 1 visualizes the process of generating a commuting matrix.

Refer to caption
Figure 2. Using matrix multiplications to generate a commuting matrix. Instead of using binary value for the similarity measure, we use exact similarities and generate a relaxed rule instance count.

Instead of using binary value for the similarity measure, we use exact similarities and generate a relaxed rule instance count. The matrix multiplication can be written as:

(5) Cd1=Sd1×I{C}_{d}^{1}=S_{d}^{1}\times I

Cd1C_{d}^{1} denotes to the commuting matrix of the chemical-based drug similarity rule. Sd1S_{d}^{1} is the chemical-based similarity adjacent matrix, I is the drug-target interaction adjacent matrix. We could possibly consider multiple similarity measures and assign each rule with them by following the same procedure above. Similarly, if we consider another rule that introduces new types of nodes other than drug or target on a heterogeneous network:

DrugTreatsDiseaseCausedbyTargetDrug\operatorname*{\longrightarrow}\limits^{Treats}Disease\operatorname*{\longrightarrow}\limits^{Caused\ by}Target

If a drug is known to treat a certain type of disease, while a target is known as the cause of this disease, then we can imply that the drug may interacts with the target. The matrix multiplication form of calculating the commuting matrix for this meta-path can be written as:

(6) CDisease=Treatd×Causedt{C}_{Disease}=Treat_{d}\times Caused_{t}

CDiseaseC_{Disease} means the commuting matrix for disease-caused by rule, TreatdTreat_{d} is the Drug-Disease association matrix, CausedtCaused_{t} is the Disease-Target association matrix.

Although the value within commuting matrices can already represent the strength of drug-target pairs, we want to transfer these values into genuine probabilities so that they can fit in the probabilistic soft logic setting. We introduce a Bayesian probabilistic approach where the probability P that a drug-target interaction pair is positive interaction given commuting matrix value CC is defined as:

(7) P(C)=prior(1)×c(C,1)N(C,1)prior(1)×c(C,1)N(C,1)+prior(0)×c(C,0)N(C,0)P(C)=\frac{prior(1)\times\frac{c(C,1)}{N(C,1)}}{prior(1)\times\frac{c(C,1)}{N(C,1)}+prior(0)\times\frac{c(C,0)}{N(C,0)}}

c(C,label)c(C,label) is the number of drug-target pairs with known label{0,1}label\in\{0,1\} that has the same commuting matrix value, under the same PSL rule CdlC^{l}_{d}. N(C,label)N(C,label) denotes to the total number of non-zero valued drug-target pairs with known label k under the same PSL rule CC. The prior probability for a label prior(label)prior(label) is defined as:

(8) prior(label)=N(C,label)N(C,label)+N(C,1label)prior(label)=\frac{N(C,label)}{N(C,label)+N(C,1-label)}

After applying the Bayesian probabilistic approach, we generate only one rule instance for each pre-defined rules. We also include a negative prior rule which indicates that all drugs and targets tend not to interact with each other by default:

(9) ¬Interacts(D,T)\neg Interacts(D,T)

Based on the rules and probability obtained from the commuting matrices, we can write down our objective that minimizing the total energy on the Hinge-loss Markov Random Fields:

(10) f(I)=argminIrRwm+1Irp+k=1mwkmax{Pr,kIr,0}pf(I^{*})=\operatorname*{argmin}\limits_{I}\sum\limits_{r\in R}w_{m+1}I_{r}^{p}+\sum\limits_{k=1}\limits^{m}w_{k}\max\{P_{r,k}-I_{r},0\}^{p}

R is the union of all unknown drug-target interaction pairs. I=(I1,,In)RI=(I_{1},...,I_{n})\in R represents the probability that unknown drug-target interaction pairs are positive. ww is a weight parameter, pp is an exponential parameter and in our experiment we take p=2p=2. Same as PSL model, given a settled weight parameters ww, our objective aims to minimize the total distance to satisfaction dr(I)pd_{r}(I)^{p}.

Due to the effect of summations, we significantly reduce the number of rule instances compared with original PSL framework, from over one million instances down to the number of defined PSL rules (meta-paths). Also, comparing with (Fu et al., 2016), we assign comprehensible propability to meta-path counts and use a probabilistic model to predict unknown links.

Weight Learning: In most cases, different rules may contribute unevenly, so adding a weight learning procedure before entering inference part is essential. We run a stochastic gradient descent algorithm on a portion of observed links to learn the weight parameters. The gradient of the log-likelihood with respect to the weight ww can be written as:

(11) wf(I)=dr(I)p+E[dr(I)p]\frac{\partial}{\partial w}f(I)=-d_{r}(I)^{p}+E[d_{r}(I)^{p}]

6. Experimental Results

6.1. Datasets

We use three datasets for our experiments. One is the dataset constructed and used by (Fakhraei et al., 2014; Perlman et al., 2011), which is a multi-similarity based dataset; another is used by (Fu et al., 2016) and the other is used by (Luo et al., 2017), both of which incorporate additional domain-specific knowledge and form a heterogeneous network.

6.1.1. Dataset I

The dataset contains 315 drugs, 250 targets and 1306 known interactions. Besides, there are five drug-drug similarities and three target-target similarities within the dataset which are obtained from (Perlman et al., 2011). The five drug-drug similarity measures are: Chemical-based, Ligand-based, Expression-based, Side-effect-based and Annotation-based. The three target-target similarity measures are: Sequence-based, Protein-protein interaction network-based and Gene Ontology-based. The drug-target interactions of this dataset are obtained from several open-source online databases organized by (Fakhraei et al., 2014), including DrugBank (Wishart et al., 2007), KEGG Drug (Kanehisa et al., 2009), Drug Combination database (Liu et al., 2009), and Matador (Günther et al., 2007). A brief description of each similarity extraction is provided below:

  • Chemical-based drug similarity: (Perlman et al., 2011) use the chemical development kit (Steinbeck et al., 2006) to compute a hashed fingerprint for each drug based on the specification information obtained from Drugbank. They compute the Jaccard similarity of the fingerprints. A Jaccard similarity score between two sets X and Y is defined as:

    Jaccard(X,Y)=|XY||XY|Jaccard(X,Y)=\frac{|X\bigcap Y|}{|X\bigcup Y|}

  • Ligand-based drug similarity: (Perlman et al., 2011) compare the specification information from Drugbank against a collection of ligand sets using the similarity ensemble approach (SEA) search tool (Keiser et al., 2009). The ligand sets denote to the Jaccard similarity between the corresponding sets of protein-receptor families for each drug pair.

  • Expression-based drug similarity: (Perlman et al., 2011) use the Spearman rank correlation coefficient to compute a similarity of how gene expression responses to drugs which is obtained from the Connectivity Map Project (Lamb, 2007; Lamb et al., 2006). The Spearman rank correlation coefficient between two sets X and Y is calculated as:

    Spearman(X,Y)=i(xix¯)(yiy¯)i(xix¯)2i(yiy¯)2Spearman(X,Y)=\frac{\sum_{i}(x_{i}-\overline{x})(y_{i}-\overline{y})}{\sqrt{\sum_{i}(x_{i}-\overline{x})^{2}\sum_{i}(y_{i}-\overline{y})^{2}}}

  • Side-effect-based drug similarity: The Jaccard similarity of side-effect sets for each drug pairs. The side-effect sets are obtained from (Günther et al., 2007).

  • Annotation-based drug similarity: (Perlman et al., 2011) use the semantic similarity algorithm of Resnik (Resnik, 1999) to calculate the similarity of ATC code which is obtained from DrugBank and matched against the World Health Organization ATC classification system (Skrbo et al., 2004) for each drug pair.

  • Sequence-based target similarity: (Perlman et al., 2011) use the sequence-based similarity score as a target similarity measure which is suggested in (Bleakley and Yamanishi, 2009).

  • Protein-protein interaction network-based target similarity: They calculate the distance between target pairs as similarity measure using an all-pairs shortest path algorithm on the human protein-protein interactions network.

  • Gene Ontology-based target similarity: They compute the semantic similarity between Gene Ontology annotations from the source of (Jain et al., 2009) using Resnik’s method (Resnik, 1999)

We follow the same procedure as (Fakhraei et al., 2014) that uses a ten-folder validation for the experiments. Each folder we have 90%90\% of positive links and negative links for training, and the remaining 10%10\% of links for testing.

The pre-defined rules for this multi-similarity dataset is showned below:

m1:D1ChemicalD2InteractsT1m_{1}:D1\operatorname*{\longrightarrow}\limits^{Chemical}D2\operatorname*{\longrightarrow}\limits^{Interacts}T1

m2:D1LigandD2InteractsT1m_{2}:D1\operatorname*{\longrightarrow}\limits^{Ligand}D2\operatorname*{\longrightarrow}\limits^{Interacts}T1

m3:D1ExpressionD2InteractsT1m_{3}:D1\operatorname*{\longrightarrow}\limits^{Expression}D2\operatorname*{\longrightarrow}\limits^{Interacts}T1

m4:D1SideeffectD2InteractsT1m_{4}:D1\operatorname*{\longrightarrow}\limits^{Side-effect}D2\operatorname*{\longrightarrow}\limits^{Interacts}T1

m5:D1AnnotationD2InteractsT1m_{5}:D1\operatorname*{\longrightarrow}\limits^{Annotation}D2\operatorname*{\longrightarrow}\limits^{Interacts}T1

m6:D1InteractsT2SequenceT1m_{6}:D1\operatorname*{\longrightarrow}\limits^{Interacts}T2\operatorname*{\longrightarrow}\limits^{Sequence}T1

m7:D1InteractsT2ProteinT1m_{7}:D1\operatorname*{\longrightarrow}\limits^{Interacts}T2\operatorname*{\longrightarrow}\limits^{Protein}T1

m8:D1InteractsT2GeneOntologyT1m_{8}:D1\operatorname*{\longrightarrow}\limits^{Interacts}T2\operatorname*{\longrightarrow}\limits^{GeneOntology}T1

6.1.2. Dataset II

A total of up to nine types of nodes are presented, including drugs (compounds), targets (proteins), adverse side effects, Gene Ontology (GO) annotations, ChEBI types, substructures, tissues, biological pathways and diseases; ten types of edges are presented, including drug-ChEBI types associations, drug-protein interactions, drug-substructure associations, adverse side effect-drug associations, disease-drug associations, target-target interactions, target-GO annotation associations, disease-protein associations, pathway-protein associations and tissue-target associations. In addition, a 2D structural based drug-drug similarity measure and a sequence based target-target similarity measure are included. The similarity measures are obtained from the PubChem databases (Kim et al., 2015; Wang et al., 2009). The total number of nodes in this heterogeneous network is 295897, including 258030 drugs and 22056 targets, and the total number of edges in the network is 7191240, the total number of meta-paths is 6487339992.

We follow the procedure in (Fu et al., 2016) that a set of 145,622 positively labeled DTI links and 600,000 negatively labeled DTI links contain in the current heterogeneous network are treated as the training set; another set of 43,159 positive links and 195,000 negative links that are not observed in the network are treated as the testing set.

We follow the method of (Fu et al., 2016) that defines 51 different semantic meta-paths on the heterogeneous network. A sample of meta-paths is presented in Table 1, and the full version of 51 meta-paths is shown in the supplement.

Table 1. Sample of Semantic meta-paths defined for Dataset II
Index Semantics
C1 drugsimilartodruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C2 druginteractstargetinteractstargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target
C3 druginteractstargetsimilartotargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target
C4 drugtreatsdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C5 drugsimilartodruginteractsdruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C11 drugtreatsdiseasetreatedbydruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{treated\ by}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C14 druginteractstargetexpressedintissueexpressestargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{expressed\ in}tissue\operatorname*{\longrightarrow}\limits^{expresses}target
C15 druginteractstargetcausesdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{causes}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target

6.1.3. Dataset III

A total of four types of nodes (drugs, proteins, diseases and side-effects) and six types of edges (drug-protein interactions, drug-drug interactions, drug-disease associations, drug-side-effect associations, protein-disease associations and protein-protein interactions) representing diverse drug-related information are collected from the public databases that were used to construct this dataset (Luo et al., 2017). It contains 12,015 nodes and 1,895,445 edges in total, including 708 drugs and 1512 drugs. A set of drug and target similarity measures is also included.

The known DTIs as well as drug-drug interactions are collected from DrugBank (Wishart et al., 2007), the protein-protein interactions are collected from the HPRD database (Keshava Prasad et al., 2008), the drug-disease and protein-disease associations are collected from the Comparative Toxicogenomics Datasbase (Davis et al., 2012) and the drug-side-effect associations are collected from the SIDER database (Kuhn et al., 2010).

We follow the same experimental settings in (Luo et al., 2017) that randomly pick 90%90\% of positive DTI links and corresponding number of negative links as training set, and the remaining 10%10\% positive links and corresponding number of negative links as testing set. Since the information type provided in Datset III is a subset of Dataset II, we pick a subset of 21 meta-paths out of 51 used in Dataset II. The subset meta-path we select is: {C1, C2, C3, C4, C5, C6, C7, C11, C15, C16, C17, C18, C19, C24, C25, C26, C27, C44, C45, C46, C47}.

6.2. Evaluation Metrics

We use the area under the Precision-Recall curve - the AUPR score and the area under the Receiver Operatin Characteristic (ROC) curve - the AUC score as measurements.

The AUC score is a commonly used measurement of a binary classifier in related pubilications. The ROC curve is plotted by the true positive rate (TPR) against the false-positive rate (FPR) at various thresholds. By applying the AUC score, we can compare our model with many other papers’ approaches. However, if a dataset is highly imbalanced i.e. the number of positive test cases is too small compared with the number of negative test cases, the change of AUC score will be subtle. So we propose a second evaluation criteria, the AUPR score, since it can be more informative than the AUC score under an imbalanced dataset.

6.3. Baselines

We compare our model with five approaches: original PSL method (Fakhraei et al., 2014), Meta-path count feature + Random Forests and Support Vector Machine methods (Fu et al., 2016), the Pairwise MKL method (Cichonska et al., 2018) and the DTINet method (Luo et al., 2017). All five baselines can predict drug-target interactions on a heterogeneous network.

6.3.1. Probabilistic Soft Logic

(Fakhraei et al., 2014) introduces the Probabilistic Soft Logic model, which pre-defines a series of association rules, treated as ”rules”, and solves the DTI prediction problem as inference on a bipartite graph. More specifically, it introduces eight different similarity based association rules for the DTI prediction task. (Shown at section 6.1.1) After defining the rules, (Fakhraei et al., 2014) incorporates all rule instances within a bipartite graph, and minimizes a total distance to satisfaction based on all rule instances to make predictions.

(Fu et al., 2016) introduces the Meta-path topological feature, which also defines a series of association rules but on a heterogeneous network, denoted as ”Meta-path”, then uses matrix multiplications to calculate a Meta-path count. For the DTI prediction task, they define 51 different meta-paths (introduced in section 6.1.2) and for each drug-target pair, they form a 51-dimensional vector based on the meta-path count topological features, then uses machine learning classifiers, Support Vector Machine and Random Forests, to solve the DTI prediction problem as a supervised learning task.

Moreover, (Luo et al., 2017) introduces a method that could generate a low dimensional representation for both drug and target, based on a series of association matrices. For each drug, they have four association matrices and for each target, they have three association matrices. They apply a diffusion component analysis (DCA), to obtain the informative low-dimensional feature representations, then use a learned projection matrix that could project the drug feature into protein space so that the distance between the projected vectors and the interacted targets are minimized.

We also compare our model with a latest approach (Cichonska et al., 2018) that uses a kernel-based approach that learns a set of weights for each single kernel as well as for each combination of kernel pairs.

6.4. Results

We operate several experiments including both effectiveness comparison and running time comparison, and we compare results between different approaches on all three datasets introduced in section 6.1. Moreover, we report a case study of the weight learning and the selection of meta-paths.

6.4.1. Effectiveness Comparison

Table 2 shows the comparison experimental results between our model and other approaches on all datasets in AUC score, and AUPR score.

Table 2. Effectiveness comparison between our model and baselines on all datasets
PSL triads Meta-path+SVM Meta-path+RFs DTINet PairwiseMKL SMPSL
AUC in Dataset I 0.920 0.719 0.766 0.844 0.825 0.929
AUPR in Dataset I 0.617 0.378 0.430 0.381 0.225 0.617
AUC in Dataset II N/A 0.867 0.845 & 0.542 (self) N/A N/A 0.917
AUPR in Dataset II N/A 0.523 N/A & 0.248 (self) N/A N/A 0.815
AUC in Dataset III N/A 0.509 0.884 0.914 N/A 0.928
AUPR in Dataset III N/A 0.505 0.884 0.932 N/A 0.947

From the experimental results of Dataset I we can tell that the PSL approach proposed in 2014 still outperforms the Metapath+SVM/RFs (2016), DTINet (2017) and PairwiseMKL (2018) methods. Furthermore, our Summated Meta-path PSL model preserves the performance after applying the summation and gives a slightly better results in AUC score.

The Dataset II is a very huge dataset: the total number of meta-path counts is 6487339992, over 100000 times the size of Dataset I (which is 38251); the number of drugs and targets are also hundreds times the size of Dataset I. As a result, the PSL, DTINet and PairwiseMKL model fail to finish running in a reasonable time limit (168 hours), because they do not apply any summation strategies. For the Meta-path + SVM method, we only sampled 1% of training data to feed the classifier. The AUC score (0.845) of the Meta-path + RFs method is proposed by the original paper (Fu et al., 2016), yet the AUPR score are not reported. Based on the reported parameters in the original paper, we implement a Python based CM-RFs model using the sklearn.RandomForestClassifier() function, and the parameters are: n_estimators=500,max_features=13n\_estimators=500,max\_features=13.

In terms of Dataset III, we apply the same experiment protocol as (Luo et al., 2017) that runs a ten-folder cross-validation, each folder randomly picks 90% of positive links and corresponding number of negative links as training, and the remaining 10% of positive links corresponding with the same number of randomly picked negative links as the test set. By doing so, each validation folder only contains 3460 labeled drug-target interaction links and 384 links to be predicted, which helps the DTINet approach executable. Yet the total number of meta-path counts is still very large, which is 783950268303, because there are multiple association and similarity measures on the heterogeneous network. As a result, the PSL model and the PairwiseMKL model still cannot finish running on this dataset in a reasonable amount of time.

6.4.2. Running Time Comparison

More importantly, due to the fact that we use a summation strategy to reduce the number of rule instances of a PSL model, we can accelerate the DTI prediction process. Table 3 shows the running time comparison between all five approaches on three datasets.

Table 3. Running time comparison between our model and baselines on all datasets (in minutes)
Methods DatasetIDataset\ I DatasetIIDataset\ II DatasetIIIDataset\ III
PSL triads 14.95 >10000>10000 >10000>10000
Meta-path+SVM 9.32 3.68 3.85
Meta-path+RFs 1.32 N/A & 32.53 (self) 0.913
DTINet 19.13 >10000>10000 18.83
PairwiseMKL 386 >10000>10000 >10000>10000
SMPSL 0.159 2.03 0.85

From the dataset I result we can tell that the pairwiseMKL method takes the moust amount of time. In the original paper (Cichonska et al., 2018), the experiments were done on a 120 drug 120 cancer bipartite network, and the running time was 1.45h. Thus, the running time for pairwiseMKL on dataset I is logical.

We run our experiments on a computer with a (16×2)(16\times 2) 3.3 GHz Intel Xeon CPU and 128GB of RAM. We gain significant efficiency improvement compared with the original PSL model, resulting in over 99% of time reduction while gaining comparable results. Even comparing with other approaches, our method still takes the least amount of time.

6.4.3. Case Study: Weight Learning

We study the effect of weight learning. The effectiveness and running time comparison with/without weight learning is shown below:

Table 4. Effectiveness and running time comparison with/without weight learning on dataset I, II, III acccordingly
Methods AUC AUPR Running Time(min)
(I)weight learning 0.929 0.617 0.1590.159
(I)no weight learning 0.927 0.0.552 0.142
(II)weight learning 0.917 0.815 2.0302.030
(II)no weight learning 0.911 0.779 1.815
(III)weight learning 0.928 0.947 0.8500.850
(III)no weight learning 0.926 0.945 0.814

From Table 4 we can tell that weight learning is positively effective, especially when the dataset is imbalanced (For Dataset I, the ratio of positive links and negative links in the test set is approximately 1 : 45; For Dataset II, the ratio is 1 : 4.5; For Dataset III, the ratio is 1 : 1).

We also demonstrate a figure that shows the log-scale importance of the relative weight parameter ww in Dataset II:

Refer to caption
Figure 3. The log-scale importance of the relative weight parameter ww in Dataset II

The weight parameter ww indicates the importance of a meta-path. From figure 2 we can tell that the top-6 most effective meta-paths selected by our model is: {C7,C18,C1,C3,C19,C2}\{C7,C18,C1,C3,C19,C2\}. We can draw a conclusion from this case study that shorter meta-paths C1,C2,C3{C1,C2,C3} tend to have higher effect. Besides, all the top-6 meta-paths are defined using only interaction and similarity association measures within drugs and targets. Which satisfies the truth that interaction and similarity information are more solid in the DTI prediction task.

7. Conclusion

In this paper, we propose a summated meta-path and probabilistic soft logic model (SMPSL) for the drug-target interaction (DTI) prediction. We form the DTI prediction problem into an inference problem on a heterogeneous network with multiple domain-specific measures, such as similarities, associations, interactions etc. We detect the shortcut for both Meta-path method and PSL method, while succeed in combining both method together, showing success to transfer the meta-path topological feature into a probabilistic metric by generating the probabilistic commuting matrices based on a Bayesian probabilistic approach. By using the value of commuting matrices as the rule instance of the PSL model, we significantly reduce the total number of rule instances of a PSL model, drawing over 99% of time reductions while the performance still remains comparable in AUC score and AUPR score. Besides, we compare our model with other four latest DTI prediction approaches on three open-source large-scale datasets, showing performance improvement in both AUC/AUPR score and framework efficiency.

In addition, it is worth pointing out that our method may be eligible to extend to broader scope of applications because both the meta-path method and the PSL method are general tools widely used in various areas, including but not constrained in spammer detection, scheme mapping, page ranking, recommendation system etc. This possibility motivates us to keep digging the potential of our model.

References

  • (1)
  • Bach et al. (2015) Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. 2015. Hinge-loss markov random fields and probabilistic soft logic. arXiv preprint arXiv:1505.04406 (2015).
  • Bach et al. (2017) Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. 2017. Hinge-Loss Markov Random Fields and Probabilistic Soft Logic. Journal of Machine Learning Research (JMLR) 18 (2017), 1–67. https://github.com/stephenbach/bach-jmlr17-code
  • Bach et al. (2013) Stephen H. Bach, Bert Huang, Ben London, and Lise Getoor. 2013. Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction. In Uncertainty in Artificial Intelligence.
  • Bleakley and Yamanishi (2009) Kevin Bleakley and Yoshihiro Yamanishi. 2009. Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics 25, 18 (2009), 2397–2403.
  • Chen et al. (2012) Bin Chen, Ying Ding, and David J Wild. 2012. Assessing drug target association using semantic linked data. PLoS computational biology 8, 7 (2012), e1002574.
  • Chen et al. (2015) Xing Chen, Chenggang Clarence Yan, Xiaotian Zhang, Xu Zhang, Feng Dai, Jian Yin, and Yongdong Zhang. 2015. Drug–target interaction prediction: databases, web servers and computational models. Briefings in bioinformatics 17, 4 (2015), 696–712.
  • Chen et al. (2016) Xing Chen, Chenggang Clarence Yan, Xiaotian Zhang, Xu Zhang, Feng Dai, Jian Yin, and Yongdong Zhang. 2016. Drug–target interaction prediction: databases, web servers and computational models. Briefings in Bioinformatics 17, 4 (2016), 696–712. https://doi.org/10.1093/bib/bbv066
  • Cheng et al. (2012) Feixiong Cheng, Chuang Liu, Jing Jiang, Weiqiang Lu, Weihua Li, Guixia Liu, Weixing Zhou, Jin Huang, and Yun Tang. 2012. Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS computational biology 8, 5 (2012), e1002503.
  • Cichonska et al. (2018) Anna Cichonska, Tapio Pahikkala, Sandor Szedmak, Heli Julkunen, Antti Airola, Markus Heinonen, Tero Aittokallio, and Juho Rousu. 2018. Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics 34, 13 (2018), i509–i518.
  • Cockell et al. (2010) Simon J Cockell, Jochen Weile, Phillip Lord, Claire Wipat, Dmytro Andriychenko, Matthew Pocock, Darren Wilkinson, Malcolm Young, and Anil Wipat. 2010. An integrated dataset for in silico drug discovery. Journal of integrative bioinformatics 7, 3 (2010), 15–27.
  • Davis et al. (2012) Allan Peter Davis, Cynthia Grondin Murphy, Robin Johnson, Jean M Lay, Kelley Lennon-Hopkins, Cynthia Saraceni-Richards, Daniela Sciaky, Benjamin L King, Michael C Rosenstein, Thomas C Wiegers, et al. 2012. The comparative toxicogenomics database: update 2013. Nucleic acids research 41, D1 (2012), D1104–D1114.
  • Ding et al. (2014) Hao Ding, Ichigaku Takigawa, Hiroshi Mamitsuka, and Shanfeng Zhu. 2014. Similarity-based machine learning methods for predicting drug–target interactions: a brief review. Briefings in Bioinformatics 15, 5 (2014), 734–747. https://doi.org/10.1093/bib/bbt056
  • Ezzat et al. (2016) Ali Ezzat, Min Wu, Xiao-Li Li, and Chee-Keong Kwoh. 2016. Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC bioinformatics 17, 19 (2016), 509.
  • Ezzat et al. (2017a) Ali Ezzat, Min Wu, Xiao-Li Li, and Chee-Keong Kwoh. 2017a. Drug-target interaction prediction using ensemble learning and dimensionality reduction. Methods 129 (2017), 81–88.
  • Ezzat et al. (2017b) A. Ezzat, P. Zhao, M. Wu, X. Li, and C. Kwoh. 2017b. Drug-Target Interaction Prediction with Graph Regularized Matrix Factorization. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14, 3 (May 2017), 646–656. https://doi.org/10.1109/TCBB.2016.2530062
  • Fakhraei et al. (2014) S. Fakhraei, B. Huang, L. Raschid, and L. Getoor. 2014. Network-Based Drug-Target Interaction Prediction with Probabilistic Soft Logic. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11, 5 (Sept 2014), 775–787. https://doi.org/10.1109/TCBB.2014.2325031
  • Fourches et al. (2010) Denis Fourches, Eugene Muratov, and Alexander Tropsha. 2010. Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. Journal of Chemical Information and Modeling 50, 7 (2010), 1189–1204. https://doi.org/10.1021/ci100176x arXiv:https://doi.org/10.1021/ci100176x PMID: 20572635.
  • Fu et al. (2016) Gang Fu, Ying Ding, Abhik Seal, Bin Chen, Yizhou Sun, and Evan Bolton. 2016. Predicting drug target interactions using meta-path-based semantic network analysis. BMC Bioinformatics 17, 1 (12 Apr 2016), 160. https://doi.org/10.1186/s12859-016-1005-x
  • Getoor and Diehl (2005) Lise Getoor and Christopher P. Diehl. 2005. Link Mining: A Survey. SIGKDD Explor. Newsl. 7, 2 (Dec. 2005), 3–12. https://doi.org/10.1145/1117454.1117456
  • Günther et al. (2007) Stefan Günther, Michael Kuhn, Mathias Dunkel, Monica Campillos, Christian Senger, Evangelia Petsalaki, Jessica Ahmed, Eduardo Garcia Urdiales, Andreas Gewiess, Lars Juhl Jensen, et al. 2007. SuperTarget and Matador: resources for exploring drug-target relationships. Nucleic acids research 36, suppl_1 (2007), D919–D922.
  • Jain et al. (2009) Eric Jain, Amos Bairoch, Severine Duvaud, Isabelle Phan, Nicole Redaschi, Baris E Suzek, Maria J Martin, Peter McGarvey, and Elisabeth Gasteiger. 2009. Infrastructure for the life sciences: design and implementation of the UniProt website. BMC bioinformatics 10, 1 (2009), 136.
  • Kanehisa et al. (2009) Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hirakawa. 2009. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic acids research 38, suppl_1 (2009), D355–D360.
  • Keiser et al. (2009) Michael J Keiser, Vincent Setola, John J Irwin, Christian Laggner, Atheir I Abbas, Sandra J Hufeisen, Niels H Jensen, Michael B Kuijer, Roberto C Matos, Thuy B Tran, et al. 2009. Predicting new molecular targets for known drugs. Nature 462, 7270 (2009), 175.
  • Keshava Prasad et al. (2008) TS Keshava Prasad, Renu Goel, Kumaran Kandasamy, Shivakumar Keerthikumar, Sameer Kumar, Suresh Mathivanan, Deepthi Telikicherla, Rajesh Raju, Beema Shafreen, Abhilash Venugopal, et al. 2008. Human protein reference database—2009 update. Nucleic acids research 37, suppl_1 (2008), D767–D772.
  • Kim et al. (2015) Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A Shoemaker, et al. 2015. PubChem substance and compound databases. Nucleic acids research 44, D1 (2015), D1202–D1213.
  • Kimmig et al. (2012) Angelika Kimmig, Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. 2012. A Short Introduction to Probabilistic Soft Logic. In NIPS Workshop on Probabilistic Programming: Foundations and Applications.
  • Kuhn et al. (2010) Michael Kuhn, Monica Campillos, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. 2010. A side effect resource to capture phenotypic effects of drugs. Molecular systems biology 6, 1 (2010), 343.
  • Lamb (2007) Justin Lamb. 2007. The Connectivity Map: a new tool for biomedical research. Nature reviews cancer 7, 1 (2007), 54.
  • Lamb et al. (2006) Justin Lamb, Emily D Crawford, David Peck, Joshua W Modell, Irene C Blat, Matthew J Wrobel, Jim Lerner, Jean-Philippe Brunet, Aravind Subramanian, Kenneth N Ross, et al. 2006. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. science 313, 5795 (2006), 1929–1935.
  • Liu et al. (2009) Yanbin Liu, Bin Hu, Chengxin Fu, and Xin Chen. 2009. DCDB: drug combination database. Bioinformatics 26, 4 (2009), 587–588.
  • Liu et al. (2016) Yong Liu, Min Wu, Chunyan Miao, Peilin Zhao, and Xiao-Li Li. 2016. Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLoS computational biology 12, 2 (2016), e1004760.
  • Luo et al. (2017) Yunan Luo, Xinbin Zhao, Jingtian Zhou, Jinglin Yang, Yanqing Zhang, Wenhua Kuang, Jian Peng, Ligong Chen, and Jianyang Zeng. 2017. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nature communications 8, 1 (2017), 573.
  • Lü and Zhou (2011) Linyuan Lü and Tao Zhou. 2011. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications 390, 6 (2011), 1150 – 1170. https://doi.org/10.1016/j.physa.2010.11.027
  • Perlman et al. (2011) Liat Perlman, Assaf Gottlieb, Nir Atias, Eytan Ruppin, and Roded Sharan. 2011. Combining drug and gene similarity measures for drug-target elucidation. Journal of computational biology 18, 2 (2011), 133–145.
  • Resnik (1999) Philip Resnik. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of artificial intelligence research 11 (1999), 95–130.
  • Skrbo et al. (2004) A Skrbo, B Begović, and S Skrbo. 2004. Classification of drugs using the ATC system (Anatomic, Therapeutic, Chemical Classification) and the latest changes. Medicinski arhiv 58, 1 Suppl 2 (2004), 138–141.
  • Steinbeck et al. (2006) Christoph Steinbeck, Christian Hoppe, Stefan Kuhn, Matteo Floris, Rajarshi Guha, and Egon L Willighagen. 2006. Recent developments of the chemistry development kit (CDK)-an open-source java library for chemo-and bioinformatics. Current pharmaceutical design 12, 17 (2006), 2111–2120.
  • Wang et al. (2009) Yanli Wang, Jewen Xiao, Tugba O Suzek, Jian Zhang, Jiyao Wang, and Stephen H Bryant. 2009. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research 37, suppl_2 (2009), W623–W633.
  • Wen et al. (2017) Ming Wen, Zhimin Zhang, Shaoyu Niu, Haozhi Sha, Ruihan Yang, Yonghuan Yun, and Hongmei Lu. 2017. Deep-learning-based drug–target interaction prediction. Journal of proteome research 16, 4 (2017), 1401–1409.
  • Wishart et al. (2007) David S Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava, Dan Tzur, Bijaya Gautam, and Murtaza Hassanali. 2007. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids research 36, suppl_1 (2007), D901–D906.
  • Yamanishi et al. (2008) Yoshihiro Yamanishi, Michihiro Araki, Alex Gutteridge, Wataru Honda, and Minoru Kanehisa. 2008. Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24, 13 (2008), i232–i240.
  • Yildirim et al. (2007) Muhammed A Yildirim, Kwang-Il Goh, Michael E Cusick, Albert-László Barabási, and Marc Vidal. 2007. Drug-target network. Nature biotechnology 25, 10 (October 2007), 1119—1126. https://doi.org/10.1038/nbt1338

Appendix A Supplement

Table 4 and Table 5 show the total 51 meta-paths defined for Dataset II:

Table 5. Semantic meta-paths defined for Dataset II Part A
Index Semantics
C1 drugsimilartodruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C2 druginteractstargetinteractstargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target
C3 druginteractstargetsimilartotargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target
C4 drugtreatsdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C5 drugsimilartodruginteractsdruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C6 drugsimilartodruginteractstargetinteractstargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target
C7 druginteractstargetinteractsdruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C8 drughaspartsubstructurepartofdruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{has\ part}substructure\operatorname*{\longrightarrow}\limits^{part\ of}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C9 drughastypeChEBItypetypeofdruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{has\ type}ChEBI\ type\operatorname*{\longrightarrow}\limits^{type\ of}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C10 druginducesadversesideeffectinducedbydruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{induces}adverse\ side\ effect\operatorname*{\longrightarrow}\limits^{induced\ by}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C11 drugtreatsdiseasetreatedbydruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{treated\ by}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C12 druginteractstargethasannotationGOannotationannotationoftargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{has\ annotation}GO\ annotation\operatorname*{\longrightarrow}\limits^{annotation\ of}target
C13 druginteractstargetparticipatesinpathwayhasparticipantstargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{participates\ in}pathway\operatorname*{\longrightarrow}\limits^{has\ participants}target
C14 druginteractstargetexpressedintissueexpressestargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{expressed\ in}tissue\operatorname*{\longrightarrow}\limits^{expresses}target
C15 druginteractstargetcausesdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{causes}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C16 druginteractstargetinteractsdruginteractstargetinteractstargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target
C17 druginteractstargetinteractsdrugtreatsdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}drug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C18 druginteractstargetinteractsdruginteractsdruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}drug\operatorname*{\longrightarrow}\limits^{interacts}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C19 druginteractstargetinteractsdruginteractstargetsimilartotargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target
C20 drugtypeofChEBItypetypeofdruginteractstargetinteractstargetdrug\operatorname*{\longrightarrow}\limits^{type\ of}ChEBI\ type\operatorname*{\longrightarrow}\limits^{type\ of}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target
C21 drugtypeofChEBItypetypeofdrugtreatsdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{type\ of}ChEBI\ type\operatorname*{\longrightarrow}\limits^{type\ of}drug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C22 drugtypeofChEBItypetypeofdruginteractstargetsimilartotargetdrug\operatorname*{\longrightarrow}\limits^{type\ of}ChEBI\ type\operatorname*{\longrightarrow}\limits^{type\ of}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target
C23 drugtypeofChEBItypetypeofdrugsimilartodruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{type\ of}ChEBI\ type\operatorname*{\longrightarrow}\limits^{type\ of}drug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C24 drugtreatsdiseasetreatedbydruginteractstargetinteractstargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{treated\ by}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target
C25 drugtreatsdiseasetreatedbydrugtreatsdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{treated\ by}drug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C26 drugtreatsdiseasetreatedbydruginteractstargetsimilartotargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{treated\ by}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target
C27 drugtreatsdiseasetreatedbydrugsimilartodruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{treated\ by}drug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C28 druginducesadversesideeffectinducedbydruginteractstargetinteractstargetdrug\operatorname*{\longrightarrow}\limits^{induces}adverse\ side\ effect\operatorname*{\longrightarrow}\limits^{induced\ by}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target
C29 druginducesadversesideeffectinducedbydrugtreatsdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{induces}adverse\ side\ effect\operatorname*{\longrightarrow}\limits^{induced\ by}drug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C30 druginducesadversesideeffectinducedbydruginteractstargetsimilartotargetdrug\operatorname*{\longrightarrow}\limits^{induces}adverse\ side\ effect\operatorname*{\longrightarrow}\limits^{induced\ by}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target
C31 druginducesadversesideeffectinducedbydrugsimilartodruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{induces}adverse\ side\ effect\operatorname*{\longrightarrow}\limits^{induced\ by}drug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C32 drughaspartsubstructurepartofdruginteractstargetinteractstargetdrug\operatorname*{\longrightarrow}\limits^{has\ part}substructure\operatorname*{\longrightarrow}\limits^{part\ of}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target
C33 drughaspartsubstructurepartofdrugtreatsdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{has\ part}substructure\operatorname*{\longrightarrow}\limits^{part\ of}drug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C34 drughaspartsubstructurepartofdruginteractstargetsimilartotargetdrug\operatorname*{\longrightarrow}\limits^{has\ part}substructure\operatorname*{\longrightarrow}\limits^{part\ of}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target
C35 drughaspartsubstructurepartofdrugsimilartodruginteractstargetdrug\operatorname*{\longrightarrow}\limits^{has\ part}substructure\operatorname*{\longrightarrow}\limits^{part\ of}drug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target
C36 druginteractstargetinteractstargethasannotationGOannotationannotationoftargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{has\ annotation}GO\ annotation\operatorname*{\longrightarrow}\limits^{annotation\ of}target
Table 6. Semantic meta-paths defined for Dataset II part B
Index Semantics
C37 drugtreatsdiseasecausedbytargethasannotationGOannotationannotationoftargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target\operatorname*{\longrightarrow}\limits^{has\ annotation}GO\ annotation\operatorname*{\longrightarrow}\limits^{annotation\ of}target
C38 druginteractstargetsimilartotargethasannotationGOannotationannotationoftargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target\operatorname*{\longrightarrow}\limits^{has\ annotation}GO\ annotation\operatorname*{\longrightarrow}\limits^{annotation\ of}target
C39 drugsimilartodruginteractstargethasannotationGOannotationannotationoftargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{has\ annotation}GO\ annotation\operatorname*{\longrightarrow}\limits^{annotation\ of}target
C40 druginteractstargetinteractstargetparticipatesinpathwayhasparticipantstargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{participates\ in}pathway\operatorname*{\longrightarrow}\limits^{has\ participants}target
C41 drugtreatsdiseasecausedbytargetparticipatesinpathwayhasparticipantstargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target\operatorname*{\longrightarrow}\limits^{participates\ in}pathway\operatorname*{\longrightarrow}\limits^{has\ participants}target
C42 druginteractstargetsimilartotargetparticipatesinpathwayhasparticipantstargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target\operatorname*{\longrightarrow}\limits^{participates\ in}pathway\operatorname*{\longrightarrow}\limits^{has\ participants}target
C43 drugsimilartodruginteractstargetparticipatesinpathwayhasparticipantstargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{participates\ in}pathway\operatorname*{\longrightarrow}\limits^{has\ participants}target
C44 druginteractstargetinteractstargetcausesdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{causes}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C45 drugtreatsdiseasecausedbytargetcausesdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target\operatorname*{\longrightarrow}\limits^{causes}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C46 druginteractstargetsimilartotargetcausesdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target\operatorname*{\longrightarrow}\limits^{causes}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C47 drugsimilartodruginteractstargetcausesdiseasecausedbytargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{causes}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target
C48 druginteractstargetinteractstargetexpressedintissueexpressestargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{expressed\ in}tissue\operatorname*{\longrightarrow}\limits^{expresses}target
C49 drugtreatsdiseasecausedbytargetexpressedintissueexpressestargetdrug\operatorname*{\longrightarrow}\limits^{treats}disease\operatorname*{\longrightarrow}\limits^{caused\ by}target\operatorname*{\longrightarrow}\limits^{expressed\ in}tissue\operatorname*{\longrightarrow}\limits^{expresses}target
C50 druginteractstargetsimilartotargetexpressedintissueexpressestargetdrug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{similar\ to}target\operatorname*{\longrightarrow}\limits^{expressed\ in}tissue\operatorname*{\longrightarrow}\limits^{expresses}target
C51 drugsimilartodruginteractstargetexpressedintissueexpressestargetdrug\operatorname*{\longrightarrow}\limits^{similar\ to}drug\operatorname*{\longrightarrow}\limits^{interacts}target\operatorname*{\longrightarrow}\limits^{expressed\ in}tissue\operatorname*{\longrightarrow}\limits^{expresses}target