Sunteți pe pagina 1din 7

Bioinformatics

doi.10.1093/bioinformatics/xxxxxx
Advance Access Publication Date: Day Month Year
Manuscript Category

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz411/5497254 by Springfield College user on 23 May 2019


ReSimNet: Drug Response Similarity Prediction
using Siamese Neural Networks
Minji Jeon 1† , Donghyeon Park 1† , Jinhyuk Lee 1 , Hwisang Jeon 2 , Miyoung
Ko 1 , Sunkyu Kim 1 , Yonghwa Choi 1 , Aik-Choon Tan 3 , and Jaewoo Kang 1,2∗
1
Department of Computer Science and Engineering, Korea University, Seoul, Korea
2
Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Korea
3
Division of Medical Oncology, Department of Medicine, Translational Bioinformatics and Cancer Systems Biology Laboratory,
University of Colorado Anschutz Medical Campus, Aurora, USA
∗ To whom correspondence should be addressed.
† These two authors contributed equally to this work.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract
Motivation: Traditional drug discovery approaches identify a target for a disease and find a compound
that binds to the target. In this approach, structures of compounds are considered as the most important
features because it is assumed that similar structures will bind to the same target. Therefore, structural
analogs of the drugs that bind to the target are selected as drug candidates. However, even though
compounds are not structural analogs, they may achieve the desired response. A new drug discovery
method based on drug response, which can complement the structure-based methods, is needed.
Results: We implemented Siamese neural networks called ReSimNet that take as input two chemical
compounds and predicts the CMap score of the two compounds, which we use to measure the
transcriptional response similarity of the two counpounds. ReSimNet learns the embedding vector of
a chemical compound in a transcriptional response space. ReSimNet is trained to minimize the difference
between the cosine similarity of the embedding vectors of the two compounds and the CMap score of
the two compounds. ReSimNet can find pairs of compounds that are similar in response even though
they may have dissimilar structures. In our quantitative evaluation, ReSimNet outperformed the baseline
machine learning models. The ReSimNet ensemble model achieves a Pearson correlation of 0.518 and
a precision@1% of 0.989. In addition, in the qualitative analysis, we tested ReSimNet on the ZINC15
database and showed that ReSimNet successfully identifies chemical compounds that are relevant to a
prototype drug whose mechanism of action is known.
Availability: The source code and the pre-trained weights of ReSimNet are available at
https://github.com/dmis-lab/ReSimNet
Contact: kangj@korea.ac.kr
Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction is a naïve approach to identifying candidate compounds, as it would be


time consuming and costly to screen numerous compounds in a large
The discovery and development of a new drug remains a challenging
compound search space. Once a candidate compound has been identified
process. It has been estimated that it takes 10-15 years and 2.6 billion
dollars on average to commercialize a drug, yet the success rate is less than by HTS, the next step is to optimize the candidate compound to achieve the
desired biological activity such as inhibiting the growth rate or increasing
10% (Paul et al., 2010; DiMasi et al., 2015). High-throughput screening
apoptosis in cancer cells. Most of the time, the underlying mechanism of
(HTS) is the initial step for identifying candidate compounds for certain
action (MOA) of such a compound is initially unknown. Computer-aided
biological activities. However, the brute-force approach employed in HTS
drug design (CADD) approaches such as ligand-based and structural-based
approaches have been implemented in drug discovery and development

© The Author(s) (2019). Published by Oxford University Press. All rights reserved. For Permissions, please email:
journals.permissions@oup.com
2 Jeon et al.

pipelines to improve the HTS step and predict the MOA of candidate aqueous solubility of chemical compounds, Lusci et al. (2013) formed
compounds (Sliwoski et al., 2014). multiple directed acyclic graph recursive neural networks (DAG-RNNs)
The ligand-based drug discovery approach, which is a commonly into undirected graph recursive neural networks, training each DAG-RNN
used CADD approach, assumes that compounds with similar chemical and combining the results of all the DAG-RNNs. However, there is no deep

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz411/5497254 by Springfield College user on 23 May 2019


structures will bind to the same target and exert the same MOA. However, learning method that identifies drug candidates using the transcriptional
the biological effects of chemical compounds are not always similar even response similarity of drugs.
if they share similar structures. For example, the two anti-diabetic drugs
rosiglitazone and troglitazone share very similar chemical structures but
have different targets and different MOAs (Camp et al., 2000). In contrast, 2 Approach
the structure-based drug discovery approach relies on the structural
In this study, we propose ReSimNet which is Siamese neural networks that
information of a target protein and optimizes compounds based on the
predict the transcriptional response similarity of drugs. As shown in Figure
interactions between the target protein and a compound. This approach
1, the inputs of ReSimNet are two chemical compounds represented as
is useful when high-resolution structural data on the target protein is
2048-bit extended connectivity fingerprints (ECFPs). ReSimNet learns the
available, as this will increase the likelihood of identifying candidate
response similarity of the two compounds collected from the Connectivity
compounds that can bind to the active site of the target protein, and thus
Map (CMap) which is the perturbation-driven gene expression dataset
avoid off-target effects. For example, first-generation epidermal growth
provided by Broad Institute (Subramanian et al., 2017). The dataset
factor receptor (EGFR) inhibitors such as gefitinib and erlotinib have
includes similarity scores called CMap scores of compound pairs based
demonstrated significant clinical benefit in EGFR-mutant non-small cell
on their compound-induced gene expression similarity to nine core cell
lung cancer (NSCLC) patients. However, patients whose cancer initially
lines. ReSimNet can generate the compound embedding vectors of two
responded to the EGFR inhibitors eventually progressed. The acquired
compounds as intermediate products, and the cosine similarity of the
secondary gatekeeper mutation in EGFR T790M is one of the common
two compounds embedding vectors is trained to be similar to the CMap
resistance mechanisms. Thus, pharmaceutical companies have focused on
score. We evaluated the performance of ReSimNet in predicting the
using the mutated form of the EGFR protein structure to develop novel
response similarity of compounds and compared ReSimNet with structure-
drugs that can address this resistance mechanism. By the structure-based
based vectors generated by Mol2vec (Jaeger et al., 2018) and ECFP,
drug discovery approach, the new drug osimertinib was discovered and
and other machine learning models such as Support Vector Machine
developed. It was recently approved by the FDA in 2018 as the first-line
Regressor, Ridge Regression, Gradient Boosting, Multi-Layer Perceptron,
treatment of the EGFR-mutant in NSCLC patients (Cross et al., 2014).
and Random Forest. We also evaluated the drug candidates discovered by
High-throughput characterization of gene expression changes in
ReSimNet by a literature survey.
cells after drug treatment provides information on the MOAs of
drugs. This approach, which is based the Connectivity Map (CMap)
concept (Subramanian et al., 2017; Lamb et al., 2006), suggests that
gene expression signatures can be used to measure the similarity between
3 Methods
different drugs that induce the same drug activity. The CMap concept 3.1 CMap score - drug response similarity based on
is a new data-driven drug discovery paradigm both in academia and differential gene expression
pharmaceutical industries (De Wolf et al., 2018; Verbist et al., 2015;
We obtained the CMap scores of compound pairs from the Touchstone
Senkowski et al., 2016; Yoo et al., 2018; Readhead et al., 2018).
dataset (Touchstone V1.0) which is the CMap reference dataset provided
Machine learning algorithms that predict drug-target interactions and drug-
by Broad Institute (Subramanian et al., 2017). Profiled signatures of
MOA relationships using large-scale transcriptomic response data have
perturbagens such as compounds and shRNAs across nine core cell lines
been developed. For example, Janssen Pharmaceutica has applied the
(A375, A549, HA1E, HCC515, HEPG2, HT29, MCF7, PC3, and VCAP)
CMap concept to drug discovery and added transcriptomic profiles to
are included in the dataset. The CMap scores are enrichment scores of
its dataset containing 31K compounds for HTS. By this approach, they
query compounds included in a bidirectional gene list (up-regulated gene
have used transcriptional connection scores of compounds for generating
list and a down-regulated gene list) for reference signatures. The CMap
compound similarities, and employed machine learning algorithms to
scores of compound pairs range from -100 to 100 and a score above 90
predict target activities and identify candidate chemical scaffolds for
indicates that the two compounds are similar in transcriptional response.
compound optimization (De Wolf et al., 2018).
We rescaled the CMap scores by dividing the CMap scores by 100.
Several deep learning approaches have been developed and applied to
data-driven drug discovery. For example, deep neural network approaches
that use structures for molecular binding affinity prediction (Ghasemi et al.,
3.2 Dataset sampling and splitting
2018; Wallach et al., 2015) or compound-protein interaction prediction, or We obtained the CMap scores of more than 2.9M compound pairs, which
use the fingerprint of a compound and the domain fingerprint of a protein were generated from 2,428 compounds in the Touchstone dataset. Since
or the sequence of a protein (Tian et al., 2016; Wen et al., 2017; Gonczarek the CMap scores are asymmetric, we filtered compound pairs when the
et al., 2017) have been proposed. Deep learning methods that predict difference in bidirectional rescaled CMap scores of a pair is greater than 0.1
phenotypes such as the toxicity and side effects of chemical compounds for constructing a robust sample set. For the remaining pairs, the average
have also been proposed. DeepTox is a deep learning model that predicts bidirectional CMap scores are used as their final CMap scores. Moreover,
the toxicity of compounds using the Tox21 dataset (Tice et al., 2013) we sampled pairs so that the number of samples with CMap scores of 0.9 or
and characterizes toxicophores (Mayr et al., 2016). Ramsundar et al. more and the number of samples with CMap scores of 0.9 or less were the
(2015) designed multitask deep neural networks that can learn common same. Among the 2.9M samples, we use 269,542 samples for our model.
information on compounds from different datasets for toxicity and protein We trained our model with a real-world drug discovery scenario in
binding prediction. Coley et al. (2017) proposed a model that makes mind. We divided the samples into training and evaluation sets. The
predictions on octanol solubility, aqueous solubility, melting points, and ligand-based drug discovery approach is commonly used to find new drugs
toxicity. Altae-Tran et al. (2017) predicted the toxicity and side effects similar to well-known drugs. Like this method, we randomly divided the
of a new drug using a considerably small training set. For predicting the compounds in the Touchstone dataset into the following two groups: known
ReSimNet 3

3F4JN/FU .PEFM5SBJOJOHBOE&WBMVBUJPO

%BUBTFU1SFQBSBUJPO $PNQPVOE1BJSTBT*OQVU $PNQPVOE3FQSFTFOUBUJPO-FBSOJOH &WBMVBUJPO


4JBNFTF/FVSBM/FUXPSL

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz411/5497254 by Springfield College user on 23 May 2019


//

FNCFEEJOHWFDUPS
&SMPUJOJC
3%,JU

• 1FBSTPO$PSSFMBUJPO
$PFGGJDJFOU

$PTJOFTJNJMBSJUZ
-PTT
• .4&
$.BQ TDPSF &SMPUJOJC &$'1 CJUT
 • "630$
4IBSFE8FJHIUT
TJNJMBSJUZTDPSFTCFUXFFOUXP "DUVBM • 1SFDJTJPO!L

FNCFEEJOHWFDUPS
EJGGFSFOUJBMHFOFFYQSFTTJPO $.BQ TDPSF

(FGJUJOJC
QBUUFSOTBGUFSDPNQPVOE 3%,JU
USFBUNFOUT
GSPN5PVDITUPOFEBUBTFU

(FGJUJOJC &$'1 CJUT


 //

7JSUVBM%SVH%JTDPWFSZ1JQFMJOFVTJOH3F4JN/FU

$PNQPVOE1BJSTBT*OQVU 5SBJOFE3F4JN/FU 3FTVMUT


//

FNCFEEJOHWFDUPS
$IFNJDBMDPNQPVOET
$I 5PQSBOLFEOPWFMESVHDBOEJEBUFT
XJUILOPXOFGGFDUT QSFEJDUFEUPIBWFTJNJMBSFGGFDU
GSPN5PVDITUPOF BT)BMPQFSJEPM
FH )BMPQFSJEPM

$PTJOFTJNJMBSJUZ
 $IMPSBIBMPQFSJEPM
&$'1 CJUT

 #SPNQFSJEPM
FNCFEEJOHWFDUPS

 "NJQFSPOF

$IFNJDBMDPNQPVOET  .PQFSPOF
GSPN;*/$  $ZBOUSBOJMJQSPMF
;*/$

&$'1 CJUT

//

Fig. 1. Pipeline of ReSimNet: The upper part of the figure illustrates the training process and the quantitative evaluation of ReSimNet. ReSimNet takes the structural information of two
compounds as input and learns the transcriptional response similarity of the two compounds. As the lower part of the figure shows, ReSimNet is applied to the drug discovery process using
the ZINC15 database. ReSimNet takes Haloperidol as one input and produces a ranked list of drug candidates from the ZINC15 database whose effects are predicted to be similar to those
of Haloperidol.

compounds (K) (90%) and unknown compounds (U) (10%). The unknown model. The detailed results of the four input representation methods are
compounds are considered as new compounds whose effects are unknown, provided in the supplementary file (Table S1).
and thus the unknown compounds are excluded from the training set and
included in only the validation and test sets. The training set consists of 3.4 Drug response-based Siamese neural networks
only pairs of known compounds (KK), and the validation set and test set
To predict the CMap scores of pairwise compound inputs, we construct
consist of pairs of known compounds (KK), pairs of known and unknown
Siamese neural networks, motivated by the work of Koch et al. (2015).
compounds (KU), and pairs of unknown compounds (UU) (Figure S1).
Siamese neural networks are one kind of neural network architecture that
The training, validation, and test sets are divided as follows: 70% for
contain two or more identical subnetworks. The identical subnetworks
training, 10% for validation, and 20% for testing, respectively. Note that
share weights that are updated simultaneously during training. Siamese
a KK type test sample represents a pair of known compounds that have
neural networks are commonly used for finding similarities or relationships
never appeared together in the training set.
between two inputs. Based on the similarity score of two compounds, our
Siamese neural networks ReSimNet learns the distributed representation
3.3 Compound representations for model input of each compound. The pipeline of ReSimNet is illustrated in Figure 1.
We tested the following drug input representation methods which are Input pairs and target label A pair of inputs (xa , xb ) is given as
used to obtain input for our model: SMILES, InChIKey, ECFP, and two 2048-bit ECFP vectors which represent the structural information of
Mol2vec (Jaeger et al., 2018). We obtained the SMILES and InChIKey compound xa and compound xb . The target values tab are the CMap
information of the 2,428 compounds from the Touchstone dataset. 2048- scores of input pairs. Once ReSimNet is properly trained, it can predict
bit ECFPs of the compounds were produced using RDKit1 and 300- the CMap scores of new input pairs.
dimensional Mol2vec vectors were generated based on ECFPs. Among the Model architecture We built two identical multilayer perceptrons
four kinds of input representations, ECFP obtained the best performance (MLPs) that share the weights in our Siamese networks ReSimNet. Given
in predicting CMap scores. ECFP seems to be more suitable for our an input pair, the outputs of each MLP are the embedding vectors of two
model and for learning the relationship between the response similarity compounds (ca , cb ), and are calculated as follows:
and substructures of chemical compounds. Thus, we decided to use ECFP
to represent drug structures and use them as input for our neural network ca = W2 f (W1 xa + b1 ) + b2

1 http://www.rdkit.org/ cb = W2 f (W1 xb + b1 ) + b2
4 Jeon et al.

Table 1. The performance of the baseline models and ReSimNet on all types of test samples. We calculated the Pearson correlation, MSE, and AUROC of each model.
We compared the predicted CMap scores with the actual CMap scores. We also calculated the precision@k% of the top k% samples with the highest predicted CMap
scores. The performance of the ReSimNet ensemble is measured based on the average predicted CMap scores of the 10 individual ReSimNets. Note that the test set
samples were randomly bootstrapped with replacement 1 million times and the mean and standard deviation of the scores are provided.

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz411/5497254 by Springfield College user on 23 May 2019


Model Correlation MSE (Total) MSE (1%) MSE (2%) MSE (5%) AUROC Precision@1% Precision@2% Precision@5%

Mol2vec 0.021 (0.004) 0.132 (0.001) 0.056 (0.008) 0.082 (0.007) 0.105 (0.005) 0.519 (0.003) 0.820 (0.017) 0.782 (0.013) 0.724 (0.009)
ECFP 0.064 (0.004) 0.499 (0.001) 0.323 (0.007) 0.359 (0.006) 0.394 (0.004) 0.522 (0.003) 0.868 (0.016) 0.771 (0.013) 0.695 (0.009)
Linear SVR 0.049 (0.004) 0.146 (0.001) 0.169 (0.016) 0.148 (0.010) 0.153 (0.006) 0.540 (0.003) 0.657 (0.021) 0.673 (0.014) 0.655 (0.009)
Ridge 0.092 (0.004) 0.115 (0.001) 0.117 (0.010) 0.109 (0.007) 0.098 (0.004) 0.538 (0.003) 0.625 (0.021) 0.637 (0.015) 0.640 (0.009)
Gradient boosting 0.145 (0.004) 0.113 (0.001) 0.080 (0.009) 0.086 (0.007) 0.085 (0.004) 0.562 (0.002) 0.769 (0.019) 0.743 (0.013) 0.708 (0.009)
MLP 0.367 (0.005) 0.107 (0.001) 0.212 (0.015) 0.167 (0.010) 0.122 (0.005) 0.662 (0.002) 0.710 (0.020) 0.721 (0.014) 0.741 (0.008)
Random Forest 0.269 (0.005) 0.107 (0.001) 0.018 (0.001) 0.031 (0.003) 0.041 (0.002) 0.625 (0.002) 0.979 (0.007) 0.920 (0.009) 0.865 (0.007)
ReSimNet(Ensemble) 0.518 (0.004) 0.084 (0.001) 0.002 (0.002) 0.007 (0.002) 0.014 (0.002) 0.737 (0.002) 0.990 (0.005) 0.977 (0.005) 0.939 (0.005)

where W∗ and b∗ are trainable weights and biases of the MLP, respectively, Mol2Vec are used to evaluate the relationship between the structural
and f (·) denotes an element-wise nonlinear activation function such as a similarity and response similarity of compounds. For ECFP, Jaccard
logistic sigmoid. Note that each embedding vector is calculated using the similarity (or Tanimoto similarity) is commonly used as a structural
shared parameters W1 ∈ Rh×2048 , W2 ∈ Re×h , b1 ∈ Rh and b2 ∈ Re , similarity measure. We computed the Jaccard similarity of the ECFP
where h is the dimension of a hidden layer, and e is the dimension of an vectors of two compounds and used the similarity as the predicted CMap
output layer. We set h = 512 and e = 300, and use a rectified linear unit score of the two compounds. Mol2vec is a model that considers the
(max(x, 0)) as f (·). substructures of a whole structure as words and the whole structure as
We compute the cosine similarity of two output vectors (ca , cb ) from a sentence, and uses Word2vec(Mikolov et al., 2013) to generate a 300-
the MLPs, respectively, to predict CMap scores of two compounds. Unlike dimensional vector for each compound. We calculated the cosine similarity
the study of Koch et al. (2015) where a weighted L1 distance is employed, of a pair of compounds represented using Mol2vec and used the similarity
we used cosine similarity as a distance measure to directly utilize the as the predicted CMap score of the two compounds.
outputs of each MLP as compound embeddings. We trained the five baseline machine learning models to directly predict
the CMap scores of two compounds given the ECFPs of the two compounds
ca · cb
sab = as input. Unlike ReSimNet, the baseline machine learning models are
kca k kcb k sensitive to the order of two compounds of an input. To deal with this
We can predict the response similarity of two compounds using the problem, we doubled the size of the training set by replicating each sample
cosine similarity of the embedding vectors of the two compounds. A and reversing their order of compounds, and trained the models on this new
compound embedding vector of an unseen compound can be generated dataset. The inference of the baseline models must be performed twice
using the trained ReSimNet and the ECFP vector of the compound. using the original samples and the samples with compounds in reverse
Loss and optimization function We train our model to minimize the order and the results need to be averaged to obtain the final prediction.
mean squared error between the target CMap Score tab and the cosine We also applied an ensemble method to ReSimNet and evaluated its effect
similarity sab of two outputs as follows: on performance. It is known that an ensemble of complex high-variance
models such as deep neural networks often improves performance. We
1 X evaluated the ensemble of 10 ReSimNets that are trained independently
J(Θ) = (sab − tab )2
N a,b with random weight initializations. The 10-ReSimNet ensemble (by
averaging the predictions) consistently outperformed the single ReSimNet
where we define N as the total number of training examples of input
model. Hence, we report the results obtained by the 10-ReSimNet
pairs, and Θ denotes the trainable parameters of our model. The
ensemble model.
hyperparameters of ReSimNet are selected using the validation set. The
Performance results As mentioned in Section 3.2, the samples in the
details of the hyperparameters and the considered values are provided in the
test set are either KK, KU, or UU pairs. We report the performance on
supplementary file (Tables S2-S7). We used the Adam optimizer (Kingma
each pair type. Note that K denotes a compound used in the training set,
and Ba, 2014) with a learning rate of 0.005.
and U denotes a compound used for validation and testing. A KK pair
in the test set denotes a pair of compounds that appeared separately and
were never a pair in the training set. Using the results of the KK pairs
4 Results with high predicted similarity scores, we can hypothesize new uses of the
4.1 Model evaluation known compounds, which can be considered as drug repositioning. The
results of the KU pairs can be used to determine whether a model can find
Baseline models For the qualitative evaluation of ReSimNet, we compared drug candidates that are similar to well-known drugs. The results of the
the performance of ReSimNet with that of the following seven baseline UU pairs can be used to gauge the response similarity between unknown
models: ECFP and Mol2vec(Jaeger et al., 2018), both of which are compounds. The results of the KK, KU, and UU pair samples are provided
structure-based chemical compound representation methods; and Support in the supplementary file (Tables S8, S9, S10).
Vector Machine Regressor with a linear kernel (Linear SVR), Ridge Table 1 shows the performance results on the full test set which includes
regression, Gradient Boosting, Multi-Layer Perceptron (MLP), and all three pair types. As shown in Table 1, the ReSimNet ensemble model
Random Forest, all of which are machine learning models. ECFP and
ReSimNet 5

outperformed all the baseline models in all evaluation metrics. For example, properties. However, in practice, without loss of generality, all of these
the ReSimNet ensemble achieved a Pearson correlation of 0.518 and an constraints can be removed and virtually any compound can be a candidate
MSE of 0.084 between the predicted similarity scores and the actual CMap for screening. To demonstrate the effectiveness of ReSimNet, we report
scores on all test samples while MLP, which obtains the second best the virtual screening results of the two compounds Haloperidol and

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz411/5497254 by Springfield College user on 23 May 2019


performance, only achieved a Pearson correlation of 0.367 and an MSE of Selumetinib (from the Touchstone dataset) whose mechanism of actions
0.107. To evaluate the statistical significance of the observed performance are known.
differences, we performed t-tests and ReSimNet obtained a p-value of < Haloperidol (BRD-K67783091) is a Dopamine receptor antagonist
10−18 against all baseline methods. Moreover, when the MSE of the top and an FDA-approved drug for neurological or psychiatric diseases. We
k% samples is calculated based on the predicted similarity scores, the obtained the top 10 drug candidates that were predicted to have a drug
MSEs of the ReSimNet ensemble are significantly lower than those of response (predicted similarity score > 0.9) similar to that of Haloperidol
the baseline models. For example, the MSE of the top 1% of samples for (Table 2). Table 2 also shows the Jaccard similarity coefficient of ECFP
the ReSeimNet ensemble is 0.002 while that for the second best model vectors of two compounds. Haloperidol and the top 10 ranked drug
Random Forest is 0.018 which is nine times larger than that of our model candidates were searched in pairs using the Biomedical Entity Search Tool
(p-value < 10−18 ). This result implies that the high confidence predictions (Lee et al., 2016) to identify which compounds are mentioned together
made by our model are significantly more accurate than those of the other with Haloperidol in abstracts. Six of the top 10 candidate compounds
baseline models. were mentioned with Haloperidol in more than one abstract. The fact that
In addition, when measuring AUROC, if a sample has a CMap score the compounds are mentioned with Haloperidol in the abstracts of articles
of 0.9 or more, a positive label is given; otherwise, a negative label is indirectly suggests that the compounds are related and were experimentally
given. A threshold of 0.9 is chosen based on the similarity criteria used compared. We listed the description and references in the description
by the CMap. The AUROC scores of Mol2vec and ECFP are close to that column of Table 2.
of a random predictor and the AUROC scores of the baseline machine Drugs with significantly lower structure similarity but also with
learning models are lower than those of ReSimNet. The result shows similar gene expression profiles can be identified by ReSimNet. Although
that ReSimNet learned the relationship between substructures and drug Bromperidol, a drug approved by the FDA for the treatment of dementia,
response similarity. Moreover, the low performance of Mol2vec and ECFP depression, schizophrenia, anxiety disorders, and psychosomatic
suggests that the ligand-based drug discovery approach used to develop disorders, was not included in the Touchstone dataset, Bromperidol is
new drugs by designing analogs of well-known drugs has limitations. selected as one of the drug candidates that is similar to Haloperidol
Although obtaining high accuracy on every sample is important, high in terms of drug response. Since Bromperidol is also an approved
accuracy on the high confidence predictions is much more important as antipsychotic drug, we can say that ReSimNet successfully found
only the top predicted drug candidates are examined in practice. We a compound similar to Haloperidol. In addition, Chlorohaloperidol,
calculated precision@k% scores to measure the performance of ReSimNet Amiperone, Moperone, Budipine, Ganaxolone, and Butorphanol are
on the top ranked samples. Table 1 shows the precision@k% score which dopamine receptor antagonists or possible treatments for neuropsychiatric
denotes the number of samples whose CMap scores are greater than 0.9, diseases. There is no research describing the mechanisms of action
among the top k% of samples with the highest predicted CMap scores (k of Cyantraniliprole, B-Hyodeoxycholate, or 2,3-Dibromopropanol.
= 1%, 2%, 5%). The top k% of samples and the number of samples whose However, based on the ReSimNet results, we hypothesized that these
CMap scores are greater than 0.9 among the top k% samples are indicated compounds have gene expression signatures similar to Haloperidol,
in the table. The precision@k% will be 0.5 if we randomly predict the and potentially could be used as antipsychotic drugs. The results on
CMap scores. As k increases, precision@k% tends to decrease because Selumetinib are provided in the supplementary file (Table S11). Our
more samples with lower predicted similarity scores are included. The literature survey shows that ReSimNet can be used to find new drug
precision@1% column in Table 1 shows that among the top 1% of the candidates with targets or effects similar to those of known drugs, even if
ReSimNet ensemble prediction results, 98.9% of the samples have CMap their structural similarity is low, which is not possible with ligand-based
scores greater than 0.9. The precision@k% of Mol2vec and ECFP shows or structure-based drug discovery methods.
that there is a significant relationship between structure similarity and
response similarity. However, Mol2vec and ECFP cannot find drug pairs
that have different drug structures but similar effects or drug pairs that have
similar structures but dissimilar effects. Some examples of this found by
5 Discussion
ReSimNet are described in the supplementary file (Figures S2 and S3). We developed ReSimNet which a novel drug response-based Siamese
neural networks that predicts whether compounds have gene expression
signatures similar to those of known compounds, and to obtain
4.2 Use case scenario: searching for drug candidates in
transcriptional response similarity-based embedding vectors of the
the ZINC15 database using ReSimNet compounds. In ReSimNet, we exploited the CMap scores as target
For more practical verification, we used trained ReSimNet for a drug values for training the models, with the assumption that the CMap score
discovery pipeline. We simulated the process of finding new drugs that increases as the drug response similarity of two compounds becomes
could be similar to drugs that were already known to be effective for a higher. We tested the performance of ReSimNet on a large database
disease. To simulate this scenario, ReSimNet predicts the drug response (ZINC15), and the predictions of ReSimNet were validated by a literature
similarity between a compound used in ReSimNet in the training phase and survey. As mentioned in Section 3.3, we have used various types of input
a new compound in the ZINC15 database. The ZINC15 database (Sterling representations obtained by SMILES, InChIKey, ECFP, and Mol2vec. For
and Irwin, 2015) contains more than 230 million compounds and we the SMILES and InChIKey input representations, we used bidirectional
selected around 16,000 ZINC15 compounds that are named, purchasable, long short term memory (Bi-LSTM) (Hochreiter and Schmidhuber, 1997)
satisfy all the Lipinski rules, and excluded from the Touchstone dataset. because Bi-LSTM can capture sequential information of inputs. Although
We limited our search to named compounds because we had to perform Bi-LSTM uses more expressive structures than a simple MLP model,
literature surveys to verify the compounds found in the ZINC15 database. the simple MLP model performed better. The simple MLP model has
We also used the Lipinski rules to select compounds with drug-like been tested on the inputs obtained by Mol2vec. The simple MLP model
6 Jeon et al.

Table 2. Top drug candidates for Haloperidol from the ZINC15 database

Predicted
Similarity
similarity # of
ZINC15 ID ZINC15 name score by Description

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz411/5497254 by Springfield College user on 23 May 2019


score by articles
ECFP
ReSimNet

ZINC2516029 chlorohaloperidol 0.995 0.884 6 Chlorohaloperidol targets the Dopamine D2 receptora


Bromperidol is an FDA approved drug for dementia, depression, schizophrenia,
ZINC601270 bromperidol 0.967 0.792 66
anxiety disorders, and psychosomatic disorders (Yasui-Furukori et al., 2002)
Amiperone targets the Dopamine D3 receptor and D3 is a potential target of
ZINC4214827 amiperone 0.961 0.704 0
Parkinson’s disease and schizophrenia (Varady et al., 2003)
ZINC538026 moperone 0.955 0.792 11 Moperone is a Dopamine D2 receptor antagonistb
ZINC35851465 cyantraniliprole 0.946 0.098 0 -
ZINC1481990 budipine 0.942 0.172 1 Budipine is used in the treatment of Parkinson’s disease (Klockgether et al., 1993)
ZINC12494203 B-Hyodeoxycholate 0.938 0.113 0 -
Ganaxolone is one of neurosteroids and is used for epilepsy (Nohria and Giller,
ZINC3824281 ganaxolone 0.933 0.129 1
2007)
ZINC3812988 butorphanol 0.933 0.200 9 Butorphanol is a neuropsychiatric agent(Iyengar et al., 1987)
ZINC2041178 2,3-Dibromopropanol 0.93 0.158 0 -
a
https://pubchem.ncbi.nlm.nih.gov/compound/173712#section=ChEMBL-Target-Tree
b
https://www.kegg.jp/dbget-bin/www_bget?D01105

obtains better performance on the input obtained by ECFP than on the of drugs is important for data-driven drug discovery as the high-throughput
inputs obtained by Mol2vec. The experimental results of the four input characterization of gene expression changes in cells after compound
representations are provided in the supplementary file (Table S1). treatment provides more information on the MOAs of the drugs.
The Siamese networks in our model allow us to learn the
representations of drugs in a transcriptional response space. The
embedding vectors (the last hidden layers of ReSimNet) represent drugs 6 Conclusion
in a vector space where drugs with similar transcriptional responses
are located close to each other. However, conventional structure- In summary, we have developed and implemented a novel Siamese neural
based representation methods (e.g., ECFP) place drugs with similar network model that predicts the similarity of the gene expression patterns
structures close to each other. Our drug embedding vectors could be of two compounds. Compared with the models, such as Mol2vec and ECFP,
used as input in addition to the conventional structure representations which depend solely on compound features, we found that ReSimNet is
in downstream applications such as synergy prediction (Preuer et al., more effective in extracting the embedding vectors of compounds and thus
2017; Jeon et al., 2018; Menden et al., 2018), personalized drug more appropriate for novel drug discovery. Literature surveys have also
response prediction (Menden et al., 2013), drug toxicity prediction (Mayr been used to prove that ReSimNet can find new compounds that are similar
et al., 2016), drug-drug interaction prediction (Lee et al., 2012), drug to Haloperidol and Selumetinib from the ZINC15 database. We found that
repositioning (Napolitano et al., 2013), or compound mechanism of action ReSimNet can find drug candidates similar to drugs that were proven to be
prediction. The Siamese networks could also be applied directly to other effective for diseases, and can reduce the search space of the drug discovery
tasks such as synergy prediction. We plan to explore both directions in pipeline. The source code and the pre-trained weights of ReSimNet are
future work. available at https://github.com/dmis-lab/ReSimNet.
We acknowledge that our qualitative analysis in the use case scenario
may not represent general cases. We used drugs with which we are familiar
and selected the reported drugs from the results. Although we chose Funding
the cases after conducting only a small number of trials, there may be
This research was supported by the National Research Foundation of Korea
a confirmation bias in our case study. To address this issue, in future
(NRF-2016M3A9A7916996 and NRF-2017M3C4A7065887) and by the
work, we plan to conduct wet-lab experiments to validate the hypotheses
National IT Industry Promotion Agency grant funded by the Ministry of
made by ReSimNet (e.g., Cyantraniliprole, B-Hyodeoxycholate, and 2,3-
Science and ICT and Ministry of Health and Welfare (NO. C1202-18-1001,
Dibromopropanol as drug candidates that may have effects similar to
Development Project of The Precision Medicine Hospital Information
Haloperidol).
System (P-HIS)).
Finally, we would like to emphasize that ReSimNet is not intended to
compete with well-established drug discovery methods but to complement
the existing methods. ReSimNet can predict the transcriptional response
References
similarity of drugs, which can be useful for drug repurposing. Moreover, Altae-Tran, H. et al. (2017). Low data drug discovery with one-shot
since ReSimNet is not limited to structural analogs, ReSimNet can find learning. ACS central science, 3(4), 283–293.
novel drug candidates whose structure greatly differs from that of prototype Camp, H. S. et al. (2000). Differential activation of peroxisome
drugs. We also believe that exploiting the transcription response similarity proliferator-activated receptor-gamma by troglitazone and rosiglitazone.
Diabetes, 49(4), 539–547.
ReSimNet 7

Coley, C. W. et al. (2017). Convolutional embedding of attributed systems, pages 3111–3119.


molecular graphs for physical property prediction. Journal of chemical Napolitano, F. et al. (2013). Drug repositioning: a machine-learning
information and modeling, 57(8), 1757–1772. approach through data integration. Journal of cheminformatics, 5(1),
Cross, D. A. et al. (2014). Azd9291, an irreversible egfr tki, overcomes 30.

Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz411/5497254 by Springfield College user on 23 May 2019


t790m-mediated resistance to egfr inhibitors in lung cancer. Cancer Nohria, V. and Giller, E. (2007). Ganaxolone. Neurotherapeutics, 4(1),
discovery, pages CD–14. 102–105.
De Wolf, H. et al. (2018). High-throughput gene expression profiles to Paul, S. M. et al. (2010). How to improve r&d productivity: the
define drug similarity and predict compound activity. Assay and drug pharmaceutical industry’s grand challenge. Nature reviews Drug
development technologies, 16(3), 162–176. discovery, 9(3), 203.
DiMasi, J. A. et al. (2015). The cost of drug development. New England Preuer, K. et al. (2017). Deepsynergy: predicting anti-cancer drug synergy
Journal of Medicine, 372(20), 1972–1972. PMID: 25970070. with deep learning. Bioinformatics, 34(9), 1538–1546.
Ghasemi, F. et al. (2018). Deep neural network in qsar studies using deep Ramsundar, B. et al. (2015). Massively multitask networks for drug
belief network. Applied Soft Computing, 62, 251–258. discovery. arXiv preprint arXiv:1502.02072.
Gonczarek, A. et al. (2017). Interaction prediction in structure-based Readhead, B. et al. (2018). Expression-based drug screening of
virtual screening using deep learning. Computers in biology and neural progenitor cells from individuals with schizophrenia. Nature
medicine. communications, 9(1), 4412.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Senkowski, W. et al. (2016). Large-scale gene expression profiling
Neural computation, 9(8), 1735–1780. platform for identification of context-dependent drug responses in
Iyengar, S. et al. (1987). Agonist action of the agonist/antagonist analgesic multicellular tumor spheroids. Cell chemical biology, 23(11), 1428–
butorphanol on dopamine metabolism in the nucleus accumbens of the 1438.
rat. Neuroscience letters, 77(2), 226–230. Sliwoski, G. et al. (2014). Computational methods in drug discovery.
Jaeger, S. et al. (2018). Mol2vec: Unsupervised machine learning Pharmacological reviews, 66(1), 334–395.
approach with chemical intuition. Journal of chemical information and Sterling, T. and Irwin, J. J. (2015). Zinc 15–ligand discovery for everyone.
modeling, 58(1), 27–35. J. Chem. Inf. Model, 55(11), 2324–2337.
Jeon, M. et al. (2018). In silico drug combination discovery for Subramanian, A. et al. (2017). A next generation connectivity map: L1000
personalized cancer therapy. BMC systems biology, 12(2), 16. platform and the first 1,000,000 profiles. Cell, 171(6), 1437–1452.
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic Tian, K. et al. (2016). Boosting compound-protein interaction prediction
optimization. arXiv preprint arXiv:1412.6980. by deep learning. Methods, 110, 64–72.
Klockgether, T. et al. (1993). The antiparkinsonian agent budipine is Tice, R. R. et al. (2013). Improving the human hazard characterization of
an n-methyl-d-aspartate antagonist. Journal of Neural Transmission- chemicals: a tox21 update. Environmental health perspectives, 121(7),
Parkinson’s Disease and Dementia Section, 5(2), 101–106. 756.
Koch, G. et al. (2015). Siamese neural networks for one-shot image Varady, J. et al. (2003). Molecular modeling of the three-dimensional
recognition. In ICML Deep Learning Workshop, volume 2. structure of dopamine 3 (d3) subtype receptor: discovery of novel and
Lamb, J. et al. (2006). The connectivity map: using gene-expression potent d3 ligands through a hybrid pharmacophore-and structure-based
signatures to connect small molecules, genes, and disease. science, database searching approach. Journal of medicinal chemistry, 46(21),
313(5795), 1929–1935. 4377–4392.
Lee, K. et al. (2012). Drug-drug interaction analysis using heterogeneous Verbist, B. et al. (2015). Using transcriptomics to guide lead optimization
biological information network. In Bioinformatics and Biomedicine in drug discovery projects: Lessons learned from the qstar project. Drug
(BIBM), 2012 IEEE International Conference on, pages 1–5. IEEE. discovery today, 20(5), 505–513.
Lusci, A. et al. (2013). Deep architectures and deep learning in Wallach, I. et al. (2015). Atomnet: A deep convolutional neural network for
chemoinformatics: the prediction of aqueous solubility for drug-like bioactivity prediction in structure-based drug discovery. arXiv preprint
molecules. Journal of chemical information and modeling, 53(7), arXiv:1510.02855.
1563–1575. Wen, M. et al. (2017). Deep-learning-based drug–target interaction
Mayr, A. et al. (2016). Deeptox: toxicity prediction using deep learning. prediction. Journal of proteome research, 16(4), 1401–1409.
Frontiers in Environmental Science, 3, 80. Yasui-Furukori, N. et al. (2002). Comparison of prolactin concentrations
Menden, M. P. et al. (2013). Machine learning prediction of cancer cell between haloperidol and bromperidol treatments in schizophrenic
sensitivity to drugs based on genomic and chemical properties. PLoS patients. Progress in Neuro-Psychopharmacology and Biological
one, 8(4), e61318. Psychiatry, 26(3), 575–578.
Menden, M. P. et al. (2018). A cancer pharmacogenomic screen powering Yoo, M. et al. (2018). Exploring the molecular mechanisms of traditional
crowd-sourced advancement of drug combination prediction. bioRxiv, chinese medicine components using gene expression signatures and
page 200451. connectivity map. Computer methods and programs in biomedicine.
Mikolov, T. et al. (2013). Distributed representations of words and phrases
and their compositionality. In Advances in neural information processing

S-ar putea să vă placă și