Documente Academic
Documente Profesional
Documente Cultură
doi.10.1093/bioinformatics/xxxxxx
Advance Access Publication Date: Day Month Year
Manuscript Category
Abstract
Motivation: Traditional drug discovery approaches identify a target for a disease and find a compound
that binds to the target. In this approach, structures of compounds are considered as the most important
features because it is assumed that similar structures will bind to the same target. Therefore, structural
analogs of the drugs that bind to the target are selected as drug candidates. However, even though
compounds are not structural analogs, they may achieve the desired response. A new drug discovery
method based on drug response, which can complement the structure-based methods, is needed.
Results: We implemented Siamese neural networks called ReSimNet that take as input two chemical
compounds and predicts the CMap score of the two compounds, which we use to measure the
transcriptional response similarity of the two counpounds. ReSimNet learns the embedding vector of
a chemical compound in a transcriptional response space. ReSimNet is trained to minimize the difference
between the cosine similarity of the embedding vectors of the two compounds and the CMap score of
the two compounds. ReSimNet can find pairs of compounds that are similar in response even though
they may have dissimilar structures. In our quantitative evaluation, ReSimNet outperformed the baseline
machine learning models. The ReSimNet ensemble model achieves a Pearson correlation of 0.518 and
a precision@1% of 0.989. In addition, in the qualitative analysis, we tested ReSimNet on the ZINC15
database and showed that ReSimNet successfully identifies chemical compounds that are relevant to a
prototype drug whose mechanism of action is known.
Availability: The source code and the pre-trained weights of ReSimNet are available at
https://github.com/dmis-lab/ReSimNet
Contact: kangj@korea.ac.kr
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author(s) (2019). Published by Oxford University Press. All rights reserved. For Permissions, please email:
journals.permissions@oup.com
2 Jeon et al.
pipelines to improve the HTS step and predict the MOA of candidate aqueous solubility of chemical compounds, Lusci et al. (2013) formed
compounds (Sliwoski et al., 2014). multiple directed acyclic graph recursive neural networks (DAG-RNNs)
The ligand-based drug discovery approach, which is a commonly into undirected graph recursive neural networks, training each DAG-RNN
used CADD approach, assumes that compounds with similar chemical and combining the results of all the DAG-RNNs. However, there is no deep
3F4JN/FU .PEFM5SBJOJOHBOE&WBMVBUJPO
FNCFEEJOHWFDUPS
&SMPUJOJC
3%,JU
• 1FBSTPO$PSSFMBUJPO
$PFGGJDJFOU
$PTJOFTJNJMBSJUZ
-PTT
• .4&
$.BQ TDPSF &SMPUJOJC &$'1 CJUT
• "630$
4IBSFE8FJHIUT
TJNJMBSJUZTDPSFTCFUXFFOUXP "DUVBM • 1SFDJTJPO!L
FNCFEEJOHWFDUPS
EJGGFSFOUJBMHFOFFYQSFTTJPO $.BQ TDPSF
(FGJUJOJC
QBUUFSOTBGUFSDPNQPVOE 3%,JU
USFBUNFOUT
GSPN5PVDITUPOFEBUBTFU
7JSUVBM%SVH%JTDPWFSZ1JQFMJOFVTJOH3F4JN/FU
FNCFEEJOHWFDUPS
$IFNJDBMDPNQPVOET
$I 5PQSBOLFEOPWFMESVHDBOEJEBUFT
XJUILOPXOFGGFDUT QSFEJDUFEUPIBWFTJNJMBSFGGFDU
GSPN5PVDITUPOF BT)BMPQFSJEPM
FH
)BMPQFSJEPM
$PTJOFTJNJMBSJUZ
$IMPSBIBMPQFSJEPM
&$'1 CJUT
#SPNQFSJEPM
FNCFEEJOHWFDUPS
"NJQFSPOF
$IFNJDBMDPNQPVOET .PQFSPOF
GSPN;*/$ $ZBOUSBOJMJQSPMF
;*/$
&$'1 CJUT
//
Fig. 1. Pipeline of ReSimNet: The upper part of the figure illustrates the training process and the quantitative evaluation of ReSimNet. ReSimNet takes the structural information of two
compounds as input and learns the transcriptional response similarity of the two compounds. As the lower part of the figure shows, ReSimNet is applied to the drug discovery process using
the ZINC15 database. ReSimNet takes Haloperidol as one input and produces a ranked list of drug candidates from the ZINC15 database whose effects are predicted to be similar to those
of Haloperidol.
compounds (K) (90%) and unknown compounds (U) (10%). The unknown model. The detailed results of the four input representation methods are
compounds are considered as new compounds whose effects are unknown, provided in the supplementary file (Table S1).
and thus the unknown compounds are excluded from the training set and
included in only the validation and test sets. The training set consists of 3.4 Drug response-based Siamese neural networks
only pairs of known compounds (KK), and the validation set and test set
To predict the CMap scores of pairwise compound inputs, we construct
consist of pairs of known compounds (KK), pairs of known and unknown
Siamese neural networks, motivated by the work of Koch et al. (2015).
compounds (KU), and pairs of unknown compounds (UU) (Figure S1).
Siamese neural networks are one kind of neural network architecture that
The training, validation, and test sets are divided as follows: 70% for
contain two or more identical subnetworks. The identical subnetworks
training, 10% for validation, and 20% for testing, respectively. Note that
share weights that are updated simultaneously during training. Siamese
a KK type test sample represents a pair of known compounds that have
neural networks are commonly used for finding similarities or relationships
never appeared together in the training set.
between two inputs. Based on the similarity score of two compounds, our
Siamese neural networks ReSimNet learns the distributed representation
3.3 Compound representations for model input of each compound. The pipeline of ReSimNet is illustrated in Figure 1.
We tested the following drug input representation methods which are Input pairs and target label A pair of inputs (xa , xb ) is given as
used to obtain input for our model: SMILES, InChIKey, ECFP, and two 2048-bit ECFP vectors which represent the structural information of
Mol2vec (Jaeger et al., 2018). We obtained the SMILES and InChIKey compound xa and compound xb . The target values tab are the CMap
information of the 2,428 compounds from the Touchstone dataset. 2048- scores of input pairs. Once ReSimNet is properly trained, it can predict
bit ECFPs of the compounds were produced using RDKit1 and 300- the CMap scores of new input pairs.
dimensional Mol2vec vectors were generated based on ECFPs. Among the Model architecture We built two identical multilayer perceptrons
four kinds of input representations, ECFP obtained the best performance (MLPs) that share the weights in our Siamese networks ReSimNet. Given
in predicting CMap scores. ECFP seems to be more suitable for our an input pair, the outputs of each MLP are the embedding vectors of two
model and for learning the relationship between the response similarity compounds (ca , cb ), and are calculated as follows:
and substructures of chemical compounds. Thus, we decided to use ECFP
to represent drug structures and use them as input for our neural network ca = W2 f (W1 xa + b1 ) + b2
1 http://www.rdkit.org/ cb = W2 f (W1 xb + b1 ) + b2
4 Jeon et al.
Table 1. The performance of the baseline models and ReSimNet on all types of test samples. We calculated the Pearson correlation, MSE, and AUROC of each model.
We compared the predicted CMap scores with the actual CMap scores. We also calculated the precision@k% of the top k% samples with the highest predicted CMap
scores. The performance of the ReSimNet ensemble is measured based on the average predicted CMap scores of the 10 individual ReSimNets. Note that the test set
samples were randomly bootstrapped with replacement 1 million times and the mean and standard deviation of the scores are provided.
Mol2vec 0.021 (0.004) 0.132 (0.001) 0.056 (0.008) 0.082 (0.007) 0.105 (0.005) 0.519 (0.003) 0.820 (0.017) 0.782 (0.013) 0.724 (0.009)
ECFP 0.064 (0.004) 0.499 (0.001) 0.323 (0.007) 0.359 (0.006) 0.394 (0.004) 0.522 (0.003) 0.868 (0.016) 0.771 (0.013) 0.695 (0.009)
Linear SVR 0.049 (0.004) 0.146 (0.001) 0.169 (0.016) 0.148 (0.010) 0.153 (0.006) 0.540 (0.003) 0.657 (0.021) 0.673 (0.014) 0.655 (0.009)
Ridge 0.092 (0.004) 0.115 (0.001) 0.117 (0.010) 0.109 (0.007) 0.098 (0.004) 0.538 (0.003) 0.625 (0.021) 0.637 (0.015) 0.640 (0.009)
Gradient boosting 0.145 (0.004) 0.113 (0.001) 0.080 (0.009) 0.086 (0.007) 0.085 (0.004) 0.562 (0.002) 0.769 (0.019) 0.743 (0.013) 0.708 (0.009)
MLP 0.367 (0.005) 0.107 (0.001) 0.212 (0.015) 0.167 (0.010) 0.122 (0.005) 0.662 (0.002) 0.710 (0.020) 0.721 (0.014) 0.741 (0.008)
Random Forest 0.269 (0.005) 0.107 (0.001) 0.018 (0.001) 0.031 (0.003) 0.041 (0.002) 0.625 (0.002) 0.979 (0.007) 0.920 (0.009) 0.865 (0.007)
ReSimNet(Ensemble) 0.518 (0.004) 0.084 (0.001) 0.002 (0.002) 0.007 (0.002) 0.014 (0.002) 0.737 (0.002) 0.990 (0.005) 0.977 (0.005) 0.939 (0.005)
where W∗ and b∗ are trainable weights and biases of the MLP, respectively, Mol2Vec are used to evaluate the relationship between the structural
and f (·) denotes an element-wise nonlinear activation function such as a similarity and response similarity of compounds. For ECFP, Jaccard
logistic sigmoid. Note that each embedding vector is calculated using the similarity (or Tanimoto similarity) is commonly used as a structural
shared parameters W1 ∈ Rh×2048 , W2 ∈ Re×h , b1 ∈ Rh and b2 ∈ Re , similarity measure. We computed the Jaccard similarity of the ECFP
where h is the dimension of a hidden layer, and e is the dimension of an vectors of two compounds and used the similarity as the predicted CMap
output layer. We set h = 512 and e = 300, and use a rectified linear unit score of the two compounds. Mol2vec is a model that considers the
(max(x, 0)) as f (·). substructures of a whole structure as words and the whole structure as
We compute the cosine similarity of two output vectors (ca , cb ) from a sentence, and uses Word2vec(Mikolov et al., 2013) to generate a 300-
the MLPs, respectively, to predict CMap scores of two compounds. Unlike dimensional vector for each compound. We calculated the cosine similarity
the study of Koch et al. (2015) where a weighted L1 distance is employed, of a pair of compounds represented using Mol2vec and used the similarity
we used cosine similarity as a distance measure to directly utilize the as the predicted CMap score of the two compounds.
outputs of each MLP as compound embeddings. We trained the five baseline machine learning models to directly predict
the CMap scores of two compounds given the ECFPs of the two compounds
ca · cb
sab = as input. Unlike ReSimNet, the baseline machine learning models are
kca k kcb k sensitive to the order of two compounds of an input. To deal with this
We can predict the response similarity of two compounds using the problem, we doubled the size of the training set by replicating each sample
cosine similarity of the embedding vectors of the two compounds. A and reversing their order of compounds, and trained the models on this new
compound embedding vector of an unseen compound can be generated dataset. The inference of the baseline models must be performed twice
using the trained ReSimNet and the ECFP vector of the compound. using the original samples and the samples with compounds in reverse
Loss and optimization function We train our model to minimize the order and the results need to be averaged to obtain the final prediction.
mean squared error between the target CMap Score tab and the cosine We also applied an ensemble method to ReSimNet and evaluated its effect
similarity sab of two outputs as follows: on performance. It is known that an ensemble of complex high-variance
models such as deep neural networks often improves performance. We
1 X evaluated the ensemble of 10 ReSimNets that are trained independently
J(Θ) = (sab − tab )2
N a,b with random weight initializations. The 10-ReSimNet ensemble (by
averaging the predictions) consistently outperformed the single ReSimNet
where we define N as the total number of training examples of input
model. Hence, we report the results obtained by the 10-ReSimNet
pairs, and Θ denotes the trainable parameters of our model. The
ensemble model.
hyperparameters of ReSimNet are selected using the validation set. The
Performance results As mentioned in Section 3.2, the samples in the
details of the hyperparameters and the considered values are provided in the
test set are either KK, KU, or UU pairs. We report the performance on
supplementary file (Tables S2-S7). We used the Adam optimizer (Kingma
each pair type. Note that K denotes a compound used in the training set,
and Ba, 2014) with a learning rate of 0.005.
and U denotes a compound used for validation and testing. A KK pair
in the test set denotes a pair of compounds that appeared separately and
were never a pair in the training set. Using the results of the KK pairs
4 Results with high predicted similarity scores, we can hypothesize new uses of the
4.1 Model evaluation known compounds, which can be considered as drug repositioning. The
results of the KU pairs can be used to determine whether a model can find
Baseline models For the qualitative evaluation of ReSimNet, we compared drug candidates that are similar to well-known drugs. The results of the
the performance of ReSimNet with that of the following seven baseline UU pairs can be used to gauge the response similarity between unknown
models: ECFP and Mol2vec(Jaeger et al., 2018), both of which are compounds. The results of the KK, KU, and UU pair samples are provided
structure-based chemical compound representation methods; and Support in the supplementary file (Tables S8, S9, S10).
Vector Machine Regressor with a linear kernel (Linear SVR), Ridge Table 1 shows the performance results on the full test set which includes
regression, Gradient Boosting, Multi-Layer Perceptron (MLP), and all three pair types. As shown in Table 1, the ReSimNet ensemble model
Random Forest, all of which are machine learning models. ECFP and
ReSimNet 5
outperformed all the baseline models in all evaluation metrics. For example, properties. However, in practice, without loss of generality, all of these
the ReSimNet ensemble achieved a Pearson correlation of 0.518 and an constraints can be removed and virtually any compound can be a candidate
MSE of 0.084 between the predicted similarity scores and the actual CMap for screening. To demonstrate the effectiveness of ReSimNet, we report
scores on all test samples while MLP, which obtains the second best the virtual screening results of the two compounds Haloperidol and
Table 2. Top drug candidates for Haloperidol from the ZINC15 database
Predicted
Similarity
similarity # of
ZINC15 ID ZINC15 name score by Description
obtains better performance on the input obtained by ECFP than on the of drugs is important for data-driven drug discovery as the high-throughput
inputs obtained by Mol2vec. The experimental results of the four input characterization of gene expression changes in cells after compound
representations are provided in the supplementary file (Table S1). treatment provides more information on the MOAs of the drugs.
The Siamese networks in our model allow us to learn the
representations of drugs in a transcriptional response space. The
embedding vectors (the last hidden layers of ReSimNet) represent drugs 6 Conclusion
in a vector space where drugs with similar transcriptional responses
are located close to each other. However, conventional structure- In summary, we have developed and implemented a novel Siamese neural
based representation methods (e.g., ECFP) place drugs with similar network model that predicts the similarity of the gene expression patterns
structures close to each other. Our drug embedding vectors could be of two compounds. Compared with the models, such as Mol2vec and ECFP,
used as input in addition to the conventional structure representations which depend solely on compound features, we found that ReSimNet is
in downstream applications such as synergy prediction (Preuer et al., more effective in extracting the embedding vectors of compounds and thus
2017; Jeon et al., 2018; Menden et al., 2018), personalized drug more appropriate for novel drug discovery. Literature surveys have also
response prediction (Menden et al., 2013), drug toxicity prediction (Mayr been used to prove that ReSimNet can find new compounds that are similar
et al., 2016), drug-drug interaction prediction (Lee et al., 2012), drug to Haloperidol and Selumetinib from the ZINC15 database. We found that
repositioning (Napolitano et al., 2013), or compound mechanism of action ReSimNet can find drug candidates similar to drugs that were proven to be
prediction. The Siamese networks could also be applied directly to other effective for diseases, and can reduce the search space of the drug discovery
tasks such as synergy prediction. We plan to explore both directions in pipeline. The source code and the pre-trained weights of ReSimNet are
future work. available at https://github.com/dmis-lab/ReSimNet.
We acknowledge that our qualitative analysis in the use case scenario
may not represent general cases. We used drugs with which we are familiar
and selected the reported drugs from the results. Although we chose Funding
the cases after conducting only a small number of trials, there may be
This research was supported by the National Research Foundation of Korea
a confirmation bias in our case study. To address this issue, in future
(NRF-2016M3A9A7916996 and NRF-2017M3C4A7065887) and by the
work, we plan to conduct wet-lab experiments to validate the hypotheses
National IT Industry Promotion Agency grant funded by the Ministry of
made by ReSimNet (e.g., Cyantraniliprole, B-Hyodeoxycholate, and 2,3-
Science and ICT and Ministry of Health and Welfare (NO. C1202-18-1001,
Dibromopropanol as drug candidates that may have effects similar to
Development Project of The Precision Medicine Hospital Information
Haloperidol).
System (P-HIS)).
Finally, we would like to emphasize that ReSimNet is not intended to
compete with well-established drug discovery methods but to complement
the existing methods. ReSimNet can predict the transcriptional response
References
similarity of drugs, which can be useful for drug repurposing. Moreover, Altae-Tran, H. et al. (2017). Low data drug discovery with one-shot
since ReSimNet is not limited to structural analogs, ReSimNet can find learning. ACS central science, 3(4), 283–293.
novel drug candidates whose structure greatly differs from that of prototype Camp, H. S. et al. (2000). Differential activation of peroxisome
drugs. We also believe that exploiting the transcription response similarity proliferator-activated receptor-gamma by troglitazone and rosiglitazone.
Diabetes, 49(4), 539–547.
ReSimNet 7