Documente Academic
Documente Profesional
Documente Cultură
PREVIOUS WORK
Classification and clustering of gene expression
in the form of microarray or RNA-seq data are well
studied. There are various approaches for the
classification of cancer cells and healthy cells using gene
expression profiles and supervised learning models. The
self-organizing map (SOM) was used to analyze leukemia
cancer cells. A support vector machine (SVM) with a dot
product kernel has been applied to the diagnosis of
ovarian, leukemia, and colon cancers. SVMs with
nonlinear kernels (polynomial and Gaussian) were also
used for classification of breast cancer tissues from
microarray data. Unsupervised learning techniques are
capable of finding global patterns in gene expression data.
Gene clustering represents various groups of
similar genes based on similar expression patterns.
Hierarchical clustering and maximal margin linear
programming are examples of this learning and they have
been used to classify colon cancer cells. K-nearest
neighbors (KNN) unsupervised learning also has been ALGORITHM
applied to breast cancer data. Due to the large number of
genes, high amount of noise in the gene expression data, Recurrent Neural Network(RNN)
and also the complexity of biological networks, there is a
need to deeply analyze the raw data and exploit the Recurrent Neural Network (RNN) are a type of Neural
important subsets of genes. Regarding this matter, other Network where the output from previous step are fed as
techniques such as principal component analysis (PCA) input to the current step. In traditional neural networks,
have been proposed for dimensionality reduction of all the inputs and outputs are independent of each other,
expression profiles to aid clustering of the relevant genes but in cases like when it is required to predict the next
in a context of expression profiles. PCA uses an
word of a sentence, the previous words are required and
orthogonal transformation to map high dimensional data
to linearly uncorrelated components. hence there is a need to remember the previous words.
However, PCA reduces the dimensionality of the data Thus RNN came into existence, which solved this issue
linearly and it may not extract some nonlinear with the help of a Hidden Layer. The main and most
relationships of the data. In contrast, other approaches important feature of RNN is Hidden state, which
such as kernel PCA (KPCA) may be capable of remembers some information about a sequence.
uncovering these nonlinear relationships.
As part of the tutorial we will implement a recurrent briefly mentioned above, it’s a bit more complicated in
neural network based language model. The applications practice because typically can’t capture information
of language models are two-fold: First, it allows us to from too many time steps ago. Unlike a traditional deep
neural network, which uses different parameters at each
score arbitrary sentences based on how likely they are to
layer, a RNN shares the same parameters (
occur in the real world. This gives us a measure
above) across all steps. This reflects the fact that we are
of grammatical and semantic correctness. Such models performing the same task at each step, just with different
are typically used as part of Machine Translation systems. inputs. This greatly reduces the total number of
Secondly, a language model allows us to generate new parameters we need to learn.
text (I think that’s the much cooler application). Training 5. The above diagram has outputs at each time step,
a language model on Shakespeare allows us to but depending on the task this may not be necessary. For
generate Shakespeare-like text. example, when predicting the sentiment of a sentence we
may only care about the final output, not the sentiment
The idea behind RNNs is to make use of sequential after each word. Similarly, we may not need inputs at
each time step. The main feature of an RNN is its hidden
information. In a traditional neural network we assume state, which captures some information about a sequence.
that all inputs (and outputs) are independent of each
other. But for many tasks that’s a very bad idea. If you NNs have shown great success in many NLP tasks. At
want to predict the next word in a sentence you better this point I should mention that the most commonly used
know which words came before it. RNNs are type of RNNs are LSTMs, which are much better at
called recurrent because they perform the same task capturing long-term dependencies than vanilla RNNs are.
for every element of a sequence, with the output being But don’t worry, LSTMs are essentially the same thing as
depended on the previous computations. Another way to the RNN we will develop in this tutorial, they just have a
different way of computing the hidden state. We’ll cover
think about RNNs is that they have a “memory” which
LSTMs in more detail in a later post. Here are some
captures information about what has been calculated so example applications of RNNs in NLP (by non means an
far. In theory RNNs can make use of information exhaustive list).
in arbitrarily long sequences.
RNN Extensions
The above diagram shows a RNN being unrolled (or
unfolded) into a full network. By unrolling we simply Over the years researchers have developed more
mean that we write out the network for the complete sophisticated types of RNNs to deal with some of the
sequence. For example, if the sequence we care about is a shortcomings of the vanilla RNN model. We will cover
them in more detail in a later post, but I want this section
sentence of 5 words, the network would be unrolled into a
to serve as a brief overview so that you are familiar with
5-layer neural network, one layer for each word. the taxonomy of models.
COMPARISION TABLE.
HMM Representation
PROPOSED METHOD
Gene-Panels
For most clinical applications, the use of gene-
panels to sequence only a discrete number of genes of
interest has been the method of choice, because of its
cost-efficiency, and because at the same time it achieves
high coverage of ROIs and offers simplicity in the raw solved by subsequent improvements in capture protocols
and subsequent data analyses. When the number of genes and data-analysis tools.Most cases carried BRCA1 gene
sequenced is restricted to the few already analysed in mutations (11% of the subjects) followed by BRCA2 (6%)
previous diagnostic tests using traditional methods, this is
and 10 additional genes (6%). Loss of heterozygousity in
normally called targeted re-sequencing
the wild-type allele was confirmed in more than 80% of
Different protocols are available to design and the cases .
capture panels of genes and other ROIs. In most cases,
companies providing the library preparation kits offer Clinical Relevance
online user-friendly tools to design the hybridisation
probes or the PCR oligos to enrich the desired ROIs. Numerous additional publications confirmed this potential
utility. In HBOC, all studies consistently indicated that
Whole-Exome-Sequencing genes besides BRCA1 and BRCA2 are mutated and confer
a moderate- to high-cancer-risk. In a study on 708
Protocols/kits to enrich the library for all exons consecutive patients suspected of HBOC, besides 69
are available from several companies and use the same or germline deleterious alterations in BRCA1 and BRCA2,
similar technologies as mentioned for the enrichment of
additional putative pathogenic mutations were identified
gene-panels. Following sequencing, raw data analysis is
relevant in order to determine the quality of the in PALB2 (almost 1% of the
experiments, checking for difficulties that may have patients), TP53, CHEK2, ATM, RAD51C, MSH2, PMS2 a
occurred at the level of library preparation and/or nd MRE11A (between 0.4% and 0.7% of the patients),
sequencing. Both steps are crucial to obtain good quality followed by RAD50, NBS1, CDH1, BARD1 (about 0.1%)
data. A high sequence-on-target yield of more than 90%
of the ROIs and coverage higher than 20× per nucleotide
CONCLUSION
is necessary for sufficient specificity and sensitivity in
mutation detection. Normally, when less than 90% of the
ROIs are sequenced but coverage is high, sample This report presents a solid theoretical
processing was suboptimal; when the ROIs are construction of how to model cancer progression and how
sufficiently sequenced (>90%) but coverage is low, then to deal with the high number of parameters to estimate.
the sequencing reaction was suboptimal and re- The algorithm developed in this project seems to perform
sequencing is required. very well when detecting the hidden parameters of the
model. However, there are some problems when the graph
Data Analyses and Interpretation that represents cancer progression is too sparse, meaning
lack of information. New methods should be devised in
After raw data are assessed for sufficient quality, order to solve this problem. Since the algorithm was
data analyses and interpretation continues using different implemented in MatLab, the performance was
pipelines depending on the approach used (gene-panel, significantly slower than expected. In retrospect, it seems
WES, WGS or targeted-RNA-seq) and on the questions that coding in C++ would have been a better choice. This
that need to be answered. is also the reason why a rigorous estimation of the
transition parameters was not possible, a question that
Base-calling is performed using software like the should also be addressed in a future work.
Casava pipeline that produces Fast-Q files (raw-initial
data), which can then be aligned to the human reference REFERENCES
genome using Burrows-Wheeler-Alignment tool (BWA).
Single base variants can be identified using Sequence-
1. M. Hjelm, M. Höglund, and J. Lagergren, “New
Alignment-MAP tools (SAM) and annotated. Additional
probabilistic network models and algorithms for
software and scripts (normally in-house developed) match oncogenesis,” Journal of Computational Biology, vol. 13,
the data from NGS analysis to variants in reference no. 4, pp. 853–865, 2016.
databases.
2. W. H. Organization, “Cancer,” 2009. Retrieved 2015-
Gene-Panels in Cancer Syndromes 05-05.
In initial technical difficulties were related to suboptimal
3. G. L. Patrick, An Introduction to Medicinal Chemistry,
enrichment of GC-rich regions and to problems in the Fourth Edition. Oxford Univ Press, 2009.
bioinformatics pipeline to correctly call indels, and were
4. L. Foulds, “The experimental study of tumor
progression: a review,” Cancer research, vol. 14, no. 5, p.
327, 2014.