Documente Academic
Documente Profesional
Documente Cultură
Machine learning (ML) is a basic form of artificial intelligence, where an algorithm is able to
recognize and predict patterns in a set of data. Deep learning (DL) is a subtype of ML that utilizes
multi-layered deep neural networks (DNNs), to recognize images, sounds and texts when fed data
(1). Deep learning is able to conduct multifactorial analyses as well as dimensionality reduction,
which makes it a favourable approach to identify the HLA alleles that are linked with SJS/TEN (2).
Figure 1: The evolution of neural networks with the introduction of deep learning.Adapted from
name (3).
DL can be broadly categorized into supervised and unsupervised learning. Supervised learning
assigns the accurate classifications to a data set and uses it to train the algorithm. Alternatively, in
unsupervised DL, the algorithm is allowed to learn hidden and inherent patterns within a dataset by
Component Analysis (PCA). It is able to conduct dimensionality reduction on genotypical data and
reveal unknown relationships between individuals (5). So far, Unsupervised DL has been the
preferred method for genomic research, although Supervised DL is becoming a reality due to the
ability of deep learning models to process and encode genomic data whilst maintaining salient
features (6).
There are several types of neural networks utilized for deep learning. A main type are Feedforward
neural networks (FNNs). FNNs are taught to map a fixed sized input (an image) to a fixed size output
(probabilities of image category). Convolutional Neural Networks (CNNs) are a subtype of FNNs that
are able to process multiple arrays of data, for example different layers of pixels from a colour image
can be fed to a CNN. This makes CNN easier to train and apply universally. So far, CNNs have been
successfully used for document reading and speech recognition. Recurrent Neural Networks (RNNs)
are another class that processes one element of a sequence at a time whilst including information
about the past elements of a sequence. Despite being difficult to train, RNNS have been able to
There are three subsets of data used for DL. Training datasets are used to learn the parameters of
the model. Validation datasets choose the most appropriate model. Additionally, test data sets are
able to analyse a foreign and unfamiliar dataset efficiently. To achieve this, a compromise needs to
be reached between the size of that data fed and the flexibility of the model. If the model is too
basic, it will fail to discover patterns within the data. Conversely, an overly complex model will
DL is especially important for genomic analysis due to the size and diversity of data collected from
the human genome. Convolutional Neural Networks (CNNs) which are a type of deep learning,
allows the reduction of large sections of genetic data, while identifying the important sections in the
input data (8). Supervised DL has provided the most accurate results in genomic analyses so far (9).
However, issues such as appropriating DL to interpret genome within the medical paradigm,
This research project aims to create an autoencoder that is able to accurately compress
multidimensional data to identify the HLA alleles that are associated with CBZ-SJS/TEN. A similar
approach was utilized in a recent study to predict the genes associated with cancer tumours,
potentially identifying the activated pathways and the suitable treatments. To accomplish this, a
variational autoencoder (VAE) was programmed using data obtained from The Cancer Genome
Atlas. In addition to being able to compress data, VAEs can also analyse latent spaces to identify
treatments, biological pathways and cancer states (12). Although, this approach is useful and
innovative, this project does not require a VAE as only the identification of the culprit alleles are
required. A recent study attempted to accomplish this by using a stacked autoenoder to graphically
classify DNA sequences of HLA-A alleles. The autoencoder was able to compress a dataset with 3288
dimensions to 2. This data was presented in a graphical projection on a 2 dimensional feature space,
showing the autoencoders ability to accurately categorise the allele subtypes. In addition to the
binary numerical vector graph, a document vector graph was produced (11).
Figure 3: Graphical representation
Figure 4: Graphical representation
of 2 dimensional
of histogram based document
1. Yu D, Deng L. Deep Learning and Its Applications to Signal and Information Processing
[Exploratory DSP]. IEEE Signal Processing Magazine. 2011;28(1):145-54.
2. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436-44.
3. Waldrop MM. News Feature: What are the limits of deep learning? Proceedings of the
National Academy of Sciences. 2019;116(4):1074-7.
4. Sathya R, Abraham A. Comparison of Supervised and Unsupervised Learning Algorithms for
Pattern Classification. 2013;2(2).
5. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography
within Europe. Nature. 2008;456(7218):98-101.
6. Schrider DR, Kern AD. Supervised Machine Learning for Population Genetics: A New
Paradigm. Trends in Genetics. 2018;34(4):301-12.
7. Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in
genomics. Nature Genetics. 2019;51(1):12-8.
8. Graham B. Fractional Max-Pooling2014.
9. Gupta A, Zou J. Feedback GAN (FBGAN) for DNA: a Novel Feedback-Loop Architecture for
Optimizing Protein Functions2018.
10. Ghorbani A, Abid A, Zou J. Interpretation of Neural Networks is Fragile2017.
11. Miyake J, Kaneshita Y, Asatani S, Tagawa S, Niioka H, Hirano T. Graphical classification of
DNA sequences of HLA alleles by deep learning. Human Cell. 2018;31(2):102-5.
12. Way GP, Greene CS. Extracting a biologically relevant latent space from cancer
transcriptomes with variational autoencoders. Pac Symp Biocomput. 2018;23:80-91.