Sunteți pe pagina 1din 4

Lesson : 11

Topic: Neural Networks & Sequences


NEURAL NETWORKS

Computer science has taken two approaches to making a computer behave with what
appears to be intelligence.
1. One is the expert system approach that figuring out the rules that guide human
behavior, then representing these rules, in a totally different form, in a computer
program.
2. Neural networks exemplify a second approach. They try to mimic the way human
and animal brains function – first learning from experience how to deal with
certain types of situations, then applying this earning to new situations of the type.

A neural network consists of multiple, interconnected cells whose behavior is based on


the neurons that control the behavior of humans and animals. Each neuron receives
signals from system inputs or from other neurons. Based on the signals it receives, it
generates an output, which it sends to system outputs or to other neurons.

Conceptual structure of neural network

Network outputs

Output
layer

Hidden
layer

Input
layer

Network
input
• The cells and their interconnections are usually simulated by suitable data
structures and variables whose values change over time.
• At each step, each neuron sends a data value to the other neurons to which it is
connected, as defined by the tables that control the simulation.
• In the next time step, each neuron collects its inputs, figures out what its output
should be, and sends that output on to the next level.
• The process continues until the output emerges.

How to connect to neurons?


• The challenge of developing a neural network to identify patterns is figuring
out how to connect its neurons.
• The basic approach involves making a random set of connections, trying out
the resulting network, and seeing if it produces useful outputs.
• If it does, the connections in it are given high scores.
• If it doesn’t, they are given low scores.
• The process is repeated with many other sets of connections.
• The high-scoring connections are reused in new network designs while the
low-scoring ones are discarded.
• This procedure is known as a genetic algorithm because it resembles the way
biological genetics works: the scoring for results corresponds to survival of
the fittest, while the mixing of high-scoring connections corresponds to the
mixing of parental genetic material.
• Genetic algorithms also incorporate the capability for statistical mutations,
preventing them from getting permanently bogged down with one set of
“chromosomes” and never improving beyond what that set makes possible.
Drawback:
• A practical drawback is that they can’t tell us how they reach a conclusion.

Nearest Neighbor Approaches


It involves finding old cases that are close to each new one and assuming its outcome
will match the majority of those neighbors. This can be done either on the fly, by looking
for neighbors each time a new case comes along, or in advance, by predetermining the
regions within which old cases tend to have one outcome or another. This is called the k-
nearest neighbors method (k-nn), where k is the number of neighbors to look at for each
point. This approach is also referred to as memory-based reasoning, because it is based
on remembering where other points fall in the database.

The number of neighbors to consider, k, is a parameter in this type of analysis. Using too
few neighbors can create small regions where a random cluster of anomalous results
distorts the outcome.
Putting the results to use
When there are more than two factors, it is usually possible to divide the set of
subjects into more than two categories.
In data mining, high lift is good, as it means that the data mining process has
identified factors that affect the outcome. The higher the lift, the higher the business
value of the model. A model with no lift at all-that is, a model that has no ability to
predict which members of the overall population are more or less likely to behave in a
desirable or undesirable fashion-is of no business value at all. Conversely, a model that
could distinguish between the two population subsets with total accuracy would be
valuable indeed.

SIMILARITY SEARCH OVER SEQUENCES


A lot of information stored in databases consists of sequences.
Query model: The user specifies a query sequence and wants to retrieve all data
sequences that are similar to the query sequence.
A data sequence X = <x1,...,xk>.
A subsequence Z = <z1,…,zj>. It is obtained from another sequence X=<x1,…,xk> by
deleting numbers from the front and back of the sequence X.
Z1=x1,z2=xi+1…..,zj=zi+j+1.
X=<x1,….,xk>
Y=<y1,…,yk>
Euclidean norm – as the distance between the two sequences.
k 2
||X – Y|| = ∑ (xi – yi)
i=1
Similarity queries over sequences can be classified into two types.
• Complete Sequence Matching: The query sequence and the sequences in the
database have the same length. Given a user-specified threshold parameter ε,
our goal is to retrieve all sequences in the database that are within ε-distance to
the query sequence.
• Subsequence Matching: The query sequence is shorter than the sequences in
the database. In this case, we want to find all subsequences of sequences in the
database such that the subsequence is within distance ε of the query sequence.

An algorithm to find similar sequences


Given a collection of data sequences, a query sequence, and a distance threshold ε,
how can we efficiently find all sequences within ε-distance of the query sequence?
• One possibility is to scan the database, retrieve each data sequence, and compute
its distance to the query sequence. While this algorithm has the merit of being
simple, it always retrieves every data sequence.
• Because we consider the complete sequence matching problem, all data sequences
and the query sequence have the same length.
• Each data sequence and query sequence can be represented as a point in a k-
dimensional space.
• If we insert all data sequences into a multidimensional index, we can retrieve data
sequences that exactly match the query sequences by querying the index.
• But since we want to retrieve not only data sequences that match the query
exactly but aso all sequences within ε-distance of the query sequence, we do not
use a point query as defined by the query sequence.
• Instead, we query the index with a hyper-rectangle that has side-length 2ε and the
query sequence as center, and we retrieve all sequences that fall within this hyper-
rectangle.
• We then discard sequences that are actually further than ε away from the query
sequence.Using the index allows us to greatly reduce the number of sequences we
consider and decreases the time to evaluate the similarity query significantly.

S-ar putea să vă placă și