Sunteți pe pagina 1din 39

Unit -I

Random Access Machine model


Algorithms can be measured in a machine-independent way using the Random Access Machine
(RAM) model. This model assumes a single processor. In the RAM model, instructions are
executed one after the other, with no concurrent operations. This model of computation is an
abstraction that allows us to compare algorithms on the basis of performance. The assumptions
made in the RAM model to accomplish this are:

 Each simple operation takes 1 time step.


 Loops and subroutines are not simple operations.
 Each memory access takes one time step, and there is no shortage of memory.

For any given problem the running time of an algorithms is assumed to be the number of time
steps. The space used by an algorithm is assumed to be the number of RAM memory cells.

Review Of Data Structures

Introduction
When we talk about balanced search trees, we specifically are talking about two things.

 A search trees.
 A height balanced tree.
Here in this post we will consider a Binary Search Tree and will try to equip with features to
self balance itself upon modifications. Before starting up with the logic and code part we
need to understand the motive behind this kind of data structure and more precisely we
need to understand the need of such a complex scenario.

Apart from that we also need to understand what it means to be termed as balanced.

What does balancing means?


We say a tree to be balanced when it is height balanced, that means the following:
 It is not skewed.
 It has a uniform distribution of nodes through out the structure.
 The height is as small as possible.
 A general relation can be derived such that we can claim that the height is proportional to logN
where N is the total number of nodes.
Why is balancing important in a search tree?
Balanced Search Trees makes a tree data structure efficient, it guarantees logN running
time for all the dictionary operations (Insert, Search, Delete). This of course is achieved by
paying some cost as mentioned below:

 Re balancing upon insertion of new node.


 Re balancing upon deletion of existing node.

 Balance Trees

 Mergeable Sets
In computer science, a disjoint-set data structure, also called a union–find data structure or merge–
find set, is a data structure that keeps track of a set of elements partitioned into a number
of disjoint (non-overlapping) subsets. It provides near-constant-time operations (bounded by the
inverse Ackermann function) to add new sets, to merge existing sets, and to determine whether
elements are in the same set. In addition to many other uses (see the Applications section), disjoint-
sets play a key role in Kruskal's algorithm for finding the minimum spanning tree of a graph.

Initially there are 10 subsets and each subset has single element in
it.

When each subset contains only single element, the array Arr
is:

Let’s perform some Operations: 1) Union(2,


1)
Arr will

be:
2) Union(4, 3)
3) Union(8, 4)
4) Union(9, 3)

Arr will
be:

5) Union(6,
5)
Arr will
be:

After performing some operations of Union(A ,B), you can see that now there are 5 subsets. First
has elements {3, 4, 8, 9}, second has {1, 2}, third has {5, 6}, fourth has {0} and fifth has {7}. All these
subsets are said to be Connected Components.

One can also relate these elements with nodes of a graph. The elements in one subset can be
considered as the nodes of the graph which are connected to each other directly or indirectly,
therefore each subset can be considered as connected component.

From this, we can infer that Union-Find data structure is useful in Graphs for performing various
operations like connecting nodes, finding connected components etc.

Let’s perform some Find(A, B) operations. 1) Find (0, 7) - as 0 and 7 are disconnected ,this will gives
false result.
2) Find (8, 9) -though 8 and 9 are not connected directly ,but there exist a path connecting 8 and 9,
so it will give us true result.

When we see above operations in terms of components, then :

Union(A, B) - Replace components containing two objects A and B with their union.
Find(A, B) - check if two objects A and B are in same component or not.

So if we perform operation Union(5, 2) on above components, then it will be :


Now the Arr will be:

Algorithm Design Techniques

 Iterative Techniques
In computational mathematics, an iterative method is a mathematical procedure that generates a
sequence of improving approximate solutions for a class of problems, in which the n-th
approximation is derived from the previous ones. A specific implementation of an iterative
method, including the termination criteria, is an algorithm of the iterative method. An iterative
method is called convergent if the corresponding sequence converges for given initial
approximations. A mathematically rigorous convergence analysis of an iterative method is
usually performed; however, heuristicbased iterative methods are also common. In the problems
of finding the root of an equation (or a solution of a system of equations), an iterative method
uses an initial guess to generate successive approximations to a solution. In contrast, direct
methods attempt to solve the problem by a finite sequence of operations. In the absence of
rounding errors, direct methods would deliver an exact solution (like solving a linear system of
equations {\displaystyle A\mathbf {x} =\mathbf {b} } by Gaussian elimination). Iterative
methods are often the only choice for nonlinear equations. However, iterative methods are often
useful even for linear problems involving a large number of variables (sometimes of the order of
millions), where direct methods would be prohibitively expensive (and in some cases
impossible) even with the best available computing power

 Divide and Conquer


In divide and conquer approach, the problem in hand, is divided into smaller sub-problems and
then each problem is solved independently. When we keep on dividing the subproblems into
even smaller sub-problems, we may eventually reach a stage where no more division is possible.
Those "atomic" smallest possible sub-problem (fractions) are solved. The solution of all sub-
problems is finally merged in order to obtain the solution of an original problem.
Broadly, we can understand divide-and-conquer approach in a three-step process.
Divide/Break
This step involves breaking the problem into smaller sub-problems. Sub-problems should
represent a part of the original problem. This step generally takes a recursive approach to divide
the problem until no sub-problem is further divisible. At this stage, sub-problems become
atomic in nature but still represent some part of the actual problem.

Conquer/Solve
This step receives a lot of smaller sub-problems to be solved. Generally, at this level, the
problems are considered 'solved' on their own.

Merge/Combine
When the smaller sub-problems are solved, this stage recursively combines them until they
formulate a solution of the original problem. This algorithmic approach works recursively and
conquer & merge steps works so close that they appear as one.

Examples
The following computer algorithms are based on divide-and-conquerprogramming approach −

 Merge Sort
 Quick Sort
 Binary Search
 Strassen's Matrix Multiplication
 Closest pair (points)
There are various ways available to solve any computer problem, but the
mentioned are a good example of divide and conquer approach.

 Dynamic Programming
Dynamic programming approach is similar to divide and conquer in breaking down the problem
into smaller and yet smaller possible sub-problems. But unlike, divide and conquer, these sub-
problems are not solved independently. Rather, results of these smaller sub-problems are
remembered and used for similar or overlapping sub-problems.

Dynamic programming is used where we have problems, which can be divided into similar sub-
problems, so that their results can be re-used. Mostly, these algorithms are used for
optimization. Before solving the in-hand sub-problem, dynamic algorithm will try to examine
the results of the previously solved sub-problems. The solutions of sub-problems are combined
in order to achieve the best solution.

So we can say that −

 The problem should be able to be divided into smaller overlapping sub-problem.

 An optimum solution can be achieved by using an optimum solution of smaller sub-problems.

 Dynamic algorithms use memorization.

Comparison
In contrast to greedy algorithms, where local optimization is addressed, dynamic algorithms are
motivated for an overall optimization of the problem.

In contrast to divide and conquer algorithms, where solutions are combined to achieve an
overall solution, dynamic algorithms use the output of a smaller sub-problem and then try to
optimize a bigger sub-problem. Dynamic algorithms use memorization to remember the output
of already solved sub-problems.

Example
The following computer problems can be solved using dynamic programming approach −

 Fibonacci number series

 Knapsack problem

 Tower of Hanoi
 All pair shortest path by Floyd-Warshall

 Shortest path by Dijkstra

 Project scheduling

Dynamic programming can be used in both top-down and bottom-up manner. And of course,
most of the times, referring to the previous solution output is cheaper than recomputing in terms
of CPU cycles

 Greedy Algorithms
An algorithm is designed to achieve optimum solution for a given problem. In greedy algorithm
approach, decisions are made from the given solution domain. As being greedy, the closest
solution that seems to provide an optimum solution is chosen.

Greedy algorithms try to find a localized optimum solution, which may eventually lead to
globally optimized solutions. However, generally greedy algorithms do not provide globally
optimized solutions.

Counting Coins
This problem is to count to a desired value by choosing the least possible coins and the greedy
approach forces the algorithm to pick the largest possible coin. If we are provided coins of ₹ 1,
2, 5 and 10 and we are asked to count ₹ 18 then the greedy procedure will be −

 1 − Select one ₹ 10 coin, the remaining count is 8

 2 − Then select one ₹ 5 coin, the remaining count is 3

 3 − Then select one ₹ 2 coin, the remaining count is 1

 4 − And finally, the selection of one ₹ 1 coins solves the problem

Though, it seems to be working fine, for this count we need to pick only 4 coins. But if we
slightly change the problem then the same approach may not be able to produce the same
optimum result.

For the currency system, where we have coins of 1, 7, 10 value, counting coins for value 18 will
be absolutely optimum but for count like 15, it may use more coins than necessary. For
example, the greedy approach will use 10 + 1 + 1 + 1 + 1 + 1, total 6 coins. Whereas the same
problem could be solved by using only 3 coins (7 + 7 + 1)

Hence, we may conclude that the greedy approach picks an immediate optimized solution and
may fail where global optimization is a major concern.
Examples
Most networking algorithms use the greedy approach. Here is a list of few of them −

 Travelling Salesman Problem

 Prim's Minimal Spanning Tree Algorithm

 Kruskal's Minimal Spanning Tree Algorithm

 Dijkstra's Minimal Spanning Tree Algorithm

 Graph - Map Coloring

 Graph - Vertex Cover

 Knapsack Problem

 Job Scheduling Problem


There are lots of similar problems that uses the greedy approach to find an optimum solution.
Unit- II

Selection Sort
Selection sort is a simple sorting algorithm. This sorting algorithm is an in-
place comparison-based algorithm in which the list is divided into two parts,
the sorted part at the left end and the unsorted part at the right end.
Initially, the sorted part is empty and the unsorted part is the entire list.

The smallest element is selected from the unsorted array and swapped with
the leftmost element, and that element becomes a part of the sorted array.
This process continues moving unsorted array boundary by one element to
the right.

This algorithm is not suitable for large data sets as its average and worst
case complexities are of Ο(n2), where n is the number of items.

How Selection Sort Works?


Consider the following depicted array as an example.

For the first position in the sorted list, the whole list is scanned sequentially.
The first position where 14 is stored presently, we search the whole list and
find that 10 is the lowest value.

So we replace 14 with 10. After one iteration 10, which happens to be the
minimum value in the list, appears in the first position of the sorted list.

For the second position, where 33 is residing, we start scanning the rest of
the list in a linear manner.
We find that 14 is the second lowest value in the list and it should appear at
the second place. We swap these values.

After two iterations, two least values are positioned at the beginning in a
sorted manner.

The same process is applied to the rest of the items in the array.

Following is a pictorial depiction of the entire sorting process −


Now, let us learn some programming aspects of selection sort.

Algorithm
Step 1 − Set MIN to location 0
Step 2 − Search the minimum element in the list
Step 3 − Swap with value at location MIN
Step 4 − Increment MIN to point to next element
Step 5 − Repeat until list is sorted

Bubble Sort Algorithm


Bubble sort is a simple sorting algorithm. This sorting algorithm is
comparison-based algorithm in which each pair of adjacent elements is
compared and the elements are swapped if they are not in order. This
algorithm is not suitable for large data sets as its average and worst case
complexity are of Ο(n2) where n is the number of items.

How Bubble Sort Works?


We take an unsorted array for our example. Bubble sort takes Ο(n2) time so
we're keeping it short and precise.

Bubble sort starts with very first two elements, comparing them to check
which one is greater.

In this case, value 33 is greater than 14, so it is already in sorted locations.


Next, we compare 33 with 27.

We find that 27 is smaller than 33 and these two values must be swapped.

The new array should look like this −

Next we compare 33 and 35. We find that both are in already sorted
positions.

Then we move to the next two values, 35 and 10.

We know then that 10 is smaller 35. Hence they are not sorted.
We swap these values. We find that we have reached the end of the array.
After one iteration, the array should look like this −

To be precise, we are now showing how an array should look like after each
iteration. After the second iteration, it should look like this −

Notice that after each iteration, at least one value moves at the end.

And when there's no swap required, bubble sorts learns that an array is
completely sorted.

Now we should look into some practical aspects of bubble sort.

Algorithm
We assume list is an array of n elements. We further assume
that swapfunction swaps the values of the given array elements.

begin BubbleSort(list)

for all elements of list

if list[i] > list[i+1]

swap(list[i], list[i+1])

end if

end for
return list

end BubbleSort

Insertion Sort
This is an in-place comparison-based sorting algorithm. Here, a sub-list is
maintained which is always sorted. For example, the lower part of an array
is maintained to be sorted. An element which is to be 'insert'ed in this
sorted sub-list, has to find its appropriate place and then it has to be
inserted there. Hence the name, insertion sort.

The array is searched sequentially and unsorted items are moved and
inserted into the sorted sub-list (in the same array). This algorithm is not
suitable for large data sets as its average and worst case complexity are of
Ο(n2), where n is the number of items.

How Insertion Sort Works?


We take an unsorted array for our example.

Insertion sort compares the first two elements.

It finds that both 14 and 33 are already in ascending order. For now, 14 is
in sorted sub-list.

Insertion sort moves ahead and compares 33 with 27.

And finds that 33 is not in the correct position.


It swaps 33 with 27. It also checks with all the elements of sorted sub-list.
Here we see that the sorted sub-list has only one element 14, and 27 is
greater than 14. Hence, the sorted sub-list remains sorted after swapping.

By now we have 14 and 27 in the sorted sub-list. Next, it compares 33 with


10.

These values are not in a sorted order.

So we swap them.

However, swapping makes 27 and 10 unsorted.

Hence, we swap them too.

Again we find 14 and 10 in an unsorted order.

We swap them again. By the end of third iteration, we have a sorted sub-
list of 4 items.
This process goes on until all the unsorted values are covered in a sorted
sub-list. Now we shall see some programming aspects of insertion sort.

Algorithm
Now we have a bigger picture of how this sorting technique works, so we
can derive simple steps by which we can achieve insertion sort.
Step 1 −
If it is the first element, it is already sorted. return 1;
Step 2 −
Pick next element
Step 3 −
Compare with all elements in the sorted sub-list
Step 4 −
Shift all the elements in the sorted sub-list that is greater than the
value to be sorted
Step 5 − Insert the value
Step 6 − Repeat until list is sorted

Sorting Techniques
Sorting refers to arranging data in a particular format. Sorting algorithm
specifies the way to arrange data in a particular order. Most common orders
are in numerical or lexicographical order.

The importance of sorting lies in the fact that data searching can be
optimized to a very high level, if data is stored in a sorted manner. Sorting
is also used to represent data in more readable formats. Following are some
of the examples of sorting in real-life scenarios −

 Telephone Directory − The telephone directory stores the telephone numbers


of people sorted by their names, so that the names can be searched easily.

 Dictionary − The dictionary stores words in an alphabetical order so that


searching of any word becomes easy.

In-place Sorting and Not-in-place Sorting


Sorting algorithms may require some extra space for comparison and
temporary storage of few data elements. These algorithms do not require
any extra space and sorting is said to happen in-place, or for example,
within the array itself. This is called in-place sorting. Bubble sort is an
example of in-place sorting.
However, in some sorting algorithms, the program requires space which is
more than or equal to the elements being sorted. Sorting which uses equal
or more space is called not-in-place sorting. Merge-sort is an example of
not-in-place sorting.

Stable and Not Stable Sorting


If a sorting algorithm, after sorting the contents, does not change the
sequence of similar content in which they appear, it is called stable
sorting.

If a sorting algorithm, after sorting the contents, changes the sequence of


similar content in which they appear, it is called unstable sorting.

Stability of an algorithm matters when we wish to maintain the sequence of


original elements, like in a tuple for example.

Adaptive and Non-Adaptive Sorting Algorithm


A sorting algorithm is said to be adaptive, if it takes advantage of already
'sorted' elements in the list that is to be sorted. That is, while sorting if the
source list has some element already sorted, adaptive algorithms will take
this into account and will try not to re-order them.
A non-adaptive algorithm is one which does not take into account the
elements which are already sorted. They try to force every single element
to be re-ordered to confirm their sortedness.

Important Terms
Some terms are generally coined while discussing sorting techniques, here
is a brief introduction to them −

Increasing Order
A sequence of values is said to be in increasing order, if the successive
element is greater than the previous one. For example, 1, 3, 4, 6, 8, 9 are
in increasing order, as every next element is greater than the previous
element.

Decreasing Order
A sequence of values is said to be in decreasing order, if the successive
element is less than the current one. For example, 9, 8, 6, 4, 3, 1 are in
decreasing order, as every next element is less than the previous element.

Non-Increasing Order
A sequence of values is said to be in non-increasing order, if the
successive element is less than or equal to its previous element in the
sequence. This order occurs when the sequence contains duplicate values.
For example, 9, 8, 6, 3, 3, 1 are in non-increasing order, as every next
element is less than or equal to (in case of 3) but not greater than any
previous element.

Non-Decreasing Order
A sequence of values is said to be in non-decreasing order, if the
successive element is greater than or equal to its previous element in the
sequence. This order occurs when the sequence contains duplicate values.
For example, 1, 3, 3, 6, 8, 9 are in non-decreasing order, as every next
element is greater than or equal to (in case of 3) but not less than the
previous one.

Quick Sort
Quick sort is a highly efficient sorting algorithm and is based on partitioning
of array of data into smaller arrays. A large array is partitioned into two
arrays one of which holds values smaller than the specified value, say pivot,
based on which the partition is made and another array holds values
greater than the pivot value.

Quick sort partitions an array and then calls itself recursively twice to sort
the two resulting subarrays. This algorithm is quite efficient for large-sized
data sets as its average and worst case complexity are of Ο(n2), where n is
the number of items.

Partition in Quick Sort


Following animated representation explains how to find the pivot value in an
array.

The pivot value divides the list into two parts. And recursively, we find the
pivot for each sub-lists until all lists contains only one element.

Quick Sort Pivot Algorithm


Based on our understanding of partitioning in quick sort, we will now try to
write an algorithm for it, which is as follows.
Step 1 − Choose the highest index value has pivot
Step 2 − Take two variables to point left and right of the list excluding pivot
Step 3 − left points to the low index
Step 4 − right points to the high
Step 5 − while value at left is less than pivot move right
Step 6 − while value at right is greater than pivot move left
Step 7 − if both step 5 and step 6 does not match swap left and right
Step 8 − if left ≥ right, the point where they met is new pivot

Quick Sort Algorithm


Using pivot algorithm recursively, we end up with smaller possible
partitions. Each partition is then processed for quick sort. We define
recursive algorithm for quicksort as follows −
Step 1 − Make the right-most index value pivot
Step 2 − partition the array using pivot value
Step 3 − quicksort left partition recursively
Step 4 − quicksort right partition recursively

To know about quick sort implementation in C programming language,


please click here.

Heap Sort

Heap is a special case of balanced binary tree data structure where the
root-node key is compared with its children and arranged accordingly.
If α has child node β then −

key(α) ≥ key(β)

As the value of parent is greater than that of child, this property


generates Max Heap. Based on this criteria, a heap can be of two types −
For Input → 35 33 42 10 14 19 27 44 26 31

Min-Heap − Where the value of the root node is less than or equal to
either of its children.

Max-Heap − Where the value of the root node is greater than or equal to
either of its children.

Both trees are constructed using the same input and order of arrival.

Max Heap Construction Algorithm


We shall use the same example to demonstrate how a Max Heap is created.
The procedure to create Min Heap is similar but we go for min values
instead of max values.

We are going to derive an algorithm for max heap by inserting one element
at a time. At any point of time, heap must maintain its property. While
insertion, we also assume that we are inserting a node in an already
heapified tree.
Step 1 − Create a new node at the end of heap.
Step 2 − Assign new value to the node.
Step 3 − Compare the value of this child node with its parent.
Step 4 − If value of parent is less than child, then swap them.
Step 5 − Repeat step 3 & 4 until Heap property holds.

Note − In Min Heap construction algorithm, we expect the value of the


parent node to be less than that of the child node.

Let's understand Max Heap construction by an animated illustration. We


consider the same input sample that we used earlier.

Max Heap Deletion Algorithm


Let us derive an algorithm to delete from max heap. Deletion in Max (or
Min) Heap always happens at the root to remove the Maximum (or
minimum) value.
Step 1 − Remove root node.
Step 2 − Move the last element of last level to root.
Step 3 − Compare the value of this child node with its parent.
Step 4 − If value of parent is less than child, then swap them.
Step 5 − Repeat step 3 & 4 until Heap property holds.

Merge Sort Algorithm


Merge sort is a sorting technique based on divide and conquer technique.
With worst-case time complexity being Ο(n log n), it is one of the most
respected algorithms.

Merge sort first divides the array into equal halves and then combines them
in a sorted manner.

How Merge Sort Works?


To understand merge sort, we take an unsorted array as the following −

We know that merge sort first divides the whole array iteratively into equal
halves unless the atomic values are achieved. We see here that an array of
8 items is divided into two arrays of size 4.
This does not change the sequence of appearance of items in the original.
Now we divide these two arrays into halves.

We further divide these arrays and we achieve atomic value which can no
more be divided.

Now, we combine them in exactly the same manner as they were broken
down. Please note the color codes given to these lists.

We first compare the element for each list and then combine them into
another list in a sorted manner. We see that 14 and 33 are in sorted
positions. We compare 27 and 10 and in the target list of 2 values we put
10 first, followed by 27. We change the order of 19 and 35 whereas 42 and
44 are placed sequentially.

In the next iteration of the combining phase, we compare lists of two data
values, and merge them into a list of found data values placing all in a
sorted order.

After the final merging, the list should look like this −

Now we should learn some programming aspects of merge sorting.

Algorithm
Merge sort keeps on dividing the list into equal halves until it can no more
be divided. By definition, if it is only one element in the list, it is sorted.
Then, merge sort combines the smaller sorted lists keeping the new list
sorted too.
Step 1 − if it is only one element in the list it is already sorted, return.
Step 2 − divide the list recursively into two halves until it can no more be divided.
Step 3 − merge the smaller lists into new list in sorted order.

Shell Sort
Shell sort is a highly efficient sorting algorithm and is based on insertion
sort algorithm. This algorithm avoids large shifts as in case of insertion sort,
if the smaller value is to the far right and has to be moved to the far left.

This algorithm uses insertion sort on a widely spread elements, first to sort
them and then sorts the less widely spaced elements. This spacing is
termed as interval. This interval is calculated based on Knuth's formula as

Knuth's Formula
h = h * 3 + 1

where −
h is interval with initial value 1

This algorithm is quite efficient for medium-sized data sets as its average
and worst case complexity are of Ο(n), where n is the number of items.

How Shell Sort Works?


Let us consider the following example to have an idea of how shell sort
works. We take the same array we have used in our previous examples. For
our example and ease of understanding, we take the interval of 4. Make a
virtual sub-list of all values located at the interval of 4 positions. Here these
values are {35, 14}, {33, 19}, {42, 27} and {10, 44}
We compare values in each sub-list and swap them (if necessary) in the
original array. After this step, the new array should look like this −

Then, we take interval of 2 and this gap generates two sub-lists - {14, 27,
35, 42}, {19, 10, 33, 44}

We compare and swap the values, if required, in the original array. After
this step, the array should look like this −

Finally, we sort the rest of the array using interval of value 1. Shell sort
uses insertion sort to sort the array.
Following is the step-by-step depiction −

We see that it required only four swaps to sort the rest of the array.

Algorithm
Following is the algorithm for shell sort.
Step 1 − Initialize the value of h
Step 2 − Divide the list into smaller sub-list of equal interval h
Step 3 − Sort these sub-lists using insertion sort
Step 3 − Repeat until complete list is sorted

External Sorting
Review
External Sorting--This term is used to refer to sorting methods that are employed
when the data to be sorted is too large to fit in primary memory.

Characteristics of External Sorting


1. During the sort, some of the data must be stored externally. Typically the data
will be stored on tape or disk.
2. The cost of accessing data is significantly greater than either bookkeeping or
comparison costs.
3. There may be severe restrictions on access. For example, if tape is used, items
must be accessed sequentially.

Criteria for Developing an External Sorting Algorithm


1. Minimize number of times an item is accessed.
2. Access items in sequential order

Sort Merge
Sort merge is the strategy of choice for external sorting because it:

1. Accesses records sequentially


2. Minimizes block accesses
3. Gives a stable sort

Sort Merge Strategy


1. Divide the file into runs such that the size of a run is small enough to fit into
main memory
2. Sort each run in main memory using a fast in-memory sorting algorithm
3. Merge the resulting runs together into successively bigger runs, until the file is
sorted.

Balanced Multiway Merging


Strategy
1. Select an equal number of I/O units (e.g., tapes, disks)
2. Sort Step
o Select half of the I/O units to be output files
o Using an in-memory sorting algorithm, create the initial runs and divide
them evenly among the output files
3. Merge Step
o (a) On each merge step, alternate the I/O units so that on one step, a unit
is an input file and on the next step it is an output file
o (b)Read one run from each of the input files, merge the runs, and store
the resulting run on an output file. Alternate output files so that the runs
are evenly distributed on the output files.
o (c)Repeat step b until all the runs on the input files have been merged.
o (d)Repeat steps a-c until the file has been sorted
Unit-III
Lower Bounding Techniques
Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.

The following decision tree is for the concept buy computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents a
class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

 It is easy to comprehend.

 The learning and classification steps of a decision tree are simple and fast.

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data
due to noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −

 Pre-pruning − The tree is pruned by halting its construction early.

 Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity
The cost complexity is measured by the following two parameters −

 Number of leaves in the tree, and

 Error rate of the tree.

Adversaries
String processing
Processing strings of characters is one of the oldest application of mechanical computers, arguably
predating numerical computation by at least fifty years. Assuming you've already solved the problem of
how to represent characters in memory (e.g. as the C char type encoded in ASCII), there are two
standard ways to represent strings:

As a delimited string, where the end of a string is marked by a special character. The advantages of this
method are that only one extra byte is needed to indicate the length of an arbitrarily long string, that
strings can be manipulated by simple pointer operations, and in some cases that common string
operations that involve processing the entire string can be performed very quickly. The disadvantage is
that the delimiter can't appear inside any string, which limits what kind of data you can store in a string.

As a counted string, where the string data is prefixed or supplemented with an explicit count of the
number of characters in the string. The advantage of this representation is that a string can hold
arbitrary data (including delimiter characters) and that one can quickly jump to the end of the string
without having to scan its entire length. The disadvantage is that maintaining a separate count typically
requires more space than adding a one-byte delimiter (unless you limit your string length to 255
characters) and that more care needs to be taken to make sure that the count is correct

Knuth-Morris-Pratt (KPM)
Algorithm Processing time Matching time

Knuth-Morris-Pratt O(m) O(n)

Algorithm

• The Knuth-Morris-Pratt (KMP) string searching algorithm differs from the brute-force algorithm by keeping
track of information gained from previous comparisons.
• Afailure function (f) is computed that indicates how much of the last comparison can be reused if it fails.
• Specifically, f is defined to be the longest prefix of the pattern P[0,..,j] that is also a suffix of P[1,..,j]
- Note: not a suffix of P[0,..,j]
• Example:
- value of the KMP failure function:
j012345
P[j] a b a b a c
f(j) 0 0 1 2 3 0
• This shows how much of the beginning of the string matches up to the portion immediately preceding a failed
comparison.
- if the comparison fails at (4), we know the a,b in positions 2,3 is identical to positions 0,1

Pros
1. Its searching complexity is O(m+n) which is faster than brute force and Rabin-Karp
2. It’s fairly easy to implement

Cons
1. It needs additional space and time – O(m) for pre-processing
2. It can be optimized a bit (Knuth-Morris-Pratt)

Final Words
Obviously this algorithm is quite useful because it improves in some very elegant manner the brute force
matching. In the other hand you must know that there are faster string searching algorithms like the Boyer-
Moore algorithm. However the Morris-Pratt algorithm can be quite useful in many cases, so understanding its
principles can be very handy

Boyre Moore
Robin karp
Rabin-Karp string search algorithm
From Wikipedia, the free encyclopedia

Jump to: navigation, search

The Rabin-Karp algorithm is a string searching algorithm created by Michael O.


Rabin and Richard M. Karp in 1987 that uses hashing to find a substring in a text. It is
used for multiple pattern matching rather than single pattern matching. For text of
length n and pattern of length m, its average and best case running time is O(n), but
the worst case performance of O(nm) is a reason why it is not widely used. However,
it has the unique advantage of being able to find any one of k strings or less in O(n)
time on average, regardless of the magnitude of k.

A practical application of Rabin-Karp is detecting plagiarism. Given source material,


Rabin-Karp can rapidly search through a paper for instances of sentences from the
source material, ignoring details such as case and punctuation. Because of the
abundance of the sought strings, single-string searching algorithms are impractical.

Contents
 1 Shifting substrings search and competing algorithms
 2 Use of hashing for shifting substring search
 3 Hash function used
 4 Rabin-Karp and multiple pattern search

Shifting substrings search and competing algorithms


A simple substring search algorithm checks all possible positions:
1 function NaiveSearch(string s[1..n], string sub[1..m])
2 for i from 1 to n-m+1
3 for j from 1 to m
4 if s[i+j-1] ≠ sub[j]
5 jump to next iteration of outer loop
6 return i
7 return not found

This algorithm works well in many practical cases, but can exhibit relatively long
running times on certain examples, such as searching for a string of 10,000 "a"s
followed by a "b" in a string of 10 million "a"s, in which case it exhibits its worst-
case Θ(mn) time.

The Knuth-Morris-Pratt algorithm reduces this to Θ(n) time using precomputation to


examine each text character only once; the Boyer-Moore algorithm skips forward not by
1 character, but by as many as possible for the search to succeed, effectively
decreasing the number of times we iterate through the outer loop, so that the number
of characters examined can be as small as n/m in the best case. The Rabin-Karp
algorithm focuses instead on speeding up lines 3-6.

Use of hashing for shifting substring search


Rather than pursuing more sophisticated skipping, the Rabin-Karp algorithm seeks to
speed up the testing of equality of the pattern to the substrings in the text by using
a hash function. A hash function is a function which converts every string into a
numeric value, called its hash value; for example, we might have hash("hello")=5.
Rabin-Karp exploits the fact that if two strings are equal, their hash values are also
equal. Thus, it would seem all we have to do is compute the hash value of the
substring we're searching for, and then look for a substring with the same hash value.

However, there are two problems with this. First, because there are so many different
strings, to keep the hash values small we have to assign some strings the same
number. This means that if the hash values match, the strings might not match; we
have to verify that they do, which can take a long time for long substrings. Luckily, a
good hash function promises us that on most reasonable inputs, this won't happen too
often, which keeps the average search time good.

The algorithm is as shown:


1 function RabinKarp(string s[1..n], string sub[1..m])
2 hsub := hash(sub[1..m]); hs := hash(s[1..m])
3 for i from 1 to n-m+1
4 if hs = hsub
5 if s[i..i+m-1] = sub
6 return i
7 hs := hash(s[i+1..i+m])
8 return not found

Lines 2, 5, and 7 each require Θ(m) time. However, line 2 is only executed once, and
line 5 is only executed if the hash values match, which is unlikely to happen more
than a few times. Line 4 is executed ntimes, but only requires constant time. So the
only problem is line 7.

If we naively recompute the hash value for the substring s[i+1..i+m], this would
require Θ(m) time, and since this is done on each loop, the algorithm would require
Ω(mn) time, the same as the most naive algorithms. The trick to solving this is to note
that the variable hs already contains the hash value of s[i..i+m-1]. If we can use this
to compute the next hash value in constant time, then our problem will be solved.

We do this using what is called a rolling hash. A rolling hash is a hash function
specially designed to enable this operation. One simple example is adding up the
values of each character in the substring. Then, we can use this formula to compute
the next hash value in constant time:
s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]
This simple function works, but will result in statement 5 being executed more often
than other more sophisticated rolling hash functions such as those discussed in the
next section.

Notice that if we're very unlucky, or have a very bad hash function such as a constant
function, line 5 might very well be executed n times, on every iteration of the loop.
Because it requires Θ(m) time, the whole algorithm then takes a worst-case Θ(mn)
time.

Hash function used


The key to Rabin-Karp performance is the efficient computation of hash values of the
successive substrings of the text. One popular and effective rolling hash function
treats every substring as a number in some base, the base being usually a large prime.
For example, if the substring is "hi" and the base is 101, the hash value would be 104
× 1011 + 105 × 1010 = 10609 (ASCII of 'h' is 104 and of 'i' is 105).

Technically, this algorithm is only similar to the true number in a non-decimal system
representation, since for example we could have the "base" less than one of the
"digits". See hash function for a much more detailed discussion. The essential benefit
achieved by such representation is that it is possible to compute the hash value of the
next substring from the previous one by doing only a constant number of operations,
independent of the substrings' lengths.

For example, if we have text "abracadabra" and we are searching for a pattern of
length 3, we can compute the hash of "bra" from the hash for "abr" (the previous
substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 × 101 2 (97
is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding
for the last a of "bra", i.e. 97 × 1010 = 97. If the substrings in question are long, this
algorithm achieves great savings compared with many other hashing schemes.

Theoretically, there exist other algorithms that could provide convenient


recomputation, e.g. multiplying together ASCII values of all characters so that shifting
substring would only entail dividing by the first character and multiplying by the last.
The limitation, however, is the limited size of the integer data type and the necessity of
using modular arithmetic to scale down the hash results, for which see hash
function article; meanwhile, those naive hash functions that would not produce large
numbers quickly, like just adding ASCII values, are likely to cause many hash
collisions and hence slow down the algorithm. Hence the described hash function is
typically the preferred one in Rabin-Karp.
Rabin-Karp and multiple pattern search
Rabin-Karp is inferior for single pattern searching to Knuth-Morris-Pratt
algorithm, Boyer-Moore string searching algorithm and other faster single pattern string
searching algorithms because of its slow worst case behavior. However, Rabin-Karp is
an algorithm of choice for multiple pattern search.

That is, if we want to find any of a large number, say k, fixed length patterns in a text,
we can create a simple variant of Rabin-Karp that uses a Bloom filter or a set data
structure to check whether the hash of a given string belongs to a set of hash values of
patterns we are looking for:
function RabinKarpSet(string s[1..n], set of string subs, m) {
set hsubs := emptySet
for each sub in subs
insert hash(sub[1..m]) into hsubs
hs := hash(s[1..m])
for i from 1 to n-m+1
if hs ∈ hsubs
if s[i..i+m-1] = a substring with hash hs
return i
hs := hash(s[i+1..i+m])
return not found
}

Here we assume all the substrings have a fixed length m, but this assumption can be
eliminated. We simply compare the current hash value against the hash values of all
the substrings simultaneously using a quick lookup in our set data structure, and then
verify any match we find against all substrings with that hash value.

Other algorithms can search for a single pattern in O(n) time, and hence they can be
used to search for k patterns in O(n k) time. In contrast, the variant Rabin-Karp above
can find all k patterns in O(n+k) time in expectation, because a hash table checks
whether a substring hash equals any of the pattern hashes in O(1) time.

S-ar putea să vă placă și