Sunteți pe pagina 1din 9

Rev Date: 08-02-13

Data Structures & Algorithms MOD002641

Dr. Ian van der Linde

Assignment 2012-13: Experimental Algorithm Characterisation


Hand Out Date: Friday 8th February 2013, Hand In Date: Friday 10th May 2013 [13 Work Weeks]

Introduction: During the module so far, we have conducted thought experiments to determine the best case, worst case and average case behaviour of the algorithms that we have studied. For simple algorithms, this approach is perfectly possible and it has served us well. However, sometimes, when an algorithm is more complicated or contains random elements, paper-based analysis becomes difficult. In this assignment, we will be considering two types of algorithm: the first relies upon random chance to stumble across the solution to a problem. We will refer to this kind of algorithm as Saltation, which is an embodiment of Hoyles Fallacy. The second uses the principle of Cumulative Selection to evolve a solution to a problem. This is a Genetic Algorithm, an embodiment of evolution by survival of the fittest. Refer to Lecture 05 for a discussion of these techniques. In the first part of the assignment (worth 60 marks), we will characterise the performance of these algorithms by empirical experimentation. Essentially, running each algorithm on a real computer under different input conditions, and recording how much work each approach undertook to solve the problem. In the second part of the assignment (worth 40 marks), we will modify the programs and evaluate the impact that these modifications had on performance. Two programs, written in C/C++, are provided in Appendix A: The first uses a Saltation approach. It tries to generate a random string (which we will call the child genome) that exactly matches a target string (which we will call the target genome). It counts how many attempts were needed to accomplish this goal, which we will refer to as generations, and provides this as an output at the end of the run. The second uses a Genetic Algorithm to tackle the same task. It evolves the random string (child genome) over successive generations by using the best matching child from the previous generation as a parent genome to sire the next generation. To create each child genome, the current parent genome is mutated slightly (i.e., we randomise some, but not all, of its characters). Again, the number of generations it took to match the target genome are recorded, and provided as an output at the end of the run.
1

Rev Date: 08-02-13

Part 1 (60 marks): Experimental Comparison of Saltation and Genetic String Matching Algorithms In part 1, your task is to experimentally evaluate the performance of these two algorithms under different input conditions, and to make some deductions about their behaviour. (a) First, we will test the Saltation algorithm. It requires two command-line inputs to work, so must be run from a shell (e.g., Command Prompt in Windows, or Terminal in MacOS or Linux). The first command-line argument (following the name of the executable file) is the length of the string to be matched (the genome length); the second is number of possible values that each character in the string can assume (the gene varieties). When you run the program, it outputs how many attempts were needed to match the target genome by repeatedly populating the child genome with random values from 0 to gene varieties and checking how many characters (genes) in the child genome match the target genome at the corresponding positions (array indices). We will refer to this process as evaluating the fitness of the child genome. Plot a graph with genome length on the x-axis and generations to solve on the y-axis. Run the program to test its performance for a range of genome lengths (i.e., each position on the xaxis, e.g., from 1 to 10), and number of gene varieties (e.g., 2, 4, 8, 16, 32). This will require several separate lines on your graph, one line for each of the different numbers of gene varieties we run for (e.g., using the settings suggested above, there will be 5 separate lines). Note that this algorithm could take minutes, hours or even days to run. It might also never finish, so start with small genome lengths and low numbers for gene varieties and build up to the larger values. For each particular setting (e.g., genome length=5, gene varieties=5) you will need to repeat your measurement a number of times to obtain a good estimate (smooth lines in your graph), because we want average case behaviour, and if we dont run each configuration a number of times and take the mean, our estimate may be nearer to the worst or best cases by pure chance. This is because both of the algorithms we are testing have a random component. If you run for the range of settings suggested above (10 genome lengths and 5 settings for gene varieties), there will be 50 data points in your graph in total, 10 for each of the 5 lines. As mentioned above, each data point on our graph should be the mean of a number of repetitions of that setting. However, you should annotate each data point with error bars showing the dispersion of the samples contributing to each data point. Error bars can be added to Excel graphs. The standard error for a set of measurements for a given point can be calculated in Excel using the following formula, placed in a conveniently located cell, =stdev(A1:A50)/sqrt(50), assuming that we have 50 individual measurements that are stored in column A, rows 1 to 50. You could also plot in MATLAB, if you are familiar with it. Remember to label your axes, and to title your graphs. Not doing this will lose marks. After you have completed data collection and plotted the graph, you will have characterised the saltation algorithm. Write 2 or 3 paragraphs explaining your findings. (20 marks)
2

Rev Date: 08-02-13

(b) Next, we will test the Cumulative Selection algorithm. Just like Saltation, it accepts commandline inputs, so must be run from a shell. Like in Saltation, genome length and gene varieties are accepted as the first two command-line arguments. However, our Cumulative Selection algorithm accepts two additional arguments, corresponding to two other settings that that it requires. In this question we will evaluate the impact of the first of these settings, mutation rate. During the creation of a new child genome from a parent genome, the mutation rate dictates how likely it is that a particular gene (i.e., a character in our string) will be replaced. For example, a mutation rate of 0 means that there is no chance of any of the genes being replaced. A mutation rate of 100 means that every single gene will be replaced, making the algorithm effectively the same as Saltation. For each genome length, and number of gene varieties, you are to test different values of mutation rate, to see what effect this has on performance (i.e., observe which setting permits the target genome to be evolved in the fewest generations). For example, you might measure the number of generations required to evolve the target genome for each mutation rate from 1% from 100%. Again, you will need to repeat every measurement a number of times, creating standard error bars in Excel or MATLAB. This time, it makes more sense to place mutation rate on the x-axis, and generations to solve (i.e., to match the target genome) on the y-axis. If we test mutation rates 1% to 100%, we can have one line on the graph for each number of genome varieties (2, 4, 8, 16, 32), as we did in part (a), above. However, since genome length is no longer on the x-axis, we will need to generate several graphs, one for each genome length (e.g., if we test for genome lengths 1 to 10, we will need 10 graphs). Furthermore, Cumulative Selection takes one additional argument: the children per generation. For the time being, we will fix this at 10. After you have completed this question, you should write 2 or 3 paragraphs explaining your findings. Does the optimum mutation rate (i.e., the mutation rate that enables to target genome to be evolved in the fewest generations) differ as we change the number of gene varieties and/or the genome length, or is it always the same? (20 marks) (c) Next, we will test the impact of the number of children per generation on the number of generations required to evolve the target genome in our Cumulative Selection algorithm. Fix the number of gene varieties to 16 for now, and create 10 graphs, one for each genome length from 1 to 10, like we did in part (b), above. On the x-axis of each graph, vary children per generation (e.g., from 1 to 100). After you have completed this question, you should write 2 or 3 paragraphs explaining your findings. (20 marks)

Rev Date: 08-02-13

Part 2 (40 marks): Modifying the String Matching Algorithms In part 2, your task is to experimentally evaluate the impact of several modifications to the two algorithms that you have been provided with. You are to submit both your code, along with your experimental results and discussion (see below). Note that to receive full marks, your code must be commented, correctly indented, economical (the smaller the better), and use sensible/intuitive variable names. Refer to the two example programs for guidance on how to accomplish this. (a) What will happen if, instead of the target genome being a set of randomised genes, it simply contains the same character repeated over the entire genome? For example, if we ask for a genome length of 10, we set the target genome to [0 0 0 0 0 0 0 0 0 0], irrespective of the number of gene varieties? Present the results of your experiment, in the form of one or more graphs, and a written summary of your findings (2-3 paragraphs). Note that this modification can be applied to both Saltation and Cumulative Selection algorithms. (10 marks) (b) At present, the Cumulative Selection program has only a single parent genome to sire each generation. It is common for a genetic algorithm to keep a set of the best child genomes over the preceding generation (i.e., parent genomes for the next generation), rather than just the single best one. This can be decided by selecting the N best children, sometimes expressed as a percentage of the total number of children produced per generation (e.g., keeping the top 10% of child genomes where there are 100 children per generation will entail storing 10 child genomes). This setting should be controlled though a further command-line argument. This is called a breeding pool. Modify the program to keep a set of the best child genomes to become new parent genomes in the next generation; when it comes to creating each new child genomes for the next generation, select one parent genome from the breeding pool at random (called a stochastic approach). What effect does the use of a breeding pool have on the number of generations needed to evolve the target genome? Present the results of your experiment, in the form of one or more graphs, and a written summary of your findings (2-3 paragraphs). (10 marks) (c) Sometimes we choose to add some weak individuals (typically selected randomly) to the breeding pool to ensure we have good population diversity. Modify the program to enable this feature (e.g., in addition to telling the algorithm to keep the top 10% of child genomes to form the breeding pool, like in part (b), above, provide an further command-line argument that states the percentage chance that a randomly selected individual will be admitted to the breeding pool instead. What effect does this modification have on the performance of the algorithm? Present the results of your experiment, in the form of one or more graphs, and a written summary of your findings (2-3 paragraphs). (10 marks)
4

Rev Date: 08-02-13

(d) Once we have a breeding pool in place, we can use crossover to add diversity to the child genomes we create, rather than using mutation only. To create a new child genome using the crossover technique, we select a number of parent genomes from the breeding pool, and a number of crossover points. For example, if we decide to use two parent genomes to create a child genome (similarly to nature), and have one randomly calculated crossover point from 1 to the genome length (not so natural!), we may have the situation shown below (wherein a and b represent genes from parent 1 and parent 2, respectively): Parent 1 Parent 2 Child = [a a a a a a a a] = [b b b b b b b b] = [a a a a b b b b]

In the above example, the crossover point in this case was (randomly) in the centre of the genome (gene four). If instead we wanted to use three parent genomes to create one child genome, and had four crossover points, we might have the following situation: Parent 1 Parent 2 Parent 3 Child = [a a a a a a a a] = [b b b b b b b b] = [c c c c c c c c] = [a a b c c c a a]

Wherein, in this case, since there were an insufficient number of parent genomes to furnish each crossed-over section of the child genome, we cycled back to the first parent again. In general, we cannot have more crossover points than the genome length. What effect does adding crossover capability have on the performance of the algorithm? Present the results of your experiment, in the form of one or more graphs, and a written summary of your findings (2-3 paragraphs). (10 marks)

Rev Date: 08-02-13

Appendix A.1: String Matching by Saltation (saltation.cpp)


// Title: // // Author: // // Date: #include #include #include #include Saltation String Matching Algorithm (Hoyles Fallacy) Ian van der Linde 06-02-13

<stdio.h> <stdlib.h> <time.h> <unistd.h>

#define NARGS 3 int main(int argc, char** argv) { if(argc<NARGS) { printf("\nUsage: %s [Genome Length] [Gene Varieties]\n\n", argv[0]); exit(EXIT_FAILURE); } srand(getpid()); int genomeLength int geneVarieties = atoi(argv[1]); = atoi(argv[2]);

unsigned char targetGenome[genomeLength]; unsigned char childGenome[genomeLength]; for(int currentGene = 0; currentGene < genomeLength; currentGene++) targetGenome[currentGene] = rand()%geneVarieties; // Create Our Target Genome for Matching Against int currentFitness = 0; int highestEverFitness = 0; long int currentGenerationNumber = 0; 6

Rev Date: 08-02-13

while(highestEverFitness != genomeLength) { currentGenerationNumber++; currentFitness = 0; for(int currentGene = 0; currentGene < genomeLength; currentGene++) { childGenome[currentGene] = rand()%geneVarieties; if(childGenome[currentGene] == targetGenome[currentGene]) { currentFitness++; } } // Loop To Mutate All Genes To Create A New Specimen, AND Simultaneously Calculate Its Fitness if(currentFitness>highestEverFitness) { highestEverFitness = currentFitness; } // Evaluate The Fitness Of Our New Specimen; Is It The Best Were Ever Had? } // Repeat Until A Perfect Specimen Is Produced By Blind Chance (Termination Condition) printf("%d\t\t%d\t\t%ld", genomeLength, geneVarieties, currentGenerationNumber); return 0; }

Rev Date: 08-02-13

Appendix A.2: String Matching by Cumulative Selection (cumulative.cpp)


// Title: // // Author: // // Date: #include #include #include #include Cumulative Selection String Matching Algorithm (Genetic) Ian van der Linde 06-02-13

<stdio.h> <stdlib.h> <time.h> <unistd.h>

#define NARGS 5 int main(int argc, char** argv) { if(argc<NARGS) { printf("\nUsage: %s [Genome Length] [Gene Varieties] [Mutation Rate (0 to 100)] [Children per Generation]\n\n", argv[0]); exit(EXIT_FAILURE); } srand(getpid()); int int int int genomeLength geneVarieties mutationRate childrenPerGeneration char char char char = = = = atoi(argv[1]); atoi(argv[2]); atoi(argv[3]); atoi(argv[4]);

unsigned unsigned unsigned unsigned

targetGenome[genomeLength]; parentGenome[genomeLength]; currentChildGenome[genomeLength]; bestChildGenomeThisGeneration[genomeLength];

for(int currentGene=0; currentGene<genomeLength; currentGene++) { targetGenome[currentGene] = rand()%geneVarieties; parentGenome[currentGene] = rand()%geneVarieties; 8

Rev Date: 08-02-13

} // Create Our Target Genome for Matching Against, AND A First Parent to Start Evolving From int currentChildFitness = 0, highestEverFitness = 0, highestFitnessThisGeneration=0; long int currentGenerationNumber=0; while(highestEverFitness!=genomeLength) { currentGenerationNumber++; highestFitnessThisGeneration = 0; for(int currentChild = 0; currentChild < childrenPerGeneration; currentChild++) { currentChildFitness = 0; for(int currentGene = 0; currentGene < genomeLength; currentGene++) { currentChildGenome[currentGene] = parentGenome[currentGene]; } // Copy Current Parent Genome As Basis For The Next Child for(int currentGene = 0; currentGene < genomeLength; currentGene++) { if(rand()%100 < mutationRate) currentChildGenome[currentGene] = rand()%geneVarieties; if(currentChildGenome[currentGene] == targetGenome[currentGene]) currentChildFitness++; } // Loop Through Current Child Genome Mutating Some Genes, AND Calculate Its Fitness if(currentChildFitness > highestFitnessThisGeneration) { highestFitnessThisGeneration = currentChildFitness; for(int currentGene = 0; currentGene < genomeLength; currentGene++) bestChildGenomeThisGeneration[currentGene] = currentChildGenome[currentGene]; if(highestFitnessThisGeneration > highestEverFitness) highestEverFitness = highestFitnessThisGeneration; } // Evaluate Fitness Of New Child; Is It The Best We've Ever Had? If So, Keep It } // Generate Next Child in This Generation for(int currentGene = 0; currentGene < genomeLength; currentGene++) // Make New Parent From Best Child parentGenome[currentGene] = bestChildGenomeThisGeneration[currentGene]; } // Repeat Until A Perfect Child Evolves (Termination Condition) printf("\n%d\t\t%d\t\t%d\t\t%d\t\t%ld\n",genomeLength,geneVarieties,mutationRate,childrenPerGeneration, currentGenerationNumber); } 9

S-ar putea să vă placă și