Table of Content
Chapter |
Title |
Page |
1 |
Chapter 1 : Introduction |
1 |
1.1 |
Introduction |
1 |
1.2 |
Algorithm |
1 |
1.3 |
Types of Algorithm |
1 |
1.4 |
Top-down vs. Bottom-up object database design |
1 |
1.5 |
Top-Down Algorithms taken into study |
3 |
16 |
Bottoms-Up Algorithms feasible for this project |
3 |
1.7 |
Conclusion |
3 |
1.7.1 |
TOP of the tree |
3 |
1.7.2 |
BOTTOM of the tree |
4 |
1.8 |
Memoization, Tabulation |
4 |
1.9 |
Recursive |
5 |
1.10 |
Practical concerns |
5 |
1.11 |
Optimality |
5 |
1.12 |
Advanced optimizations |
5 |
2 |
Chapter 2 : Recursive-Descent Parsing: Top-Down Approach |
6 |
2.1 |
Parsing |
6 |
2.2 |
Recursive Descent |
6 |
2.3 |
Eliminating Left Recursion |
6 |
2.4 |
Left-Factoring |
7 |
2.5 |
Recursive Descent Parsing |
7 |
2.5.1 |
Example |
8 |
3 |
Chapter 3 : Genetic Algorithms: Bottom-Up Approach |
10 |
3.1 |
Introduction |
10 |
3.2 |
Introduction to Optimization |
10 |
3.3 |
What are Genetic Algorithms? |
10 |
3.4 |
Advantages of GAs |
11 |
3.5 |
Limitations of GAs |
11 |
3.6 |
Genetic Algorithms – Motivation |
11 |
3.6.1 |
Solving Difficult Problems |
11 |
3.6.2 |
Failure of Gradient Based Methods |
11 |
3.6.3 |
Getting a Good Solution Fast |
12 |
3.7 |
Genetic Algorithms - Fundamentals |
12 |
3.7.1 |
Basic Terminology |
12 |
3.7.2 |
Basic Structure |
14 |
3.7.3 |
Genotype Representation |
15 |
3.7.4 |
Binary Representation |
15 |
3.7.5 |
Real Valued Representation |
15 |
3.7.6 |
Integer Representation |
16 |
3.7.7 |
Permutation Representation |
16 |
3.8 |
Genetic Algorithms - Population |
16 |
3.8.1 |
Population Initialization |
16 |
3.8.2 |
Population Models |
17 |
3.8.2.1 |
Steady State |
17 |
3.8.2.2 |
Generational |
17 |
3.9 |
Genetic Algorithms - Fitness Function |
17 |
3.10 |
Genetic Algorithms - Parent Selection |
18 |
3.10.1 |
Fitness Proportionate Selection |
18 |
3.10.1.1 |
Roulette Wheel Selection |
18 |
3.10.1.2 |
Stochastic Universal Sampling (SUS) |
19 |
3.10.2 |
Tournament Selection |
20 |
3.10.3 |
Rank Selection |
20 |
3.10.4 |
Random Selection |
21 |
3.11 |
Genetic Algorithms - Crossover |
21 |
3.11.1 |
Introduction to Crossover |
21 |
3.11.2 |
Crossover Operators |
21 |
3.11.3 |
One Point Crossover |
21 |
3.11.4 |
Multi Point Crossover |
22 |
3.11.5 |
Uniform Crossover |
22 |
3.11.6 |
Whole Arithmetic Recombination |
22 |
3.11.7 |
Davis’ Order Crossover (OX1) |
22 |
3.12 |
Genetic Algorithms – Mutation |
23 |
3.12.1 |
Mutation Operators |
23 |
3.12.2 |
Bit Flip Mutation |
23 |
3.12.3 |
Random Resetting |
23 |
3.12.4 |
Swap Mutation |
23 |
3.12.5 |
Scramble Mutation |
24 |
3.12.6 |
Inversion Mutation |
24 |
3.13 |
Genetic Algorithms - Survivor Selection |
24 |
3.13.1 |
Age Based Selection |
24 |
3.13.2 |
Fitness Based Selection |
25 |
3.14 |
Genetic Algorithms - Termination Condition |
26 |
3.15.1 |
Models of Lifetime Adaptation |
26 |
3.14.2 |
Lamarckian Model |
26 |
3.14.3 |
Baldwinian Model |
27 |
3.14.4 |
Effective Implementation |
27 |
3.14.5 |
Introduce problem-specific domain knowledge |
27 |
3.14.6 |
Reduce Crowding |
27 |
3.14.7 |
Randomization Helps! |
28 |
3.14.8 |
Hybridize GA with Local Search |
28 |
3.14.8.1 |
Variation of parameters and techniques |
29 |
3.15 |
Genetic Algorithms - Application Areas |
29 |
4 |
Summary |
30 |
5 |
Literature references |
31 |
List of Figures:
Fig No |
Title |
Page |
1 |
Top-down Parsing data flow format |
2 |
2 |
Bottoms-Up Parsing data flow format |
3 |
3 |
General process flow |
10 |
4 |
Failure of Gradient Based Methods |
12 |
5 |
Genetic Algorithm Basic flow |
13 |
6 |
Encoding and Decoding |
13 |
7 |
Basic Structure |
14 |
8 |
Binary Representation |
15 |
9 |
Real Valued Representation |
16 |
10 |
Integer Representation |
16 |
11 |
Permutation Representation |
16 |
12 |
Fitness Calculation |
18 |
13 |
Roulette Wheel Selection |
19 |
14 |
Stochastic Universal Sampling (SUS) |
19 |
15 |
Tournament Selection |
20 |
16 |
Rank Selection |
20 |
17 |
One Point Crossover |
21 |
18 |
Multi Point Crossover |
22 |
19 |
Uniform Crossover |
22 |
20 |
Whole Arithmetic Recombination |
22 |
21 |
Davis’ Order Crossover |
23 |
22 |
Bit Flip Mutation |
23 |
23 |
Swap Mutation |
24 |
24 |
Scramble Mutation |
24 |
25 |
Inversion Mutation |
24 |
26 |
Age Based Selection |
25 |
27 |
Fitness Based Selection |
25 |
28 |
Michalewicz’s view of the EA |
27 |
29 |
Crossbreed with Native Quest Engine |
28 |
List of Tables:
Table |
Title |
Page |
1 |
Rank Selection |
21 |
Chapter 1:
- Introduction:
This project “Automated Academic Course Scheduler” aims to bridge the gap of performance assessment of the conflicts in existing algorithm logics and overcome them with the performance of DB engines which is least used in any web or cloud based applications for any typical transactions.
- Algorithm:
Algorithm is the key principle for the solution of any requirement or a problem statement. It’s designed based on how the functional logic flows for to derive a solution for given input data set with the features and limitation of the technology used in the application.
- Types of Algorithm:
Though we have wide varieties of algorithm to elect, we majorly focus on 2 categories of them in this report.
- Top-Down Approach
- Bottoms-Up Approach
As per names, the solution for course scheduler application is designed starting from given input parameters or based on desired output data set.
Typically the algorithms were designed and implemented in application code, but I’ve focused to migrate them into database transactional code i.e. MS-SQL language code by following one of the bottoms up algorithmic methodology.
As DB engines’ architecture mainly focuses on fetching data present in tables, the data access rate will be robust when we pursue for a solution by using DB codes rather than transferring data from DB layer to application code layer and then initiate the solution algorithm over the same to get an output.
- Top-down vs. Bottom-up object database design
There are two approaches for designing a database principally targeted for one solution, the top-down method and the bottom-up method. These approaches perform fundamentally on different strategies.
The top-down method starts from the final output set and progresses parsing to the single entity of input data set. Essentially, you start with an expected output layout for the system and let the end-users to provide basic entity depending on how data is required by the logic flow. The business analyst will then explain the users on how data is stored in the database.
Using the top-down method necessitates that the business analyst has a meticulous knowledge and understanding of the system. The top-down method also can have inadequacies. In some cases, comprising large chunk of data for input phase, top-down design can lead to inadequate results because the analyst and end-users can miss something that is significant and is obligatory for the system due to technical grounds.
The bottom-up approach set off with the elementary entities like possible range of data input streams and evolves up to a narrow spectrum of output pattern in an ideal work flow. To begin a bottom-up design, the system analyst will scrutinize all the crossing points in the system, analysis reports, screens, and forms. The analyst will work backwards through the system to define what data should be stockpiled in the database.
To comprehend the dissimilarities between these strategies, let's deliberate some works that are bottom-up in nature. Doctors observe definite symptoms and then infer the general disease that roots the symptoms.
An example of jobs that necessitate the top-down approach include project management and engineering tasks where the overall requirements must be quantified before the detail can be understood and concluded.
For instance, an automobile manufacturer must follow a top-down approach to meet the overall specifications for the car. If a car has the requirement that it cost less than Rs.6,00,000, gets 25 kmpl and seating of five people with airbags. In order to meet these requirements the engineers must start by constructing a specification document and then drilling down to meet these requirements.
The business analyst will have no choice but to discuss with users to regulate as per their significances and as a result concludes what data should be stored in the database.
Figure 01: Top-Down Parsing data flow format
Figure 02: Bottoms-Up Parsing data flow format
- Top-Down Algorithms taken into study:
- Recursive descent parser
- Predictive parser
- Earley parser
- Bottoms-Up Algorithms feasible for this project:
- Precedence parser
- bounded context parsing
- Shift-Reduce Parser
- Conclusion
Dynamic programming is all about collating your notions in a tactic that you avoid recalculating identical logics in multiple phases. You have a main problem (the root of your tree of sub problems), and sub problems (sub trees). The sub problems classically repeat and intersect.
For example, consider your favorite instance of Fibonacci series. This is the full tree of sub problems, if we did a naive recursive call:
TOP of the tree
fib(4)
fib(3)...................... + fib(2)
fib(2)......... + fib(1) fib(1)........... + fib(0)
fib(1) + fib(0) fib(1) fib(1) fib(0)
fib(1) fib(0)
BOTTOM of the tree
In some other exceptional scenarios, this tree could be infinite in some branches, representing non-termination, and thus the bottom of the tree may be outsized. Likewise in some scenarios, you might not know what the full tree looks like ahead of time, and thus you might need a strategic plan to elect which sub problems to expose to output phase.
- Memoization, Tabulation
There are at least two main practices of dynamic programming, such as:
- Memoization - This is a laissez-faire approach: Presuming that we already have listed all sub problems with no scope on optimal evaluation order. So we typically implement a recursive call opening from the root or either hopes we will get close to the optimal evaluation order. And we ensure that the recursive call never be reiterated over a sub problem as we cache the result sets.
Example – On calculating the Fibonacci sequence such as fib(n) we don’t need to process the loop for fib(2) since we cached it.
This flow gets initiated at the root node and evaluates the sub problems from the leaves or sub trees back up towards the root.
- Tabulation – Recalling on dynamic programming strategy like a "table-filling" algorithm, this is similar to memoization but more dynamic and implicates one additional step on selecting at earlier phases of work flow for the executional order we may shadow up on.
Example: for the same Fibonacci series problem, we opt to process the sequence in this order “fib(2),fib(3),fib(4)...” and also caching every value so you can calculate the next ones more certainly.
Before concluding the algorithm strategy, the programmer studies the whole tree and then composes an algorithm to calculate the sub problems in a precise order towards the root node, commonly populating in a table.
At its most general, in "
dynamic programming" I would say the programmer studies the whole tree and then writes an algorithm that gears up by a strategy for evaluating sub problems (which optimizes any attributes as per the requirements, commonly a blend of time and space-complexity).
The work flow must get triggered anywhere in random, some particular sub problem, and perhaps adapts itself based on the results of those evaluations. In "
dynamic programming" in the most universal exercise, we typically try to cache these sub problems in various iterations to avoid revisiting them, a subtle distinction perhaps in the case of graphs. Recurrently these data structures are at their core like arrays or tables. Solutions to any sub problems can be scrapped if we don't need them any longer.
Ultimately it is important to recognize the peculiarity rather than the terminology.
- Recursive
Note that both top-down and bottom-up can be employed with recursion table-filling, however it may not be natural.
- Practical concerns
With memoization, if the tree is very deep (e.g. fib(10^6)), you will run out of stack space, since each overdue iteration must be laid on the stack where we will have 10^6 of them.
- Optimality
Either strategy may not be time-optimal if the sequence we exercise to visit sub problems is not optimal, precisely if there is more than one method to evaluate a sub problem. Memoization will typically add on the time-complexity to the space-complexity in the system.
- Advanced optimizations
While undertaking unusually dense problems, we may not have any choice but to implement in tabulation method. Also in circumstances where optimization is undeniably critical, tabulation will allow us to do optimizations which memoization would not.
Chapter 2:
- Recursive-Descent Parsing: Top-Down Approach
- Parsing
Parsing is the process of taking a string with delimiters and finding a derivation for that string of symbols in a context-free grammar. Parsing is critical task in implementing an interpreter or compiler for a programming language, since the syntax of a program contains essential information about the meaning of the program.
A parser is the module of an interpreter or compiler which performs parsing. It takes a sequence of tokens from the lexical analyzer (you know how to write one of those!), finds a derivation for the sequence of tokens, and builds a parse tree (also known as a syntax tree) representing the derivation. We have seen that parse trees are very important in figuring out the meaning of a program (or part of a program).
- Recursive Descent
Recursive descent is a simple parsing algorithm that is very easy to implement. It is a top-down parsing algorithm because it builds the parse tree from the top (the start symbol) down.
The main limitation of recursive descent parsing (and all top-down parsing algorithms in general) is that they only work on grammars with certain properties. For example, if a grammar contains any left recursion, recursive descent parsing doesn't work.
- Eliminating Left Recursion
Here's our simple expression grammar we discussed earlier:
S → E
E → T | E + T | E - T
T → F | T * F | T / F
F → a | b | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
[S, E, T, and F are nonterminal symbols, and a, b, and the digits 0-9 are terminal symbols.]
Unfortunately, this grammar is not suitable for parsing by recursive descent, because it uses left recursion. For example, consider the production
E → E + T
This production is left recursive because the nonterminal on the left hand side of the production, E, is the first symbol on the right hand side of the production.
To adapt this grammar to use with a recursive descent parser, we need to eliminate the left recursion. There is a simple technique for eliminating immediate instances of left recursion. [This technique won't handle indirect instances of left recursion.]
Given an occurrence of left-recursion:
A → A α
A → β
Note that some non-recursive production of the form A → β must exist; otherwise, we could never eliminate occurrences of the nonterminal A from our working string when constructing a derivation.
We can rewrite the rules to eliminate the left recursion as follows:
A → β A'
A' → α A'
A' → ε
So, for example,
E → E + T
E → T
becomes
E → T E'
E' → + T E'
E' → ε
Here's the entire expression grammar, with left-recursion eliminated.
E → T E'
E' → + T E'
E' → - T E'
E' → ε
T → F T'
T' → * F T'
T' → / F T'
T' → ε
F → a | b | 0 | 1 | 2 | ... | 9
- Left-Factoring
Another property required of a grammar to be suitable for top-down parsing is that the grammar is left-factored. A left-factored grammar is one where for each nonterminal, there are no two productions on that nonterminal which have a common nonempty prefix of symbols on the right-hand side of the production.
For example, here is a grammar that is not left-factored:
A → a b c
A → a b d
Both productions share the common prefix a b on the right hand side of the production.
We can left-factor the grammar by making a new nonterminal to represent the alternatives for the symbols following the common prefix:
A → a b A'
A' → c
A' → d
Our non-left-recursive expression grammar is already left-factored, so we don't need to change it.
- Recursive Descent Parsing
Once you have a non-left-recursive, left-factored grammar, recursive descent parsing is extremely easy to implement.
Each nonterminal symbol has a parsing function. The purpose of the parsing function for a nonterminal is to consume a string of terminal symbols that are "generated" by an occurrence of that nonterminal.
Terminal symbols are consumed either by directly reading them from the lexer, or by calling the parse methods of another nonterminal (or nonterminals). In general, the behavior of the parse method for a particular nonterminal is governed by what string of symbols (nonterminal and terminal) is legal for the right-hand side of productions on the nonterminal.
Typically, any parser will construct a parse tree as a side-effect of parsing a string of input symbols. The nodes of a parse tree represent the nonterminal and nonterminal symbols generated in the derivation of the input string. So, we can have the parse method for each nonterminal return a reference to a parse tree node for that particular nonterminal symbol.
- Example
Here's what a parse method for the E nonterminal in our revised expression grammar might look like:
public Symbol parseE() throws IOException {
NonterminalSymbol e = new NonterminalSymbol("E");
e.addChild(parseT());
e.addChild(parseEPrime());
return e;
}
The parseE method internally calls parseT and parseEPrime methods, because the only production on E is
E → T E'
The parseT method is similar.
The parseEPrime method is more interesting. There are three productions on E':
E' → ε
E' → + T E'
E' → - T E'
When parseEPrime is called, we should ask the lexical analyzer if we've reached the end of the input string. If so, then the epsilon production must be applied.
If the lexer is not at end-of-input, we should ask the lexical analyzer for a single terminal symbol. If we see a symbol other than + or -, then we'll again assume that the epsilon production should be applied. Otherwise, we'll add the terminal symbol to the parse node we're creating for the E' symbol, and continue by parsing the T and E' nonterminal symbols by calling their parse methods:
private Symbol parseEPrime() throws IOException {
NonterminalSymbol ePrime = new NonterminalSymbol("E'");
TerminalSymbol op = lexer.peek(); // look ahead
if (op == null) {
// end of input: epsilon production is applied
System.out.println("end of input");
} else {
if (!op.getValue().equals("+") && !op.getValue().equals("-")) {
// we saw a terminal symbol other than + or -:
// apply epsilon production
} else {
// consume the operator
lexer.get();
// saw + or -
// production is
// E' -> + T E'
// or
// E' -> - T E'
ePrime.addChild(op);
ePrime.addChild(parseT());
ePrime.addChild(parseEPrime());
}
}
return ePrime;
}
Note a few interesting things going on in parseEPrime:
The lexical analyzer is the lexer object. We're using two of its methods: peek, which asks for the next token without consuming it, and get, which asks for and consumes the next token. Both methods return a TerminalSymbol object. peek returns the null value when the end of input is reached.
Applying the epsilon production means we don't add any child symbols to the parse node being created.
The parseEPrime method can call itself recursively, because the
E' → + T E'
E' → - T E'
productions contain the symbol E' on the right hand side. That's why it's called recursive descent!
To use a recursive descent parser to parse an entire input string, simply call the parse method for the grammar's start symbol. It will return a reference to the root node of the parse tree.
Chapter 3
- Genetic Algorithms: Bottom-Up Approach
- Introduction
Genetic Algorithm (GA) is an optimized scan-based search technique based on the fundamentals of Genetics and Natural Selection. It is often used to derive an optimal set of solutions for complicated requirements which might take indefinite time to get solutions in traditional design logics.
- Introduction to Optimization
Optimization is the practice of enhancing the performance by reducing the response period. In general for a process, we have a set of inputs and a set of outputs as shown below.
Figure 3: General Process flow
Optimization refers to handling of input data set in such a way that we get the “seamless” permutation of output data set. The scale range for the term “seamless” varies for every requirement, i.e. maximizing performance or minimizing response time on one or more objective functions, by varying the input data permutations or volume or its schema.
The collection of potential solutions that inputs can take populates the search process space. In this search space, lies a constraint or a set of constraints which gives the prime solution. The aim of optimization is to discover that constraint or set of constraints in the search process space for each given requirements.
- What are Genetic Algorithms?
Genetic Algorithms (GA) are search based algorithms designed by the principles of natural selection and genetics. Genetic Algorithms are a subset of a much wider spectrum of composite computation known as Evolutionary Computation.
GAs were optimally designed and developed by John Holland and his colleagues at the University of Michigan, most notably David E. Goldberg.
In GAs, we will get a collection of potential resultant data set to the given requirement statements. These result then undergo recombination and mutation, producing new child set of results referred as offspring and this process is repeated over several generations and occasionally across adjacent generations as well. Each individual result (or a potential solution) is assigned a fitness value and the fitter result sets are given privileges to reprocess and yield more “fitter” potential solutions.
In this way we keep progressing to discover the most potential individual result set over iterations, till we reach a verge of termination constraints.
Genetic Algorithms are resourcefully randomized to provide a wide range of permutation in nature, but this progression is considerably better than random local search as they exploit and validate with distinct set of cached potential results as well.
- Advantages of GAs
GAs have various advantages such as below:
- Does not necessitate any specific derivative data.
- Has a very good parallel computing and overall scalability.
- Optimizes both common and distinct functions.
- Generates a collection of potential solutions that gets fitter in further generations naturally.
- Necessitates a very large search space but still faster in processing with huge chunk of input data sets.
- Limitations of GAs
GAs also has few limitations such as:
- GAs are not suited for derivative information based strategy.
- Fitness value is calculated recurrently which might be expensive for CPU time in some requirements.
- Despite many advantages, there are no assurances on the optimality score for the resultant data set.
- If not designed properly as per requirement statement and environment of implementation, the GA may not yield an optimal yet potential result set.
- Genetic Algorithms Motivation
Genetic Algorithms can naturally generate a potential yet optimal solutions set. So designing a genetic algorithm becomes interesting for to implement in solving optimization requirements over large data set.
The reasons why GAs are needed are as below
There are a large set of problems which are
NP-Hard. Essentially even the most powerful computing systems take a very long time to solve that problem if attempted through traditional strategies. In such a scenario, GAs proves to be an efficient approach to generate potential yet optimal solutions in a shorter response time.
Traditional calculus based techniques work by starting with a random data set and by moving along with the gradient, till the fitness score of objectivity reach its peak. This practice is optimal only for single-peaked objective requirement.
However, in most of the realistic scenarios, we have a very composite requirement referred to as landscapes comprising many peaks and valleys, causing such approaches to fail, as they certainly tend to select the immediate or local optima as per local search space population.
Figure 4: Failure of Gradient Based Methods
- Optimal yet Potential Result:
Common yet composite most requirements like the Travelling Salesperson Problem (TSP) have realistic applications like optimum path finding.
Suppose if we use a GPS router, which takes a few mins or much more time to evaluate the “optimal” path from the source to destination with traditional techniques. Delay in such a realistic applications is not adequate and thus an potential solution that delivers in optimal short time is expected.
- Genetic Algorithms Fundamentals
This section introduces the basic terminology required to understand GAs. Also, a generic structure of GAs is presented in both pseudo-code and graphical forms. The reader is advised to properly understand all the concepts introduced in this section and keep them in mind when reading other sections of this report as well.
- Basic Terminology
Let me brief though some basic terminologies which will be stated throughout this report.
- Population − A subset of all the physical permutations for a solution to the given requirement. The population for a GA is corresponding to the population for human beings except that instead of human beings, we have Potential Solutions representing human beings.
- Chromosomes − A chromosome is one such solution entity to the given requirement.
- Gene − A gene is one element position i.e. array index of a chromosome.
- Allele − It is the data value in a gene of any particular chromosome.
Figure 5: Genetic Algorithm Basic flow
- Genotype − Genotype is the machine readable population in the computation space.
- Phenotype − Phenotype is the population in the human readable format for a realistic solution space.
- Decoding and Encoding − for simple requirements, the phenotype and genotype spaces are the same else they aren’t. Decoding is a process of transforming a solution from the genotype to the phenotype space, while encoding is a vice-versa process.
- For instance, consider the binary bit Knapsack Problem. The Phenotype space consists of solutions comprising the index numbers of the items to be chosen.
However, in the genotype space it can be represented as a binary string of length n. A
0 at position x represents that
x^{th} item is picked while a
1 represents the reverse. This is a case where genotype and phenotype spaces are different.
Figure 6: Encoding and Decoding
- Fitness Function − A fitness function picks any potential solution from a population as input and produces the suitability score for that solution. Occasionally, the fitness and the objective function may be the same based on the requirements.
- Genetic Operators − these defines the genetic composition logic for the offspring such as crossover, mutation, selection, etc.
- Basic Structure
Main GA solution logic is invoked with an initial population which may be generated at random or seeded by other heuristics on selecting the parent chromosomes from this population to proceed on mating. Executing crossover and mutation operators on the parents to populate with new off-springs. And finally these off-springs replace the existing chromosomes in the population and the process repeats towards optimal result set.
Each of the following steps is covered in further segments of this report.
Figure 7: Basic Structure
A generalized pseudo-code for a GA is explained as below:
GA()
initialize population
find fitness of population
while (termination criteria is reached) do
parent selection
crossover with probability pc
mutation with probability pm
decode and fitness calculation
survivor selection
find best
return best |
- Genotype Representation
It’s a vital decision on designing a genetic algorithm for a specific requirement to elect the representation method that will be used in our solutions. It has been witnessed that inappropriate representation method may not yield potential solution in an optimal flow.
In this segment, we discuss about some commonly used representations methods for GA. On the other hand, representation is highly requirement specific and the reader might presume that another method or a mix of few representation methods mentioned here might suit the requirement in a better stream.
- Binary Representation
Modest yet most widely used representation method in GAs. In here the genotype is comprised of bit strings.
For scenarios necessitating the solution space being populated with Boolean decision variables such as
yes or no, the binary representation is natural.
For instance the 0/1 Knapsack Problem. If there are
n items, we can populate the solution’s chromosomes with a binary string of
n elements, where the
x^{th} element tells whether the item
x is picked (1) or not (0).
Figure 8: Binary Representation
While handling the integer numbers, we can populate the data as in by their binary correspondent values. Therefore mutation and crossover functions may breed undesired consequences, but can be fixed to some point of scope by integrating Grey Coding. Also a change in one bit does not have an immense consequence on the solution space.
- Real Valued Representation
While we necessitate populating the genes by continuous over discrete variables, the real valued representation is the most seeming technique. The data range of these real valued numbers is however bounded to the computer.
Figure 9: Real Valued Representation
- Integer Representation
For a discrete valued genes, we cannot bind the solution space to binary type data. Such that, if subjected to encode the four integer values it can be encoded as
{0,1,2,3}. In such cases, integer representation is appropriate.
Figure 10: Integer Representation
- Permutation Representation
In many scenarios, the solution data is defined by an order of granular elements. In such cases permutation representation is the most apt method.
A typical yet composite requirement of this representation method is the travelling salesman problem (TSP). The total distance of the travel has to be saturated to a lower point without repetition in locations. The solution to this TSP is naturally one of the permutation patterns of all the locations and thus adopting the permutation representation generates potential solutions for this problem.
Figure 11: Permutation Representation
- Population
Population is a subset of potential solutions in the present generation. It can also be termed as a set of chromosomes.
- The assortment of the population is to be handled optimally and if not may lead to premature unhealthy solution set.
- The population size should be kept saturated to optimal limit as it can cause a GA to slow down. While a lesser sized population may not be apt for populating the mating collection with healthy varieties. Therefore, an optimal population size needs to be decided by trial and error.
The population is usually defined by a two dimensional array of
– population size
X (per generation)
– chromosome size
Y
- Population Initialization
There are two primary procedures to initialize a population data set in a GA.
- Random Initialization − Populate exclusively with random solution set.
- Heuristic initialization − Populate partially with known heuristic data for the same requirement.
It has been perceived that the population should not be exclusive while using a heuristic strategy, as it can result with the solution space populated with very little diversity in data ranges. It has been cited by experts that the random data sets are the ones to drive the population to higher optimality score.
It has also been witnessed that heuristic initialization method in some cases, only affects the initial fitness scores of a population, but later the diversity of the solutions takes over the lead to optimality.
- Population Models
There are two population models extensively in use
- Steady State
In the GA design incorporated with steady state population model, we breed a couple of off-spring result sets for every iteration and they replace the weakest chromosome sets from the population. A steady state GA is also known as Incremental GA.
- Generational
In the generational model, we manipulate
‘x’ off-springs, where
X is the population array size, and then entire population set is replaced by the new one at the end of this iteration.
- Fitness Function
The fitness function ideally takes a potential solution to the requirement as input and gives us score on how “accurate” or how “stable” the solution is with respect to the problem in consideration.
Fitness score is computed repeatedly in any GA and thus it should be adequately fast or else it can be affected undesirably and make it exceptionally slow.
Often fitness and objective functions are the same in a GA for a given requirement statement. However, for more composite requirements with multi-objective and constraints, an Algorithm Designer might have to choose for implementation of a different fitness function.
A fitness function should possess the below attributes:
- Sufficiently fast to compute the fitness score.
- Provide how fit a given solution is or how fit its next generations might be.
Sometimes, direct computation might not be possible due to the complexities of the requirement. So, we do consider fitness approximation score to proceed further.
Below is a simple fitness function which just sums the profit values of the items being picked, scanning the elements from left to right till the knapsack is full.
Figure 12: Fitness Calculation
- Parent Selection
We select parents to combine or recombine their data to create off-springs for the next generation. Parent selection is very vital to the conjunction rate of the GA as healthy parent chromosomes produce next gen individuals with much better health scores.
Still, preventing an extremely fit solution from being taken often over the entire population across few generations’ increases the diversity score to higher range and is much essential for a successful GA design. This taking up of the entire population by one extremely fit solution is known as premature convergence and is an undesirable condition in a GA.
- Fitness Proportionate Selection
It is one of the most popular methods in use where every individual can be selected as a parent with high probability, proportional to its fitness. Thus, fitter individuals have a higher chance of transmitting their features to the next generation during crossover.
Consider a circular wheel that is divided into
n pies, where
n is the number of chromosome in the population. Each individual gets a portion of the circle which is relative to its fitness score.
Two implementations strategies in this method are:
- Roulette Wheel Selection
- Stochastic Universal Sampling
- Roulette Wheel Selection
In here, a fixed point is chosen on the wheel circumference as shown and the wheel is rotated. The region of the wheel which comes in front of the fixed point is chosen as the parent and repeated once more for the second parent as well.
Figure 13: Roulette Wheel Selection
It is clear that a fitter individual has a greater pie on the wheel, giving more chance of stopping in front of the fixed point when rotated.
Implementation wise, we use:
- Calculate S = the sum of a finesses.
- Generate a random number between 0 and S.
- Starting from the top of the population, keep adding the finesses to the partial sum P, till P<S.
- The individual for which P exceeds S is the chosen individual.
- Stochastic Universal Sampling (SUS)
This is similar to previous strategy; however in here we have multiple fixed points as shown below. Thus, all the parents are chosen in just one spin of the wheel. Also, such a setup encourages the highly fit individuals to be chosen at least once.
Figure 14: Stochastic Universal Sampling (SUS)
These selection methods don’t work for scenarios where the fitness scores ranges from negative values.
- Tournament Selection
In K-Way tournament selection, we select K individuals from the population at random and select the best out of these K chromosomes as a parent. The same process is repeated for selecting the next parent. This strategy is also frequently used in literature surveys as it can even work with negative fitness values.
Figure 15: Tournament Selection
- Rank Selection
This strategy is mostly used when the chromosomes in the population have very close fitness scores that lead to each individual having an almost equal share of the pie as shown below and hence each individual no matter how fit relative to each other has an approximately same probability of getting selected as a parent. This in turn leads to a loss in the selection pressure towards fitter individuals, making the GA to make poor parent selections in such scenarios.
Figure 16: Rank Selection
In this, we remove the concept of a fitness value while selecting a parent. However, every individual in the population is ranked according to their fitness. The selection of the parents depends on the rank of each individual and not the fitness. The higher ranked individuals are preferred more than the lower ranked ones.
Chromosome |
Fitness Value |
Rank |
A |
8.1 |
1 |
B |
8.0 |
4 |
C |
8.05 |
2 |
D |
7.95 |
6 |
E |
8.02 |
3 |
F |
7.99 |
5 |
Table 1: Rank Selection
- Random Selection
In this strategy we randomly select parents from the current population. There is no selection pressure towards fitter individuals and therefore this strategy is usually evaded.
- Crossover
The crossover operator function is equivalent to reproduction and biological crossover. In this more than one parent is selected and one or more off-springs are produced using the genetic material of the parents. Crossover is usually applied in a GA with a high probability –
p_{c}.
- Crossover Operators
In here the crossover operators are very generic and the GA designer can choose to implement a problem-specific crossover operator as well.
- One Point Crossover
In this a random crossover point (gene index) is selected and the tails of its two parents are swapped to get new off-springs.
Figure 17: One Point Crossover
- Multi Point Crossover
This is a generalization of the one-point crossover with wherein alternating segments of the parent chromosomes are swapped to form new off-springs.
Figure 18: Multi Point Crossover
- Uniform Crossover
Here we don’t divide the chromosome into segments; rather we treat each gene separately. On the flip of a coin for each gene it’s been decided whether or not to be included in the off-spring. We can also bias the coin to one parent, to have more genetic material in the child from that parent.
Figure 19: Uniform Crossover
- Whole Arithmetic Recombination
This is usually used for integer representations and works by taking the weighted average of the two parents by using the a formula
Child1 = α.x + (1-α).y
Obviously, if α = 0.5, then both the children will be identical as shown in the following image.
Figure 20: Whole Arithmetic Recombination
- Davis’ Order Crossover
OX1 is a permutation based crossovers with the aim of sharing alleles about relative ordering to the off-springs.
- Pick 2 random crossover points across the parent and copy the segment between them from the first parent to the first offspring.
- Then, starting from the second point of second parent, copy the lasting numbers from the second parent to the first child, wrapping around the list.
- By reversing the parent’s role, fill in for second child.
Figure 21: Davis’ Order Crossover
Other rarely used crossover strategies are such as
Partially Mapped Crossover (PMX),
Order based crossover (OX2),
Shuffle Crossover, Ring Crossover,
etc.
- Mutation
Practice of a small random tweak in the chromosome, to get a new solution. It is used to maintain high diversity score in population and is mostly applied when with a low probability –
p_{m}.
Mutation is the part of the GA which is related to the “diversity” of the search space. It has been observed that mutation is essential to the merging of the GA while crossover is not.
- Mutation Operators
Alike the crossover operators, this is not an extensive list and the GA designer can pick a mixture of these strategies or a problem-specific strategy will be more apt.
- Bit Flip Mutation
Here just pick one or more random bits and flip them. This is used for binary encoded GAs.
Figure 22: Bit Flip Mutation
- Random Resetting
It is an extension of the bit flip for the integer representation specifically. In this, a random allele from the set of acceptable values range is assigned to a randomly chosen gene in that chromosome.
- Swap Mutation
In here, we pick two genes on a chromosome at random, and swap the alleles. This is also communal in permutation based encodings.
Figure 23: Swap Mutation
- Scramble Mutation
This is also popular with permutation representations. In this, the subset of chosen genes and their alleles are either scrambled or shuffled randomly to form new solution.
Figure 24: Scramble Mutation
- Inversion Mutation
In here we start the process like in scramble mutation, but instead of shuffling the subset, we purely overturn the entire string order in the subset and copy it into new chromosome.
Figure 25: Inversion Mutation
- Survivor Selection
The Survivor selection policy regulates which chromosomes are to be kicked out and which are to be set aside for next generation. It is vital as it should ensure that the fitter chromosomes are not kicked out of the population, meanwhile diversity should also be retained at higher score in the population.
Some GAs uses Elitism. In simple terms, it means the current fittest member of the population is always transmitted to the next generation. Thus, under no scenarios can the fittest member of the current population be substituted.
The easiest policy is to kick random members out of the population, but such a tactic frequently has merging issues, therefore the following strategies are widely used.
- Age Based Selection
In here, we don’t take fitness score as a constraint. It is based on the principle that every chromosome in the population is processed for few finite generations where it is breed until a predefined verge limit; upon which it is opted out of the population despite its strong fitness score.
For instance, in here, the age is the count of generations it’s been present actively. So the oldest members of current population i.e. P4 and P7 are opted out, tailed by incrementing age by 1 for the rest of chromosomes.
Figure 26: Age Based Selection
- Fitness Based Selection
In this strategy, the newest offspring are inclined to substitute the weakest individuals in a population. By using one of strategies on parent selection operator, we opt the weakest chromosome in current population.
For instance, the children replace the least fit chromosomes P1 and P10 of this population of which P1 and P9 despite possessing the same fitness score, the choice to replace either of them is non-subjective usually.
Figure 27: Fitness Based Selection
- Termination Condition
The termination constraint is significant in prominent when a GA run will end. Typically in initial phases, the GA breeds much sooner with healthy solutions which later unexpectedly tend to get saturated on solution’s aspects possessing negligible improvements in them. So typically we should pre-define a termination constraint such that GA will be flagged for end of progression with highest possible optimality.
Commonly, we opt one of the following termination constraints
- Once the improvement quotient of a population is closure to 0 over X iterations.
- Once we are stretched to an outright number of generations.
- Once the objective function score matches the expectations.
For instance, in a GA we track on a counter variable for generation’s count for which the improvement quotient is below the threshold score for the population. Initially, we set this counter to zero. Each time we don’t generate off-springs which are better than the individuals in the population, we increment the counter.
Yet, intermediately if the fitness score of any off-springs is better, then we reset the counter to zero. The aim is to terminate when the counter reaches the encoded value.
- Models of Lifetime Adaptation
So far in this report, we have briefed on procedures corresponds to the Darwinian model of evolution such as natural selection and genetic variation through recombination and mutation. So far, only the data enclosed in the individual’s chromosome genotype can be migrated to the next generation.
Other models of lifespan variation – Lamarckian Model and Baldwinian Model also do exist. These choice of lifetime adaptation strategies are highly problem specific as witnessed by most of experts so far.
Often, we crossbreed a GA with local search so as like Memetic Algorithms. In such cases, either Lamarckian or Baldwinian Model can be opted to proceed with computation on the initial phase’s population/chromosome to start the bigtime process of opted GA.
- Lamarckian Model
The Lamarckian Model principally defines that the attributes which an individual attains in its lifespan can be shared on to its offspring. It is entitled over French biologist
Jean-Baptiste Lamarck.
Albeit, natural biology has entirely marginalized Lamarckism as we all know that only the data in the chromosome can be transferred. Conversely, from an implementation view point, its witnessed that opting for the Lamarckian model gives good potential solutions for some of the common requirements.
In here, a resident quest engine scrutinizes the vicinity to yield new attributes, and if a healthier chromosome is produced, it becomes the offspring.
- Baldwinian Model
This practice is a transitional idea titled after James Mark Baldwin (1896). In this practice, the chromosomes can possess an inclination of learning advantageous conducts. Unlike the Lamarckian approach, we don’t transfer the yielded attributes to next generation, and neither do we utterly overlook the yielded attributes like in the Darwinian strategy.
The Baldwin approach is in the mid of these two extremes, wherein the affinity of a chromosome to yield definite attributes is programmed reasonably differing from the attributes of themselves.
In here, a native search function scrutinizes the vicinity and if a healthier chromosome is discovered, it only attains the enriched fitness to the chromosome and does not revise the chromosome itself. The change in competency denotes the chromosomes competence to “attain the attribute”, though it is not handed openly to the upcoming generations.
- Effective Implementation
GA is very broad-spectrum in nature and unbiasedly smearing them to any optimization requirement wouldn’t yield healthy solutions. In this section, we describe a few points which would benefit and guide a GA designer in their effort.
- Introduce problem-specific domain knowledge
It has been perceived that the auxiliary requirement-precise sphere understanding we integrate into the GA; the healthier unbiased solutions we get. Adding requirement-precise knowledge can be done by using requirement-precise crossover or mutation tactics and with bespoke representations techniques, etc.
Figure 28: Michalewicz’s view of the EA
- Shrink Congregating
Crowding happens when a healthy chromosome gets to breed a lot, and in a few compeers, the complete population is boasted with similar chromosomes possessing analogous health. This diminishes miscellany which is a very vital constituent to guarantee the accomplishment of a potential GA design. There are copious techniques to limit flocking such as:
- Augmenting diversity by support of Mutation.
- Swapping to rank or tournament selection that possesses more pressure than fitness equivalent selection for chromosomes with analogous fitness.
- The individual’s fitness is to be condensed if the previous population possesses alike chromosomes.
- Arbitration Benefits
It has been experimentally witnessed that the healthiest solutions are yielded by randomized chromosomes as they divulge diversity to the population. The GA designer should be cautious to possess adequate quantity of randomization and miscellany in the population for the healthier results.
- Crossbreed with Native Quest Engine
Native search refers to scrutiny of the solutions in the vicinity of a yielded solution to gaze for healthier solution standards for the objective.
It possibly will be every now and then expedient to crossbreed the GA with native quest engine.
Figure 29: Crossbreed with Native Quest Engine
- Distinction of Factors and Practices
In GAs, there is no derivative solution set or a mystic recipe which suits for all requirements. Even after the preliminary GA design is organized; it grosses a lot of time and effort to research with the constraints like population magnitude, transmutation and crossover likelihood etc. to discover the ones that suits the precise requirement statement.
- Implementation Extents
Genetic Algorithms are predominantly applied for optimization requirements of several breeds, but they are habitually implied for additional application areas as well.
In this segment, we listed some of the scopes in which GA is recurrently implemented
- Optimization − Genetic Algorithms are most regularly practiced in for requirements wherein we have to exploit or diminish a given objective function value under a specified collection of controls.
- Finances − GAs are employed to distinguish different commercial prototypes like the cobweb model, game theory equilibrium resolution, asset pricing, etc.
- Neural Network − GAs are also incorporated to impart neural networks, predominantly recurring neural networks strategy.
- Parallelization − GAs also have brilliant analogous proficiencies and ascertain to be productive means in cracking definite requirements and along with it delivers a good extent for research and exploration.
- Image Processing − GAs are implemented for several Digital Image Processing (DIP) chores as well like dense pixel pattern matching and toning.
- Travelling Salesman Problem − with several lenient time frames, several warehouses and a assorted fleets, the GA plays a vital role in providing them with optimal solutions well in effective processing time.
- Scheduling Applications – the efficiency of GAs are consumed to disentangle several scheduling tasks as well, predominantly the course time tabling tasks.
- Machine Learning − as previously deliberated, inheritances grounded machine learning is a niche area in machine learning.
- Robot Trajectory − GAs have been employed to plot the track that a robot arm grosses by moving from one position to another.
- Parametric Design of Aircraft − GAs have been used to design aircrafts by constantly fluctuating the factors and sprouting a healthier resolutions.
- DNA Exploration − GAs have been incorporated to regulate the arrangement of DNA by means of spectrometric information about the mockup.
- Multimodal Optimization − GAs are evidently brilliant methodologies for multimodal optimization in which we have to discover several prime yet optimal solutions.
- Summary:
After studying several algorithm designs from top down strategy, thus determined that genetic algorithm employed in SQL language yields more performance benefits due to swiftest data access on huge volumes of input streams.
As entire algorithm progression logic code is being incorporated in solitary computing engine i.e. SQL Engine the time consumed for data transmission is much proficiently condensed and yielded more value for swiftest and alleviated yet potential solution sets.
These potential solutions are recorded silently in the process of every specific strategic phase; they are much helpful in being used as heuristics in future iterations of the same.
Classifying that genetic algorithm to be supreme operative procedure for huge data volume input streams and adding SQL engine’s performance on top of it to the competency of GA offers us the most constructive outcome for composite requirements.
- Literature references:
[1] Ramez Elmasri, Durvasula VLN Somayajulu, Shamkant B. Navathe and Shyam K. Gupta. Fundamentals of Database System. Pearson Education. May, 2007.
[2] Pang-Ning Tan, Michael Steinbach and Vipin Kumar. Introduction to Data Mining. Pearson Education. May 2013.
[3] John Sharp, Douglas McMurtry, Andrew Oakley, Mani Subramanian, AND Hanzhong Zhang. Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence. Microsoft patterns & practices; 1 edition. September 2013
[4] David E Goldberg. Genetic algorithms in search, optimization, and machine learning. 2009.
[5] Zbigniew Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. 1992.
[6] Kalyanmoy Deb. Multi-Objective Optimization Using Evolutionary Algorithms. 2001.
[7]
https://app.pluralsight.com/library/courses/relational-database-design/table-of-contents
[8]
https://app.pluralsight.com/library/courses/sqlserver-why-physical-db-design-matters/table-of-contents