Disclaimer: This dissertation has been written by a student and is not an example of our professional work, which you can see examples of here.

Any opinions, findings, conclusions, or recommendations expressed in this dissertation are those of the authors and do not necessarily reflect the views of UKDiss.com.

Hyper-heuristics Using Reinforcement Learning for the Examination Timetabling Problem

Info: 11312 words (45 pages) Dissertation
Published: 9th Dec 2019

Reference this

Tagged: Education

Hyper-heuristics using Reinforcement Learning for the Examination Timetabling Problem

Abstract Selection hyper-heuristics as general-purpose search methods controlling a set of low level heuristics have been successfully applied to various problem domains. A key to designing an effective selection hyper-heuristic is to find the right combination of heuristic selection and move acceptance methods which are invoked successively under an iterative single-point-based search framework. The examination timetabling problem is a well-known challenging real world problem faced recurrently by many educational institutions across the world. In this study, we investigate various reinforcement learning techniques for heuristic selection embedded into a selection hyper-heuristic using simulated annealing with reheating for examination timetabling. Reinforcement learning maintains a utility score for each low level heuristic. At each iteration, a heuristic is selected based on those adaptively updated utility scores and applied to the solution in hand with the goal of improvement. All selection hyper-heuristics using different reinforcement learning schemes are tested on the examination timetabling benchmark of ITC 2007. The results show that -decay-Greedy reinforcement learning which chooses a low level heuristic with the maximum utility score with a decaying probability rate, otherwise choosing a random low level heuristic performs the best. The proposed tuned approach although does not perform as good as the state-of-the-art, it delivers a better performance than some existing hyper-heuristics.

Keyword: reinforcement learning, hyper-heuristic, exam timetabling

1 Introduction

The research involved in the search for suitable solutions for Examination timetabling problems requires a combination of practical and experimental-based techniques [1]. The main focus of current hyper-heuristic research is towards the design and adaptation of heuristic methods in automating optimisation for hard computational search problems. Over the last few decades a lot of research has been conducted on the application of hyper-heuristic techniques in solving examination timetabling problems. A number of review papers highlighting these efforts have have been published [2], [3]. Experimental results generated by various different techniques in the literature have been reported, focussing on success measures including generality of the solver and the time cost [2].

For a given examination timetabling problem, whether a technique can generate a feasible and workable solution within a practical time limit is an important factor in assessing the success level of the technique employed. An examination scheduling track, based on the post-enrolment examination timetabling problem was introduced in the second International Timetabling Competition (ITC2007) [4]. This track introduced 12 real-world datasets which are drawn from anonymised data from several institutions worldwide and is used to test our designed solver in this paper.

The solving process of the examination timetabling problem normally commonly conforms to two main stages; the initial construction stage and the subsequent improvement stage. In this first stage, a feasible solution is constructed. Along with feasibility, some focus is given to the relative quality of the solution in terms of satisfying the problem, to benefit the improvement stage which is more computationally time-consuming. Then, in the second stage, the solution undergoes a series of improvement transformations until the process terminates. In this case this will occur when the time limit is reached. The improvement process outlined in this paper describes a hyper-heuristic search methodology used to search for higher quality solutions when given specific objectives [6].

In the literature, research on the application of hyper-heuristic techniques to solve examination timetabling problems has proved the potential of their effectiveness in this area. Certain research has focussed on combining reinforcement learning with hyper-heuristics, yielding competitive results which suggest that reinforcement learning can be applied to improve the performance of hyper-heuristics in terms of solution quality. Also, reinforcement learning has also been effective in solving resource allocation problems. In this work, rather than applying reinforcement learning directly to solving examination timetabling problems, reinforcement learning algorithms are combined with hyper-heuristics within the process of improvement. The learning process when using reinforcement learning on solving examination timetabling problems directly could be rather time and resource consuming due to the complexity of the problem. This can also be considered as an interesting approach to explore how the generality of the solution process can be improved using reinforcement learning. An assessment can be made on how the combination of reinforcement learning algorithms with hyper-heuristics may widen the application of solvers to other types of resource allocation problems.

In the approach described here, the initial feasible solution is created using squeaky wheel construction, with hyper-heuristics employed for the optimization and improvement phase. The hyper-heuristic methods consists of two levels; one is the higher heuristic search from the low heuristics pool, and the other is the low-level heuristic search process within the problem domain. For the higher-level heuristics, a Simulated Annealing (SA) method is combined with different reinforcement learning methods to enable intelligence-based selection of the low level heuristics. In each iteration information is gathered to establish and record the current total performance of low level heuristics. The selection methods inspired by the reinforcement learning are three different greedy search methods and the softmax method, which are compared with a simple random method to analysis performance difference. For the low level heuristics, we designed six small-move low heuristics, three large-move low heuristics and two directed low heuristics.

The remainder of the paper is as follows: Section 2 briefly describes the background of this paper, the examination timetabling problem and the hyper-heuristic method. Section 3 describes the Squeaky Wheel constructor and our SA-RL hyper-heuristic method. Section 4 describes the experimental environment and time parameters used during experimentation. Section 5 presents and discusses the results and section 6 concludes the paper with a brief discussion on the effectiveness of the technique and potential future research areas.

2 Background

2.1 The Examination Timetabling Problem

Examination timetabling is one of the academic timetabling problems and has been proven to be NP-hard [7]. Solving an examination timetabling problem requires allocating the given exams with suitable timeslots and rooms under both hard and soft constraints. Among these constraints, all hard constraints must be satisfied in order to achieve a feasible solution, with the satisfaction of the soft constraints used to determine the quality of a given solution.  Therefore, the solving process consists of two main parts: first, the generation of a feasible solution that meets all the hard constraints; second, minimization of violations to the soft constraints of the current solution while maintaining feasibility. What’s more, the quality of (feasible) timetables can be calculated numerically based on the satisfaction level of the soft constraints which can be used to help in the evaluation of the corresponding algorithms. In the examination timetabling problems, before the scheduling process all the student enrolment data is generally known, which allows an accurate use of the available resources during the examination period.

In real-world examination timetabling problems, the type and parameter values of the hard and soft constraints varies between institutions. In particular, the soft constraints usually reflect an institutions’ different preferences. As interest has increased on the application of research to examination timetabling problems, benchmarks which identify the common soft and hard constraints from real world institutions using various standard defined measures have emerged. These are used to advance the field of research in solving Examination Timetabling problems in providing a way to allow meaningful scientific comparison and the public exchange of research achievements and expertise in the domain.

Carter et al. introduced a set of 13 benchmark examination datasets in 1996 [5], drawn from three Canadian high schools, five Canadian universities, one American university, one British university and one university in Saudi Arabia. These datasets have been widely tested and used in examination timetabling research [2]. These datasets were supplemented by a series of new datasets, drawn from anonymised data provided by several institutions worldwide for the 2007 International Timetabling Competition (ITC2007).

2.1.1 Related work in examination timetabling problems

The research on solving the examination timetabling problems involves a variety of different methodologies.   Combining the use of Memetic algorithms with hill climbing method was proposed to solve the timetabling problem by Ender Ozcan and Alpay Alkan in 2007 [9], which offered an interesting approach to evolutionary timetabling and experimental results which were initially promising. In [10] and [11], large neighbourhood search techniques were introduced to solve the problem. One such approach was based on an improvement graph construction of solution adjacency and identified improvement moves for timetabling by solving negative cost subset-disjoint graph cycles using a modified shortest path label-correcting algorithm [10]. Another investigated a constructive local search approach for examination timetabling, employing an adaptive decomposition strategy [11]. In 2010, Genetic algorithms [8] were utilized to solve a real world timetabling problems in Nigerian Universities using a hierarchy of constraints. This was introduced in conjunction with a new real-world examination timetabling dataset at the University of Agriculture, Abeokuta Nigeria. Tabu search as a Meta heuristic algorithm was combined with large neighbourhood search for solving the examination timetabling problem in [12]. Here the capacitated examination timetabling problem is considered as a partitioning problem and improvement moves are kept in a Tabu list for a certain number of iterations.

Two Ant algorithm methods were proposed to solve the examination timetabling problems in [13]. The timetabling problems were formulated as combinatorial optimization problems, as ant algorithms had been demonstrated to be a powerful solution approach to such problems. The proposed ant algorithm methods were tested on the Toronto benchmark and the experimental results turned out to be competitive as compared to the literature.

The use of simulated annealing in the solution of the multi-objective examination timetabling problem was described in [14]. The algorithm proposed was based on graph theory, with the application of three variants of the proposed simulated annealing method.

An extended Great deluge algorithm was employed in a two-phased approach in solving the Examination Timetabling problem data sets introduced as part of the ITC2007 Competition [15]. The approach attempts to exploit the inherent advantages with this Extended Great Deluge technique in escaping from local optima while also maintaining a relatively simple set of neighbourhood moves.  Also, Variable neighbourhood techniques in [16] and multi-objective approaches in [17] were also implemented to solving examination timetabling problems.

Late Acceptance Hyper-heuristics were introduced by Burke and Bykov [18]. Traditionally, the approach in hyper-heuristics was to compare the current solution with the solution immediately preceding within the neighbourhood search process. In late acceptance, the current solution is compared with what was the current solution a number of iterations previously. Late acceptance techniques are able to produce competitive results in a short timeframe. Reinforcement Learning techniques are used to influence heuristic selection for hyper-heuristics. A memory log of heuristic actions is kept during execution, with successful actions being rewarded and unsuccessful actions being punished. With this log, successful actions are chosen more often and unsuccessful actions are chosen less often across the search space.

Other various methodologies which have produced relatively promising experimental results on solving the examination timetabling problems are Case-based reasoning [19], Fuzzy approach [20] and Neural networks [21].

2.2 Hyper-heuristics

The use of Hyper-heuristics combines a set of approaches that share the common goal of automating the design and adaptation of heuristic methods in order to solve hard computational search problems [23]. A heuristic refers to a whole search algorithm or used to refer to a particular decision process sitting within some repetitive control structure, while meta-heuristic refers to an overall control strategy exploring a solution search-space [23]. Unlike meta-heuristics, hyper-heuristics search in a space of low level heuristics [24].

Specifically, a hyper-heuristic can be seen as a process which, when given a particular problem instance and a number of low-level heuristics, manages the selection of which low level heuristic to apply at any given time, until a stopping condition is met [23]. The hyper-heuristic operates at a higher level of abstraction without knowledge of the domain in which it operates. It only has access to a set of low level heuristics [24], simple local search operators or domain dependent heuristics [24], while the higher level strategy (High Heuristics, for selecting or generating heuristics) can either be a heuristic, a sophisticated knowledge-based technique, or a learning mechanism [22].

A variety of hyper-heuristic categories have been proposed in the literature based on different rules. When considering the nature of the heuristic search space, the hyper-heuristic can be split into two groupings: 1) heuristic selection, heuristics choose heuristics process, the hyper-heuristic uses a set of known domain-dependent low level heuristics; 2) heuristic generation, heuristics generate heuristics process, the hyper-heuristic evolves new low level heuristics by making use of the components of the existing set. Meanwhile, the hyper-heuristic can be categorized into three types considering the source of feedback during learning: 1) Online learning hyper-heuristic; learning while solving a given instance of a problem, 2) Offline learning hyper-heuristic; learning from a set of training instances, a method that would generalize to unseen instances; 3) No-learning hyper-heuristic; using no feedback from the search process [22] & [23].

Both the traditional single-point based search hyper-heuristic, where a single low level heuristic is selected at a time and applied to a single working solution, and cooperative hyper-heuristic framework which allows the possibility of having parallel execution of multiple low level heuristics that can cooperate by sharing many different working solutions to initiate the low level heuristics are sufficiently researched in the literature [24]. The latter enables the strengths of one low level heuristic to compensate for the weaknesses of another [25]. The traditional single-point based search hyper-heuristic framework relies on two key components; heuristic selection method and move acceptance criteria. Operating on a single solution, low-level heuristics are repeatedly selected and applied and a decision made as to whether to accept the move until some termination criteria is met [25].

2.2.1 Related work in Hyper-heuristics

Graph colouring hyper-heuristics were investigated as a new constructive hyper-heuristic method to solve the examination timetabling problems in [26]. The author utilized the hierarchical hybridizations of four low level graph colouring heuristics which including largest degree, saturation degree, largest coloured degree and largest enrolment to produce four ordered lists. The sequence of exams selection for scheduling is based on the generated lists. Furthermore, a roulette wheel selection mechanism is included in the algorithm to improve the effectiveness of timeslot selection, aiming at probabilistically selecting an appropriate timeslot for the chosen exam. The proposed approach was tested upon the Carter benchmarks as well as the ITC2007 benchmarks. The experimental results proved that the graph colouring constructive hyper-heuristic is capable of producing good results and outperforming other approaches on various benchmark instances.

Inspired by the musical improvisation process, the Harmony Search Algorithm (HSA) is a relatively new metaheuristic algorithm and is used within the Harmony Search-based Hyper-heuristic (HSHH) for solving examination timetabling problems. The Harmony Search algorithm will operate at a high level of abstraction which intelligently evolves a series of low-level heuristics in the improvement process. Each low-level heuristic equates to a move and swap strategy. The authors tested the proposed method using ITC-2007 benchmark datasets, with 12 different datasets of various complexity and size. The proposed method produced strong competitively comparable results.

Monte Carlo based hyper heuristics are designed and implemented to solve the examination problems in [28].  Simulated annealing involving a reheating scheme was introduced to this Monte Carlo based hyper heuristic. However, this method compared poorly as compared to other hyper heuristic methods. It was believed that the reason for this is due to fundamental weaknesses such as the choice of appropriate rewarding mechanism and the evaluation of the utility values used for heuristic selection. The conclusions suggested that the use of a choice function rather than incorporating Simulated Annealing might improve the method itself.

E Burke brings forward the concept of the reinforcement learning process to hyper heuristics in [29]. In this paper, Hyper-heuristics methodologies were identified to search the heuristic selection space and use selected low level heuristics to search the problem domain. According to the general definition, their proposed method is an   iterative hyper-heuristic framework formed of a single candidate solution and multiple perturbative low level heuristics. Basically, two parts including the heuristic selection and the move acceptance with certain termination criteria formed this algorithm. Inspired by related work in this area, one of the main focuses of this study is to analyse the working processes of learning heuristic selection within the automatic search for examination timetabling solutions..

3 RL-SA based hyper heuristic Algorithm

The hyper-heuristic construction for this research consists of two stages; first an initial construction phase which generates a feasible solution, secondly a further improvement phase to ultimately achieve a better solution. In the construction stage, we use squeaky wheel construction while the optimization stage is based on the Simulated Annealing process, combining reinforcement learning with the use of hyper-heuristics.

3.1 Squeaky Wheel Construction

Squeaky Wheel (adaptive) construction [30] is an iterative construction process, building an initial schedule by placing one exam at a time, in the order determined by a weighted sequence. There are a number of different methods for determining the initial order of the weighted list. The technique presented here calculates the initial ordering based on examination size and the number of conflicts. Each exam is assigned to the first available time and room combination, where possible, ensuring that a feasible solution is maintained while minimizing soft constraint violations. If an exam cannot be scheduled in the current iteration, it is left unscheduled and the constructor moves onto the next exam. When an exam is scheduled a weighting based upon its current penalty, as defined by the various soft constraint violations, is added to the stored weighting in the weighted list. If an exam cannot be scheduled a suitably large weighting is used instead. Once an attempt has been made to schedule all exams, the weighted list is re-sorted and subsequently those exams with the highest weighted value (or most difficult exams) are first to be scheduled on the next iteration or ”run” of the constructor. The weightings held in the list evolve over the duration of the entire construction process.

Algorithm 1: Squeaky Wheel
Read in the problem file into memory and build conflict and suitability matrices;
Calculate an initial weighting based on pre-defined criteria;
WeightedList is used for store the scheduling ordering of all the exams
1. while stopping criteria not met do
2. foreach exam ei in weightedList do
3.  for all suitable timeslots ti of ei do
4.   if CanSchedule(ei, ti) then
5.    best penalty and store best (bestti); //minimize violation to soft constraints
6.   end
7.  end
8.  if multipleBest found then
9.   Schedule(ei, randomBestti) and store associated weighting in weightedList;
10.  end

11.  else if bestTi found then

12.   Schedule (ei, bestti) and store associated weighting in weightedList;
13.  end
14.  else
15.   Leave exam unscheduled and add large weighting in weightedList;
16.  end
17. end
18. Sort(weightedList); //leave hard-to-schedule exam to later
19. end

3.2 RL-SA based hyper-heuristic Optimization

The hyper-heuristic consists of three main parts; the move acceptance Simulated Annealing algorithm, heuristic selection method Reinforcement Learning algorithms and a set of various low level heuristics. The selection method will select low heuristics to optimize the current solution and the selected low heuristic will propose a move to optimize the current solution. The move acceptance is the reference for the algorithm in deciding whether to accept this move or not. For each low heuristic, a utility value is used to represent its performance and is updated dynamically. The utility update methods are also based on reinforcement learning algorithms.

Input – n: number of heuristics, u: array holding utility value for each heuristic, totalTime, 0: initial temperature
1. // initialisation
2. Generate a random complete solution Scurrent ;
3. Initialise utility values;
4. fcurrent = f0 = quality( Scurrent );
5. startTime = t = time(); level = fcurrent
6. ← 0;sinceImp=0; numNonImprovingReheat=0;maxReheatTimes,=5;
7. bool bImprovementFound=false;
8. // main loop executes until total running time allowed is exceeded
9. while ( t < totalTime ) {
10. // heuristic selection
11. i = selectHeuristic( u ); // select a heuristic using the utility values
12. Stemp = applyHeuristic( i );
13. ftemp = quality( Stemp );
14. t = time() – startTime;
15. // move acceptance
16. if (ftemp < fcurrent ) then {
17.     ui = reward( ui ); // improving move
18.     Scurrent = Stemp;}
19. else {
20.     ui = punish( ui ); // worsening move}
21. Δ = ftemp − fcurrent;
22. ← ∈ [0,1];
23. if ( Δ < 0 | | < (Δ, ) ) then{// accept if non-worsening or with the Boltzmann probability of P
24.     if (Δ < 0) {Scurrent = Stemp; }else sinceImp++;}
25.     if (sinceImp==reheatFrequency) {sinceImp = 0;  //reset
26.         if(!bImprovementFound) numNonImprobingReheat++;
27.         else bImprovementFound=false;
28.         if(numNonImprobingReheat>maxReheatTimes) {
29.             numNonImprovingReheat = 0;
30.             temperture = (initialTemperature – temperture) / 2.0;
31.             bestEval = (long)Math.Round(bestEval * 1.1);}}
32.     Temp=Temp*0.9;// decrease temperature according to LundeAndMees rule
33. UNTIL( terminate-when-the-given-MAX-CPU-time-is-exceeded);
34. return Scurrent;}}

3.2.1 Simulated Annealing

Simulated Annealing (SA) is a global stochastic optimization technique that has been widely used in several types of combinatorial optimization problems ([32], [33], [34], [35]). It is a variant of local search which allows uphill moves to be accepted in a controlled manner. The typical SA algorithm accepts a new solution if its cost is lower than the cost of the current solution. However, even if the cost of the new solution is greater, there is a probability that this solution will also be accepted. This acceptance criteria, along with a reheating scheme, is particularly effective in helping the improvement process in escaping from local optima.

For the hyper heuristic, simulated annealing is used for the move acceptance. Therefore, the decision as to whether to accept any new solution depends on whether it improves on the current solution or the acceptance random probability is greater than the Boltzman probability. The current “temperature” of the improvement process is reduced for each iteration of the search process and directly influences the acceptance probability. The reheating scheme is invoked when the iteration count reaches the given reheat criteria, at which point the temperature will be increased as opposed to reduced.

3.2.2 Reinforcement Learning

Reinforcement Learning (RL) is a general term for a set of widely used approaches which provide a learning mechanism in determining how to behave against certain outcomes, or “how to map situations to actions” through trial-and-error interactions [31]. The proposed heuristic selection method involved the design of four different selection methods as well as four different low heuristic utility reward/punish methods inspired by reinforcement learning algorithms. Selection methods

(0) Equal Sequence

In this selection method, each low heuristic will be selected in sequence to ensure each will be applied in equal measure. This is inspired by the Maximize Explore theory used within reinforcement learning.

(1) Greedy

In this selection method, the process checks whether all the low heuristics have been selected at least once. If not, a random low heuristic is selected, with this process continuing until all low heuristics have been selected. From this point the low heuristic with the highest current utility value will be selected. Where there are more than one low heuristic selected, one is chosen randomly from the contenders.

(2) -Greedy

In this selection method, random selection is employed until all low heuristic has been selected at least once. After that, if the random probability is smaller than greedy probability , low heuristics are selected at random. Otherwise, the low heuristic with the highest utility value is selected. In determining the most effective value for the greedy probability , experimental tests were performed with  set to 0.1 as well as  set to 0.01.

(3) -Decay-Greedy

Similar to the previous method, although the value of the greedy probability  is dynamic, decreasing in value as the optimization process continues. The rational for this is based on the principle that as the hyper heuristic algorithm learns, the random low heuristic will become less necessary. The algorithm for this selection method is outlined in detail below.

Algorithm 3: e-decay-greedy
input: iteration: current iteration times;llhList: arraylist of all the low heuristics
1. int index;//index of low heuristic
2. double edecay = 1 / (System.Math.Sqrt(iteration));
3. Random randDouble = new Random();
4. if (randDouble.NextDouble() < edecay || !checkSelected(llhList)){
5.     //under edecay probablity, do random select
6.     Random randInt = new Random();
7.     index = randInt.Next(0, llhList.Count);}
8.  else{
9.      index = llhMaxUtilityIndex(llhList);
10.     index = randomDuplicateSelectionSum(index, llhList);}
11. //select one of the highest utility heuristic
12.  return index;}

(4) softmax selection

Unlike the previously described selection methods, this calculates the softmax probability value for each low heuristic and uses this probability value for selecting among the low heuristics. Utility update methods

Within the process of updating the utility values, a positive “reward” (increment) is given to a low heuristic if it was effective in generating a better solution, otherwise a “punish” value is applied. For all of the utility update methods, the lower bound of the utility value is set to 0, upper bound set to 40 and initial value of 20. These values were reached as a result of trial experimentation.

(5) Sum utility

In this method the reward value is +1 and the punish value is -1.

(6) Average utility value/Monte Carlo

In this method the reward value is +1 and the punish value is -1. The sum utility is then divided by the iteration count to set as an average.

(7) Discount sum

This method is similar to the first utility update method, although the positive reward and negative punish values are dynamic. The positive reward value is calculated as (+1)*(0.9^ (iteration times)), with the negative punish value calculated as (-1)*(0.9^ (iteration times)).

Algorithm 4: Discount Sum Utility method
Input: bool flag: reward or punish, int index, llhList: heuristics arrylist, SelectedTimes, discountFactor=0.9.
1. for (int i = 0; i < llhList.Count; i++){
2. if (llhList[i].LlhId == index){
3.     if(flag) llhList[i].utility+= (+1) * discountFactorR^selectedTimes;
4.     else llhList[i].utility+= (-1) * discountFactorR^selectedTimes;}}

(8) Temporal difference

The theory of this utility update method is that it will estimate the future possible reward based the current performance of each low heuristics. First, this method will calculate the sum utility with the reward and punish values (again set as +1 and -1 respectively). Secondly, an estimate reward value is calculated. The selected probability of the current low heuristic in the next step is 1.0 / (number of low heuristics) and the reward Probability (the probability it might make an improvement in the next iteration if selected) is 0.5 (50%). Therefore the estimate reward is (selected Probability)*(reward Probability). This estimate reward is then added to the current calculated sum utility as the low heuristic’s final utility value. The step size value has been experimentally tested between 1 and 5 to determine the most effective setting.

3.3.3 Low heuristics design

For this work, 19 low heuristics were implemented, grouped into four main types of heuristic and described below. Most of the above low-level heuristics are either 1-opt or 2-opt and there is also a mixture of some randomness and deterministic selection of exams and slots.  We purposely test low-level heuristics with simple moves rather than low-level heuristic with intelligence and complex moves because we want to make  sure  that  the  hyper-heuristic  can  recognise  good  moves  and  make  an  intelligent  decision  based  on  these  simple  moves.  Furthermore, we want to make  the  problem-domain  knowledge  heuristics  easy  to  implement  and  the  hyper-heuristic more generalised. Small perturbative moves

Table 1: small heuristics
H1 (random period assignment):
Select a Random Exam in Period P1 with Room assignment R1.

Assign Exam to Random Period P2 (if feasible), do not change Room assignment.

H2 (random room assignment):
Select a Random Exam in Period P1 with Room assignment R1.

Move the Exam to a feasible Random Room R2 – do not change the Period.

H3 (random assignment):
Select a Random Exam in Period P1 with Room assignment R1.

Assign Exam to Random Period P2 (if feasible) and Random Room R2 which can accommodate the exam (if feasible – exclusive constraint).

H4 (random period swap):
Select 2 Random Exams, one in Period P1 with Room assignment R1 and the other in Period P2 (≠P1) with Room assignment R2.

Swap them both to the others Periods (if both feasible) – no swapping of Rooms.

H5 (random room swap):
Select 1 Random Exam with Room assignment R1 (ignore the period – ensure not exclusive).

Select another Random Exam with similar capacity requirement assigned to Room R2 (≠R1).

Swap them both to the others Room (if both feasible, e.g. both not exclusive) – no swapping of Periods.

 H6 (random exam swap):
Select 1 Random Exam with Room assignment R1 in Period P1.

Select another Random Exam Room assignment R2 in Period P2, Swap them both to the others Room (if both feasible, e.g. both not exclusive)  and swapping their  Periods. Large perturbative moves


Table 2: large heuristics
H7 (random group of exams swap based on period):
Select two Random Periods P1 and P2 and swap ALL exams between them.
H8 (random group of exams move based on period):
Select one Period P1 and move each exam assigned to that period to a new feasible period, one by one (APPLY H1).
H7_2 (random group of exams swap based on room):
Select two Random Room R1 and R2 and swap ALL exams between them.
H8_2 (random group of exams move based on period):
Select one Room R1 and move each exam assigned to that room to a new feasible room, one by one (APPLY H2).
H7_3 (random group of exams swap based on timeslot):
Select two Random Timeslot T1 and T2 and swap ALL exams between them.
H8_3 (random group of exams move based on period):
Select one Timeslot T1 and move each exam assigned to that timeslot to a new feasible timeslot, one by one (APPLY H3).

Note: Timeslot in the above table refers to the combination of period and room. Very large moves

Table 3: shuffle heuristics
H9 (random group of exams shuffle based on period):
Shuffle all periods at random. Loop over all Periods i=P_1..P_K; H6(i, Random(i+1,K)).
H9_2 (random group of exams shuffle based on room):
Shuffle all rooms at random. Loop over all Rooms i=R_1..R_K; H6(i, Random(i+1,K)).
H9_3 (random group of exams shuffle based on timeslot):
Shuffle all timeslots at random. Loop over all Timeslots i=T_1..T_K; H6(i, Random(i+1,K)). Directed moves

Table 4: directed heuristics
H10 (ranked exams group move – consider both hard and soft constraint violations):
Generate a weighted-List based on all hard and soft constraint violations for exams.

For the top 10 exams in the list, re-allocate them to a random new timeslot.

The new timeslot is selected from the suitabletimeslot-List.

Update weighted-List whenever a move is made.

H10_2 (ranked exams group swap – consider both hard and soft constraint violations):
Generate a weighted-List based on all hard and soft constraint violations for exams.

From the top 10 exams in the list, select two random exams, swap their timeslot.

Update weighted-List whenever a  move is made

H11 (ranked exams group move – consider only soft constraint violations):
Generate a softconstraint-List based on all soft violations for exams.

For the top 10 exams in the list, re-allocate them to a random new timeslot.

The new timeslot is selected from the suitabletimeslot-List.

Update softconstraint-List whenever a move is made.

H11_2 (ranked exams group swap – consider only soft constraint violations):
Generate a softconstraint-List based on all soft constraint violations for exams.

From the top 10 exams in the list, select two random exams, swap their timeslot.

Update softconstraint-List whenever a move is made.

4 Experimental Environment

The algorithm was implemented and tested on a PC with a 2.9 GHz Intel Core i7processor, 8GB RAM and macOS 10.12.05. The program was coded in C# targeting the .NET Framework 4.5. For each problem set, the program was executed for ten iterations, with a 270 second time limit per iteration determined by a benchmarking application released by the competition organizers. During initial experimentation it was found that allowing adaptive construction to execute for approximately 10% of the total execution time provided the best results.

5 Results and Analysis

As described above, four different selection methods and four different utility update methods were designed and implemented. Note that for the exam timetabling solution, the lower the score achieved the better the solution quality as it represents the total violations of hard and soft constraints. The least amount and penalty of violations, the higher quality the resultant solution.

5.1 Benchmark data sets

The algorithm was tested with the 2007 International Timetabling Competition (ITC2007) benchmark data sets. These were chosen due to the fact they are based on real examination data problem sets with a richer set of constraint considerations, to a certain extent helping to determine how this approach may assist in real-world similar problems.

Table 5: benchmark attributes


Exams Students


Periods Rooms Conflict Density Period Hard Constraints Room Hard Constraints
Exam 1 607 7891 54 7 5.05% 12 0
Exam 2 870 12743 40 49 1.17% 12 2
Exam 3 934 16439 36 48 2.62% 170 15
Exam 4 273 5045 21 1 15.00% 40 0
Exam 5 1018 9253 42 3 0.87% 27 0
Exam 6 242 7909 16 8 6.16% 23 0
Exam 7 1096 14676 80 15 1.93% 28 0
Exam 8 598 7718 80 8 4.55% 20 1
Exam 9 169 655 25 3 7.84% 10 0
Exam 10 214 1577 32 48 4.97% 58 0
Exam 11 934 16439 26 40 2.62% 170 15
Exam 12 78 1653 12 50 18.45% 9 7

Table 5 lists the main characteristics for each of the examination datasets provided by the organizers of ITC2007. The conflict density is a measure of the number of examinations that are in conflict due to student enrolments, defining how tightly the problem is constrained by student module choice. It is initially observed that the conflict density for most of the datasets is quite low, which is reflective of the amount of choice available to students within a modern curriculum, with a large variation in course or subject choices between each student. The measure of problem size, based on the number of exams and students, varies across the datasets. The largest exam dataset could be argued to be either Exam 3/Exam 11 or Exam 7 and the smallest to be either Exam 9 or Exam 12. The amount of periods and rooms available will also have a measurable effect on the difficulty of constructing a feasible solution. Exam 3 and Exam 11 are almost identical, however Exam 11 has a much smaller set of period resources available. The differences between Exam 3 and Exam 11 reflect a ”real-world” situation where an examination session has been shortened to minimize space and staff costs, while keeping all other existing constraints where possible.

The following tests focus on datasets 7, 9, 11 and 12 as they include the biggest and smallest data size as well as conflict density and provide a reasonable variation for testing purposes. The tests on the four datasets were used to guide the design of the final hyper heuristic reheating scheme, selection method and the utility value update method.

5.2 Simulated Annealing operator tests

For the simulated annealing process, four groups of experiments were carried out to define the best settings for reheating frequency and to determine whether to combine the reheating scheme with the reset of the low heuristics utility value on reaching the maximum reheating count (which is 5).

5.2.1 Reheating frequency

The reheating frequency is tested at values 1000 and 10000 on datasets 7, 9, 11 and 12 with ten runs for each dataset. Here, a simple random selection method and sum utility update method is used, and the utility value is not reset when the maximum reheating count is reached.

Table 6: reheating  frequency tests
reheat frequency = 1000 reheat frequency = 10000
Dataset 7 8631 8497 8858 8805
Dataset 9 1347 1278 1352 1300
Dataset 11 35782 35266 35504 35611
Dataset 12 5985 5965 6034 5911

The results presented in table 6 above suggest that the reheat frequency of 1000 is generally better. Therefore, in the following test the reheat frequency is set as 1000.

5.2.2 Reset low heuristics value

The utility values of the low heuristics can become similar in value after a certain amount of iterations, which can adversely affect the selection methods and render the process ineffective. We test the approaches of both resetting and not resetting the utility value of the low heuristics when the maximum reheating point has been reached. This uses a simple -greedy selection method and sum utility update method.

Table 7: reset low heuristics utility value tests
reset No reset
Dataset 7 7530 6721 7890 7528
Dataset 9 1340 1320 1520 1506
Dataset 11 35733 35519 35673 35375
Dataset 12 5995 5955 6015 6165

The results presented in table 7 above suggest that generally the performance is improved by resetting the utility value of all low heuristics on reaching the maximum reheating iteration count. Therefore, this approach is adopted in the selection method tests outlined in the next section.

5.3 Selection Method test

As described above, four different selection methods were designed, some with differing operators. Tests were performed on all the selection methods with different operator values on datasets 7, 9, 11 and 12 with ten runs per dataset. These tests use the sum utility update method. The best results for each selection method are presented in the table below.

Table 8: different selection methods comparison
equal sequence greedy e-greedy,




Decay greedy Softmax,




Dataset 7 8644 7792 8381 7953 6501 8856 8912
Dataset 9 1372 1303 1343 1306 1288 1276 1396
Dataset 11 35840 35295 35038 35281 34054 35832 35441
Dataset 12 6065 5965 5965 5965 5965 6065 5965

In this test, for the -greedy two values, 0.1 and 0.01 are tested, with the setting 0.1 beating 0.01 in most cases. For the softmax selection method, the setting of temperature controls the balance of exploration and exploitation. The test temperature values of 0.1 and 1 are chosen based on the literature, with the results suggesting setting this value to 0.1 on average performs better, although the advantage is not great. Overall, the decay greedy method has the lowest best result over the chosen datasets.

5.4 Utility update method

This test uses the decay greedy selection method combined with several different utility update methods. For this test, the temporal difference utility update method uses an estimation step-size setting value of 1 in order to establishing effectiveness of this approach.

Table 9: different utility value update method comparison
  Sum Average Discount Temporal Difference
Best Best Best Best
Dataset 7 6501 8824 6721 9233
Dataset 9 1288 1370 1258 1423
Dataset 11 34954 35529 34390 34399
Dataset 12 6065 6065 5883 5916

It is clear from the results that the discount sum utility update method performs best over the chosen datasets.

5.5 Comparison with the published literature

Having chosen the best selection method (decay greedy) and the best utility update method (discount sum utility), experimentation is carried out over all 12 datasets, with a best evaluation value recorded over 10 times for each. A comparison over the current best techniques in the literature is presented in table 10 below.

Table 10: Best result comparison to the literature
  RL-SA-HH Other Techniques
BEST DSO Muller ITC2007 Adaptive Liner Combination Graph Coloring Multistage algorithm
dataset 1 6059 5186 4370 5231 6234 5814
dataset 2 863 405 400 433 395 1062
dataset 3 14027 9399 10049 9265 13302 14179
dataset 4 20031 19031 18141 17787 17940 20207
dataset 5 3637 3117 2988 3083 3900 3986
dataset 6 26910 26055 26585 26060 27000 27755
dataset 7 6572 3997 4213 10712 6214 6885
dataset 8 10485 7303 7742 12713 8552 10449
dataset 9 1267 1048 1030 1111 N/A N/A
dataset 10 14357 14789 16682 14825 N/A N/A
dataset 11 34054 30311 34129 28891 N/A N/A
dataset 12 5509 5369 5535 6181 N/A N/A

The results obtained show that the proposed method is reasonably competitive, and in fact has achieved a best result for dataset 10, and is approaching the best for several others. The approach taken can therefore be argued as promising and worth further research and experimentation.

5.6 Comparison with other hyper-heuristic methods

It is interesting to examine the experimental results of other hyper-heuristic techniques which have been applied to solving the examination timetabling problem using the ITC 2007 benchmarks. A comparison with the approach undertaken in this work is useful to determine whether it improves on the current work using hyper-heuristics.

Table 11: Comparison to other Hyper-heuristics methods in the literature
  RL-SA-HH Other Techniques
dataset 1 6059 11823 6234 8559 6235
dataset 2 863 976 395 830 2974
dataset 3 14027 26670 13002 11576 15832
dataset 4 20031 //// 17940 21901 35106
dataset 5 3637 6772 3900 3969 4873
dataset 6 26910 30980 27000 28340 31756
dataset 7 6572 11762 6214 8167 11562
dataset 8 10485 16286 8552 12658 20994
dataset 9 1267
dataset 10 14357
dataset 11 34054
dataset 12 5509

The results in table 11 above, show that this approach has achieved best results for three of the data sets as compared to the other hyper heuristics in solving the examination timetabling problem for the benchmark ITC2007. Indeed, for the other datasets the results are relatively good (above the average), suggesting that this work makes a positive contribution to improving research within hyper-heuristics.

6 Conclusion

This paper outlined the implementation of a hyper heuristic to solve the examination timetabling problems. The construction stage uses an adaptive squeaky wheel algorithm to generate a feasible and relatively good initial solution. The improvement phase focusses on how combing reinforcement learning algorithms into a hyper-heuristic, with the examination timetabling problems as the target. It uses four selection methods and four utility value update methods inspired by reinforcement learning, with a series of experimentation performed to test the effectiveness of the approach.

The main aim of this work is in identifying a general approach to utilizing reinforcement learning algorithms in solving complicated resource allocation problems. The examination timetabling problems were selected due to their high complexity and the fact this complexity means they are not suitable to be solved directly using reinforcement learning alone. The work and results presented show that combining reinforcement learning with a hyper-heuristic forms a workable way to solve complicated problems. This also contributes to the automatic scheduling research area using a creative hyper heuristic. Multiple selection methods and utility values inspired by reinforcement learning methods fits the learning theory of the hyper heuristic. The problem area of examination timetabling, and in particular the chosen benchmarks used are closely reflective of real world scheduling problems, bringing theory to real world problem solving. In addition, the tests carried out also cover the investigation of how reheating influences the simulated annealing searching process.

Future work is planned to extend the reinforcement learning hyper-heuristic to other resource allocation optimization problems as well as combining further reinforcement learning algorithms into this area. Another strand of research from this work will be involved in the discovery of how the unsupervised learning algorithms can better fit into the optimization research area. The ultimate aim is to prove this approach works in general for a wide variety of optimisation problems, with encouraging results already obtained in progressing towards this goal.


1. B. McCollum, “A perspective on bridging the gap between theory and practice in university timetabling,” Practice and Theory of Automated Timetabling VI, vol. 3867, pp. 3–23, 2007.

2.      R. Qu, E. K. Burke, B. McCollum, L. T. G. Merlot, and S. Y. Lee, “A survey of search methodologies and automated system development for examination timetabling,” Journal of Scheduling, vol. 12, no. 1, pp. 55–89, Oct. 2008.

3.      S. Kristiansen and T. R. Stidsen, “A Comprehensive Study of Educational Timetabling – a Survey,” Department of Management Engineering, Technical University of Denmark, no. November, 2013.

4.      B. McCollum, A. Schaerf, B. Paechter, P. McMullan, R. Lewis, A. J. Parkes, L. D. Gaspero, R. Qu, and E. K. Burke, “Setting the Research Agenda in Automated Timetabling: The Second International Timetabling Competition,” INFORMS Journal on Computing, vol. 22, no. 1, pp. 120–130, May 2009.

5.      M. W. Carter, G. Laporte, and S. Y. Lee, “Examination Timetabling : Algorithmic Strategies and Applications,” The Journal of the Operational Research Society, vol. 47, no. 3, pp. 373–383, 1996.

6.      E. K. Burke, G. Kendall, B. McCollum, and P. Mcmullan, “Constructive versus improvement heuristics: an investigation of examination timetabling,” 3rd Multidisciplinary International Scheduling Conference: Theory and Applications, pp. 28–31, 2007.

7.      T. B. Cooper and J. H. Kingston, “The Complexity of Timetable Construction Prob- lems,” Lecture Notes in Computer Science, vol. 1153, pp. 281–295, 1996.

8.      O. Arogundade, A. Akinwale, and O. Aweda, “Genetic Algorithm Ap- proach for a Real-World University Examination Timetabling Problem,” Int. J. Comput. Appl., vol. 12, no. 5, pp. 0975-8887, Dec., 2010.

9.      E. Ozcan and A. Alkan, “A Memetic Algorithm for Solving a Timetabling Problem:An Incremental Strategy,” In Multidisciplinary international scheduling: theory and applications (MISTA07), Paris, France, 2007.

10.    S. Abdullah, S. Ahmadi, K. Burke, and M. Dror, “Investigating Ahuja-Orlins large neighbourhood search approach for examination timetabling,” OR-Spektrum, vol. 29, no.1, pp. 351372, 2007.

11.     A. Syariza, E. Burke, A. Bargiela, B. McCollum, and E. Ozcan, “A Constructive Approach to Examination Timetabling based on Adaptive Decomposition and Ordering,” Ann. Oper. Res., vol. 218, no. 1, pp. 3-21, 2014.

12.     F. Glover, tabu search fundamentals and uses. University of Colorado, Boulder.

13.     M.Eley,Ant algorithms for the exam timetabling problem, International Conference on the Practice and Theory of Automated Timetabling. Springer Berlin Heidelberg, 2006.

14.     J. Thompson and K. Dowsland, “Variants of simulated annealing for the examination timetabling problem,” Ann. Oper. Res., vol. 63, pp. 105128, 1996.

15.     McCollum, Barry, et al. “An extended great deluge approach to the examination timetabling problem.” Proceedings of the 4th multidisciplinary international scheduling: Theory and applications 2009 (MISTA 2009) (2009): 424-434.

16.     K. Burke, J. Eckersley, B. McCollum, S. Petrovic, and R. Qu, “Hybrid variable neighbourhood approaches to university exam timetabling,” Eur. J. Oper. Res., vol. 206, no. 1, pp. 4653, 2010.

17.     L. Mumford, “A multiobjective framework for heavily constrained examination timetabling problems,” Ann. Oper. Res., vol. 180, no. 1, pp. 3-31, 2010.

18.     E.O ̈zcan, Y.Bykov, M.Birben, and E.Burke, “ Examination timetabling using late acceptance hyper-heuristics,” in Proceedings of the 2009 IEEE congress on evolutionary computation (CEC 2009), Trondheim, Norway.

19.     E. Burke, S. Petrovic, and R. Qu, “Case based heuristic selection for timetabling problems,” Journal of Scheduling, vol. 9, no. 2, pp. 115-132, 2006.

20.     H. Asmuni, E. Burke, J. Garibaldi, B. McCollum, and A. Parkes, “An investigation of fuzzy multiple heuristic orderings in the construction of university examination timetables,” Comput. Oper. Res., vol. 36, no. 4, pp. 981-1001, 2009.

21.     P. Corr, B. McCollum, M. McGreevy, and P. McMullan, “A new neural network based construction heuristic for the examination timetabling problem,” In Lecture notes in computer science: Vol. 4193. Parallel problem solving from naturePPSN IX, pp. 392-401, 2006.

22.     Burke E K, Gendreau M, Hyde M, et al. Hyper-heuristics: A survey of the state of the art[J]. Journal of the Operational Research Society, 2013, 64(12): 1695-1724.

23.     Burke E K, Hyde M, Kendall G, et al. A classification of hyper-heuristic approaches[M]//Handbook of metaheuristics. Springer US, 2010: 449-468.

24.     Ouelhadj D, Petrovic S. A cooperative hyper-heuristic search framework[J]. Journal of Heuristics, 2010, 16(6): 835-857.

25.     Ross P. Hyper-heuristics[M]//Search methodologies. Springer US, 2005: 529-556.

26.     N. R. Sabar, M. Ayob, R. Qu, and G. Kendall, “A graph coloring constructive hyper-heuristic for examination timetabling problems,” Applied Intelligence, vol. 37, pp. 1-11, 2012.

27.     K. Anwar, A. T. Khader, M. A. Al-Betar, and M. A. Awadallah, “Harmony Search-based Hyper-heuristic for examination timetabling,” in Signal Processing and its Applications (CSPA), 2013 IEEE 9th International Colloquium on, 2013, pp. 176-181.

28.     E. Burke, G. Kendall, M. Mısır, and E. Özcan, “Monte Carlo hyper heuristics for examination timetabling,” Annals of Operations Research, vol. 196, pp. 73-90, 2012/07/01 2012.

29.     Özcan E, Mısır M, Ochoa G, et al. A Reinforcement Learning: Great-Deluge Hyper-Heuristic[J]. Modeling, Analysis, and Applications in Metaheuristic Computing: Advancements and Trends: Advancements and Trends, 2012: 34.

30.     D. P. Clements and D. E. Joslin, “Squeaky Wheel Optimization,” Journal Of Artificial Intelligence Research, vol. 10, pp. 353–373, May 1999.

31.     Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. Cambridge: MIT press, 1998.

32.     D. Abramson, “Constructing School Timetables using Simulated Annealing: Sequential and Parallel Algorithms”, Management Science, vol. 37, no.1, 98-113, 1991.

33.     D. Abramson, M. Krishnamoorthy, and H. Dang, “Simulated Annealing Cooling Schedules for School Timetabling Problem”, 1997. Available: http://citeseer.nj.nec.com/104097.html

34.     E. Poupaert, Y. Deville, “Simulated Annealing with Estimated Temperature”, AI Communication, vol. 13, 2000, 19-26.

35.     J. Thompson, and K. A. Dowsland, ”General Cooling Schedules for a Simulated Annealing Based Timetabling System”, In: Practice and Theory of Automated Timetabling, First International Conference: Selected Papers, (E. Burke & P. Ross , Eds.), Edinburgh, U.K., August/September 1995. Lecture Notes in Computer Science 1153, Springer-Verlag, 345-363, 1996.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

Related Content

All Tags

Content relating to: "Education"

Education is the process of teaching or learning, especially systematically during childhood and adolescence, in a school or college, or the knowledge that someone gains from this. Post study, education can mean the imparting or acquiring of specific knowledge or skills required for a task, or profession for example.

Related Articles

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the UKDiss.com website then please: