GENETIC-BASED SOLUTIONS FOR INDEPENDENT BATCH SCHEDULING IN DATA GRIDS

Scheduling in traditional distributed systems has been mainly studied for system performance parameters without data transmission requirements. With the emergence of Data Grids (DGs) and Data Centers, data-aware scheduling has become a major research issue. In this work we present two implementations of classical genetic-based data-aware schedulers of independent tasks submitted to the grid environment. The results of a simple empirical analysis conﬁrm the high eﬀectiveness of the genetic algorithms in solving very complex data intensive combinatorial optimization problems.


INTRODUCTION
In today's modern heterogeneous computational systems with massive data processing, data-aware scheduling is one of the crucial problem, which has attracted considerable attention of researchers in data intensive computing.Much of the current efforts are focused on scheduling tasks work-loads, data location reorganization [8] and energyeffective scheduling in large-scale data centers [4].In many grid and cloud approaches, the scheduling problems are divided into two main classes: (i) those, which can be solved in computational systems, where usually it is assumed that data is delivered a priori and no data transfer times, data access rights, data availability (replication) and security issues are considered; and (ii) those, which can be solved just in Data Grids or data centers.However, efficient grid or cloud schedulers must take into account the features of both computing and data infrastructures to achieve desired performance of grid-enabled applications [7].In such systems the data hosts are usually distributed in similar way as the computational nodes, which makes the general scheduling problem a real research challenge [3].
In this work, we address a general grid scheduling problem of data intensive applications submitted independently by the grid end users.Based on our previous work [6], we have integrated the data transmission and data nodes location criteria with the traditional scheduling objectives, namely makespan and flowtime.We provided a simple empirical analysis with genetic-based schedulers, that have been also tested in our previous works for similar class of problems, where data access and processing were ignored (see [5] for details).This analysis confirms a high effectiveness of genetic-based schedulers in solving complex data-intensive combinatorial optimization problems in the dynamic computational environments.All the experiments have been conducted by using Data-Sim-G Batch data-aware grid simulator developed by the authors.
The remainder of this paper is structured as follows.First we define a modified Expected Time to Compute matrix model for data-aware independent batch scheduling.A brief presentation of the genetic schedulers and main concept of Data-Sim-G Batch grid simulator is followed by a simple analysis of the experiments conducted for two variants of the genetic schedulers.The paper ends with simple conclusions and future research plan.

Data-aware ETC Matrix model
We consider in this paper a general batch scheduling problem of tasks independently submitted to the system by the data-grid end users.This problem can be defined by the following four components (see also [6]): • a batch of grid applications (tasks) N batch = {t 1 , . . ., t n } , where n -is the size of the batch (the number of tasks in the batch); • a set of computational grid resources M batch = {m 1 , . . ., m k }, (k -is the total number of machines available in the system for a given batch; • a set of data-files F batch = {f 1 , . . ., f r } needed for the completion of the tasks from N batch ; and • a set of data-hosts DH = {dh 1 , . . ., dh s } with the necessary data service capabilities. We assume that 'tasks' in our model can be complex data-intensive applications, and 'machines' can be single CPUs, parallel machines or even small local computing clusters.Those applications require multiple data files from data hosts, which can be also distributed in the grid system.It means that data files needed for completing the grid applications can be located (and/or replicated) at various grid nodes and their transfer to the computa-tional nodes can be provided by the networks of varying capability.
For the characteristics of tasks in the batch, we introduce a batch workload vector W Load batch = [wload 1 , . . ., wload n ], where wload j denotes an estimation of the computational load of a task t j (in Millions of Instructions -MI).Each task t j requires a set of files F j = {f (1,j) , . . ., f (r,j) } (F j ⊆ F batch ) that are distributed on a subset DH j of the data nodes DH.We assume that each data host can serve multiple data files at a time and data replication is a priori defined as a separate replication process [6].
The computational nodes of the grid system can be characterized by a a computing capacity vector CC batch = [cc 1 , . . ., cc m ], where cc i denotes the computing capacity of the node i.Each cc i parameter (i = 1, . . ., m) can be expressed by clock frequencies or by MIPS (Million Instructions Per Second) calculated for CPUs in the resources.The estimation of the prior load of each computational node from a given M batch set is defined by a ready times vector ready times (batch) = [ready 1 , . . ., ready m ].The workload and computing capacity parameters for tasks and computing grid nodes can be generated by using the Gamma probability distributions for the expression of tasks and machines heterogeneities in the system (see [5] , chapter 2 , for details).

Data-aware task execution time model
We use the Expected Time to Compute (ETC) matrix model [1] for an estimation of times needed for the completion of the tasks assigned to the grid resources assuming also the data transmission times from the data nodes.A general concept of conventional ETC matrix model, used very often for solving the independent grid scheduling problems is based on the ETC array structure denotes an expected (estimated) time needed for the computing the task t j at the resource m i .The values of ET C[i][j] parameters depend depend on the processing speed of the machines, to which they are assigned.However, in data-aware scheduling, the data transmissions times must be included into the model.Let us denote by T T [i][j][f (p,j) ] a time needed for the transfer of the data file f (p,j) (p ∈ {1, . . ., r}) from the data host dh (p,j) ∈ D j to the computational node m i .This parameter can be calculated as follows [6]: where response time (dh (p,j) ) denotes a time needed for receiving the first byte of the data file f (p,j) by the computational node m i calculated from the moment of receiving data request by the data host dh (p,j) , and B(dh (p,j) , i) denotes a bandwidth of the (logical) link between dh (p,j) and m i .The impact of the data transfer time on the task completion time depends on the mode, in which the data files are processed by the task.The are two main such scenarios which can be considered: (a) in the first scenario all data files needed for the execution of the task t j are transferred before the computational process starts, and (b) the second scenario, where it is assumed that those data files which are not necessary for the initialization of the the execution of task t j may be sent to the computational node later during the calculation process (the files are accessed as data streams during the calculations).
Let us denote by We denote by completion[j][i] an estimated completion time for the task t j on machine m i , calculated from the task's submission till its completion in node m i with the assumption of the access and transfer of all required data from the data hosts.In the first scenario this parameter can be calculated as follows: (2) where ] denotes the total time required for the 'sequential' transfer of all data files needed for the execution of task t j .
In the second scenario (case(b)) the completion times for computational machines and tasks are calculated in the following way: where F j denotes a set of data files which are transferred prior the task execution.We will use the above completion[i][j] parameters for the definition of the optimization criteria (schedulers' performance measures) in our simply empirical analysis presented in the next section.
For making the system easily adaptable to various scheduling scenarios, we consider the data hosts as the data storage centers separated from the computing resources.The scalability and effectiveness of the whole such system depends strongly on the replication mechanism and the resource data storage and computation capacities, which in some cases can be the main barrier in the schedulers' performance improvement.In our previous works [5,7] we assumed that each computing resource has its own data storage module.In such cases the internal data transfer times were low and we ignored them.

EMPIRICAL ANALYSIS
In this section we present the results of a simple empirical analysis of the performance of two implementations of GA-based energy-aware schedulers for static and dynamic versions of the data-aware independent batch scheduling problem in grid.We have developed a Data-Sim-G Batch simulator by a simple extension of our previously defined Sim-G Batch grid simulation toolkit (see [5]) by a data processing module The GA-based schedulers were evaluated on two benchmarks composed by a set of static and dynamic instances generated by the grid simulator.

Scheduling Objectives
Scheduling phases in the data-ware scheduling are similar to grid scheduling without data sets, and most of the conventional grid scheduling objectives, such as makespan and flowtime, can be easily adapted to the data-aware scheduling.For the scenario presented in in the following way: • Makespan: where completion[m i ] is computed as the sum of the completion times of tasks assigned to machine m i (see Eq. 3); • Flowtime: -Flowtime for a machine i can be calculated as a workflow of the sequence of tasks on a given machine m i , that is to say: (5) where Sorted[i] denotes a set tasks assigned to the machine m i sorted in ascending order by the corresponding ET C values.
-The cumulative flowtime in the whole system is defined as the sum of F [i] parameters, that is: Both objectives are minimized.We consider hierarchical optimization process with makespan as the privileged (major) criterion.Flowtime is optimized with a constrain of not increasing the generated best makespan value.The wider list of the scheduling criteria in data grids can be found in [2].

Genetic-based data-aware schedulers
As a result of the wide assortment of constraints and different optimization criteria in the grid scheduling, meta-heuristic methods are the effective solutions for data intensive grid scheduling problems [10].Genetic-based schedulers can easily explore the robustness of the search space and they can tackle various scheduling attributes.
For solving the data-aware independent batch scheduling problem, we have used in this paper two implementations of simple genetic grid schedulers, similar to the methodologies used in our previous works, where the big set of benchmarks and instances of the problem has been defined (see [5] for the summary of the results).These implementations, namely GA and StGA differ in the replacement mechanisms.The general frameworks of the schedulers are based on classical (µ+λ) evolutionary strategy (see e.g.[9]), adapted to the scheduling problem through the implementation of the following genetic operators: • Initialization method: Randomly generated initial population; • Selection method: Linear Ranking Selection; • Crossover operator: Partially Mapped Crossover (PMX); • Mutation operator: Rebalancing; • Replacement operators: Elitist Generational (GA) and Struggle (StGA).
The detailed definition of these techniques can be found in [5].

Data-aware Batch grid Simulator -basic concept
The main concept of Data Sim-G Batch simulator is presented in Fig. 1.We have extended the Sim-G Batch grid toolkit defined in [5] by an implementation of additional data processing module responsible for generating (i) a set of data files, (ii) a set of data hosts, (iii) data transmission time matrix, (iv) response time vector, and (v) bandwith vector.All those data are considered as basic characteristics of an instance of the problem and together with (vi) workload vector of tasks, (vii) computing capacity vector, (viii) prior load vector, and (ix) ETC matrix are passed on to the selected scheduler, which computes the schedule of the task assignments to the machines.Finally, the scheduler sends the schedules back to the simulator, which makes the allocation.

Key input parameters for simulator and schedulers
The performance of genetic-based schedulers analyzed in two types of grid environment: static and dynamic.
In both cases four grid size scenarios: small (32 hosts/512 tasks), medium (64 hosts/1024 tasks), large (128 hosts/2048 tasks), and very large (256 hosts/4096 tasks).The schedulers' key parameters, including mutation and crossover probabilities, population size and stopping criteria (can be the maximal number of evolution steps or termination time criterion, are presented in Table 1.The values of key parameters for the simulator for static and dynamic grid scenarios are presented in Table 2. N ( * , * * ) denotes the Gaussian distribution.
The detailed interpretation of all parameters is available in [5].
Each experiment was repeated 30 times under the same configuration of operators and parameters.

Results
The averaged makespan and flowtime values are presented in Tables 3 and 4.
It can be observed from the comparison of the result that the struggle replacement mechanism has rather crucial impact on the performance of the genetic scheduler.In all instances but three, calculated for both criteria in static and dynamic scenarios, StGA outperforms classical GA scheduler.The minimization of the flowtime, where StGA was the best in all instances, is in fact noticeable if we have into account that flowtime was considered a secondary (less important) objective in the optimization process.Both schedulers are rather stable in the optimization, which is confirmed by the low values of the C.I. parameters.Finally, compare to the results achieved by similar implementations of the schedulers but in the case, where data transfer times are ignored (see [5], Chapter 4 for details), the values of makespan and flowtime have increased average by 10-25 %, which confirms the high importance of this criterion in data intensive scheduling.

CONCLUSIONS AND FUTURE WORK
In this paper we have addressed a general problem of data-aware scheduling problem of tasks submitted independently by the grid end-users.We assumed that for the completion of each task there are required some data files distributed also in the grid system and stored at heterogeneous data hosts.We have formalized the transmission time, in a way that it can be easily integrated into classical optimization objectives of grid scheduling, namely makespan and flowtime expressed in the terms of completion times of task on computational grid nodes, where data can be transferred a priori or immediately during the task computation.For the empirical analysis, we have implemented two versions of simple geneticbased grid scheduler for solving the considered scheduling problem aiming to minimize both makespan and flowtime scheduling objectives in the hierarchical mode, with makespan as major (privileged) objective.The empirical analysis has been performed by using the developed Data-Sim-G Batch grid simulator.

Figure 1 :
Figure 1: General concept of Data Sim-G Batch

Table 1 :
Shedulers' key parameters for static and dynamic benchmarks.

Table 2 :
Parameter setting for the grid simulator static instances