Data-aware Scheduling in Massive Heterogeneous Systems

Data-aware scheduling in large-scale heterogeneous computing systems remains a challenging research issue, especially in the era of Big Data. Design of all data-related components of the popular distributed environments, such as Data Clouds (DCs), Data Grids (DGs) and Data Centers supports the processing, analysis and monitoring of the big data generated by various sources at computing centers by the end-users, devices and services. The above facts leave no doubts that data scheduling must be integrated in a single joint process together with the scheduling of computer tasks and applications. Therefore, many of the current optimization issues need to be changed and new requirements have to be considered in the scheduling process. This includes data transmission times, data processing times, availability of the data servers, safety and authentication in the data access processes. This paper presents a new version of the Expected Time to Compute Matrix model (ETC Matrix) for the case of data-aware independent batch scheduling in physical network in DGs and DCs environments. Simple genetic-based schedulers have been developed for experimental justification of the significance of the presented problem.


INTRODUCTION
s, such as grid and cloud environments, are the large-scale global infrastructures enabling the remote access to variety of data types and applications, and large amount of data bases.Data in such systems can be generated also by multiple highly distributed users, different types of services and sources such mobile devices, computing applications, social networks, enterprise, cameras etc.The researchers must to face the problem of scheduling such big data, but also they must develop new methodologies and models for an effective management of large volumes of data and information.Common scheduling issues in distributed environments are mainly concerned with the task processing CPU-related requirements which are makespan, flowtime, resource usage, energy utilization etc. (Kolodziej 2012).In all similar approaches, typical data-related scheduling criteria, such as data processing and transmission time, data availability, data access and security requirements, are not considered.The reality, where data facilities can be located anywhere, with different access rights and administrative domains, is far more different from the current assumptions.Scheduling with data-awareness has been considered in many research works on cluster computing, DGs infrastructures and also recently in DCs (Buyya et al. 2005).Most of the provided surveys concentrates on data processing optimization issues along with data servers reliability in the data centers.Other approaches focused on data transmission scheduling and data allocation (Kosar and Balman 2009) for effective resource/storage utilization or energy-aware scheduling in large-scale data centers (Kliazovich et al. 2010;Kolodziej et al. 2011).GridBatch (Liu and Orban 2008 ) can be a good example for large-scale data-intensive issues in cloud environments.The significant survey challenge is to efficiently process the huge amount of data in such infrastructures and the major issue is the scheduling process with the data transmission criteria.In this work, a new version of the Expected Time to Compute Matrix (ETC Matrix) model is defined for Computational Grids (CGs) and the physical layers of the cloud environments, which are considered with new requirements like the data transmission and separation data from transformation (Zeadally et al. 2011;Xhafa 2010).The main aim of this paper is to define the scheduling process with the criteria mentioned above, as a multiobjective global optimization problem, similarly to the classical grid scheduling with ETC Matrix model (Ali et al. 2000a).Grid schedulers in the proposed model have both DGs and CGs features to meet the required performance of grid-enabled applications ( and Xhafa 2011a;Xhafa 2011b).This work is a simple extension of our previous results presented in (Szmajduch 2014).We implemented the developed model in the dynamic grid scenarios for three types of grid environments: small (nb tasks/nb of hosts), medium (the same) and large (the same) grids.The remainder of the paper is structured as follows.In the next section, the modified data-aware ETC Matrix model for independent batch scheduling and major scheduling requirements are defined.The analysis of the empirical results is then described.The last section is the paper summary and conclusions.

DATA-AWARE EXPECTED TIME TO COMPUTE (ETC) MATRIX MODEL
We consider a batch scheduling problem of independent processed tasks, which need for their execution multiple data packages located at various heterogeneous data hosts, in physical computational infrastructures such as large-scale cluster, grid or the Infrastructure as a Service (IaaS) layer of the cloud system.The required data collection can be replicated at different servers, databases and can be delivered to the computational grid by the different capabilities networks (see Fig. 1).Such data-aware grid system may be composed of elements denoted as follows: a meta-tasks N = {t 1 ,...,t n }defined as a batch of independent tasks, a set of computing grid nodes M = {m 1 ,...,m m } available for a given batch; a set of data-files F = { f 1 ,..., f r } needed for the batch execution, a set of data-hosts D = {dh 1 ,...,dh s } dedicated for the data storage purposes, having the necessary data services capabilities.
The tasks workload vector is used todefine the computational load of the meta-task, where is the evaluation of the computational load of task (measured in Millions of Instructions (MI)).Each task needs a batch of data files for its correct computation.Such batch is copied and located at the following data servers .The is a part of the .Each file is replicated on and available from the set of data hosts .Each data host is assumed to be able to serve multiple data files at a time and data replication is a priori defined as a separate replication process.
The computing capacity vector is used to define the performance efficiency of the available computational server for a given set.The element of the vector denotes the computing capacity of the server and is expressed in a Million of Instructions Per Second (MIPS).The ready times vector characterize the calculation of the prior load of every machine from the set.To estimate the completion times of tasks allocated at a specific computational server an Expected Time to Compute (ETC) matrix model (Ali et al. 2000a) is adapted.
The particular elements of the ETC matrix are estimated as the proportion of the vectors and coordinates, which are: (1) For every single pair machine and task in Eq. ( 1) the value of the matrix element primarily depend on the computing speeds of the machines.However also the diversity of tasks and sources in the system has to be reflected and taken into account.
For that reason, this model use the Gaussian distribution to produce the elements of both vectors and .What is more when considering data-aware scheduling is the estimation of the data transfer time.The time needed to transfer each, necessary for the execution of the task , data file from the data host to the server is marked as and can be computed as follows: (2) Fig. 1 Data-aware meta-task grid scheduling problem.The stands for the response time of the data server and is evaluated as a difference between the time of the demand send to and the time when the first byte of the data file reached the machine for processing the task .The size of the data file required for the execution of the task is defined by and is expressed in Mbits.Where the bandwidth of the logical link connecting and is denoted by and expressed with Mbits/time unit.The are the elements which form the Data Response Times matrix denoted as .Similarly to the vectors and ,the data response times are generated using the standard Gaussian distribution.The major scheduling factors in the ETC matrix model are the resources completion times.The defines the calculated completion time of the task on machine as the wall-clock time measured from the task submission till its completion.In data-aware approach it highly depends on the computing and transmission times specified in Eq. ( 1) and Eq. ( 2).The data transfer time can have different influence on the task completion time depending on the method which is used to process the data file by the task.Two possible scenarios are presented in Figure 2.
for the computation of the task are delivered to the machine before the execution of all the tasks, from the tasks batch, assigned to his machine, including task Every transfer bandwidth is calculated due to the number of possible synchronized data transfers.In such case the completion time on machine of the task is expressed by: (3) with the difference that the rest of data required for the execution of every task on this particular machine (including task ) is delivered while executing the tasks.In this case, the delivery times of the streamed data files are concealed by execution times of the tasks, thus the completion time of the task on machine is calculated with a different, following equation: (4) where represents the data files batch which is delivered before the execution of the task and obviously all other tasks belonged to this machine.This survey considered the data hosts as, separated from the computing resources, data storage centers.

Scheduling criteria
The overall data-aware batch scheduling procedure is performed in the following steps: obtain the information about resources that are available in the system, obtain the information about unsettled tasks, establish the location of data hosts where the data files needed for the tasks completion are placed, prepare a set of tasks and calculate a schedule for this set on available machines and data hosts, allocate the tasks, monitor the process and re-scheduled the tasks which failed.This process has been presented graphically in Figure 3 below.
where is calculated in Eq. 3 or Eq.4according to data transfer mode; minimizing makespan computed as: (6) where is calculated as the sum of completion times of all the tasks assigned to machine by using either Eq.3or Eq. 4.
minimizing average flowtime For a machine the flowtime can be computed as a workflow of the tasks chain on this machine, specifically: (7) The cumulative flowtime for the entire system is denoted as the sum of factors, namely: (8) In the end, the scheduling aim is to minimize the average flowtime for single machine, which is defined as below: . ( 9) The above formal definitions of the major scheduling criteria are based on the ETC matrix model, which is very helpful in formulating such equations.The parameters form the completion vector .The full list of the major scheduling criteria defined in terms of completion times and ETC matrix, is presented in (Kolodziej 2012).

EXPERIMENTS
The aim of a simple experimental analysis is to show, how much the data access and transfer can possibly delay the whole scheduling process.The scheduling considered in the experiments were the makespan and average flowtime calculated by using Eqs.6 and 9.The results of data-aware scheduling were compared with the results achieved in the conventional scheduling, where data transfer times are ignored.In such a case it is assumed that all necessary data is stored at computational nodes and ready for use, which is not the realistic scenario.For the analysis both data transfer scenarios specified in Section II are considered.Therefore, the completion times in Eq. 5 are estimated using Eq.3in the first scenario, and Eq.4in the second scenario.
We used as the scheduler in our experiments a simple genetic-based scheduler presented in Fig. 4.This is a strategy often used for solving classical combinatorial optimization problems (see Xhafa et al. 2007;Pinel et al. 2011;Michalewicz 1992).We configured the genetic operators in the following way: selection Linear Ranking, crossover Cycle Crossover, mutation Rebalancing, replacement Steady State.
All those genetic operators are commonly used in solving the large-scale combinatorial problems.The detailed definition of those operators and schedule representation can be found in ().
The input parameters for the scheduler are presented in Table 1 TABLE 1. Settings of the genetic scheduler.

Parameter Value
/ 3 mut_prob 0.15 cross_prob 0.9 nb_of_epochs max_time_to_spend The number of individuals in base populations shown as and of individuals in offspring populations , and .The parameters cross_prob, mut_prob are used for the notation of the crossover and mutation probabilities.The nb_of_epochs denotes the maximal number of main loop executions of the algorithm.Each loop execution is interpreted as genetic epoch.The maximal number of such epochs is defined as the main global stopping criterion for the scheduler.However, if the execution of those epochs will take much time, the algorithm is stopped after 25 s (max_time_to_spend).
The main reason of our choice of such a simple scheduler was to demonstrate the impact of the data transfer and access on the optimization of the scheduling criteria.Therefore, we wanted to use a simple method, easy for the implementation in the performed analysis.However, this is just an early stage of our research in the domain and of course we plan to conduct a comprehensive analysis of the effectiveness of various heuristic-based schedulers in data-aware scheduling.
The experiments have been conducted by using the Sim-G-Batch data grid simulator defined in (Kolodziej et al. 2012).The main input data for the simulator is: the workload vector of tasks, the computing capacity vector of machines, the vector of prior loads of machines, and the ETC matrix of estimated execution times of tasks on machines the data host response times.
The parameters of the simulator are presented in Table 2.We consider in our experiments three grid size scenarios are defined: small (64 hosts/1024 tasks), medium (128 hosts/2048 tasks), and large (256 hosts/4096 tasks).The capacities of the resources, data transmission times and the workloads of tasks are randomly generated by the Gaussian distributions.This is the dynamic case, so the number of hosts and tasks can be different in the different time units (add_host, delet_host, add_task, delete_task parameters) It is assumed that all tasks submitted to the system must be scheduled and all machines in the system can be used.The sizes of data files and the bandwidth are generated by the uniform distributions defined for the following intervals [2;1600] and [10;100] respectively.

Results
The results of the experiments achieved in the scenarios (see Section II) and No Data Transfer (NDT) case are presented in Table 3 (makespan) and 4. (average flowtime).The results were averaged over 30 independent runs of the simulator with [±s.d.] s.dstandard deviation values.Both makespan and average flowtime are expressed in arbitrary (but not concrete) time units.
In both makespan and average flowtime optimizations, a big differences in the achieved results is observed in the additional data transfer and no data transfer cases.In a data-aware scheduling, s the achieved results (for makespan and flowtime) in Medium grid and Large grid infrastructures are better than for the prior load of all data files before the task In Small grid the results are similar for both scenarios.

CONCLUSIONS AND RESEARCH DIRECTIONS
This paper presents the new version of ETC Matrix model for batch scheduling in the physical clusters, where separate computing and data servers are located.In this model, the completion times of all tasks assigned to the computing nodes of the network have included the data transmission times.Two data transmission scenarios were considered with prior load of all files necessary for the execution of assigned tasks, and with the ad-hoc delivery of just requested (necessary) data files during the task execution.The implementation of this model and further experimental analysis were performed in the case of dynamic grid infrastructure, were number of network nodes and assigned tasks may vary in different time intervals.The results of the performed experiments show that omitting the data transfer phase in the scheduling process may lead to the bad estimations of the scheduling times, and more general scheduling costs.
The performed analysis in its early stage.The presented work is a simple extension of the previous analysis published in (Szmajduch 2014).The author plans to extend it to the virtual resources and databases and the extended cloud infrastructures, where the mobile devices (smartphones, tablets, laptops, etc.) are considered as the computational nodes of the physical cloud layer and can additionally store and generate the data.This will allow to validate proposed model in much more realistic cloud scheduling scenarios, but also will increase the complexity of the scheduling problem.

Fig. 2 .
Fig.2.Two variants of task completion times estimation assigned to the machine mi with k data files needed for the task execution.

Fig. 3 .
Fig.3.Phases of the data-aware batch scheduling.The main data-aware scheduling criteria are very similar to those desired in common scheduling systems where data file transfers are not considered.It includes minimization of the completion time, makespan and average flowtime, defined as follows:the minimizing completion time of the set of tasks is defined as follows:

Fig 4 .
Fig 4. General template of the GA-scheduler implementation.

TABLE 2 .
Settings of the simulator.