Optimal Scheduling of Periodic Gang Tasks

The gang scheduling of parallel implicit-deadline periodic task systems upon identical multiprocessor platforms is considered. In this scheduling problem, parallel tasks use several processors simultaneously. We propose two DP-Fair (deadline partitioning) algorithms that schedule all jobs in every interval of time delimited by two subsequent deadlines. These algorithms deﬁne a static schedule pattern that is stretched at run-time in every interval of the DP-Fair schedule. The ﬁrst algorithm is based on linear programming and is the ﬁrst one to be proved optimal for the considered gang scheduling problem. Furthermore, it runs in polynomial time for a ﬁxed number m of processors and an eﬃcient implementation is fully detailed. The second algorithm is an approximation algorithm based on a ﬁxed-priority rule that is competitive under resource augmentation analysis in order to compute an optimal schedule pattern. Precisely, its speedup factor is bounded by (2 − 1 /m ). Both algorithms are also evaluated through intensive numerical experiments. 2012 ACM


Introduction
We consider the preemptive scheduling of real-time tasks on identical multiprocessor platforms (see [13]).We deal with parallel real-time tasks, the case where each job may be executed on different processors simultaneously, i.e., we have job parallelism.Nowadays, the design of parallel programs is common thanks to parallel programming paradigms like Message Passing Interface (MPI [26,28]) or Parallel Virtual Machine (PVM [40,23]).Even better, sequential programs can be parallelized using standards like OpenMP application programming interface (see [8] for details).
Contributions.We define and prove correct a technique to schedule optimally periodic implicit deadline rigid gang tasks (see Definition 1 for details) upon multiprocessors.The algorithm is based on linear programming (LP) and runs in polynomial time for a fixed number m of processors.The second proposed method is a fixed-task priority rule with a performance guarantee of (2− 1 m ) under resource augmentation analysis.These algorithms are compared through numerical experiments.
Organization.Section 2 presents the studied scheduling problem and the related work.Section 3 presents basic results about DP-Fair scheduling of parallel tasks.Section 4 presents the LP-based optimal method and its implementation.Section 5 presents a gang heuristic and its worst-case

04:2
Optimal Scheduling of Periodic Gang Tasks performance analysis under resource augmentation.Section 6 presents numerical results comparing both methods on randomly generated task systems.Then, Section 7 concludes the paper and presents some future work.

2
Model, Problem and Related Work

Parallel Task & Job Models
We deal with jobs and tasks which may be executed on different processors at the very same instant, in which case we say that job (or task) parallelism is allowed.Various kinds of parallel task models exist.Goossens et al. [25] adapted parallel computing terminology [7] to recurrent (real-time) tasks and jobs as follows.
Definition 1 ([25] Rigid, Moldable and Malleable Job).A job is said to be (i) rigid if the number of processors assigned to this job is specified externally to the scheduler a priori, and does not change throughout its execution; (ii) moldable if the number of processors assigned to this job is determined by the scheduler, and does not change throughout its execution; (iii) malleable if the number of processors assigned to this job can be changed by the scheduler during the job's execution.
A recurrent task is said to be rigid if all its jobs are rigid, and the number of processors assigned to the jobs is specified externally to the scheduler; a recurrent task is said to be moldable if all its jobs are moldable; malleable if all its jobs are malleable.
Additionally at task level the literature distinguishes between at least three kinds of parallelism: Multithread [12,32,39,38,1].Each task is sequence of phases (multiphase in the following), each phase is composed of several threads, each thread requires a single processor for execution and threads can be scheduled simultaneously [38].A particular case is the Fork-Join (see e.g.[36]) task model where the phases are an alternate sequence of sequential and parallel segments; task begins as a single master thread that executes sequentially until it encounters the first fork construct, where it splits into multiple parallel threads which synchronize/join on their terminaison and so on.Dag task model [2,6,34].The model generalizes the fork-join model, each task is represented as a directed acyclic graph, which is a set of precedence-constrained sequential jobs.Any group of jobs that are not constrained may execute in parallel.
Gang [11,4,25,30].Each task corresponds to e × k rectangle where e is the execution time requirement and k the number of required processors with the restriction: the k processors must execute task in unison (i.e., at the exact same time).This model is very representative to real world parallel applications where, at the submission time, users or scheduler select the number of processors for the task [14,16] and consequently the number of generated threads corresponds to the number of used processors like MPI [41] and OpenMP [9] tools do.The threads communicate each other, they must be ready to communicate at the same time which imposes the synchronous threads execution.

Our Task/Job Model and Scheduling Problem
At job-level we consider the preemptive scheduling of parallel jobs on a multiprocessor platform upon m identical processors.We will focus on the problem of scheduling a set of rigid gang parallel jobs, each job J j def = (r j , v j , e j , d j ) is characterized by a release time r j , a required number of processors v j , an execution requirement e j and an absolute deadline d j .The job J j must execute for e j time units over the interval [r j , d j ) on v j processors.We consider the scheduling of rigid jobs since v j is fixed externally to the scheduler.
Our main scheduling problem concerns periodic (and sporadic -see discussion Section 5.3) preemptive hard real-time systems.Let τ def = {τ 1 , . . ., τ n } denote a set of n periodic implicitdeadline rigid gang parallel tasks.Each task τ i def = (v i , C i , T i ) will generate an infinite number of jobs, where the k th job of task In the following u i def = Ci Ti denotes the utilization factor of task τ i .This utilization is not related to the number of required processors (v i ) this is an horizontal notion.The execution requirement of a job of τ i corresponds to a C i × v i rectangle.
This Research.We study the scheduling of preemptive periodic implicit-deadline rigid gang real-time tasks.We address the feasibility and the schedulability questions by designing an optimal scheduler.The proposed approach exploits the deadline partitioning of the schedule.Precisely, in every slice delimited by two subsequent deadlines in the schedule, a static schedule pattern is stretched according to the slice length.Two algorithms are proposed to define such a pattern.The first algorithm is based on linear programming and is the first one to be proved optimal for the considered gang scheduling problem.Furthermore, it runs in polynomial time for a fixed number m of processors and an efficient implementation is fully detailed.The second algorithm is an approximation algorithm based on a fixed-priority rule that is competitive under resource augmentation analysis for computing a static schedule pattern.Precisely, its speedup factor is bounded by (2 − 1/m).Both algorithms are also evaluated through intensive numerical experiments.
For the sake of simplicity we consider periodic tasks and we assume that C i is the exact duration of the task τ i .Consequently, we consider an offline scheduling problem.The discussion of Section 5.3 extends the scope of our result to the scheduling of sporadic tasks with C i as the worst-case execution time.

Related Work
Optimal solutions.To the best of our knowledge there is a single work which provides an optimal solution for the scheduling of recurrent hard real-time parallel computing: the work of Collette et al. [11] which considers sporadic implicit-deadline malleable gang scheduling.The authors provide an optimal scheduler and an exact feasibility/schedulability test.The work we report in this document provides an optimal scheduler and an exact feasibility/schedulability test as well, except we consider a more realistic parallel task model since our task are rigid jobs-the degree of parallelism is specified at design time and does not change at run-time-while the authors of [11] consider malleable jobs.
Non optimal solutions.The other contributions to the scheduling of recurrent hard real-time parallel computing consider non optimal schedulers and schedulability tests (sufficient or exact).[30] considers the EDF (Earliest Deadline First [35]) scheduler for rigid gang sporadic tasks and proposes a sufficient schedulability test.[25,4] consider FTP (Fixed Task Priority, such as Rate Monotonic [35]) periodic gang scheduling and provide an exact/sufficient schedulability tests.[12] considers FTP and periodic multithread tasks and proposes an exact schedulability test.[32] consider Deadline Monotonic scheduling of fork-join tasks, they provide a competitive analysis for that suboptimal scheduler.[39] considers EDF and FTP scheduling for multiphase multithread periodic tasks and provides a task decomposition technique and a competitive analysis.[38] considers optimal schedulers for sequential tasks (e.g., DP-Fair described Section 3.4), implicit

04:4
Optimal Scheduling of Periodic Gang Tasks deadline multiphase multithread recurrent tasks, for which the authors propose a decomposition technique and a competitive analysis.More recent works [6, 34] consider DAG tasks model (which generalizes fork-join model).The authors study the competitiveness of global EDF and global RM.

3
Basic Results

Hardness for arbitrary number of processors
Assuming that the number of processors is an input in the problem model, it is easy to show that the preemptive Gang scheduling problem is equivalent to the Bin Packing problem.Such a result was observed (without proof) for the non preemptive scheduling problem of parallel (Gang) tasks having unit processing times [5].Hereafter, we provide a basic proof sketch to clearly exhibit the reducibility among both problems.
Theorem 2. Preemptive Gang scheduling is NP-hard in the strong sense for problem instances with an arbitrary number of processors.
Proof.(Sketch) We transform from Bin Packing [22]: Finite set A of items, a rational size s(a) ∈ Q + for each a ∈ A, a positive integer bin capacity B, and a positive integer K; is there a partition A into disjoint A 1 , A 2 , . . ., A K such that the sum of the sizes of the items in each A i is no more than B?
Without loss of generality we assume that all s(a) have a common denominator and B is scaled accordingly, we define a Gang scheduling instance as follow: m = B processors, each item a ∈ A is model as a unit-length task of period K, τ a = (v a , e a , K), such that e a = 1 and v a = s(a).The Bin Packing decision problem is equivalent to determine if there is a feasible schedule of period K? (i.e., a schedule with no more than m processors are simultaneously used at any time).
The previous transformation shows that preemptive Gang Scheduling with an arbitrary number of processors contains the bin packing problem as a particular case (i.e., with all task execution times equal to 1).It is also known that the Bin Packing problems can be solved in polynomial time for any fixed B by exhaustive search [22].Such a property will be also exploited for defining our optimal polynomial time algorithm for scheduling periodic Gang real-time tasks.

Maximum Rectangle Utilization Bound
In our task model (see Section 2.2), the task utilization u i def = C i /T i represents an horizontal utilization.In this section we introduce the notion of total rectangle task set utilization which is by definition The maximum rectangle utilization bound U b of a scheduler guarantees that every system of tasks whose total utilization is smaller than or equal to U b will be correctly scheduled.Beyond this utilization limit, and if the bound is said to be tight, then there exist systems of tasks which are not schedulable.
First notice that, since our scheduling problem of parallel tasks is a generalization of the popular scheduling problem of periodic sequential tasks, we have to report a negative result: the maximum rectangle utilization bound is 1/m.Theorem 3. The maximum rectangle utilization bound for the scheduling of periodic Gang tasks is 1/m.Proof.The result will be established using a simple parallel task set with two tasks 1 : τ 1 (1, 1, 1) and τ 2 (m, , 1), where is an arbitrary small positive infinitesimal number.This system is trivially infeasible and the total task rectangle utilization is 1/m.

Scheduling Anomalies
We have to report a second negative result concerning our scheduling problem: FJP (Fixed Job Priority, such as Earliest Deadline First [35]) and consequently FTP gang scheduling are not predictable [25] 2 .Here is an example task system, on 2 processors and three jobs 3 (see Figure 1): Using the priority assignment J 1 > J 2 > J 3 , Gang FJP schedules the set of jobs (J 3 completes at time-instant 2).Unfortunately, if the actual duration of J 1 is 1, J 2 will preempt J 3 at time t = 1 and J 3 will complete later, at time-instant 3.Then, J 3 does not miss its deadline in the "worst case" scenario, but misses it if J 1 uses less than its worst case execution time C 1 .Thus, reducing an execution time can delay the completion of another job.

DP-Fair Scheduling for Sequential Tasks
While this research concerns parallel tasks we will introduce a scheduling technique defined for sequential tasks (in the next section -Section 3.5-we will show how to apply the very same technique to gang tasks).Consequently, we assume in this section the scheduling of n sequential and implicit-deadline tasks upon m identical processors.Each task τ i def = (C i , T i ) is characterized by a worst-case execution duration C i and a period T i .Seminal optimal multiprocessor scheduling techniques were based on the notion of proportionate fairness, it is the case for instance of the PF (Proportionate Fairness) scheduler [3].This type of algorithm assumes that time is discrete.
The quantum-by-quantum construction of the scheduling is not necessary in order to define an optimal algorithm [42,20,24].DP (Deadline Partitioning) scheduling techniques do not decompose the tasks into single-time unit (sub)-tasks.The construction of the scheduling is done over time intervals delimited by two consecutive deadlines called blocks.In each block, every task receive a workload that is proportional to its utilization so that the fairness property is satisfied at each deadline in the schedule.
Let L j be the length of the block j delimited by two subsequent task deadlines, every implicit deadline periodic task τ i receives an amount of processor equal to L j × u i .Consequently, a task τ i has received an execution times equal to u i × T i = C i for each of its deadlines.
2 a scheduling algorithm is predictable if reducing an execution requirement cannot increase the completion of tasks. 3(r j , v j , e j , d j )

04:6 Optimal Scheduling of Periodic Gang Tasks
The previous algorithms assume that time is by its nature discrete: the times at which the scheduler can be activated are integers (in other words correspond to the clock ticks of the real-time operating system).Discrete time is by its nature a source of complexity in multiprocessor scheduling and for this reason, algorithms which exploit the continuous nature of time have been defined.
We will now detail the simplest algorithm in this category: the DP-Wrap algorithm.
The DP-Wrap algorithm is a very simple deadline fair algorithm which is optimal for tasks with implicit deadlines [20].Contrarily to the previous algorithms, DP-Wrap considers that the time is continuous.The scheduling is broken down into blocks delimited by deadlines/periods.The distribution of the tasks into each interval is equal to the length of the interval multiplied by the utilization of the task, i.e., u i def = C i /T i .Thus, in the interval [s j , s j+1 ), each task τ i is given . Consequently, at each deadline, the tasks have received an execution time equals to u i × T i = C i .The scheduling (distribution) in each block is done by McNaughton's algorithm, which has been proposed in 1959 [37].
In the next section, we show how to reuse the deadline partitioning technique in order to define an optimal static gang scheduling algorithm.

DP-Fair and Gang Scheduling
The next result (Theorem 4) shows that concerning our parallel scheduling problem (gang scheduling defined Section 2.2) we can, without loss of generality, consider DP-Fair scheduling, i.e., schedule where the same pattern is replicated (and stretched) in each interval delimited by deadline/period.In Section 4 we will define a technique to build such pattern optimally.Theorem 4. Every feasible parallel task set is schedulable with a DP-Fair schedule.
Proof.Assume we have a feasible schedule, then we show how to define a feasible DP-Fair schedule.Since parallel tasks are strictly periodic and are simultaneously released, the whole schedule is periodic and has a period equal to the hyperperiod.Within the hyperperiod H, every task is executed for H Ti C i = Hu i .To define a DP-Fair schedule: Schedule pattern: stretch the complete schedule within unit time slots.Stretch the pattern accordingly in every block of the schedule.The corresponding schedule is DP-Fair.At every block boundary t, every task τ i , 1 ≤ i ≤ n receives exactly t • u i .Hence, for every time instant t corresponding to a deadline of task τ i (i.e., t = kT i ) the task τ i receives exactly kT i u i = kC i and thus has been executed to completion by its deadline.

4
Optimal Pattern Definition

Research Method
Firstly, we will revisit a non real-time scheduling problem and its solution (the work of Błazewicz et al. [5]) where the main goal is to minimize the schedule length of non recurrent rigid gang jobs.We will show how that technique can be optimally adapted to our hard real-time scheduling problem.

The work of Błazewicz et al. revisited
these jobs are released simultaneously at time origin.Upon an identical multiprocessor platform the scheduling problem is to find a schedule which minimizes the schedule length, the first instant where the jobs are completed or equivalently to minimize the makespan.The authors present a polynomial time algorithm for a fixed number of processors m, based on linear programming, for computing an optimal schedule in the general case.Particular cases are also considered but not useful in our framework.Notice that the problem is NP-hard for arbitrary number of processors (i.e., m is an input of the problem) [15].
The method decomposes the schedule as a sequence of slices.Remember, a feasible allocation of jobs is the one that uses no more than m processors.Each slice σ i is characterized by the set S i of feasible jobs and the duration x i of their execution.The algorithm computes the length for every feasible allocation of jobs.As a result, the slices having a positive length are sequenced in arbitrary order.Moreover the method minimizes the M i=1 x i , i.e., the makespan to define an optimal schedule.Definition 5 (Feasible allocation).A feasible allocation of jobs is a subset s of job indexes that can be processed simultaneously on the platform: i∈s v i ≤ m.By definition we have a finite number of different feasible allocation sets.In the following M is the number of different feasible allocation sets.Thus the set of all feasible allocation subsets is denoted S = {S 1 , . . ., S M }.
Notice that S has a cardinality of It is important to notice that the number of subsets M (i.e., number of variables in the linear program) is in O(n m ) that is polynomial for fixed values of m.
Let Q j be the set of those subset indexes which contain job J j .Let x i be the processing time of the subset S i in the schedule and used to define variables in the linear program.The linear program computes the schedule as a set of slices of length x i .Every value x i such that x i > 0 defines a slice in the schedule in which the jobs of S i are executed.Slices are executed in an arbitrary order without any inserted delays between them.Hence, the computed makespan is M i=1 x i .The objective function is to minimize the makespan: M i=1 x i ; and the linear program must enforce that all jobs are completed.The corresponding constraint is: i∈Qj x i = u j , j = 1 . . .n. Hence, the schedule with the smallest length (i.e., makespan) is defined as follows: Algorithm 1: Optimal Schedule Pattern construction by Linear Programming.
A solution for Example 6 is x 1 = 1 (duration of the execution of J 1 only), x 2 = 1 (duration of the execution of J 2 only), x 3 = 0 and x 4 = 2 (duration of the joint execution of J 1 and J 3 ) which corresponds to the schedule of Figure 2.
In the previous linear program, there are: M variables and n constraints.It can be solved in polynomial time using for instance Khachiyan's algorithm [31].This is pretty much what Błazewicz et al. that technique can be used to solve our hard real-time scheduling problem (Section 4.3).Then, we will show how to speedup the resolution time by defining an efficient problem construction before calling the LP solver (Section 4.4).

Minimizing the Makespan vs. Meeting Hard Real-time Deadline of Recurrent Tasks
The next property establishes an equivalence between the Błazewicz et al. optimal solution and an optimal scheduler for our real-time scheduling problem thanks to the DP-Fair scheduling theory [33].
Theorem 7. A periodic implicit deadline rigid gang scheduling system has a minimum makespan not larger than the unity.
Proof.⇐ Assuming the makespan of the set of synchronous jobs (u i , v i ) i=1•••n is not larger than the unity we have a schedule pattern which executes each task for a duration of u i in a unit-length interval.Using deadline partitioning fairness theory (DP-Fairness, see [33,20,10,18,21,19]) we can schedule our original periodic task set.The main idea of that kind of schedule is the deadline partitioning of the timeline: the time is divided in slices bounded by two successive job deadlines [33,20].All tasks are assigned a local execution time which is the length of the current slice times the task utilization u i .As basically DP-Wrap [20] we use an identical pattern (the solution of the LP) in each time slice, i.e., the unitary pattern is stretched according to the length of the slice.Thanks to the DP-Fair theory, we know that for each time interval [kT i , (k + 1)T i ) the task τ i executes during u i T i = C i time units on v i processors simultaneously.Consequently, the periodic gang task is feasibly scheduled.
⇒ We will show the contra-positive, i.e., assuming that all schedules of the synchronous job set have a length larger than one.Since DP-Fair is optimal for periodic implicit-deadline systems and since the technique of Błazewicz et al. determines the minimal makespan we can conclude that it is necessary for the schedulability of periodic implicit-deadline tasks that in each slice the active task τ i executes for u i times the slice length.Hence, slices where all tasks are active (like the first one in the synchronous case) cannot execute u i times the slice length since the solution of the LP is larger than one, consequently (because DP-Fair is optimal), the periodic system is not schedulable on m unit-speed processors.

LP implementation issues
Efficient generation of the set of all feasible allocations S (Definition 5) is the main combinatorial problem in the linear program construction (Algorithm 1) in order to setup the linear program that will compute the optimal schedule pattern.Even if the size of the problem is polynomially bounded for fixed values of m, the brute force definition of the set of all feasible allocations S ⊥ 1 12 13 Figure 3 Search tree with nodes defined by task indexes in subsets; black nodes are feasible subsets whereas red nodes are infeasible subsets.requires a huge amount of time for n > 20.This stage is the bottleneck of the approach since the linear program is solved quickly as it will be shown in the experimental section.
A simple way to implement a Brute Force generation of all subsets of tasks is to represent every feasible allocation by an integer in which the binary encoding represents the tasks selected in a subset.There are 2 n subsets of a set with n elements, exactly as there are 2 n different ways to write numbers with n bits.Let s denote such an integer, if i th binary digit is 1 in s, then it indicates that task τ i is in the feasible allocation represented by s.With such a binary encoding, the brute force enumeration of all feasible allocations of n tasks simply consists in counting from 1 to 2 n − 1 and defining subsets from the binary encoding.Subsets corresponding to feasible allocations are those that do not use more than m processors.
Using the same binary encoding principles for feasible allocations, we define an efficient generation using Depth First Search with lexicographic ordering of enumerated subsets.Tasks are sorted in non increasing order of v i .As a consequence, vertices corresponding to infeasible allocations are efficiently pruned.A branch in the search tree is pruned if the current vertex in the tree corresponds to subset of tasks that use more than m processors.The search tree is illustrated in Figure 3 for three tasks: {τ 1 , τ 2 , τ 3 } which respectively use 3, 2 and 1 processors.The considered platform has 3 processors.Nodes define indexes of tasks in a subset.Black nodes are feasible subsets whereas red ones are infeasible (i.e., requires more than 3 processors).During the search, the node {123} is not defined since {12} is already infeasible (i.e., the branch is pruned or fathomed).Using a Depth First Search, the search tree simply consists in the list of unexplored subsets (i.e., encoded as one integer each) and its size is upper-bounded by O(n 2 ) subset entries.Our Matlab implementation of this algorithm, denoted DFSLex hereafter, is limited to 64 tasks (i.e., due to Matlab 64-bit integers).Brute Force and DFSLex methods will be compared in the section dedicated to numerical experiments.The performance of the LP for optimally solving Gang scheduling problems will be also presented in Section 6.
In the next section, we propose an heuristic that avoids the previous combinatorial problem.This heuristic has a performance guarantee in terms of resource augmentation (i.e., speedup factor).

Gang heuristic
This section presents a scheduling heuristic for defining the schedule pattern.As for the optimal solution presented in the previous section, we consider a DP-Fair schedule in which the pattern will be stretched in every block delimited by two subsequent deadlines of tasks.The heuristic algorithm is a fixed-task priority scheduling rule that runs in O(n log n) for an arbitrary number of processors.We provide a resource augmentation analysis to compare its worst-case performance against the optimal method.

L I T E S
04:10 Optimal Scheduling of Periodic Gang Tasks 5.1 Fixed-Task Priority Scheduling gang-h [29] presents an heuristic for minimizing the makespan of preemptive parallel jobs.We shall reuse the basis of the algorithm for defining a fixed-task priority algorithm, denoted gang-h hereafter.As in the previous section, the algorithm is used to define the pattern of tasks to schedule in every time slot (i.e., block) of the DP-Fair schedule.As in the optimal method, the pattern is defined as a unit-length schedule in which utilization factors of tasks play the role of execution requirements.For each block in the schedule, the rule is simultaneously used to: allocate the portion of each task to processors.sequence tasks within the block.
gang-h is a fixed-task priority scheduling rule that works as follows [29]: 1. Priorities are assigned in non-increasing order of the number of requested processors (i.e., non-increasing order of v i ); ties are broken arbitrarily.2. Scheduling decisions are taken every time a job is released or completed.At such event, all tasks are preempted and the priority list is used to allocate ready tasks to the processors greedily as feasibly while the current job can be scheduled on the remaining available processors.
The complete algorithm is described in Algorithm 2. This algorithm considers synchronous jobs and will be used to define the pattern of jobs to be scheduled in every block of the schedule (as in the optimal algorithm).

Optimality under resource augmentation
The scheduling rule gang-h is obviously not optimal for minimizing the pattern makespan in our DP-Fair approach.Nevertheless, we next prove that it is as powerful as an optimal algorithm if it is allowed to use a faster processor than the optimal algorithm executed upon a unit-speed processor.Such a performance guarantee quantifies the price being paid for using gang-h.Precisely, we establish that the speedup factor is bounded by 2 − 1/m.We first recall the speedup factor metric.Definition 8. (Speedup factor) A scheduling algorithm A has a speedup factor f , f ≥ 1, if it can schedule any task set that can be scheduled on a given platform by an optimal algorithm, provided that A is able to schedule the same task set upon a platform in which each processor is f times as fast as the processors available to the optimal algorithm.For proving the resource augmentation performance guarantee for our real-time scheduling problem, we reuse the following results that establish the competitive ratio for the makespan minimization problem.[29]) Let w L be the makespan computed by gang-h and w 0 be the optimal makespan, then:

Lemma 9. (Theorem 3.1 in
The approximation bound of (2 − 1/m) can be easily proved to be tight by using the same task set that has been proposed in [27]: m 2 − m + 1 jobs.This job set is defined by (m 2 − m) unit-length jobs and one job of length m.Every task uses exactly one processor.Since ties are broken arbitrarily, assume that gang-h assigns the lowest priority to the job of length m.Consequently, gang-h defines a schedule that first allocates all unit-length jobs to the m processors and lastly the last job.This schedule is of length m − 1 + m = 2m − 1, whereas the optimal makespan is m by allocating the long job to a dedicated processor and by scheduling the unit-length jobs on m − 1 remaining processors.by Lemma 9, gang-h produces a pattern of length not exceeding (2 − 1 m )w 0 on Π.Now consider a platform Π where each processor is of speed s = 2 − 1 m .The task set corresponding to τ , denoted τ , has an execution requirement defined by C i /s for every tasks.By Lemma 10, the length of the schedule defined by gang-h upon Π is not exceeding (2 − 1 m )w 0 /s = w 0 .Hence, the pattern defined by gang-h is feasible.Thus, a processor speedup of 2 − 1/m is an upper bound on the price being paid for using the presented gang heuristic for defining the schedule pattern to be stretched in every block of the DP-Fair schedule.

Extending the scope of the results
For the sake of simplicity we considered implicit-deadline periodic tasks and we assumed that C i is the exact duration of the task τ i .In this section we discuss straightforward extensions (constrained-deadlines, asynchronous systems and C i as the worst-case execution requirement) and possible extensions which are left as future work (sporadic and arbitrary deadlines).

Constrained-deadlines.
We considered in this work implicit-deadline rigid gang parallel tasks.Constrained-deadline tasks are characterized by an additional parameters D i ≤ T i the relative deadline.Each constrained-deadline task τ i def = (v i , C i , T i , D i ) will generate an infinite number of jobs, where the k th job of task . DP-Fair techniques can be obviously extended for constrained-deadlines by considering the task density δ i def = C i /D i instead of task utilization.DP-Fair is not longer optimal for constrained-deadline and sequential tasks, but if the makespan of gang jobs {(δ i , v i ) | i = 1, . . ., n} is not larger than the unity our method schedule feasibly constrained-deadline gang tasks.Funk et al. extended for instance DP-Wrap for constrained-deadlines [20].
Asynchronous periodic tasks.Asynchronous periodic tasks are characterized by an additional parameters O i the release time of the first job of τ i .Each task τ i def = (v i , C i , T i , O i ) will generate an infinite number of jobs, where the k th job of task Once again the technique can be used for that asynchronous system: we define the pattern for the synchronous job scenario and we apply the deadline partitioning method and stretching accordingly that pattern.
Early completion.We assumed in this work that C i is the exact duration of the task τ i .Meanwhile, from applicative perspective this is incorrect, at design time we determine the worst-case execution time (C i is the WCET) for each task.At run-time the actual duration of any job of τ i can be smaller than C i .Again DP-Fair techniques can be obviously extended for that case, a task might not use all the capacity reserved for it, but because of scheduling anomalies reported Section 3.3 we have to respect the stretched pattern, in other words it is forbidden to schedule another task earlier.
Sporadic tasks.Sporadic tasks are quite similar to periodic tasks, the only difference being that the period of a sporadic task denotes the minimum inter-arrival time instead of the exact one.While Funk et al. show that handle arrivals within a time slice is fairly straightforward (see [20], Section 6.) for sequential tasks we consider that extension to parallel gang tasks is no direct and that extension is left as future work.Numerical experiments Intensive numerical experiments have been performed using Matlab.The used LP solver is an interior point method (i.e., solver linprog included in Matlab).Source codes of all algorithms and experimental results are available at the project page4 including a wiki page for detailing the file organization.We next detail the task set synthesis and the numerical results.

Task set synthesis
Input parameters for the task set synthesis are m, i.e., the number of processors in the platform, its total utilization U , and the number of tasks n.Stafford's algorithm is used for generating utilization factors u i of gang tasks τ 1 , . . .τ n to meet a total utilization of m × U .As shown in [17], the method is suitable for task set synthesis for multiprocessor systems.The utilization factors of tasks are picked up by Stafford's algorithm in the interval [0.02, m].The number of used processors for every gang task is generated using uniformly distributed pseudo random integers in the interval [1, m).We do not allow a task to simultaneously use m processors since such a situation is not interesting from scheduling perspectives.Precisely, such tasks can be removed from the optimization problem in order to compute an optimal pattern, and added afterwards in the previously computed optimal pattern.
In DP-Fair scheduling, task individual periods and execution requirements are not useful since between two subsequent deadlines, d j and d j+1 , the execution requirement to be scheduled is exactly (d j+1 − d j ) × u i for every task τ i , 1 ≤ i ≤ n.Furthermore, the presented algorithms build up the pattern that will be stretched in every block in the DP-Fair schedule.The length of the interval is basically set to one hundred to avoid small decimal numbers that can lead to numerical problems while using a LP solver.

LP-based method evaluation
As previously mentioned, the optimal LP-based algorithm must handle two different combinatorial problems: the feasible subsets construction stage and the optimization stage.From the computational time point of view, the optimization stage is quite fast in comparison with the construction of all feasible subsets (i.e., setting up the matrix of constraints).
Figure 4 presents the computation times of Brute Force v.s.Depth First Search with Lexicographic order (DFSLex) for this problem construction for m = 16.In the following plots, every point corresponds to 1000 runs (i.e., simulation with replication factor equal to 1000).As commonly observed, Brute Force is still manageable until n = 20, but cannot be used beyond whereas the DFSLex algorithm runs quite efficiently.The drawback of the DFSLex algorithm is that it is limited to 64 gang tasks due to the binary encoding of feasible subsets as 64-bit integers.The DFSLex results for larger task sets are depicted in Figure 5.We also implement a similar version of that algorithm relaxing the 64-bit constraint by using variable integer arithmetic routines but it runs quite slowly in our Matlab implementation (e.g., it is as slow as the Brute Force algorithm for small task sets).All these implementations are available in the project page.
Figure 6 depicts the resolution times of the optimal algorithm (with both stages) for several numbers of processors and for global utilization equal to 50% and 90%.As depicted, the utilization factor has a moderate influence on average resolution times.Problems become harder to solve when first, the number of gang tasks increases, and second, when the number of processors increases;  but require few seconds in the average.In both cases, the feasible subsets construction requires an important amount of computations when the problem size increases.

Acceptance ratio
We compare the optimal algorithm (OPT) and the heuristic (gang-h) for computing a schedule pattern according to the average acceptance ratio for a platform with 16 processors and varying utilization factors.The used schedulability tests are Theorem 7 for the optimal algorithm and its sufficient version for the heuristic (i.e., if (gang-h) generates a schedule pattern of length not larger that 1, then the task set is schedulable).Figure 7 depicts the results for 20 and 40 tasks, respectively.The replication factor during the simulation is set to 10000 (i.e., every point in graphs is the average of 10000 results).
For the optimal algorithm, when the number of tasks increases in the experiment for m = 16 processors, then every task has relatively smaller individual utilization due to the task set synthesis method.As a consequence, there are more feasible subsets and consequently more feasible schedules.As depicted in Figure 7, the acceptance ratio for the optimal algorithm doubles for U = 0.95 when the number of tasks doubles.Such a benefit is not observed for the gang heuristic that achieves quite poor results when the total utilization becomes high.

Average error
We also compare the optimal and heuristic algorithms according to the average error in comparison with the length of the schedule pattern.Let OPT be the length of the pattern computed by the optimal algorithm and UB be the corresponding upper bound computed by the gang heuristic, the error is defined by: err = (UB − OPT)/OPT.Due to Theorem 11, we verify that: OPT ≤ UB ≤ (2 − 1 m )OPT.Hence, the error is bounded by err ≤ (1 − 1/m).We perform comparison of algorithms for several numbers of tasks and processors that are depicted in Figure 8.As for the acceptance ratio, simulations for computing the average error have been replicated 10000 times.In these graphs, the y-axis is delimited by the worst-case error of (1 − 1/m).First, the average error is not sensitive to the utilization factor but only to the number of tasks.Precisely, the average ratio compare the schedule lengths of the patterns computed by OPT and Gang-h.Varying utilization factors of synthetic task sets will define quite similar pattern shapes that lead to quite similar average error.Second, when the number of tasks increases, the average error also increases and but is still under half of the worst-case error.

Conclusion
In this paper we considered the preemptive scheduling of implicit-deadline periodic gang task systems upon identical multiprocessors.We proposed two algorithms which define static patterns that are stretched at run-time in a DP-Fair way.The first one is optimal and runs in polynomial time for a fixed number of processors; the second one is a sub-optimal fixed-priority rule but it is competitive under resource augmentation analysis.Precisely, the speedup factor of the heuristic is bounded by (2 − 1 m ).Our numerical experiments show that the optimal pattern can be computed efficiently up to 60 tasks and ensures a high acceptance ratio when the number of tasks is not too small.For larger systems (m >> 16 or n > 64), computing an optimal pattern becomes a hard combinatorial problems.In these cases as for most hard combinatorial problems, we think that heuristics (e.g., gang-h) must be used rather than an optimal algorithm.Concerning the proposed gang heuristic, the experiments show that the acceptance ratio decreases quasi-linearly according to the platform utilization factor, but the average error with respect to the optimal pattern length is less than 40%.Future Work.Future work will concern the definition of a pattern schedule that aims to reduce the number of preemptions.This latter problem seems to be hard to cope with but still significant for allowing practical applications of real-time scheduling methods.As we said in the discussion Section 5.3 the case of sporadic task is left as future work.

Figure 1
Figure1Non-predictability of gang FJP schedulers.Job 1 has the highest priority, job 3 has the lowest one and job 2 in the middle, and they all arrive at time 0.

Figure 2
Figure 2 A solution for Example 6.

Figure 4
Figure 4 Brute Force v.s Depth First Search enumeration of feasible allocations in the linear program OPT.

Figure 5
Figure 5 Depth First Search enumeration of feasible allocations for larger task sets.

9 Figure 6
Figure 6 Resolution times of the LP-based optimal algorithm.
Acceptance ratio for n = 20 (b) Acceptance ratio for n = 40

Figure 7
Figure 7Acceptance ratio of the LP-based optimal algorithm and Gang Heuristic.
Average error for m = 16

Figure 8
Figure 8Average error on the schedule pattern length for the LP-based optimal algorithm and gang heuristic.