Many parallel applications consist of multiple computational components.
While the execution of some of these components or tasks depends on the
completion of other tasks, others can be executed at the same time, which
increases parallelism of the problem. The task scheduling problem is
the problem of assigning the tasks in the system in a manner that will
optimize the overall performance of the application, while assuring
the correctness of the result.
The task scheduling problem can be modeled as a weighted directed acyclic
graph (DAG). A vertex represents a task, and its weight the size
of the task computation. An arc represents the communication among two
tasks, and its weight represents the communication cost. The directed edge shows
the dependency between two tasks.
Example Computation DAG
The primary goal of task scheduling is to schedule tasks on
processors and minimize the makespan of the schedule, i.e.,
the completion time of the last task relative to the start time
of the first task. The output of the problem is an assignment
of tasks to processors.
Unfortunately, standard task scheduling is NP-complete,
even for the unit task size and unit communication cost.
However, various heuristic methods have been proposed that obtain
suboptimal solutions (schedules) in polynomial time.
Wave Front Method (WFM)
The wave fronts of the graph are determined according
to the level of the vertices in a breadth-first-search
traversal of the DAG. The vertices in each wave front
are independent from each other, and are all assigned
to different processors. Following is
an example of applying Wave Front Method.
Critical Path Merge (CPM)
A critical path in a DAG is a maximum weight root to
leaf path (the path weight is the summation of all vertex
and edge weights on the path). CPM computes the critical
path, clusters all tasks in it, assigns them to the same
processor, and removes them from the graph. This process
is iterated until all tasks are scheduled.
Heavy Edge Merge (HEM)
Heavy Edge Merge works by iteratively clustering vertices
(tasks) along edges with non-increasing weights. During an
initialization stage, the edges are sorted in non-increasing
order by edge weight, one task is assigned to each (virtual)
processor, and the makespan of this assignment is computed.
Then, all edges are processed in sorted order. For each edge,
the makespan resulting from merging the tasks associated with
the endpoints (perhaps clusters themselves) is computed.
If the makespan increases, then the merge is not performed.
Dominant Sequence Clustering (DSC)
DSC works by iteratively identifying, and scheduling, so-called
dominant sequence tasks which are defined as follows. An unscheduled
task is called
free if all of its predecessors are already
scheduled. A
dominant sequence task is the highest priority
free task. The priority of a task is defined as
,
where
(
)
is the length of the longest path from an in-degree 0 (out-degree 0) task to
.
DSC initializes
for every free task and inserts them into a free task list. Next,
is computed for each task. The free task
with the highest priority is processed first. If
cannot be reduced by merging it with a predecessor's cluster, then
will be assigned to a new processor. Next, DSC updates the priorities of
's
successors, and inserts any newly free successors into the free list. The
process is repeated until all tasks are scheduled.
Standard Template Adaptive Parallel Library (STAPL) is a parallel version
of Standard Template Library (STL) for C++.
STAPL components
Task
Scheduling is important for the performance of applying STAPL in various
applications. STAPL Scheduler is responsible of mapping tasks onto
multiple processors to gain the maximum performance out of the system.
The STAPL scheduler should also provide an interface for the user to have
control of the scheduling process. The scheduling strategy can be internally
selected, or the user can select a specific way, like wave front execution,
or ultimately provide their own scheduling strategies for their best interests.
The scheduling results are passed to the STAPL
Executor. The Executor is responsible for the real dispatching and
execution of the problem.
One of the applications
of STAPL is the
ASCI
project. Due to the specific requirement of physical computation, we
develop a scheduling package,
ASCI Scheduler, which uses the STAPL scheduler in its construction and is
an example of a user-defined scheduler.
The following pages present a short
introduction
to task scheduling problems, and some
algorithms used in STAPL Scheduler.
Scheduling Sweeps for Particle Transport Calculations
In this work, we consider
the relationship between task scheduling and deterministic mesh sweeps
that arise in particle-transport computations. In particular, we argue
that the efficient parallelization of such computations is most accurately
viewed as a generalization of the traditional task scheduling problem and
not as an application for domain decomposition, as has been generally assumed
in the past. In fact, as we show, the transport problem represents an interesting
composite task-scheduling problem: given one set of tasks, and multiple
dependence graphs for these tasks (one for each distinguishable sweep direction),
find an assignment of tasks to processors that minimizes the time required
to process all such graphs. Within this context, we study and propose scheduling
algorithms that are suitable for the particular generalization of the scheduling
problem that arises in the context of transport sweeps.
This project is an application of
STAPL.
Based on STAPL Scheduler,
we developed ASCI Scheduler for this particular problem.
ASCI Scheduler
ASCI scheduler is developed for the ASCI
project to optimize the Particle transport Calculations on parallel
systems, specificly for the scheduling of multiple dependence computation
graphs emerged from the ASCI project. The problem is classified into two
categories:
For general scheduling to apply to both orthogonal and unstructured grids,
we define the following functionalities of the scheduler.
Scheduling
In the Scheduling phase, the
unique schedule to be used for all the dependence graphs is computed. The
goal of the scheduling is to find the optimal mappings of tasks onto processors,
such that the total computation time of the whole sweep can be
minimized.
For orthogonal grids, we have
KBA, Hybrid, and Volumetric methods. There are also options for the users
to define their own processor mappings. The theoretical analysis of these
algorithms can be found
here.
For unstructured grids, we use the heuristics implemented in
STAPL
Scheduler, namely wave front method, critical path method, heavy edge
merge, and dominant sequence clustering, to compute the best schedule for
the multiple dependence graphs.
Due to the existence of various
machine topologies in supercomputing field, we adopt a hierarchical scheduling
policy, such that each processor partition can be scheduled iteratively
when necessary. In the following example, processor p1 can be viewed as
a virtual processor which represents two physical processors p10 and p11.
The hierarchical scheduling provides flexibility in mapping to different
machine topologies.
Cellset Aggregation
To further optimize the
computation, the scheduled computation graph is further clustered
into a graph on cellsets. Each node in the graph contains a set of cells,
which composes a computation block before each communication. The
purpose of aggregation is to properly overlap the computation and communication,
and balance the communication overhead with the cost of data transferring.
The optimal cellset size is machine dependent, and can be computed based
on machine parameters. There are theoretical analysis results for orthogonal
cases for the optimal cellset size (
description here).
To get the optimal values for unstructured
grids, we can use simulation results.
Graph Transformation
Based on the sweep angles,
the aggregated cellset graph is transformed into different directed cellset
graph in this phase. The transformation is based on the angle between the
face normal of a pair of cellsets and the sweep angle. If the angle is
within 90 degree, there is a directed edge between the cellsets, otherwise,
there is no dependency between them.
Cell Ordering
Cell Ordering is to find
out the execution order of the cells in a cellset for different sweep angles.
The basic ordering uses a topological sort on the directed graph between cells.
Here is an implementation flow chart to show the major steps in the ASCI Scheduler implementation.
Load Balancing Techniques for Scalable Parallelization of Sampling-Based Motion Planning Algorithms, Adam Fidel, Sam Ade Jacobs, Shishir Sharma, Lawrence Rauchwerger, Nancy M. Amato, Technical Report, TR13-002 , Parasol Laboratory, Department of Computer Science, Texas A&M University, Mar 2013.
Technical Report(pdf, abstract)
Task Scheduling and Parallel Mesh-Sweeps in Transport Computations, Nancy M. Amato, Ping An, Technical Report, TR00-009, Department of Computer Science and Engineering, Texas A&M University, Jan 2000.
Technical Report(ps, pdf)