CSc-387 Parallel Processing CHAPTER 7: LOAD BALANCING AND TERMINATION DETECTION ------------------- 7.1. LOAD BALANCING ------------------- mapping problem, scheduling, task partitioning, task assignment etc. - static load balancing: before the execution starts (at compile time) - dynamic load balancing: during the execution (on the fly) (* T Figure 7.1 *) (* T Figure 7.2 *) Static Load balancing techniques (combinatorial optimization): - Round-robin scheduling: tasks are assigned to processes in a round-robin fashion - Randomized algorithms: assigns tasks to processes randomly - Recursive Bisection: uses a graph representation for tasks and recursively divides it into subproblems of equal computational effort while minimizing message passing. (Min-cut approach; Kernighan-Lin algorithm) - Simulated Annealing - Genetic Algorithms - Different mappings may be needed for different networks. Q: How many different ways can we assign N tasks to P processors? A: P^N (exponential search space!; therefore, finding the optimum solution is NP-Complete) Q: How many different ways can we assign N tasks to P processors in a load-balanced manner (assume that tasks have identical loads)? Dynamic Load Balancing ---------------------- Centralized: work pool, processor farm, replicated worker ... ------------ (* T Figure 7.2 *) it is best to hand out the larger or most complex tasks first. why? - in centralized load balancing, it is easy for the master process to recognize when to terminate. Two conditions must be satisfied: - the task queue is empty - every process has made a request for another task without any new tasks being generated Decentralized Dynamic Load Balancing: ------------------------------------- In centralized work pool approach, master process is the bottleneck. Tasks can only be given out one at a time. A simple variation to this approach is using a Distributed Work Pool: (* T Figure 7.3 *) Create "Mini-masters" and let them manage the tasks under them. The tree can be constructed as deep as needed. Fully Distributed Work Pool: ---------------------------- (* T Figure 7.4 *) The tasks could be transferred by two methods: - receiver-initiated method - sender-initiated method In receiver-initiated method, one way to select a process to request a task is to use round-robin approach: Pi requests tasks from process Px, where x is given by a counter that is incremented after each request, using modulo n arithmetic. In Random Polling Algorithm, x is chosen randomly. - when a process receives a request for tasks, it will send a portion of its tasks to the requesting process. Load Balancing Using a Line Structure -------------------------------------- (* T Figure 7.6 *) Processors are connected in a line topology and The master process (P0) feeds the queue with tasks at one end, and the tasks are shifted down the queue. - In this approach, it is better to have two processes running on each processor; one for Left-Right Communication another for Computation: (* T Figure 7.7 *) 7.3. DISTRIBUTED TERMINATION DETECTION ALGORITHMS ------------------------------------------------- Two conditions must be satisfied: - all the local termination conditions must be satisfied - there should not be any messages in transit between processes Termination Detection for a Ring: (* T Figure 7.10 and 7.11 *) (assume that a process can not be reactivated after local termination is reached) 1. When P0 terminates, it passes a token to P1 2. If Pi receives a token and it is terminated, it passes the token to P(i+1) 3. When P0 receives the token back again, it sends a global termination signal to everybody. If the above assumption does not hold (a process can be reactivated after local termination is reached) then algorithm can be modified as follows: (* T Figure 7.12 *) Assumptions: i) Processes are originally BLACK. ii) Pi does not terminate before receiving an acknowledgment from a process Pj to which Pi sent a task. 1. When P0 terminates, it becomes WHITE and passes a WHITE token to P1 2. If Pi receives a (WHITE or BLACK) token and it is terminated without generating any tasks for any processes, it becomes a WHITE process and passes on the token to P(i+1) in its original color. However, if Pi sends a task to Pj where j < i, Pi becomes a BLACK process and passes on a BLACK token. A BLACK process will color a token BLACK while a WHITE process passes on a token in its original color. 3. When P0 receives a BLACK token, it passes on a WHITE token; if it receives a WHITE token back, all the processes are terminated. QUESTION: Why doesn't Pi become BLACK when it sends a task to Pj where j > i ? Think about a senario that Pi sends a task to Pj ( j>i) and the task arrives Pj at a later time than the token. Then WHITE token might circulate all the way through and back to P0 in which case, P0 will think that everyone terminated. ANSWER: (best guess) there is an implicit assumption that messages have partial order i.e. if a process sends messages in the order m1, m2, m3, ... in local time, m1 will be the first to arrive at its destination, then m2, then m3, etc. In our scenario, since the task will be sent earlier than the token, it will arrive Pj earlier and Pj will not think that it terminated. (if it terminated before the token arrives, then no problem, it can become a WHITE process). TREE ALGORITHM FOR TERMINATION: The communication topology is a binary tree structure. The tokens flow along the branches of the tree. A process Pi passes a termination token to its parent, if: i) Pi is terminated and ii) Pi has received a termination token from its child. When P0, master, receives a termination token, then it informs everyone. ================ GRAPH ALGORITHMS (* Chapter 7 from Kumar et al. *) ================ ----------------------------------------------- Minimum Spanning Tree (MST) : Prim's Algorithm ----------------------------------------------- One Application: finding the minimum length of cable necessary to connect a set of computers in a network Sequential Algorithm -------------------- - Prim's algorithm is a GREEDY one. It first selects an arbitrary vertex and then grows the MST by choosing a new vertex and edge that are guaranteed to be in the MST. It stops when all the vertices are covered. (** T Program 7.1 **) * Note that it is easier to update d[] by changing the lines 12-13 as: p for each neighbor v of u do if ( v E (V - Vt) ) then d[v] = min[d[v], w(u,v)] (** T Figure 7.5 **) (Go over the example) Parallel Formulation -------------------- P processors, n=|V| vertices Each processor is assigned n/p vertices (Pi gets the set Vi) Partitioning of the d[] array and the adjacency matrix A is shown in: (** T Figure 7.6 **) Each PE holds the n/p columns of A and n/p elements of d[] array. PARALLEL MST ALGORITHM: 1. Initialize: Vt := {r}; d[k]=INFINITY for all k except d[r] = 0; 2. P0 broadcasts selectedV = r using one-to-all broadcast. 3. The PE responsible for "selectedV" marks it as belonging to set Vt. 4. For v = 2 to n=|V| do 5. Each Pi updates d[k] = Min[d[k], w(selectedV, k)] for all k E Vi 6. Each Pi computes MIN-di = min. d[] value among its unselected elements 7. PEs perform a "global minimum" operation using MIN-di values and result is stored in P0. Call the winning vertex, selectedV. 8. P0 broadcasts "selectedV" using one-to-all broadcast. 9. The PE responsible for "selectedV" marks it as belonging to set Vt. 10. EndFor TIME COMPLEXITY ANALYSIS: (Hypercube) Tp = n*(n/p) + n*logp computation communication (Mesh) Tp = n*(n/p) + n * Sqrt(p) Worst-case Tseq = n^2, the algorithm is cost-optimal if plogp/n = O(1). --------------------------------------------------------- Single-Source Shortest Paths (SSSP): Dijkstra's Algorithm --------------------------------------------------------- This algorithm is almost identical to Prim's MST algorithm. The major difference is that, at each step of the loop, instead of keeping track of the d[] values (distance of v E (V-Vt) to any node in Vt), we keep track of the distances of the vertices in (V-Vt) to the source node, r. We use array l[] to store these incremental distances. PARALLEL SSSP ALGORITHM: /* Each PE holds the n/p columns of A and n/p elements of d[] array */ 1. Initialize: Vt := {r}; l[k]=INFINITY for all k except l[r] = 0; 2. P0 broadcasts selV = r using one-to-all broadcast. 3. The PE responsible for "selV" marks it as belonging to set Vt. 4. For v = 2 to n=|V| do 5. Each Pi updates l[k] = Min[l[k], l[selV] + w(selV, k)] for all k E Vi 6. Each Pi computes MIN-Li = min. l[] value among its unselected elements 7. PEs perform a "global minimum" operation using MIN-Li values and result is stored in P0. Call the winning vertex, selV. 8. P0 broadcasts "selV" using one-to-all broadcast. 9. The PE responsible for "selV" marks it as belonging to set Vt. 10. EndFor TIME COMPLEXITIES are the same as MST algorithm. ------------------------ All-Pairs Shortest Paths ------------------------ Matrix-multiplication Based Algorithm: -------------------------------------- Adjacency Matrix = A Standard Matrix Multiplication: C = AxB where c(i,j) = SUM(k=1, k=n) a(i,k)*b(k,j) Special Matrix Multiplication (apply add-min operation): C = AxB where c(i,j) = MIN(k=1, k=n) [a(i,k) + b(k,j)] If we use the add-min operation for the special matrix mult, then A^k = A^(k-1) x A represents minimum distances between all-pairs obtained by traversing at most k edges (** T Figure 7.7 **) Instead of computing A, A^2, A^3, ..., A^n We can easily compute A, A^2, A^4, A^8, ..., A^k. If k >= n, the final matrix will represent the all-pairs shortest paths. Computing A^n sequentially takes log(n-1) multiplications each taking O(n^3): O(n^3 * logn)