CSc-387 Parallel Processing CHAPTER 4: PARTITIONING AND DIVIDE-AND-CONQUER STRATEGIES - data/task partitioning, load balancing, domain decomposition, etc. - Adding N numbers using m processes (slaves) (* T Figure 4.1 *) *** MASTER *** -------------- s = n/m /* each slave gets this much */ for (i=0, x=0 ; i < m; i++, x=x+s) send(&numbers[x], s, Pi); /* send s numbers to slave Pi */ result = 0; for (i=0; i < m ; i++){ recv(&part_sum, Pany); /* wait for results */ sum = sum + part_sum; } *** SLAVE *** -------------- recv(numbers, s, Pmaster); part_sum=0 for (i=0; i < s ; i++) part_sum = part_sum + numbers[i]; send(&part_sum, Pmaster); Another way: ------------ *** MASTER *** -------------- s = n/m /* each slave gets this much */ scatter(numbers, &s, Pgroup, root_master); /* send numbers to slaves*/ reduce_add(&sum, &s, Pgroup, root_master); *** SLAVE *** -------------- scatter(&numbers, &s, Pgroup, root_master); /* receive s numbers */ .... compute part_sum ..... reduce_add(&part_sum, &s, Pgroup, root_master); TIME ANALYSIS (with scatter + reduce): -------------------------------------- Tseq = O(n) Tpar = Tcomm1 + Tcomp1 + Tcomm2 ^^^^^^^ ^^^^^^^ ^^^^^^^ scatter part_sum reduce Tpar = (Tstart*logm + n*Tdata)+ (n/m) + (Tstart*logm) Tpar = O(logm + n + n/m) (* Note that there are errors in the anaysis given in the textbook *) If we assume that numbers are already in the processors and ignore the time for data distribution (O(n)), then, Speedup = Tseq / Tpar = n/(n/m + logm) (If m=n, speedup= n/logm) ---------------------------------------- 4.1.2. DIVIDE-AND-CONQUER ---------------------------------------- (* T Figure 4.2 *) Parallel Implementation: (* T Figure 4.3 *) - This kind of task distribution maps onto a hypercube perfectly, because processors communicate only with those processors whose addresses differ in one bit. ====================== Example: BUCKET SORT ====================== Assumption: Elements to be sorted are distributed uniformly over an interval [a,b) The interval [a,b) is divided into m equal-sized buckets (size = n/m) --------------------- Sequential Algorithm: (* T Figure 4.8 *) --------------------- 1. Each element is put into the appropriate bucket ===> O(n) 2. sort elements in each bucket (m* (n/m * log(n/m))) = nlog(n/m) If m=OrderExactly(n) then TIME = O(n) ==================== PARALLEL BUCKET SORT ==================== select m (number of buckets) = p - An Inefficient Way of Doing It is shown in: (* T Figure 4.9 *) An Efficient Parallel Solution: (* T Figure 4.10 *) -------------------------------- 1. each PE partitions its n/p elements into p=m subblocks =====> O(n/p) 2. Using all-to-all personalized broadcast algorithm, each PE sends subblocks to the appropriate PEs. Theoretically all-to-all scatter can be done in O((n/p)log(p)) time 3. Each PE sorts its own bucket using the seq. bucket sort alg. If (n/p) elements can be put in (n/p) buckets then sorting can be done in ====> O(n/p) Time. (Otherwise, it is [n/p*log(n/p)]) Total Time = Tp = Step1 + Step2 + step3 = n/p + (n/p)logp + n/p *** Note that there are disagreements between the times here and in the textbook related to the time complexities for AlltoAll, scatter, gather, etc. In case of such conflicts, you are expected to use the one in the notes. ---------------------- 4.2.3. N-Body Problem ---------------------- - To Determine the effects of forces between N "bodies". (e.g. gravitational forces, molecular dynamics, fluid dynamics, etc) Gravitational N-Body Problem ----------------------------- The force between two masses: F= G*Ma*Mb / r^2 - A body will accelerate according to Newton's second Law: F=ma - After a time step dt, all the bodies will move to new positions due to Gravitational forces and will have new forces and velocities determined by: F(t) = ma = m*(v(t+1) - v(t))/dt ======> v(t+1) = v(t) + F(t)*dt/m Body's position changes by: x(t+1) = x(t) + v*dt The new force applied on a mass, Ma, can be computed by summing all the gravitational forces applied by other masses on Ma: F(t+1) = G*Ma* SUM(over all masses Mx) Mx / r^2 where r is the distance between Ma and Mx at time (t+1) and can be computed using the new positions for Ma and Mx. - All computations can be performed in a 3-dimensional space. In this case, all parameters have 3 vector components to be computed: Position(x, y, z) Force(Fx, Fy, Fz) velocity(Vx, Vy, Vz) ---------------- Sequential Code ---------------- for(t=0; t < Tmax; t++){ /* for each time step */ for(i=0; i < N ; i++) { /* for each body */ F = Force_routine(i); /* compute force on i th body */ v[i]new = v[i] + F*dt/Mi /* new velocity */ x[i]new = x[i] + v[i]new * dt /* new position */ } for(i=0; i < N ; i++) { /* for each body */ x[i] = x[i]new; /* update velocity and position */ v[i] = v[i]new; /* for the next iteration */ } } - Sequential Time complexity (for one time step): Tseq = O(Tmax*N^2) - However, the time complexity can be reduced using the observation that a cluster of distant bodies can be approximated as a single body: (* T Figure 4.18 *) - PARALLEL CODE: use N processes, assign one process to each mass. For each time step, each process computes the force, velocity, and position for the mass that is assigned to it (O(N)), and Master does a GATHER (O(N)) and then a Broadcast of the arrays (O(NlogN)). -------------------- BARNES-HUT ALGORITHM -------------------- - Octtree representation: 3-dimensional object space is recursively divided into 8 subsubes until each subcube contains at least one object. - Recursive division of 2-dimensional space (quadtree) is shown in: (* T Figure 4.19 *) - Obviously, the tree will be very unbalanced. - The total mass and center of mass of the subcube is stored at each node - The force on each body can be obtained by traversing the tree starting at the root, stopping at a node when clustering approximation can be used for a particular body, and otherwise continuing to traverse the tree downward. - CLUSTERING APPROXIMATION can be used when: r >= d/A where d*d*d is the cube dimensions and A is a constant less than 1.0 - Clustering Approximation can significantly reduce the total computational effort. - However, once all the bodies are given new velocities and positions at the end of a time step, the entire octtree (quadtree) must be reconstructed again for the new time step. Constructing the tree takes O(NlogN) time and so does computing all the forces. Therefore, one time step takes: Tseq = O(NlogN) SEQUENTIAL ALGORITHM: ---------------------- for(t=0; t < Tmax; T++) { /* for each time period */ Build_Octtree(); /* construct Octtree (or quadtree) */ Tot_Mass_Center(); /* compute total mass & center */ Comp_Force(); /* traverse tree and compute forces for each object */ Update(); /* Update position/velocity for each object */ } Tot_Mass_Center(); must traverse the tree computing the total mass & center of mass at each node. This can be done RECURSIVELY. The total mass, M at a node can be given by simply sum of the total masses of the children: M = SUM(i=0, i=7) Mi where Mi is the total mass of the i th child. - The center of mass, C, is given by: C = 1/M SUM(i=0, i=7) Mi*Ci PARALLEL APPROACH: will not improve the seq. time complexity unless the Build_Octtree() routine is parallelized. Because, the complexity of building Octtree is O(NlogN) and it has to be done every time step.