CSc-387 Parallel Processing CHAPTER 6: SYNCHRONOUS COMPUTATIONS - Barrier for synchronization: (* T Figure 6.1 *) (* T Figure 6.2 *) Efficient implementation of a Barrier is heavily influenced by the underlying architecture. and, time complexity depends on the implementation. - Counter Implementation: (* T Figure 6.3 *) initialize counter to zero. Each process calling a barrier will increment the counter and check whether it has reached to n. If not, go to sleep. Otherwise, wake up all the other processes. Barrier Implementation in a message-passing system is shown in: (* T Figure 6.4 *) Counter Implementation takes O(n) steps. - Tree Implementation (* T Figure 6.5 *) - Butterfly Implementation (* T Figure 6.6 *) Both Tree and Butterfly implementations take O(log n) steps on a hypercube type network with dedicated communication links. LOCAL SYNCHRONIZATION --------------------- For some problems, sychronization can be achieved through local interactions between neighboring processors: Process P(i-1) Process P(i) Process P(i+1) ---------------- ---------------- ----------------- send(P(i-1)) recv(P(i)) send(P(i+1)) recv(P(i)) send(P(i)) recv(P(i-1)) send(P(i)) recv(P(i+1)) Deadlock Example: ------------------ Process P(i-1) Process P(i) ---------------- ---------------- recv(P(i)) recv(P(i-1)) send(P(i)) send(P(i-1)) ------------------------------------------------ 6.2 ------------------------------------------------ Prefix Sum ----------- sequential code: for(j=0; j < log(n); j++) for(i=2^j; i <= n; i++) x[i] = x[i] + x[i - 2^j]; x0, x0+x1, x0+x1+x2, x0+x1+x2+x3, ..., (* T Figure 6.8 *) Time Complexity = O(nlogn) (* but can be done in O(n) easily *) Parallel Implementation on a Hypercube -------------------------------------- Given: PE 0 1 2 3 4 5 6 7 000 001 010 011 100 101 110 111 ---------------------------------------- x0 x1 x2 x3 x4 x5 x6 x7 We would like to obtain: 000 001 010 011 100 101 110 111 --------------------------------------------------------------- x0 x0+x1 x0+x1+x2+x3 x0+..+x4 x0+..+x8 Here is how you do it: ---------------------- 1) initialize SUM=xi 2) Shift SUM by 1 (without wraparound) and add to previous SUM 3) Shift SUM by 2 (without wraparound) and add to previous SUM 4) Shift SUM by 4 (without wraparound) and add to previous SUM Power Shift can be done in two steps (Constant Time O(1)) on a hypercube which uses GrayCode ordering. (* will be covered in CS-487 *) Therefore, Parallel Time Complexity for Prefix SUM is O(logN) on a Hypercube. How about Mesh, linear array, or a star network? - Here also note that this algorith can be used for any prefix operation that is Associative (e.g. prefix mult, and/or etc. ) QUESTION: --------- How would you use this algorithm to compute a polynomial for a given x and a set of coefficients ? a0 + a1*x^1 + a2*x^2 + a3*x^3 + ..........+ a_(n-1)*x^(n-1) =============================================== PARALLEL SOLUTION OF LINEAR SYSTEM OF EQUATIONS =============================================== Solve Ax = b A: Coefficient matrix (sparse or dense) x: vector of unknowns b: right hand side (known values) DIRECT vs. ITERATIVE Techniques ---------------------- -------------------------- - Gaussian Elimination - Jacobi & JOR - Gauss-Jordan - Gauss Seidel & SOR - LU-Decomposition - Conjugate Gradient - Multigrid ****** ****** - exact solution - approximate soln. - fixed number of steps - unpredicted number of steps - a good initial guess required - need pivoting strategy - convergence criterion needed - may not converge ------------- JACOBI METHOD ------------- Could be preferred when the coefficient matrix A is Sparse and convergence can be obtained Ax = b (* A must be arranged such that no diagonal Dx + (A-D)x = b is zero, i.e. D is non-zero *) x(t+1) = D^(-1)[ b - (A-D)x(t)] x_i(t+1) = (1/aii)[ SUM(j =/ i) b_i - aij * x_j(t)] * Convergence is not guaranteed (However, it will converge if A is diagonally dominant) TERMINATION CRITERIA: | x_i(t+1) - x_i(t) | < Error-tolerance for All i Or Vector Termination Condition: sum of the squares of errors < Tolerance PARALLEL CODE ------------- x[i] = b[i]; iter = 0; do { iter++ sum = -a[i][i]*x[i]; for(j=1; j