CSc-487 Advanced Parallel Computation =================================================== FUNDAMENTAL OPERATIONS ON A HYPERCUBE MULTICOMPUTER =================================================== "Hypercube Algorithms" by S. Ranka and S. Sahni -------- GENERAL -------- - Almost all the algorithms covered here use 1-hop communications If this restriction is removed, some algorithms could be simplified at the expense of an increase in communication time. Also, SIMD machines (by definition) can not have multi-hop routing. - Linear array ordering used here are as follows : SIMD (Regular) : 0 1 2 3 4 5 6 7 MIMD (Gray-Code) : 0 1 3 2 6 7 5 4 - Assumption for the SIMD hypercube used in the textbook: At a certain time, PEs can only route data to a neighbor and the direction of communication is specified by a bit number which is the SAME for all processors. EXAMPLE : A) Exchange data in direction-2 (bit-2) (SIMD) ( b2 b1 b0 ) 0 1 2 3 4 5 6 7 000 001 010 011 - 100 101 110 111 a b c d e f g h Exchange (bit-2) e f g h a b c d B) Exchange in direction-1 (bit-1) (SIMD) ( b2 b1 b0 ) 000 001 010 011 100 101 110 111 a b c d e f g h Exchange (bit-1) c d a b g h e f C) The following data movements are not allowed in SIMD hypercube ! 000 001 010 011 100 101 110 111 a b c d e f g h exchange arbitrary b a g h f e c d NOT ALLOWED ! bit-0 bit-2 bit-0 bit-2 shift-by-1 : h a b c d e f g NOT ALLOWED ! ----------------- Data Broadcasting ----------------- b2 b1 b0 0 0 0 / \ / \ / \ send on BIT-0 / \ / \ / \ 000 001 / \ / \ / \ / \ / \ / \ send on BIT-1 / \ / \ / \ / \ 000 010 001 011 / \ / \ / \ / \ / \ / \ / \ / \ send on BIT-2 / \ / \ / \ / \ 000 100 010 110 001 101 011 111 Algorithm: ---------- Each node X does a receive and then send to those neighbors whose PID's are X (EXOR) 2^j such that X < 2^j < 2^dim - If we want the originating processor to be Px, then XOR all the processor labels on the Broadcast tree with Px. The resulting algorithm will be: Node X does a receive from Z and then send to those neighbors whose PID's are X (EXOR) 2^j such that (X EXOR Z) < 2^j < 2^dim Indeed (X EXOR Z) gives the bit number that X and Z differ. -------------------- Window Broadcasting -------------------- - How do you define a k-dim window ? BOUND FREE - Represent the window (k-subcube) as 101...11xxx..xx d-k k Call the BOUND BITS as "WINDOW-ID" - If each processor knows the window-id then PEs can FIX the window-id bits and just change the FREE bits and run any fundamental operation. ------------------------------ Data Sum (window Sum) : binary tree-based (Already covered) ------------------------------ --------------- MIMD PowerShift --------------- - A power of 2 shift (in Gray-Code Ordering) can be made in two steps EXaMPLE: Shift-by i=4 on a 16 PE hypercube 0 1 000 001 011 010 - 110 111 101 100 == 100 101 111 110 - 010 011 001 000 A B C D E F G H I J K L M N O P wid=00 wid=01 wid=11 wid=10 shift by 4 in reverse order P O N M D C B A H G F E L K J I Exchange data in windows of 2 M N O P A B C D E F G H I J K L Rationale: (A|B)^r = (B^r)|(A^r) ALGORITHM: ---------- 1. Each PE Selects its window size as i and sets WID = WINDOW-ID = Most significant [d - log(i)] bits Ex: WID for 0111 is 01 for i=4 and it is 10 for 1001. 2. Window-id of the next window is the next label in the gray code order Ex: WID of the next window for 01 is 11 3. Each PE sends its data to the corresponding element in the next window. (Shift Data in Reverse order) 4. Choose window size to be i/2 and WID= d-log(i/2) bits. It is always true that if we pair WIDs, the one on the LEFT will always have a "0" in the least significant bit while the one on the RIGHT will have a "1" in the LSB. This property holds because of the way Binary Reflected Gray Code is generated. Therefore, the PEs with a "0" in the LSB of their WID will EXCHANGE DATA with those PEs with a "1" in the LSB of their WID and vice versa. Algorithmic Complexity = O(1) -------------------- MIMD Shift (general) -------------------- - How would you perform a shift by i (i being any number) ? Ex: i = 22 = 16 + 4 + 2 = 10110 Ans: Shift-by 16, then 4, then 2 ======> Alg.Complexity = O(logP) ------------------- Prefix Sum/Multiply ------------------- Given: PE 0 1 3 2 6 7 5 4 000 001 011 010 110 111 101 100 ---------------------------------------- x1 x2 x3 x4 x5 x6 x7 x8 We would like to obtain: 000 001 011 010 110 111 101 100 ------------------------------------------------------------ x1 x1+x2 x1+x2+x3 x1+..+x4 x1+..+x8 Here is how you do it: ---------------------- 1) initialize SUM=xi 2) Shift SUM by 1 (without wraparound) and add to previous SUM 3) Shift SUM by 2 (without wraparound) and add to previous SUM 4) Shift SUM by 4 (without wraparound) and add to previous SUM ALGORITHM for Processor k: -------------------------- STEP-1. SUM = xk 2. for i=0 to (dim-1) do 3. shift SUM by 2^i and receive DATA 4. SUM = SUM + DATA Algorithmic Complexity = O(dim) = O(logP) QUESTION-1: ----------- How would you use this algorithm to compute a polynomial for a given x and a set of coefficients ? a0 + a1*x^1 + a2*x^2 + a3*x^3 + ..........+ a_(n-1)*x^(n-1) QUESTION-2: ----------- How would you use this algorithm to compute all the elements of the sequence x0, x1, x2, ...,x_(n-1) which are generated by the following recurrence relation: x0 = a0 xi = ai*x_(i-1) + bi for i > 0. QUESTION-3: ----------- How about computing larger-degree (2., 3., etc.) recurrence relations ? ----------------------------------------- DATA CIRCULATION - (all-to-all broadcast) ----------------------------------------- - Each PE will get each other PE's data - MIMD Hypercube : Repeatedly shift-by-1 P times O(P) - SIMD Hypercube : shift-by-1 is an expensive operation, therefore One Solution: Each PE broadcast its data and receive from all others ------------ Complexity: O(PlogP) (* worst case *) - There is a better solution for the SIMD hypercube X-Sequence ---------- X1 = 0 Xq = X_(q-1) (q-1) X_(q-1) X1= 0 X2= 0 1 0 X3= 0 1 0 2 0 1 0 X4= 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0 .............. Define f(q,i) to be the i'th number in the sequence Xq. Then, f(3,1)=0, f(3,2)=1, f(3,4)=2, f(3,5)=0 .......... SIMD Circulate Example: ----------------------- 0 1 2 3 4 5 6 7 f(3,i) 000 001 010 011 - 100 101 110 111 ----------------------------------------------------- a b c d e f g h STEP-1. b a d c f e h g 0 2. d c b a h g f e 1 3. c d a b g h e f 0 4. g h e f c d a b 2 5. h g f e d c b a 0 6. f e h g b a d c 1 7. e f g h a b c d 0 COMPLEXITY : O(P) --------- QUESTION: --------- How does processor j find the origin of the data received at step k ? ------------ THEOREM 2.1 (Read the Proof from Textbook): --------------------------------------------------------------------- INDEX(j,0)= j /* Initially */ INDEX(j,k) = INDEX(j,k-1) EXOR 2^[f(dim,k)] --------------------------------------------------------------------- -------- Example: -------- calculate first 4 indices for PE=3 INDEX(3,0) = 3 (d) INDEX(3,1) = 3 EXOR 2^0 = 011 O 001 = 2 (c) INDEX(3,2) = 2 EXOR 2^1 = 010 O 010 = 0 (a) INDEX(3,3) = 0 EXOR 2^0 = 000 O 001 = 1 (b) * Note * : X-sequence can be computed by the control processor in O(P) time and saved in an array of size P-1 --------- QUESTION: --------- How do you generate the X-sequence on the fly ? ANSWER: ------- at step k, f(q,k) = bit position of the first 1 in k ----- O(logP) ******************* BONUS QUESTION - 1 ******************* Find an O(1) algorithm using a stack of size O(logP) to generate f(q,k) at step k and receive a 3-POINT BONUS counted towards your TOTAL points. ******************* BONUS QUESTION - 2 ******************* How would you use the "SIMD data circulation" algorithm to generate (operate on) pairwise combinations of N = 2P objects in an optimal manner ? READING ASSIGNMENT : ``Distributed Evaluation of an Iterative Function for All Object Pairs on an SIMD Hypercube'' by F. Ercal. Information Processing Letters, Vol.40, Dec.1991, pp.341-345. --------------- CONSECUTIVE SUM --------------- - Each processor has an array X[.] of N values. - j'th processor will get the sum S(j) = SUM(i=0, P-1) X(i)[j] Example: P0 P1 P2 P3 -------------------- x00 x10 x20 x30 x01 x11 x21 x31 x02 x12 x22 x32 x03 x13 x23 x33 P1 will get (x01 + x11 + x21 + x31) = SUM(i=0, P-1) [ x(i1) ] P3 will get (x03 + x13 + x23 + x33) = SUM(i=0, P-1) [ x(i3) ] --------------------------------------- MIMD Algorithm (repeated shifts of 1) --------------------------------------- GRAY CODE ORDER: P0 P1 P3 P2 ----------------------------------------------- <-- S0=x00 <-- S1=x11 <-- S3=x33 <-- S2=x22 <-- S1=x11 <-- S3=x33 <-- S2=x22 <-- S0=x00 + x01 + x13 + x32 + x20 <-- S3=S3 <-- S2=S2 <-- S0=S0 ' <-- S1=S1 + x03 + x12 + x30 + x21 <-- S2=S2 <-- S0=S0 <-- S1=S1 <-- S3=S3 + x02 + x10 + x31 + x23 - Since each processor knows which partial sum Si they received at each step, they can figure out which array value to add to the partial sum --------------------------------------- SIMD Algorithm (using SIMD Circulation) --------------------------------------- P0 P1 P2 P3 f(2,i) ----------------------------------------------------------------- S0=x00 S1=x11 S2=x22 S3=x33 \ / \ / X X 0 / \ / \ S1=x11 S0=x00 S3=x33 S2=x22 + x01 + x10 + x23 + x32 1 S3=S3 S2=S2 S1=S1 S0=S0 + x03 + x12 + x21 + x30 \ / \ / X X 0 / \ / \ S2=S2 S3=S3 S0=S0 S1=S1 + x02 + x13 + x20 + x31 1 (all back) ALGORITHMIC COMPLEXITY = O(P) --------- QUESTION: How does a PE figure out which array element to add to Si ? ANSWER : by using the index formula (Theorem 2.1) --------- Index(j,0)= j Index(j,k) = Index(j,k-1) EXOR 2^[f(dim,k)] ------- RANKING ------- Objective: Assign to each SELECTED processor a rank such that RANK(i) is the number of selected processors with index less than i 0 1 2 3 4 5 6 7 Exch. Bit# 000 001 010 011 - 100 101 110 111 ------------------------------------------------------------- * * * * * RESULT: 0 1 2 3 4 ------------------------------------------------------------- INITIAL: 0 0 0 0 0 0 0 0 R windows 0 1 1 0 1 0 1 1 S of d=0 ------------------------------------------------------------- windows 0 0 0 1 0 1 0 1 R_r=R'+S_l BIT of d=1 1 1 1 1 1 1 2 2 S=S_r + S_l 0 ------------------------------------------------------------- windows 0 0 1 2 0 1 1 2 R BIT of d=2 2 2 2 2 3 3 3 3 S 1 ------------------------------------------------------------- windows 0 0 1 2 2 3 3 4 R BIT of d=3 5 5 5 5 5 5 5 5 S 2 ------------------------------------------------------------- Algorithmic Complexity = O(logP) ----------- CONCENTRATE ----------- - The selected processors are ranked. - Move the ranked records to the PE whose label is the same as the rank. Here is how the algorithm works: -------------------------------- 0 1 2 3 4 5 6 7 agree on 000 001 010 011 - 100 101 110 111 Bit# ----------------------------------------------------------- . (B,0) . (D,1) (E,2) . (G,3)(H,4) RESULT: (B,0)(D,1)(E,2)(G,3) (H,4) ----------------------------------------------------------- (B,0) . . (D,1) (E,2) . (H,4)(G,3) 0 ----------------------------------------------------------- (B,0)(D,1) . . (H,4) . (E,2)(G,3) 1 ----------------------------------------------------------- (B,0)(D,1)(E,2)(G,3) (H,4) . . . 2 ----------------------------------------------------------- ALGORITHMIC COMPLEXITY = O(log P) QUESTION : How can this algorithm go wrong ? A : If there is a "collision". i.e. if one of the exchange sites agree while the other one disagree at a certain bit. However, THEOREM 2.5 (page 58) proves that "collision" never happens. ----------- DISTRIBUTE ----------- - Inverse of data concentration. - Begin with records in PEs 0, 1, 2, ..., R. Each record has a destination D(i) such that D(0) < D(1) < ... < D(R) OBJECTIVE: move (route) the records to the destination PEs 0 1 2 3 4 5 6 7 b2b1b0 000 001 010 011 - 100 101 110 111 Agree on Bit# ----------------------------------------------------------- (A,2)(B,4)(C,5)(D,6) . . . . RESULT: . . (A,2) . (B,4)(C,5)(D,6) . ----------------------------------------------------------- (A,2) . . . . (B,4)(C,5)(D,6) 2 010 100 101 110 ----------------------------------------------------------- . . (A,2) . (C,5)(B,4) . (D,6) 1 010 101 100 . 110 ----------------------------------------------------------- . . (A,2) . (B,4)(C,5)(D,6) . 0 010 100 101 110 . ----------------------------------------------------------- Algorithmic Complexity = O(log P) ----------- GENERALIZE ----------- 0 1 2 3 4 5 6 7 000 001 010 011 - 100 101 110 111 ------------------------------------------------ (A,2)(B,4)(C,5)(D,6) . . . . (G) RESULT: (A,2)(A,2)(A,2)(B,4) (B,4)(C,5)(D,6) . ------------------------------------------------ . . . . (A,2)(B,4)(C,5)(D,6) (F) Exch.- bit 2 . . . . . (B,4)(C,5)(D,6) (F) Eliminate (A,2)(B,4)(C,5)(D,6) . . . . (G) Eliminate (A,2)(B,4)(C,5)(D,6) . (B,4)(C,5)(D,6) (G) Consolidate ----------------------------------------------------------------- (C,5)(D,6)(A,2)(B,4) (C,5)(D,6) . (B,4) (F) Exch.- bit 1 (C,5)(D,6)(A,2)(B,4) (C,5)(D,6) . . (F) Eliminate (A,2)(B,4)(C,5)(D,6) . (B,4) . (D,6) (G) Eliminate (A,2)(B,4)(A,2)(B,4) (C,5)(B,4) . (D,6) (G) Consolidate ----------------------------------------------------------------- (B,4)(A,2)(B,4)(A,2) (B,4)(C,5)(D,6) . (F) Exch.- bit 0 (B,4)(A,2)(B,4) . (B,4)(C,5)(D,6) . (F) Eliminate (A,2)(B,4)(A,2)(B,4) (C,5) . . . (G) Eliminate (A,2)(A,2)(A,2)(B,4) (B,4)(C,5)(D,6) . (G) Consolidate ----------------------------------------------------------------- Algorithmic Complexity = O(logN) * ELIMINATION CRITERIA for F and G: a record is eliminated if its dest-PE is smaller than the smallest processor in the current window. * CONSOLIDATION CRITERIA: if {F(i) < G(i) } then G(i) <--- F(i) ------------------ RANDOM ACCESS READ ------------------ ** Go over the example from the transparency ** 0 1 2 3 | 4 5 6 7 | 000 001 010 011 | 100 101 110 111 ----------------------------------------------------------- Read FROM: 4 3 . 1 7 7 . 3 STEPS INVOLVED: sort O(log^2 (P)) rank O(logP) concentrate O(logP) distribute O(logP) concentrate O(logP) generalize O(logP) sort O(log^2 (P)) RESULT: D(4) D(3) . D(1) D(7) D(7) . D(3) ----------------------------------------------------------- OVERALL Complexity = O(log^2 (P)) ** Read from Textbook pages 71-74. ** ------------------------- RANDOM ACCESS WRITE (RAW) ------------------------- 3 ways to resolve write conflicts: 1) Arbitrary RAW 2) Highest/Lowest RAW 3) Combining RAW - We will implement "Arbitrary RAW". ** Go over the example from the transparency ** Time Complexity is O(log^2 (P)) due to the "sort" step. ---------------------------------------------- BPC PERMUTATIONS (Bit-Permute-Complement) ---------------------------------------------- Sorting N numbers using N PEs is basically a permutation (rearrangement) of N numbers. In several situations where the records are to be permuted, the desired permutation can be expressed in terms of a relationship between the bit representations of the source and the destination processors. Q: Can we find such a relationship for the sorting algorithm ? A: No. because it is not a fixed permutation, in other words it is context sensitive (depends on the data values to be sorted) Q: Can we implement perfect shuffle using BPC ? A: Yes. Example: PERFECT SHUFFLE ------------------------ The binary representation of the destination d of the data in PE I = I2 I1 I0 is I1 I0 I2 and this relation is represented as B = [0,2,1]. If we want to show this relation using a table, the following is obtained: i2 i1 i0 Destination B = [0, 2, 1] ===> d = i1 i0 i2 i d i 2 1 0 2 1 0 d --------------------------------- 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 2 2 0 1 0 1 0 0 4 3 0 1 1 1 1 0 6 4 1 0 0 0 0 1 1 5 1 0 1 0 1 1 3 6 1 1 0 1 0 1 5 7 1 1 1 1 1 1 7 - Every BPC permutation can be performed in O(log^2 N) time by having each PE compute the destination PE for its data and then sorting the data using the destination PE as key. - However, there is a better alg. which takes O(logN) time. - Algorithm is complex. Read from the textbook pages 78-87. ** Go over the example from the transparency ** ---------------------------- Some common BPC Permutations ---------------------------- k=dimension ==================================================== PERMUTATION B ==================================================== Matrix Transpose [k/2-1,...,0, k-1,...,k/2] ---------------------------------------------------- Bit Reversal [0,1,2,....,k-1] ---------------------------------------------------- Vector Reversal [-(k-1),-(k-2),....,0] ---------------------------------------------------- Perfect Shuffle [0,k-1,k-2,......,1] ---------------------------------------------------- Unshuffle [k-2,k-3,......,1,0,k-1] ---------------------------------------------------- ------- Example: MATRIX TRANSPOSE B = [k/2-1,...,0, k-1,...,k/2] ------- If i=[3 2 1 0] k=4 and B = [1 0 3 2] d = i1 i0 i3 i2 i d Sort value 3 2 1 0 i1i0i3i2 Destination ------------------------------------------------------- a00 0 0 0 0 0 0 0 0 a00 0 0 0 0 a00 a01 0 0 0 1 0 1 0 0 a01 0 0 0 1 a10 a02 0 0 1 0 1 0 0 0 a02 0 0 1 0 a20 a03 0 0 1 1 1 1 0 0 a03 0 0 1 1 a30 a10 0 1 0 0 0 0 0 1 a10 0 1 0 0 a01 a11 0 1 0 1 0 1 0 1 a11 0 1 0 1 a11 a12 0 1 1 0 1 0 0 1 a12 0 1 1 0 a21 a13 0 1 1 1 1 1 0 1 a13 0 1 1 1 a31 a20 1 0 0 0 0 0 1 0 a20 1 0 0 0 a02 a21 1 0 0 1 0 1 1 0 a21 1 0 0 1 a12 a22 1 0 1 0 1 0 1 0 a22 1 0 1 0 a22 a23 1 0 1 1 1 1 1 0 a23 1 0 1 1 a32 a30 1 1 0 0 0 0 1 1 a30 1 1 0 0 a03 a31 1 1 0 1 0 1 1 1 a31 1 1 0 1 a13 a32 1 1 1 0 1 0 1 1 a32 1 1 1 0 a23 a33 1 1 1 1 1 1 1 1 a33 1 1 1 1 a33 ============================== PARALLEL MATRIX MULTIPLICATION ============================== Chapter 3 - S.Ranka and S.Sahni ---------------------------------------- Using n^3 processors - 3D representation ---------------------------------------- C[i,j] = SUM{k=0, n-1} { A[i,k]*B[k,j] } Algorithm: ---------- STEP 1 : [DISTRIBUTE DATA] A(k,i,j) = A[i,k] and B(k,i,j) = B[k,j] STEP 2 : [MULTIPLY] C(k,i,j) = A(k,i,j) * B(k,i,j) STEP 3 : [ADD] PE(0,i,j) computes SUM{k=0,n-1} C(k,i,j) = SUM{k=0, n-1} { A[i,k]*B[k,j] } Draw a 3D rep. of processors and show how A and B are initially stored in 2D-plane, and perform step-1 through broadcast of these initial values along the third dimension (window broadcast). K z = n-1 / / /a0z /a0z / . / . / . / . / . a1z / . a1z /a02 . /a02 . / . / . /a01 . a2z /a01 . a2z / a12 . / a12 . /a00 . : /a00 . : ___a11_.____________a11_._______________ J | a22 : | a22 : |a10 |a10 | a21 : | a21 : | | |a20 |a20 | : azz | : azz |: . |: . | : . | : . |: . |: . | az2 | az2 |: |: | az1 | az1 | | |az0 |az0 | I K z = n-1 / / /bz0 bz1 bz2 .... bzz / . . . .... . / . . . .... . / . . . .... . /b20 b21 b22 .... b2z / /b10 b11 b12 .... b1z / /b00 b01 b02 .... b0z ________________________________________ J | | | /bz0 bz1 bz2 .... bzz | / . . . .... . | / . . . .... . | / . . . .... . | /b20 b21 b22 .... b2z | / | /b10 b11 b12 .... b1z |/ |b00 b01 b02 .... b0z |________________________ | | I K z = n-1 / / /bz0 bz1 bz2 .... bzz / . . . .... . / . . . .... . / . . . .... . /b20 b21 b22 .... b2z / /b10 b11 b12 .... b1z / /b00 b01 b02 .... b0z ________________________________________ J | |c00 c01 c02 .... c0z |c10 c11 c12 .... c1z |c20 c21 c22 .... c2z | . . . .... . | . . . .... . | . . . .... . | . . . .... . | | |cz0 cz1 cz2 .... czz | | I ------------------------ Time Complexity Analysis ------------------------ STEP-1 (window broadcast) takes : O(logN) STEP-2 takes : O(1) STEP-3 (treewise addition on a window) takes : O(logN) Speedup = Tseq/Tpar = (n^3)/logn (* GOOD *) Efficiency = speedup/P = 1/logn (* POOR *) ---------------------------------------- Using n^2 processors - 2D representation ---------------------------------------- INITIAL AFTER ALIGNMENT j j 00 01 10 11 00 01 10 11 --------------------- --------------------- | 00 | 01 | 02 | 03 | A | 00 | 01 | 02 | 03 | A 00 | 00 | 01 | 02 | 03 | B 00 | 00 | 11 | 22 | 33 | B --------------------- --------------------- | 10 | 11 | 12 | 13 | A | 11 | 10 | 13 | 12 | A 01 | 10 | 11 | 12 | 13 | B 01 | 10 | 01 | 32 | 23 | B i --------------------- i --------------------- | 20 | 21 | 22 | 23 | A | 22 | 23 | 20 | 21 | A 10 | 20 | 21 | 22 | 23 | B 10 | 20 | 31 | 02 | 13 | B --------------------- --------------------- | 30 | 31 | 32 | 33 | A | 33 | 32 | 31 | 30 | A 11 | 30 | 31 | 32 | 33 | B 11 | 30 | 21 | 12 | 03 | B --------------------- --------------------- Alignment can be done in O(log n) steps as follows: A[i, j] === goes to ==> A[i, i EXOR j ] (i EXOR j) affects only those bit positions in j which are 1 in i. Therefore, the following alignment algorithm would work: - each PE checks the bits of its i number from left to right - if that bit = 1, exchange with the PE whose only differing bit is in that position otherwise (that bit = 0), do nothing. Example: -------- A[0110, 1011] == goes to ==> A[0110, 1101] 0110 EXOR 1011 = 1101 Steps for routing from 1011 to 1101 under the influence of 0110: 1011 ====> 1011 ====> 1111 ====> 1101 ====> 1101 * Each PE cycles (log n) times. NOTE: You may also achieve the same result by using the Random Access Write or sorting algorithms. However both algorithms have O(log^2 n) time complexity. AFTER ALIGNMENT j 00 01 10 11 --------------------- | 00 | 01 | 02 | 03 | A 00 | 00 | 11 | 22 | 33 | B --------------------- | 11 | 10 | 13 | 12 | A 01 | 10 | 01 | 32 | 23 | B i --------------------- | 22 | 23 | 20 | 21 | A 10 | 20 | 31 | 02 | 13 | B --------------------- | 33 | 32 | 31 | 30 | A 11 | 30 | 21 | 12 | 03 | B --------------------- F-sequence ---------- 00 01 10 11 --------------------- | 01 | 00 | 03 | 02 | A 00 | 10 | 01 | 32 | 23 | B --------------------- | 10 | 11 | 12 | 13 | A 01 | 00 | 11 | 22 | 33 | B BIT-0 EXCHANGE --------------------- i | 23 | 22 | 21 | 20 | A 10 | 30 | 21 | 12 | 03 | B --------------------- | 32 | 33 | 30 | 31 | A 11 | 20 | 31 | 02 | 13 | B --------------------- --------------------- | 03 | 02 | 01 | 00 | A 00 | 30 | 21 | 12 | 03 | B --------------------- | 12 | 13 | 10 | 11 | A 01 | 20 | 31 | 02 | 13 | B BIT-1 EXCHANGE --------------------- | 21 | 20 | 23 | 22 | A 10 | 10 | 01 | 32 | 23 | B --------------------- | 30 | 31 | 32 | 33 | A 11 | 00 | 11 | 22 | 33 | B --------------------- --------------------- | 02 | 03 | 00 | 01 | A 00 | 20 | 31 | 02 | 13 | B --------------------- | 13 | 12 | 11 | 10 | A 01 | 30 | 21 | 12 | 03 | B BIT-0 EXCHANGE --------------------- | 20 | 21 | 22 | 23 | A 10 | 00 | 11 | 22 | 33 | B --------------------- | 31 | 30 | 33 | 32 | A 11 | 10 | 01 | 32 | 23 | B --------------------- ---------- Algorithm: ---------- STEP 1 : [ALLIGN] Align A and B such that P(i,j) contains A(i,j) ====> A[i, i EXOR j] and B(i,j) ====> B[i EXOR j, j] ** This alignment takes O(log n) steps STEP 2 : [SWAP-MULTIPLY-ADD] Move A's and B's so that each processor has an A and a B whose product is a new term in the sum for C[i,j]. f-sequence will work in this case (Lemma 3.2 refers to Theorem 2.1). ** This step takes O(n) steps - Due to Theorem 2.1, after every swap A(i,j) and B(i,j) will have A[i, index(j,m)] and B[index(i,m),j] where index(j,m)=index(i,m) ** TOTAL Time Complexity = O(n) Speedup = Tseq/Tpar = (n^3)/n = n^2 = P (* LINEAR SPEEDUP *) Efficiency = speedup/P = 1 EXCELLENT ! ====================================== TOURNAMENT STYLE COMPUTING in PARALLEL (INTERACTION AMONG ALL-PAIRS) ====================================== To be covered from Paper: ``Distributed Evaluation of an Iterative Function for All Object Pairs on an SIMD Hypercube'' by F. Ercal. Information Processing Letters, Vol.40, Dec.1991, pp.341-345. =========================== PARALLEL GRAPH PARTITIONING =========================== uses "Tournament style computing" described earlier. To be covered from Paper: ``Parallel Graph Partitioning on a Hypercube,'' by P. Sadayappan, F. Ercal, and J. Ramanujam Proc. of Fourth Conf. on Hypercube Concurrent Comp. and Applications, pp. 67-70, March, 1989.