CSc-387 Parallel Processing ============================== CHAPTER 9 : SORTING ALGORITHMS ============================== - Sorting is one of the most common operations performed by a computer. - Basically, it is a permutation function which operates on N elements Internal Sorting : N is small enough to fit into PE's main memory External Sorting : N is very large (doesn't fit into Main Memory) therefore auxiliary storage must be used for sorting Another categorization: ----------------------- Comparison-Based sorting : uses repeated compare/exchanges lower bound for sequential time --> O(NlogN) Noncomparison-Based sorting: uses certain known properties of the elements (e.g. binary representation or distribution) lower bound for complexity --> O(N) (e.g. if you know that the numbers are distributed between 1 to N and they are all disjoint, how do you sort them ?) ISSUES: ------- - Input-output sequences will be kept distributed in PE memories i.e. It is not centralized - The sequence will be sorted with respect to the Processor Enumeration Examples: Decimal ordering of PEs, Gray-code ordering -------- Snake-like ordering in a 2D MESH etc. (* T Figure 9.11 *) 9.2 COMPARE-AND-EXCHANGE SORTING ALGORITHMS ------------------------------------------- - Compare-exchange step with 1 element (* T Figure 9.5 *) - Compare-exchange step with N/p elements per PE (* T Figure 9.7 *) BUBBLE SORT AND ODD-EVEN TRANSPOSITION SORT =========================================== - Best Sequential Sorting Algorithm has O(NlogN) Complexity - Using N processors, if we can achieve linear speedup, Best Parallel Complexity would be O(logN). O(logN) complexity with N processors is very difficult to achieve - However, we can achieve linear speedup if we parallelize lousy sorting algorithms such as Bubble Sort [O(N^2)] (* T Figure 9.8 *) Time Complexity Tseq = SUM(i=1, i=N-1) i = N(N-1)/2 = O(N^2) ODD-EVEN TRANSPOSITION SORT =========================== Sorts n elements in n phases each requiring n/2 compare-exchanges (* T Figure 9.10 *) Parallel Time Complexity = O(n) EVEN-PHASE: even indiced PEs do a compare/exchange with their right neighbors ODD-PHASE: odd indiced PEs do a compare/exchange with their right neighbors CASE: when n >> P ------------------- Each PE gets n/p numbers. First, PEs sort n/p locally, then they run odd-even trans. algorithm each time doing a merge-split for 2n/p numbers. --------------------------------------------------------------------------- EXAMPLE: p0 p1 p2 p3 13 7 12 8 5 4 6 1 3 9 2 10 1) LOCAL SORT (Time = O(n/p*log(n/p)) 7 12 13 4 5 8 1 3 6 2 9 10 2) ^ ^ ^ ^ O-E |_____________| |_____________| 4 5 7 8 12 13 1 2 3 6 9 10 E-O ^ ^ |_____________| 4 5 7 1 2 3 8 12 13 6 9 10 ^ ^ ^ ^ O-E |_____________| |_____________| 1 2 3 4 5 7 6 8 9 10 12 13 E-O ^ ^ |_____________| SORTED: 1 2 3 4 5 6 7 8 9 10 12 13 --------------------------------------------------------------------------- Time-Complexity Analysis: (there are p Merge-splits each taking O(n/p)) ------------------------- Tpar = (n/p)log(n/p) + p*(n/p) + p*(n/p) ^^^^^^^^^^^^^^ ^^^^^^^ ^^^^^^^ LOCAL SORT merge-splits communication 9.2.3. TWO-DIMENSIONAL SORTING ON A MESH: SHEARSORT ==================================================== Snake-like ordering in a 2D MESH: (* T Figure 9.11 *) ODD-PHASE: Even Rows: sort ascending order (use O-E Transposition sort) O(n) Odd Rows: sort descending order (use O-E Transposition sort) O(n) EVEN-PHASE: Sort each column in ascending order (* T Figure 9.12 *) It takes O(logn) phases to sort (n x n) numbers: Tpar = O(nlogn) Since sorting n^2 numbers sequentially takes O(n^2 logn), Speedup = O(n) ; however, efficiency is 1/n. ================== BITONIC MERGE SORT ================== - A Bitonic List is defined as a list with (* T Figure 9.24 *) no more than one LOCAL MAXIMUM and no more than one LOCAL MINIMUM. (Endpoints must be considered - wraparound ) - BINARY-SPLIT: Divide the list equally into two. Compare/Exchange each item on the first half with the corresponding item in the second half of the list. Example: (* T Figure 9.25 *) Another Example: 24 20 15 9 4 2 5 8 10 11 12 13 22 30 32 45 Result after Binary-split: 10 11 12 9 4 2 5 8 24 20 15 13 22 30 32 45 Notice that: a) Each element in the first half is smaller than each element in the second half b) Each half is a bitonic list of length n/2. If you keep applying the BINARY-SPLIT to each half repeatedly, you will get a SORTED LIST: 10 11 12 9 . 4 2 5 8 24 20 15 13 . 22 30 32 45 4 2 . 5 8 . 10 11 . 12 9 22 20 . 15 13 . 24 30 . 32 45 4 2 . 5 8 . 10 9 . 12 11 15 13 . 22 20 . 24 30 . 32 45 Sorted: 2 4 . 5 8 . 9 10 . 11 12 13 15 . 20 22 . 24 30 . 32 45 Q: How many parallel steps does it take to sort ? ANSWER: logN Another Example: (* T Figure 9.26 *) - Notice that we sorted a BITONIC list rather than an arbitrary list. - Could you use this algorithm to sort an arbitrary list ? A: Yes. first obtain a bitonic sequence then use this alg. to sort it. (* T Figure 9.27 *) (* T Figure 9.28 *) Q: What is the most tricky part in coding this algorithm on a hypercube? A: At each step, each PE needs to figure out its partner for compare/exchange - This can be done by considering id of each process (in binary) At step j, [Partner-ID = id obtained by reversing bit (d-j) of MYID] 000 001 - 010 011 -- 100 101 - 110 111 | ^ ^ ^ |____| | | | d-3 | | |___________| | | d-2 | |________________________| d-1 Q: How would you sort a bitonic list WHEN N is NOT A POWER OF 2 ? A: Use as many processes as N' where N' is a power of 2 and the smallest number larger than N. Assign a special value such as "infinity" to all the extra processes (some kind of a padding). Extra Ps will take part in the algorithm for the sake of regularity and simplicity, but won't be doing any useful work. BITONIC SORT EXAMPLE: (Arbitrary Sequence) --------------------- Unsorted sequence : C N M F H A P D (sorted in windows of 1) (bitonic sequences of size 2) D=1 C N M F A H P D (sorted in windows of 2) (bitonic sequences of size 4) D=2 C F M N P H A D C F M N P H D A (sorted in windows of 4) (bitonic sequences of size 8) D=3 C F D A P H M N C A D F M H P N A C D F H M N P (sorted in windows of 8) TIME COMPLEXITY FOR BITONIC SORT : =================================== Bito-sort on windows of size 2 Bito-sort on windows of size 4 : noninc - nondec - noninc - .... Bito-sort on windows of size 8 : noninc - nondec - noninc - .... Bito-sort on windows of size 2 1 step Bito-sort on windows of size 4 : 2 steps Bito-sort on windows of size 8 : 3 steps ........ Bito-sort on windows of size 2^k: k steps Total STEPS = 1 + 2 + 3 + 4 + .....+ k where k=logN = 1/2 * k(k+1) = 1/2 * (log N)^2 TOTAL EXECUTION TIME of Bitonic Merge Sort: O((logN)^2) --------------------------------------------------------------------------- Q: Is it cost-optimal? A: NO, if we compare it with an optimal comparison-based sorting algorithm which has a cost of N*(logN) YES, if we compare it with the sequential implementation of the bitonic sort algorithm which has a cost of N*(logN)^2 --------------------------------------------------------------------------- Q: How about when N >> P ? A: Each PE gets N/P elements and runs the same algorithm with the exception that each PE emulates N/P virtual processors. How? ==> PROJECT 2 --------------------------------------------------------------------------- ================ 9.1.3. RANK SORT ================ - Find the rank of each number in the list and move it to its correct location ----------------------------------------------- for (i=0; i a[j]) rank++; b[rank] = a[i]; /* copy the number into its correct */ ----------------------------------------------- (*** it is left as an exercise to modify the code to cope with duplicates ****) Tseq = O(n^2) (* not a good sequential algorithm *) ------------------ Using n processors (* assume that every PE has access to the entire list *) ----------------------------------------------- forall (i=0; i a[j]) rank++; b[rank] = a[i]; /* copy the number into its correct */ ----------------------------------------------- Tpar = O(n) ------------------ Using n^2 processors (* every PE has access to the entire list *) ----------------------------------------------- - n PEs collectively work on finding the rank of 1 number in parallel. This can be done using a tree structure as shown in: (* T Figure 9.2 *) This way, Parallel Time on n^2 processors = O(log n) - However, efficiency is pretty low: O(1/n) - As a final note, it is theoretically possible to reduce the time complexity to O(1) by using a CRCW-PRAM where all the increment operations can be done simultaneously in one step (see Apeendix D). ================ 9.2.4. MERGESORT ================ - divide-and-conquer: (* T Figure 9.14 *) - A total of logn phases each phase processing n numbers Therefore Tseq = O(nlogn) Tpar = Tcomm + Tcalc = O(n) Tcomm = (2log p)*Ts + 2n*Tw Tcalc = n/p(1+2+4+...+p) = 2n ================== QUICKSORT ================== quicksort(list, start, end) { if(start < end) { partition(list, start, end, pivot); quicksort(list, start, pivot-1); quicksort(list, pivot+1, end); } } - Selection of a "good" PIVOT is very critical in boosting the performance - Sequential Time Complexity: (Average) O(NlogN) (Worst Case) O(N^2) ------------------------------------- A SIMPLE PARALLEL QUICKSORT ALGORITHM ------------------------------------- one PE does the first split and sends each half to a separate PE (the number of busy PEs grow like a binary tree) (* T Figure 9.15 *) Hypercube version: (* T Figure 9.18 *) - Time Complexity is bounded by O(N) because of the 1.step. =============================================================== Hyperquicksort: An EFFICIENT Quicksort Algorithm on a Hypercube =============================================================== Assumption: initially each PE has n/p numbers in its local memory (* T Figure 9.19 *) (* T Figure 9.20 *) 1. Each PE sorts its list sequentially O(n/p Log(n/p)) 2. Root PE in each subcube selects a pivot and broadcasts it to the other PEs in that subcube <= O(Log p) 3. The PEs in the "lower" subcube send their numbers, which are greater than the pivot, to their partners in the "upper" subcube The PEs in the "upper" subcube send their numbers, which are smaller than the pivot, to their partners in the "lower" subcube O(n/p) 4. Each PE merges the list received with its own: O(n/p) Tpar = n/p Log(n/p) + Log^2 p + (n/p)*Logp If p=n then Tpar = O(Log^2 n) (* pivot broadcast *) Note: since the numbers are always kept sorted, the number in the middle (i.e. the median) can be selected as pivot in O(1) time