CSc-387 Parallel Processing CHAPTER 1 : PARALLEL COMPUTERS - Areas requiring great computational speed include: numerical modeling, scientific simulations, virtual reality - Grand Challenge Problem: can't be solved in a reasonable amount of time with today's computers (modeling large DNA structures, global weather forecasting - go over the example on p.4 ==> need 1.7 Tflops) N-body problem needs (N^2 * STEPS) calculations. Galaxy has 10^11 stars ! (* T Figure 1.1 *) One way to increase speed: multiple interconnected processors operating together (parallel computer) need parallel programming Other advantages of parallel computing: can solve a larger problem or can obtain a more precise solution of the same problem. The idea is old: (Gill, 1958) (Holland 1959) (Conway 1963) Conclusion: The future is parallel ---------------------------------------------------- 1.2.1. Shared Memory Multiprocessor System ---------------------------------------------------- conventional computer: (* T Figure 1.2 *) shared-memory computer: (* T Figure 1.3 *) single-address space - Parallel programming: using parallel extentions to FORTRAN and C/C++ threads: code sequences for individual PEs that can access shared memory - Uniform/NonUniform memory access (UMA/NUMA) - Scalability is a problem - Putting fast CACHE on each PE can help reduce memory access time. But ... Cache Coherency Problem must be solved! ---------------------------------------------------- 1.2.2. Message-Passing Multicomputer ---------------------------------------------------- (* T Figure 1.4 *) - Interconnected PEs each having its own address space. PEs can not access each other's memory. Communicate through messages. - divide the problem into concurrent processes, execute them in seperate processors. If there are more processes than PEs, then execute more than one process on a processor in a time-shared fashion. - can SCALE much better than shared memory - explicit message-passing calls could make the code error prone - data can not be shared; it must be copied (send/recv). However, no need for special constructs (e.g. semaphores) for accessing shared data ===> increased performance - Big Advantage: Message-Passing can be used on a Network of Workstations ONE OTHER APPROACH: Distributed Shared Memory or Shared Virtual Memory KSR1 uses this technique with distributed caches (* T Figure 1.5 *) ---------------------------------------------------- 1.2.4. MIMD vs. SIMD ---------------------------------------------------- Flynn's classification: MIMD: Multiple instruction stream-multiple data stream SIMD: Single instruction stream-multiple data stream MIMD: shared memory and message-passing architectures described so far SIMD: sychronous, instructions are broadcast, single control (e.g. low-level image processing operations) Another Classification: ----------------------- MPMD: Multiple Program Multiple Data (* T Figure 1.6 *) SPMD: Single Program Multiple Data (each PE executes the same program) ---------------------------------------------------------------------- 1.3. ARCHITECTURAL FEATURES ---------------------------------------------------------------------- Static Interconnection Networks: direct fixed links between nodes (* T Figure 1.8 *) shows a switch that enables packets to be routed without the processor being disturbed. - Bit-serial vs. bit-parallel (32 wires for a 32-bit word --> expensive) Network Criteria: Bandwidth (bits/sec.) Network Latency - the time to make a message transfer through the network Comm. Latency - total time to send a message including software overheads Message latency/startup-time - time to send a zero-length message Diameter - minimum # of links between the two farthest nodes in the network - Diameter is useful in calculating lower bounds for certain algorithms (e.g. sorting, broadcasting) Bisection Width - # of wires that must be cut to divide the network into two equal parts. Also useful in calculating lower bounds Completely-Connected Ring 2D Mesh Tree Diameter: 1 n/2 2(sqrt(n)-1) 2(height) (* T Figure 1.10 *) (* T Figure 1.11 *) (* T Figure 1.12 *) ----------------- HYPERCUBE NETWORK ----------------- - other names for hypercube : cosmic cube, n-cube, binary n-cube, etc. - dimension = d =====> # of processors = P = N = 2^d - Hypercube topology has excellent mapping capabilities. i.e. many topologies such as ring, mesh, tree can be easily embedded into hypercubes - Can be constructed recursively: (* T Figure 1.13 *) (* T Figure 1.14 *) d=0 O d=1 O-----O d=2 O-----O | | O-----O d=3 O---O / /| O---O | | | O | |/ O---O Labelling the Nodes of a Hypercube ---------------------------------- Nodes of a d-dimensional hypercube can be labelled from 0 to (2^d - 1) in such a way that there is an edge between any two vertices if and only if the binary representations of their labels differ by one and only one bit. Example: d=3 -------- 100 O----O 101 /| /| 000 O-+--O | | O- + O 111 |/ |/ 010 O----O 011 ------------------------------------------- IMPORTANT PROPERTIES OF A HYPERCUBE NETWORK ------------------------------------------- (1) binary labels of neighbor PEs differ in only one bit (2) Each PE is connected to (d = logN) other PEs (3) A d-dimensional hypercube can be partitioned into two (d-1)-dimensional hypercube: select any bit position, those PEs which have 0's at that bit position will make up one partition and others will form the second partition (4) Q: How do you find the distance between Pi and Pj in a hypercube ? A: # of bit positions in which i and j differ Example: 10011 EXOR 01001 --------------- 11_1_ Distance=3 (HAMMING DISTANCE = Bitwise EXOR and then count the number of 1's) (5) Diameter = d = logN E-cube Routing: change those bits with a 1 in the result of (Pi EXOR Pj) from left-to-right MINIMAL ROUTING - always selects the shortest path (e.g. E-cube routing for hypercubes, XY-Routing for a 2D Mesh) ---------------------------------------------------- 1.3.2 EMBEDDING ---------------------------------------------------- MOTIVATION : If LOGICAL comm. structure used in the algorithm matches the PHYSICAL communication structure of the multicomputer topology, then performance is enhanced. FOR EXAMPLE: Logical comm. structure of a PIPELINE is a LINE TOPOLOGY. It is expected to map perfectly onto a physical line topology. It can also easily map onto a Ring, 2D-MESH, or a HYPERCUBE, because LINE TOPOLOGY can be embedded onto these topologies. Embedding a RING into a TORUS: (* T Figure 1.15 *) - Let G(V,E) and H(V',E') be two undirected graphs that model two sets of interconnected processors. An EMBEDDING of G into H is a mapping of the vertices of G into the vertices of H and of the edges of G into simple paths of H. V' >= V must hold. EXAMPLE: G-graph H-graph 1------2 a | | /|\ | | / | \ 3------4 b c d One embedding of G onto H is : 1 --> a, 2 --> b, 3 --> c, 4 --> d - DILATION : the length of the longest path that any edge of G is mapped DILATION = 2 (3-4 is mapped onto c-d) - CONGESTION = 2 (Two paths cross over link a-b: (1-2) and (2-4) ) - EXPANSION = the ratio V'/ V = 4/4 = 1 - The Hypercube has the important property that many topologies discussed so far can be embedded in it. GRAY CODE --------- a sequence of n-bit binary numbers such that any two successive numbers have only one different bit and so that all binary numbers having n bits are represented. BINARY REFLECTED GRAY CODE (Recursive Defn.): -------------------------------------------- 1-bit : 0 1 2-bit : 00 01 11 10 3-bit : 000 001 011 010 110 111 101 100 G(n+1) = { 0 G(n), 1 R[G(n)] } where R[] is the reverse operator - Gray codes allow us to map rings whose lengths are powers of two. - We can map a length k ring into a n-cube if k is even and 4 <= k <= 2^n CONSTRUCTION : Let m=k/2, and n=Ceil{log(k)}. Take the first m elements of the G(n-1) and Call it G(n-1)[m] Then form { 0 G(n-1)[m], 1 R{G(n-1)[m]} } - FINALLY, given a LINEAR ARRAY of arbitrary length k, the smallest dimension n-cube into which it can be mapped is n = Ceil{log(k)} EMBEDDING A MESH INTO A HYPERCUBE ---------------------------------------------------- (* T Figure 1.16 *) Consider a 2-dimensional mesh: 4x8. This can be embedded into a hypercube of dimension log4+log8 = 2+3 = 5 as shown below: 10 . . . . . . . . 11 . . . . 11 110 . . . a1a2 b1b2b3 01 . . . . . . . . 00 . . . . . . . . 000 001 011 010 110 111 101 100 THEOREM-1: Any 2D mesh of size (m1 x m2) can be embedded in a ========= d-dimensional hypercube where d = ceil[log(m1)] + ceil[log(m2)] THEOREM-2: Any 3D mesh of size (m1 x m2 x m3) can be embedded in a d-dimensional hypercube where d = ceil[log(m1)] + ceil[log(m2)] + ceil[log(m3)] EMBEDDING A BINARY TREE INTO A MESH OR A HYPERCUBE ---------------------------------------------------- (* T Figure 1.17 *) - Is it possible to map an (n-1)-node complete binary tree onto an n-node hypercube? How about a 2n-node hypercube? ---------------------------------------------------- 1.3.3. Communication Methods ---------------------------------------------------- Circuit switching: establish a path from the source to destination and maintain all the links in the path until the message is delivered (e.g. telephone system) Packet switching: divide the message into "packets" each of which includes the source and dest. addresses for routing and deliver the message in packets. A packet remains in a buffer if blocked from moving forward to the next node. Store-and-forward packet switching: entire message is stored in intermediate buffers before forwarded to the next node Virtual cut-through: if the outgoing link is available, the message is immediately passed forward without being stored in the nodal buffer. However, if the path is blocked, storage is needed for the complete message/packet being received. Wormhole Routing: is a type of Cut-Through Routing. Message is communicated in 'Flits' which are pipelined through the network. Uses LESS BUFFER SPACE (just for 1 flit) and it is FASTER. However, it is necessary to reserve the complete path for the message as the flits are linked, i.e. other packets cannot be interleaved. (* T Figure 1.18 *) PERFORMANCE ANALYSIS: ------------------------ message-length = L Bandwidth = B no.of links = l Length of control packet = Lc header-length = Lh flit-length = Lf Communication Latencies: Circuit switching: (Lc/B)*l + (L/B) Store-and-forward Packet switching: (L/B)*l Virtual cut-through: (Lh/B)*l + (L/B) Wormhole Routing: (Lf/B)*l + (L/B) - If the length of a flit is much less than the total message, the latency of wormhole routing will be appropriately constant irrespective of the length of the route. (* T Figure 1.20 *) DEADLOCK can occur in both store-and-forward and warmhole networks Both E-cube and XY-Routing algorithms are DEADLOCK FREE. (* T Figure 1.21 *) ------------------- 1.3.4. Input/Output ------------------- In multicomputers, disks can be attached to individual processors/memories. However, such signle connections become significant bottlenecks if processors need to access each other's disk memories. One way to alleviate this problem is to provide multiple paths to the disk memory from various processors, although this does not scale. ---------------------------------------------------------------------- 1.4. NETWORKED COMPUTERS AS A MULTICOMPUTER PLATFORM ---------------------------------------------------------------------- NOWS: Network of Workstations COWs: Cluster of Workstations Key advantages: 1. readily available at low cost 2. latest processors can easily be incorporated into an existing system 3. Existing software can be used or modified MPI: Message-Passing Interface (PVM: Parallel Virtual Machine) Typical Communication medium: Ethernet (* T Figure 1.23 *) - A file server holds all the files of the users and the system utilities Ethernet format: packets (frames) are used. (* T Figure 1.24 *) - 10 Mbits/sec, 100 Mbits/sec, or Gigabit Ethernet are available - collision may occur when more than one packet is transmitted simultaneously. The individual packets need to be retransmitted after intervals (in compliance with IEEE standard 802.3) - Basic Ethernet Latency using TCP/IP = 500 microsec.(usec) - We will be using a NOW platform with 100 Mbits/sec Ethernet Switch PDC Cluster: http://vision1.cs.umr.edu/pdc/ ---------------------------------------------------------------------- 1.5. POTENTIAL FOR INCREASED COMPUTATIONAL SPEED ---------------------------------------------------------------------- Tasks <====> Processes <====> granularity - Need increased granularity for better performance Sometimes increased granularity could be achieved at the expense of reduced number of processes. For good performance, we need to keep both granularity and the number of concurrent processes high. - Since communication comes as a overhead, we need to keep the COMPUTATION/COMMUNICATION ratio high. (* T Figure 1.28 *) Time it takes for the execution of the best sequential algorithm to solve Problem X SPEEDUP FACTOR = ---------------------------------------------------- Time it takes for the execution of the proposed parallel algorithm to solve X or Number of computational steps using one processor SpeedupFactor= ---------------------------------------------------------- Number of parallel computational steps using n processors Maximum Speedup = n (# of processors) (Linear Speedup) Superlinear Speedup, where S(n) > n, may be seen on occasion (e.g. branch-and-bound algorithms, architecture favors the parallel one) Causes of Overhead in parallel code: ------------------------------------ 1. Strictly sequential portions in the code force some processors to idle 2. Extra computations in the parallel version not appearing in the seq. code 3. communication time for sending messages What is the maximum speedup achievable for a program? (* T Figure 1.29 *) --------------------------------------------------------------------------- If f = the fraction of "Non-Parallelizable" operations in a program, n = no. of processors Ts = Total sequential time Then: Speed-Up <= Ts/(f*Ts + (1-f)*Ts/P) limit =====> S = 1/f AMDAHL's (P-->infinity) LAW Example: -------- 10% of a program's code can not be parallelized at all and the remaining 90% can be perfectly parallelized. If we use P=50 processors to run this program in parallel, what would be the maximum speedup? Solution: Speed-Up = 1/(f + (1-f)/P) = 1/(0.1 + 0.9/50) = 8.5 (* T Figure 1.30 *) Efficiency = Speedup/(# of PEs) = (Seq. execution Time Ts)/(Cost in Parallel) COST = (execution time)*(number of PEs used) COST_seq = Ts COST_par = Tp * P SCALABILITY: hardware scalability vs. algorithmic scalability Hardware Scalability: increase in size ====> proportional increase in perf. very much depends on the interconnection network. Communication latency should remain the same as we add more processors. Hard to achieve Algorithmic Scalability: if we increase the problem size (not necessarily input size, n, rather, computational steps) can we increase the number of PEs proportionally and obtain the same efficiency. COUNTERARGUMENT to AMDAHL's LAW: The ratio f is not static, it is problem size dependent. For many scientific problems, parallel part of a program scales up when the problem size has increased while the sequential part remains almost constant. This implies that f becomes smaller as we increase the problem size ---------------- Gustafson's Law ---------------- If parallel part of a program scales up when the problem size n has increased and the sequential part remains almost constant, then we can make the following deductions: s = the fraction of "Non-Parallelizable" operations in a program, P = no. of processors Speed-Up = Tseq / Tpar = (s + (1-s)*P) / 1 = P - s*(P-1) For s --> 0, Speed-Up = P (linear speedup!) For P=Infinity, Speed-Up = P (1-s) (it scales!)