---------------------------------------------------------------------- 2.3. EVALUATING PARALLEL PROGRAMS ---------------------------------------------------------------------- - Time analysis is usually done assuming that all the PEs are the same and operating at the same speed. - When q messages are sent each containing n data items: Tcomm = q*(Tstartup + n*Tdata) - Time complexity of a parallel program is determined by the slowest processor and it is the sum of the complexity of the computation and the communication - Cost-optimality COST(seq) = Sequential time complexity COST(par) = (# of PEs) * (Parallel time complexity) If COST(seq) = COST(par), then it is COST OPTIMAL Typically, due to parallel overhead, COST(par) > COST(seq) - Whereas time complexity is widely used for seq. program analysis, it is much less useful for evaluating the potential performance of parallel programs. The Big-Oh (O(.)) and other complexity notations use asymptotic methods which may not be practical in the case of parallel programs, because P can never be very large. However, it is still useful for finding the theoretical upper/lower bounds for the complexity of certain parallel algorithms. QUESTION: What is the theoretical minimum time for a broadcast operation? -------- (Assumptions: store-and-forward routing is used, message size m=1 and # of processors = P PEs can only send to one PE at a time No broadcast line exists ) ANSWER : MAX ( diameter, logP ) (* Note that this the absolute lower bound, i.e. impossible to beat However, it may not always be possible to achieve these bounds The best braodcast time depends on the topology *) - BROADCAST on a HYPERCUBE: Time = O(logN) (* T Figure 2.19 *) (* T Figure 2.20 *) - BROADCAST on a MESH: Time = O(2(N-1)) (* T Figure 2.21 *) - How about broadcasting on a NOW (Network of Workstations)? ---------------------------------------------------------------------- 2.4. DEBUGGING AND EVALUATING PARALLEL PROGRAMS ---------------------------------------------------------------------- xMPI: a visualization tool for analyzing the run-time behaviour of parallel programs (available on PDC cluster) Space-time diagrams: (* T Figure 2.25 *) Debugging Strategies: 1- Run the program as a single process and debug like a sequential program 2- Execute using 2-4 processes on a single computer. Check if messages are being sent and delivered correctly. 2- Execute using 2-4 processes across several computers. Check if the program is running correctly. Measuring Execution Time: -------------------------- startwtime = MPI_Wtime(); ........... endwtime = MPE_Wtime(); printf("Elapsed time = %f\n", endwtime-startwtime); - Communication Latency can be measured using the Ping-Pong method; send messages back and forth repeatedly, time it, and then take the average Profiling --------- (* T Figure 2.26 *) can be used to identify "hot spots"; places in a program executed many times. It is wise to optimize these parts of the program first. Tips for Optimizing the Parallel Code ====================================== 1- The number of processes can be changed (lowered) to increase the process granularity. Efficiency gets better, however, speedup suffers. 2- The amount of data in the messages can be increased to lessen the effects of startup times. 3- Communication and computation can be overlapped (latency hiding) 4- Perform a CRITICAL PATH analysis on the program. i.e. find the longest path that determines (dominates) the overall execution time. 5- Try to keep as much of the data as possible in processor caches. This can sometimes be achieved by simply reordering the memory requests in the program.