----------------------------------------------------------------------
2.2. USING WORKSTATION CLUSTERS
----------------------------------------------------------------------
2.2.3. MPI: Message Passing Interface
=====================================================
Barriers to the widespread use of parallelism has generally been in
three major areas:

HARDWARE: faster computers require faster comm. networks. 
          (recent progress in IN technology is satisfying)
          Work to improve I/O performance is still continuing.

ALGORITHMS: There has been so much research in this area that almost
            all problems in computer science have been parallelized. 
  
SOFTWARE: The biggest obstacle to the spread of parallel computing is
          the problem of inadequate software

Let's elaborate this statement. 
- Compilers that automatically parallelize sequential algorithms
remain limited in their applicability. Although these compilers work
well on certain problems, the best performance is still obtained when
the programmer himself supplies the parallel algorithm. 
Maybe, the right approach to take is to ease writing parallel programs
rather than trying to automate the parallelization process. So, we
need to provide useful, efficient libraries that work in multiple
platforms. The properties we should look for in any parallel language
or a library are: 
   - portability
   - efficiency
   - expressiveness

Most existing languages/libraries emphasize one property at the
expense of the other. Whereas, MPI tries to achieve a good balance
between all three. 

- MPI is not a new programming language. It is a library of
  routines that can be called from C and Fortran 77 programs. 
- It was developed by an open, international forum consisting of
  representatives from industry, academia, and government laboratories. 
- MPI is based on message passing, one of the most powerful and widely
  used paradigm for programming parallel systems. 
- Today, it is the most widely used standard in parallel programming. 

- Overall, it is fair to say that MPI has superiority over the
existing languages/libraries when all these three areas are considered
together: 

   (i) MPI is PORTABLE across a large number of machines
  (ii) Deep involvement of vendors in MPI's definition has ensured
       that vendor-supplied MPI implementations will be EFFICIENT
 (iii) MPI is EXPRESSIVE; i.e. it is designed to be a convenient,
       complete definition of the message-passing model.

- The introduction of MPI makes it possible for developers of parallel
software to write libraries of parallel programs that are both
portable and efficient. 


Parallel Computational Models:
=============================
a conceptual view of what types of operations are available to a
parallel program. Can be discussed along multiple axes:
      shared memory  vs. distributed memory  vs. threads 
      data parallel  vs. control parallel approaches
      SIMD vs. MIMD
      Message passing  vs. uniform memory access vs. NUMA 
      etc.

Advantages of the Message-Passing Model
=======================================
Universality: MP model fits well on separate PEs connected by a (fast
------------  or slow) comm. network. Thus, it matches the hardware of
              most of today's parallel supercomputers, as well as the
              network of workstations (NOWs) that are beginning to
              compete with them. It can also be used on shared memory
              models. 

Expressivity: it provides the control missing from the data-parallel
------------  and compiler-based approaches. More control over the
              movement of data, etc. Good or bad ?

Ease of debugging: easier to write debuggers in shared memory model,
-----------------  but debugging itself is easier in MP-model because
                   of the explicit access to memory.

Performance: Provides scalable memory. Provides more programmer
-----------  control over the locality of memory accesses.
             However, requires faster networks.
 
Quotes from the MPI Book:
=========================
"The primary goal of the MPI specification is to demonstrate that users
need not compromise among efficiency, portability, and functionality.."

"It is an attempt to collect the best features of many existing
message-passing systems, improve them where appropriate, and
STANDARDIZE them... " 

"MPI is a library, not a language..."

"MPI addresses the message-passing model.."



Location of MPI files :    /afs/umr.edu/software/mpi109/solaris
===============================
Basic Functions in MPI Library
===============================
The following six MPI functions allow you to write many programs:

         MPI_Init
         MPI_Finalize
         MPI_Comm_size
         MPI_Comm_rank
         MPI_Send
         MPI_Recv

   Additional commonly used MPI functions

         MPI_Bcast
         MPI_Reduce
         MPI_Barrier

For detailed info: http://www.umr.edu/~ercal/387/MPI/qstart.html

-------------------------------------------------------------------

int MPI_Init(argc,argv) -  Initialize the MPI execution environment

INPUT PARAMETERS
     int  *argc   - Pointer to the number of arguments
     char ***argv - Pointer to the argument vector

COMMAND LINE ARGUMENTS
     MPI specifies no command-line arguments but  does  allow  an
     MPI implementation to make use of them.

-------------------------------------------------------------------

int MPI_Finalize() - Terminates MPI execution environment

NOTES: All processes must call this routine  before  exiting.   

-------------------------------------------------------------------

int MPI_Comm_size(comm,size) - Determines the size of the group
                               associated with a communictor
INPUT PARAMETER 
     MPI_Comm comm - communicator (handle)

OUTPUT PARAMETER
      int *size - number of processes in the group

-------------------------------------------------------------------
A COMMUNICATOR is a communication domain that defines a set of
    processes that are allowed to communicate between themselves.
INTRACOMMUNICATOR: allows communication within a group
                   each PE has a unique rank within a group
  MPI_COMM_WORLD: used in simple applications for all point-to-point
                 and collective operations. 
INTERCOMMUNICATOR: allows communication between groups


-------------------------------------------------------------------
int MPI_Comm_rank(comm, rank) - Determines the rank of the calling
                                process in the communicator
INPUT PARAMETER
    MPI_Comm comm - communicator 

OUTPUT PARAMETER 
    int *rank - rank of the calling process
-------------------------------------------------------------------

int MPI_Send( buf, count, datatype, dest, tag, comm ) 
     Performs a basic send. May block until the message is routed

INPUT PARAMETERS
    void *buf     - initial address of send buffer (choice)
    int count     - number of elements in send buffer 
    MPI_Datatype datatype - datatype of each send buffer element
    int dest      - rank of destination 
    int tag       - message tag   (e.g. 2, i, j, MPI_ANY_TAG)
    MPI_Comm comm - communicator

-------------------------------------------------------------------

int MPI_Recv(buf,count,datatype,source,tag,comm,status) 

INPUT/OUTPUT PARAMETERS
    void *buf  - initial address of receive buffer (OUTPUT)
    int count  - maximum number of elements in receive buffer
    int source - rank of source
              MPI_ANY_SOURCE means "accept a message from anyone"
    int tag    - message tag. should match with the tag in te 'send'
              MPI_ANY_TAG means "accept a message with any tag value"
    MPI_Datatype datatype - datatype of each receive buffer element 
    MPI_Comm comm         - communicator 
    MPI_Status *status    - status object (OUTPUT)


NOTE: The 'count' argument indicates the maximum length of a message.
      The actual number can be determined with MPI_Get_count.
       i.e. MPI_Recv(new_par, MAX_PAR, par_type, source, tag, comm, &status) 
            MPI_Get_count(&status, par_type, &number)

-------------------------------------------------------------------
int MPI_Isend( buf, count, datatype, dest, tag, comm, request) 
     Performs a non-blocking send. 

int MPI_Irecv(buf,count,datatype,source,tag,comm,request) 
     Performs a non-blocking receive

ASSOCIATED OPERATIONS:  
   MPI_Wait(request, status): Waits for an MPI send or receive to complete
   MPI_Test (req, flag, status): Tests for the completion of a send or receive 

-------------------------------------------------------------------


int MPI_Bcast ( buffer, count, datatype, root, comm )
       Broadcasts a message from the process with rank "root" 
       to all other processes of the group.

**All of the processes make the call MPI_Bcast() 
but only the root sends the data while others receive it.

INPUT/OUTPUT PARAMETERS
 void *buffer  - starting address of buffer 
 int count     - number of entries in buffer
 MPI_Datatype datatype - data type of buffer
 int root      - rank of broadcast root 
 MPI_Comm comm - communicator (handle)

ALGORITHM: This function uses a tree-like algorithm for broadcast 

-------------------------------------------------------------------
  MPI_Gather - Gathers together values from a group of processes 
  MPI_Scatter - Sends data from one task to all other tasks in a group 
  MPI_Alltoall - Sends data from all to all processes 
  MPI_Reduce - Reduces values on all processes to a single value 
  MPI_Reduce_scatter - Combines values and scatters the results 
  MPI_Scan - Computes the scan (partial reductions) of data 
             on a collection of processes 
  MPI_Barrier - Blocks until all process have reached this routine. 

-------------------------------------------------------------------
int MPI_Reduce (sendbuf, recvbuf, count, datatype, op, root, comm)

INPUT PARAMETERS
    void         *sendbuf     - address of send buffer 
    int          count        - number of elements in send buffer 
    MPI_Datatype datatype     - data type of elements of send buffer
    MPI_Op       op           - reduce operation (handle)
    int          root         - rank of root process 
    MPI_Comm     comm         - communicator (handle)

OUTPUT PARAMETER
  void      *recvbuf  - address of receive buffer (significant only at root)

MPI_Reduce() combines the operands stored in the memory referenced by
"sendbuf" using operation "op" and stores the result in "*recvbuf" on
process "root". Must be called by all processes in the communicator "comm".
 "count", "datatype", "op", and "root" must be the SAME on each process.

The parameter "op" can be one of the following:

 OPERATION NAME   |      Meaning
-------------------------------------
    MPI_MAX       |      Maximum
    MPI_MIN       |      Minimum
    MPI_SUM       |      Sum
    MPI_PROD      |      Product
    MPI_LAND      |      Logical AND
    MPI_BAND      |      Bitwise AND
    MPI_LOR       |      Logical OR
    MPI_BOR       |      Bitwise OR
    MPI_LXOR      |      Logical exclusive OR
    MPI_BXOR      |      Bitwise exclusive OR
    MPI_MAXLOC    |      Maximum and location of maximum
    MPI_MINLOC    |      Minimum and location of minimum


-------------------------------------------------------------------
              Get Started With MPI PROGRAMMING
-------------------------------------------------------------------
Example Program to add numbers:    (* T  Figure 2.16 *)
-------------------------------------------------------------------

ANOTHER EXAMPLE: The "hello" program
=============================================
In the following, we build a simple HELLO program step by step. These
are also the general procedures to build a simple MPI-based program.

Write the program:
-----------------
Write the program according to the specification. You can download 
the "hello" program from  here.
To do this in netscape, simply click the right mouse button on the
link and select the menu option "Save Target As" (or "Save Link As")
to save it into your directory:  

     /************************/
     /* "Hello" program      */
     /************************/     
     #include "mpi.h"
     #include 

     int main(argc, argv)
     int argc;
     char **argv;
     {
	 int myid, numprocs;     char *name;     int resultlen;

	 MPI_Init(&argc, &argv);
	 MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
	 MPI_Comm_rank(MPI_COMM_WORLD,&myid);

	 if(myid==0) printf("This program is running on %d processes\n", 
			     numprocs);

	 MPI_Get_processor_name(name, &resultlen); 

	 printf("From Process-%d: Hello MPI world! I am running on %s \n", myid, name);


	 MPI_Finalize();

	 return 0;
     }

Compile and link the program:
-----------------------------
Use the script file "mpicc" to build the program, e.g. type:

       mpicc -o hello hello.c


For more information about mpicc, refer to:  387 Homepage .



Run the program:
----------------
Use the script file "mpirun" to load and run the "hello" program:

       mpirun -np 6 hello

Here "-np 6" option specifies that the program will use 6 processes 
and each process will be assigned to a processor. The order of the 
processors is taken from a list stored in "machine.ARCH" file. 

A user can create his/her own list for machines and store it in a
file, and then provide its name as an argument to mpirun using the
option "-machinefile". For example, you can create a file called
"mymachines" which contains:

       ultra1.cs.umr.edu
       ultra2.cs.umr.edu
       ultra3.cs.umr.edu
       ultra4.cs.umr.edu       (* do not use "ultra5.cs.umr.edu" *)
       ultra6.cs.umr.edu
       ultra7.cs.umr.edu
       ultra8.cs.umr.edu
       ultra9.cs.umr.edu

When you type "mpirun -np 6 -machinefile mymachines hello", 
your program will run on your local machine plus the first five
machines in your list (if they are all available). For more
information about mpirun, refer to    mpirun . 

If you run this program on Ultra2.cs.umr.edu,
The OUTPUT of your run will look like the following:
--------------------------------------------------------------------------
From Process-4: Hello MPI world! I am running on ultra4.yp-server.umr.edu 
From Process-2: Hello MPI world! I am running on ultra2.yp-server.umr.edu 
From Process-3: Hello MPI world! I am running on ultra3.yp-server.umr.edu 
From Process-5: Hello MPI world! I am running on ultra6.yp-server.umr.edu 
From Process-1: Hello MPI world! I am running on ultra1.yp-server.umr.edu 
This program is running on 6 processes
From Process-0: Hello MPI world! I am running on Ultra2.yp-server.umr.edu 
--------------------------------------------------------------------------

Notice that the messages are not printed in any specific order.
This is expected, because, the processors run at their own speed 
and they may finish in any order. In order to force any specific
ordering, we need to use SYNCHRONIZATION primitives.
One such call is MPI_BARRIER.

 int MPI_Barrier(comm) - Blocks the caller until all group members
     have called it. The call returns at any process only after all
     group members have entered the call.



For example, in the "hello" program above, if we insert an 
MPI_Barrier() call right before the second print statement, we enforce
the first print statement to occur before the second one:

        /* Modified Hello.c */
        
        /* ......... this part is the same ..... */ 

	 if(myid==0) printf("This program is running on %d processes\n", 
			     numprocs);

	 MPI_Get_processor_name(name, &resultlen); 

         /* force all the processes to meet in this statement */
         MPI_Barrier(MPI_COMM_WORLD);
         
	 printf("From Process-%d: Hello MPI world! I am running on %s \n", myid, name);

        /* ......... this part is the same ..... */ 

One possible output for this new program looks like:
--------------------------------------------------------------------------
This program is running on 6 processes
From Process-3: Hello MPI world! I am running on ultra3.yp-server.umr.edu 
From Process-2: Hello MPI world! I am running on ultra2.yp-server.umr.edu 
From Process-5: Hello MPI world! I am running on ultra6.yp-server.umr.edu 
From Process-1: Hello MPI world! I am running on ultra1.yp-server.umr.edu 
From Process-0: Hello MPI world! I am running on ercal.yp-server.umr.edu 
From Process-4: Hello MPI world! I am running on ultra4.yp-server.umr.edu 
--------------------------------------------------------------------------

Notice that the second print statement still did not get printed in order. 

Exercise: Try to get the messages to be printed in the ascending order
          wrt the process number.  

 Do you think the following will work?
 -------------------------------------
 ......... 

 if(myid==0) printf("This program is running on %d processes\n", numprocs);
 MPI_Get_processor_name(name, &resultlen); 
 if(myid==1) printf("From Process-%d: Hello MPI world! I am running on %s \n", myid, name);
 if(myid==2) printf("From Process-%d: Hello MPI world! I am running on %s \n", myid, name);

 .........
 ......... 




 ******************************************************************
 *  Project 1: 
 *             Computing Prime Numbers                            *
 ******************************************************************

PRIME NUMBER SIEVE:
-------------------

Problem: Find all the prime numbers between 1 and N.

Sequential Algorithm:
---------------------

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
  p 

Next_prime = 2 ;  Strike out Multiples of 2 as Non-Prime
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
  p p   *   *   *    *     *     *     *     *     *     *     *     *     *

Next_prime = 3 ;  Strike out Multiples of 3 as Non-Prime
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
  p p p *   *   * *  *     *     *  *  *     *     *  *  *     *     *  *  *

Next_prime = 5 ;  Strike out Multiples of 5 as Non-Prime
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
  p p p * p *   * *  *     *     *  *  *     *     *  *  *     *  *  *  *  *

Next_prime = 7 ;  Strike out Multiples of 7 as Non-Prime
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
  p p p * p * p * *  *     *     *  *  *     *     *  *  *     *  *  *  *  *

Next_prime = 11 ; Strike out Multiples of 11 as Non-Prime
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
  p p p * p * p * *  *  p  *     *  *  *     *     *  *  *     *  *  *  *  *

Keep repeating until the entire list is scanned!

Q: How many steps in total?
A: If pi represents the i th prime number, then approximately

     N/p1 + (N-p1)/p2 + (N-p2)/p3 + ...... 


Q: How do you perform this operation in parallel?
A: Divide the range 1-N equally between the processors and let
   each processor strike out the Non-Primes in its own region of numbers.


TASK PARTITIONING
-----------------

   1 ___ N/p ___ 2N/p ___________________________________________ N
   
   |       |       |       |   ......................     |       |
   |       |       |       |   ......................     |       |
   |  P0   |  P1   |  P2   |                              |  Pn   |
   |       |       |       |   ......................     |       |


Q: Do we need to strike out the multiples of every prime number 
   between 1 and N in order to find all the primes between 1 and N?
A: No. Here is why:

   1 ________ Sqrt(N)_______________________________________N

   1 ________   15  _______________________________________225


Consider a number Y such that   Sqrt(N) < Y < N   

   1 ________ Sqrt(N)__________________________Y_____________N

If Y is not PRIME, then it can be written as   Y = Y1 * Y2 
where at least one of Y1 or Y2 must be less than Sqrt(N). Otherwise,
both Y1 and Y2 are greater than Sqrt(N) and Y1*Y2 > N which violates
the initial assumption. 

LEMMA:  Every non-prime number between Sqrt(N) and N 
        is a multiple of a number between  1 and Sqrt(N). 

Therefore, in the Prime Number algorithm, it is sufficient to generate
only multiples of those numbers between 1 and Sqrt(N) to find all the
Non-primes between 1 and N.

EASY TASK PARTITIONING
-----------------------

   1 _______ Sqrt(N)_______________________________________N
   
   |             |     |     |     |   .........     |     |
   |     P0      |     |     |     |   .........     |     |
   |   PARENT    | P1  | P2  | P3  |                 | Pn  |
   |   PROCESS   |     |     |     |                 |     |

 **This may not result in a load-balanced distribution

Algorithm: 1) Parent process (P0) broadcasts current prime number, p,
              to all the other processors
           2) Each processor (including P0) strikes out the multiples of p
              in its own region
           3) This process continues until P0 reaches Sqrt(N).
           4) Processors report all those numbers which are not deleted
              in their region as prime numbers. 

COMPLEX TASK PARTITIONING
--------------------------

   1 _______ Sqrt(N)_______________________________________N
   
   |     |     |     |   .......................     |     |
   |     |     |     |   .......................     |     |
   | P0  | P1  | P2  |                               | Pn  |
   |     |     |     |                               |     |


    ***Equal Load on each PE but more complex to program

EXAMPLE 2: The "PI" program 
==============================
    
   (*** will be covered using seperate transparencies from the MPI pile ***)

You can download the program files 
 cpi.c  
and  cpilog.c 
( pi3.f   for Fortran users) 



 (** Other topics such as Topologies, Finding Neighbors, Partitioning,
     Collective Communication operations will also be covered from the
     same set of slides  
  **)







Back to Homepage
MST
CS
CS 228
CS 284
CS 355
CS 387
CS 487