----------------------------------------------------------------------
2.2. USING WORKSTATION CLUSTERS
----------------------------------------------------------------------
2.2.3. MPI: Message Passing Interface
=====================================================
Barriers to the widespread use of parallelism has generally been in
three major areas:
HARDWARE: faster computers require faster comm. networks.
(recent progress in IN technology is satisfying)
Work to improve I/O performance is still continuing.
ALGORITHMS: There has been so much research in this area that almost
all problems in computer science have been parallelized.
SOFTWARE: The biggest obstacle to the spread of parallel computing is
the problem of inadequate software
Let's elaborate this statement.
- Compilers that automatically parallelize sequential algorithms
remain limited in their applicability. Although these compilers work
well on certain problems, the best performance is still obtained when
the programmer himself supplies the parallel algorithm.
Maybe, the right approach to take is to ease writing parallel programs
rather than trying to automate the parallelization process. So, we
need to provide useful, efficient libraries that work in multiple
platforms. The properties we should look for in any parallel language
or a library are:
- portability
- efficiency
- expressiveness
Most existing languages/libraries emphasize one property at the
expense of the other. Whereas, MPI tries to achieve a good balance
between all three.
- MPI is not a new programming language. It is a library of
routines that can be called from C and Fortran 77 programs.
- It was developed by an open, international forum consisting of
representatives from industry, academia, and government laboratories.
- MPI is based on message passing, one of the most powerful and widely
used paradigm for programming parallel systems.
- Today, it is the most widely used standard in parallel programming.
- Overall, it is fair to say that MPI has superiority over the
existing languages/libraries when all these three areas are considered
together:
(i) MPI is PORTABLE across a large number of machines
(ii) Deep involvement of vendors in MPI's definition has ensured
that vendor-supplied MPI implementations will be EFFICIENT
(iii) MPI is EXPRESSIVE; i.e. it is designed to be a convenient,
complete definition of the message-passing model.
- The introduction of MPI makes it possible for developers of parallel
software to write libraries of parallel programs that are both
portable and efficient.
Parallel Computational Models:
=============================
a conceptual view of what types of operations are available to a
parallel program. Can be discussed along multiple axes:
shared memory vs. distributed memory vs. threads
data parallel vs. control parallel approaches
SIMD vs. MIMD
Message passing vs. uniform memory access vs. NUMA
etc.
Advantages of the Message-Passing Model
=======================================
Universality: MP model fits well on separate PEs connected by a (fast
------------ or slow) comm. network. Thus, it matches the hardware of
most of today's parallel supercomputers, as well as the
network of workstations (NOWs) that are beginning to
compete with them. It can also be used on shared memory
models.
Expressivity: it provides the control missing from the data-parallel
------------ and compiler-based approaches. More control over the
movement of data, etc. Good or bad ?
Ease of debugging: easier to write debuggers in shared memory model,
----------------- but debugging itself is easier in MP-model because
of the explicit access to memory.
Performance: Provides scalable memory. Provides more programmer
----------- control over the locality of memory accesses.
However, requires faster networks.
Quotes from the MPI Book:
=========================
"The primary goal of the MPI specification is to demonstrate that users
need not compromise among efficiency, portability, and functionality.."
"It is an attempt to collect the best features of many existing
message-passing systems, improve them where appropriate, and
STANDARDIZE them... "
"MPI is a library, not a language..."
"MPI addresses the message-passing model.."
Location of MPI files : /afs/umr.edu/software/mpi109/solaris
===============================
Basic Functions in MPI Library
===============================
The following six MPI functions allow you to write many programs:
MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_rank
MPI_Send
MPI_Recv
Additional commonly used MPI functions
MPI_Bcast
MPI_Reduce
MPI_Barrier
For detailed info: http://www.umr.edu/~ercal/387/MPI/qstart.html
-------------------------------------------------------------------
int MPI_Init(argc,argv) - Initialize the MPI execution environment
INPUT PARAMETERS
int *argc - Pointer to the number of arguments
char ***argv - Pointer to the argument vector
COMMAND LINE ARGUMENTS
MPI specifies no command-line arguments but does allow an
MPI implementation to make use of them.
-------------------------------------------------------------------
int MPI_Finalize() - Terminates MPI execution environment
NOTES: All processes must call this routine before exiting.
-------------------------------------------------------------------
int MPI_Comm_size(comm,size) - Determines the size of the group
associated with a communictor
INPUT PARAMETER
MPI_Comm comm - communicator (handle)
OUTPUT PARAMETER
int *size - number of processes in the group
-------------------------------------------------------------------
A COMMUNICATOR is a communication domain that defines a set of
processes that are allowed to communicate between themselves.
INTRACOMMUNICATOR: allows communication within a group
each PE has a unique rank within a group
MPI_COMM_WORLD: used in simple applications for all point-to-point
and collective operations.
INTERCOMMUNICATOR: allows communication between groups
-------------------------------------------------------------------
int MPI_Comm_rank(comm, rank) - Determines the rank of the calling
process in the communicator
INPUT PARAMETER
MPI_Comm comm - communicator
OUTPUT PARAMETER
int *rank - rank of the calling process
-------------------------------------------------------------------
int MPI_Send( buf, count, datatype, dest, tag, comm )
Performs a basic send. May block until the message is routed
INPUT PARAMETERS
void *buf - initial address of send buffer (choice)
int count - number of elements in send buffer
MPI_Datatype datatype - datatype of each send buffer element
int dest - rank of destination
int tag - message tag (e.g. 2, i, j, MPI_ANY_TAG)
MPI_Comm comm - communicator
-------------------------------------------------------------------
int MPI_Recv(buf,count,datatype,source,tag,comm,status)
INPUT/OUTPUT PARAMETERS
void *buf - initial address of receive buffer (OUTPUT)
int count - maximum number of elements in receive buffer
int source - rank of source
MPI_ANY_SOURCE means "accept a message from anyone"
int tag - message tag. should match with the tag in te 'send'
MPI_ANY_TAG means "accept a message with any tag value"
MPI_Datatype datatype - datatype of each receive buffer element
MPI_Comm comm - communicator
MPI_Status *status - status object (OUTPUT)
NOTE: The 'count' argument indicates the maximum length of a message.
The actual number can be determined with MPI_Get_count.
i.e. MPI_Recv(new_par, MAX_PAR, par_type, source, tag, comm, &status)
MPI_Get_count(&status, par_type, &number)
-------------------------------------------------------------------
int MPI_Isend( buf, count, datatype, dest, tag, comm, request)
Performs a non-blocking send.
int MPI_Irecv(buf,count,datatype,source,tag,comm,request)
Performs a non-blocking receive
ASSOCIATED OPERATIONS:
MPI_Wait(request, status): Waits for an MPI send or receive to complete
MPI_Test (req, flag, status): Tests for the completion of a send or receive
-------------------------------------------------------------------
int MPI_Bcast ( buffer, count, datatype, root, comm )
Broadcasts a message from the process with rank "root"
to all other processes of the group.
**All of the processes make the call MPI_Bcast()
but only the root sends the data while others receive it.
INPUT/OUTPUT PARAMETERS
void *buffer - starting address of buffer
int count - number of entries in buffer
MPI_Datatype datatype - data type of buffer
int root - rank of broadcast root
MPI_Comm comm - communicator (handle)
ALGORITHM: This function uses a tree-like algorithm for broadcast
-------------------------------------------------------------------
MPI_Gather - Gathers together values from a group of processes
MPI_Scatter - Sends data from one task to all other tasks in a group
MPI_Alltoall - Sends data from all to all processes
MPI_Reduce - Reduces values on all processes to a single value
MPI_Reduce_scatter - Combines values and scatters the results
MPI_Scan - Computes the scan (partial reductions) of data
on a collection of processes
MPI_Barrier - Blocks until all process have reached this routine.
-------------------------------------------------------------------
int MPI_Reduce (sendbuf, recvbuf, count, datatype, op, root, comm)
INPUT PARAMETERS
void *sendbuf - address of send buffer
int count - number of elements in send buffer
MPI_Datatype datatype - data type of elements of send buffer
MPI_Op op - reduce operation (handle)
int root - rank of root process
MPI_Comm comm - communicator (handle)
OUTPUT PARAMETER
void *recvbuf - address of receive buffer (significant only at root)
MPI_Reduce() combines the operands stored in the memory referenced by
"sendbuf" using operation "op" and stores the result in "*recvbuf" on
process "root". Must be called by all processes in the communicator "comm".
"count", "datatype", "op", and "root" must be the SAME on each process.
The parameter "op" can be one of the following:
OPERATION NAME | Meaning
-------------------------------------
MPI_MAX | Maximum
MPI_MIN | Minimum
MPI_SUM | Sum
MPI_PROD | Product
MPI_LAND | Logical AND
MPI_BAND | Bitwise AND
MPI_LOR | Logical OR
MPI_BOR | Bitwise OR
MPI_LXOR | Logical exclusive OR
MPI_BXOR | Bitwise exclusive OR
MPI_MAXLOC | Maximum and location of maximum
MPI_MINLOC | Minimum and location of minimum
-------------------------------------------------------------------
Get Started With MPI PROGRAMMING
-------------------------------------------------------------------
Example Program to add numbers: (* T Figure 2.16 *)
-------------------------------------------------------------------
ANOTHER EXAMPLE: The "hello" program
=============================================
In the following, we build a simple HELLO program step by step. These
are also the general procedures to build a simple MPI-based program.
Write the program:
-----------------
Write the program according to the specification. You can download
the "hello" program from here.
To do this in netscape, simply click the right mouse button on the
link and select the menu option "Save Target As" (or "Save Link As")
to save it into your directory:
/************************/
/* "Hello" program */
/************************/
#include "mpi.h"
#include
int main(argc, argv)
int argc;
char **argv;
{
int myid, numprocs; char *name; int resultlen;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if(myid==0) printf("This program is running on %d processes\n",
numprocs);
MPI_Get_processor_name(name, &resultlen);
printf("From Process-%d: Hello MPI world! I am running on %s \n", myid, name);
MPI_Finalize();
return 0;
}
Compile and link the program:
-----------------------------
Use the script file "mpicc" to build the program, e.g. type:
mpicc -o hello hello.c
For more information about mpicc, refer to: 387 Homepage .
Run the program:
----------------
Use the script file "mpirun" to load and run the "hello" program:
mpirun -np 6 hello
Here "-np 6" option specifies that the program will use 6 processes
and each process will be assigned to a processor. The order of the
processors is taken from a list stored in "machine.ARCH" file.
A user can create his/her own list for machines and store it in a
file, and then provide its name as an argument to mpirun using the
option "-machinefile". For example, you can create a file called
"mymachines" which contains:
ultra1.cs.umr.edu
ultra2.cs.umr.edu
ultra3.cs.umr.edu
ultra4.cs.umr.edu (* do not use "ultra5.cs.umr.edu" *)
ultra6.cs.umr.edu
ultra7.cs.umr.edu
ultra8.cs.umr.edu
ultra9.cs.umr.edu
When you type "mpirun -np 6 -machinefile mymachines hello",
your program will run on your local machine plus the first five
machines in your list (if they are all available). For more
information about mpirun, refer to mpirun .
If you run this program on Ultra2.cs.umr.edu,
The OUTPUT of your run will look like the following:
--------------------------------------------------------------------------
From Process-4: Hello MPI world! I am running on ultra4.yp-server.umr.edu
From Process-2: Hello MPI world! I am running on ultra2.yp-server.umr.edu
From Process-3: Hello MPI world! I am running on ultra3.yp-server.umr.edu
From Process-5: Hello MPI world! I am running on ultra6.yp-server.umr.edu
From Process-1: Hello MPI world! I am running on ultra1.yp-server.umr.edu
This program is running on 6 processes
From Process-0: Hello MPI world! I am running on Ultra2.yp-server.umr.edu
--------------------------------------------------------------------------
Notice that the messages are not printed in any specific order.
This is expected, because, the processors run at their own speed
and they may finish in any order. In order to force any specific
ordering, we need to use SYNCHRONIZATION primitives.
One such call is MPI_BARRIER.
int MPI_Barrier(comm) - Blocks the caller until all group members
have called it. The call returns at any process only after all
group members have entered the call.
For example, in the "hello" program above, if we insert an
MPI_Barrier() call right before the second print statement, we enforce
the first print statement to occur before the second one:
/* Modified Hello.c */
/* ......... this part is the same ..... */
if(myid==0) printf("This program is running on %d processes\n",
numprocs);
MPI_Get_processor_name(name, &resultlen);
/* force all the processes to meet in this statement */
MPI_Barrier(MPI_COMM_WORLD);
printf("From Process-%d: Hello MPI world! I am running on %s \n", myid, name);
/* ......... this part is the same ..... */
One possible output for this new program looks like:
--------------------------------------------------------------------------
This program is running on 6 processes
From Process-3: Hello MPI world! I am running on ultra3.yp-server.umr.edu
From Process-2: Hello MPI world! I am running on ultra2.yp-server.umr.edu
From Process-5: Hello MPI world! I am running on ultra6.yp-server.umr.edu
From Process-1: Hello MPI world! I am running on ultra1.yp-server.umr.edu
From Process-0: Hello MPI world! I am running on ercal.yp-server.umr.edu
From Process-4: Hello MPI world! I am running on ultra4.yp-server.umr.edu
--------------------------------------------------------------------------
Notice that the second print statement still did not get printed in order.
Exercise: Try to get the messages to be printed in the ascending order
wrt the process number.
Do you think the following will work?
-------------------------------------
.........
if(myid==0) printf("This program is running on %d processes\n", numprocs);
MPI_Get_processor_name(name, &resultlen);
if(myid==1) printf("From Process-%d: Hello MPI world! I am running on %s \n", myid, name);
if(myid==2) printf("From Process-%d: Hello MPI world! I am running on %s \n", myid, name);
.........
.........
******************************************************************
* Project 1:
* Computing Prime Numbers *
******************************************************************
PRIME NUMBER SIEVE:
-------------------
Problem: Find all the prime numbers between 1 and N.
Sequential Algorithm:
---------------------
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
p
Next_prime = 2 ; Strike out Multiples of 2 as Non-Prime
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
p p * * * * * * * * * * * * *
Next_prime = 3 ; Strike out Multiples of 3 as Non-Prime
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
p p p * * * * * * * * * * * * * * * * *
Next_prime = 5 ; Strike out Multiples of 5 as Non-Prime
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
p p p * p * * * * * * * * * * * * * * * * *
Next_prime = 7 ; Strike out Multiples of 7 as Non-Prime
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
p p p * p * p * * * * * * * * * * * * * * * *
Next_prime = 11 ; Strike out Multiples of 11 as Non-Prime
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
p p p * p * p * * * p * * * * * * * * * * * * *
Keep repeating until the entire list is scanned!
Q: How many steps in total?
A: If pi represents the i th prime number, then approximately
N/p1 + (N-p1)/p2 + (N-p2)/p3 + ......
Q: How do you perform this operation in parallel?
A: Divide the range 1-N equally between the processors and let
each processor strike out the Non-Primes in its own region of numbers.
TASK PARTITIONING
-----------------
1 ___ N/p ___ 2N/p ___________________________________________ N
| | | | ...................... | |
| | | | ...................... | |
| P0 | P1 | P2 | | Pn |
| | | | ...................... | |
Q: Do we need to strike out the multiples of every prime number
between 1 and N in order to find all the primes between 1 and N?
A: No. Here is why:
1 ________ Sqrt(N)_______________________________________N
1 ________ 15 _______________________________________225
Consider a number Y such that Sqrt(N) < Y < N
1 ________ Sqrt(N)__________________________Y_____________N
If Y is not PRIME, then it can be written as Y = Y1 * Y2
where at least one of Y1 or Y2 must be less than Sqrt(N). Otherwise,
both Y1 and Y2 are greater than Sqrt(N) and Y1*Y2 > N which violates
the initial assumption.
LEMMA: Every non-prime number between Sqrt(N) and N
is a multiple of a number between 1 and Sqrt(N).
Therefore, in the Prime Number algorithm, it is sufficient to generate
only multiples of those numbers between 1 and Sqrt(N) to find all the
Non-primes between 1 and N.
EASY TASK PARTITIONING
-----------------------
1 _______ Sqrt(N)_______________________________________N
| | | | | ......... | |
| P0 | | | | ......... | |
| PARENT | P1 | P2 | P3 | | Pn |
| PROCESS | | | | | |
**This may not result in a load-balanced distribution
Algorithm: 1) Parent process (P0) broadcasts current prime number, p,
to all the other processors
2) Each processor (including P0) strikes out the multiples of p
in its own region
3) This process continues until P0 reaches Sqrt(N).
4) Processors report all those numbers which are not deleted
in their region as prime numbers.
COMPLEX TASK PARTITIONING
--------------------------
1 _______ Sqrt(N)_______________________________________N
| | | | ....................... | |
| | | | ....................... | |
| P0 | P1 | P2 | | Pn |
| | | | | |
***Equal Load on each PE but more complex to program
EXAMPLE 2: The "PI" program
==============================
(*** will be covered using seperate transparencies from the MPI pile ***)
You can download the program files
cpi.c
and cpilog.c
( pi3.f for Fortran users)
(** Other topics such as Topologies, Finding Neighbors, Partitioning,
Collective Communication operations will also be covered from the
same set of slides
**)