------------------------------------------------
PyPAR - Parallel Python, no-frills MPI interface

Author:  Ole Nielsen (2001, 2002, 2003)
Email:   Ole.Nielsen@anu.edu.au
Version: See pypar.__version__
Date:    See pypar.__date__
------------------------------------------------

CONTENTS:
  TESTING THE INSTALLATION
  GETTING STARTED  
  FIRST PARALLEL PROGRAM     
  PROGRAMMING FOR EFFICIENCY
  PYPAR REFERENCE
  DATA TYPES
  STATUS OBJECT	  
  EXTENSIONS
  PROBLEMS
  REFERENCES

TESTING THE INSTALLATION

  After having installed pypar run
  
    mpirun -np 2 testpypar.py
    
  to verify that everything works.
  If it doesn't try to run a C/MPI program 
  (for example ctiming.c) to see if the problem lies with
  the local MPI installation or with pypar.
   
    
GETTING STARTED
  Assuming that the pypar C-extension compiled you should be able to
  write 
  >> import pypar
  from the Python prompt. 
  (This may not work under MPICH under Sunos - this doesn't matter, 
  just skip this interactive bit and go to FIRST PARALLEL PROGRAM.)
  
  Then you can write
  >> pypar.size()  
  to get the number of parallel processors in play.
  (When using pypar from the command line this number will be 1).
  
  Processors are numbered from 0 to pypar.size()-1.
  Try
  >> pypar.rank()
  to get the processor number of the current processor.
  (When using pypar from the command line this number will be 0).  
  
  Finally, try
  >> pypar.Get_processor_name()
  to return host name of current node 
  
FIRST PARALLEL PROGRAM   
  Take a look at the file demo.py supplied with this distribution.
  To execute it in parallel on four processors, say, run
  
    mpirun -np 4 demo.py
    
  
  from the UNIX command line.
  This will start four copies of the program each on its own processor. 
    
  The program demo.py makes a message on processor 0 and 
  sends it around in a ring
  - each processor adding a bit to it - until it arrives back at 
  processor 0.
  
  All parallel programs must find out what processor they are running on.
  This is accomplished by the call
  
    myid = pypar.rank()
  
  The total number of processors is obtained from
  
    proc = pypar.size()
    
  One can then have different codes for different processors by branching
  as in
  
    if myid == 0
      ...  
  
  To send a general Python structure A to processor p, write
    pypar.send(A, p)
  and to receive something from processor q, write
    X = pypar.receive(q)  
  
  This will cater for any (picklable Python structure) and make parallel 
  programs very simple and readable. While reasonably fast, it does not 
  achieve the full bandwidth obtainable by MPI.


PROGRAMMING FOR EFFICIENCY    
  For really fast communication one must stick to Numeric arrays and use 
  'raw' versions of send and receive, e.g.:
  To send a Numeric array A to processor p, write
    pypar.raw_send(A, p)
  and to receive the array from processor q, write
    X = pypar.raw_receive(X, q)    
  Note that X acts as a buffer and must be pre-allocated prior to the 
  receive statement as in Fortran and C programs using MPI.
  
  See the script pytiming for an example of communication of Numeric arrays.
      

MORE PARALLEL PROGRAMMING WITH PYPAR

  When you are ready to move on, have a look at the supplied 
  demos and the script testpypar.py which all give examples
  of parallel programming with pypar.
  You can also look at a standard MPI reference (C or Fortran)
  to get some ideas. The principles are the same in pypar except
  that many calls have been simplified.
  

PYPAR REFERENCE
  Here is a list of functions provided by pypar:
  (See section on Data types for explanation of 'vanilla').  
  
  size() -- Number of processors
  rank() -- Id of current processor
  Get_processor_name() -- Return host name of current node  
  
  send(x, destination, tag=0, vanilla=0) -- Blocking send (all types)
    Sends data in x to destination with given tag. 
    
  y=receive(source, tag=0) -- Blocking receive (all types)
    receives data (y) from source (possible with specified tag).
  
  y, status=receive(source, tag=0, return_status=True) -- Blocking receive (all types)
    receives data (y) and status object from source (possible with specified tag).    
    
  raw_send(x, destination, tag=0, vanilla=0): -- Blocking send (Fastest)
    Sends data in x to destination with given tag.     
    It differs from send in that the receiver MUST provide a buffer
    to store the received data.
    Although it will accept all types raw_send is thought to be used 
    mainly for Numeric arrays.
  
  raw_receive(x, source, tag=0, vanilla=0):  -- Raw blocking receive (Fastest)
    receives data from source (possible with specified tag) and puts
    it in x (which must be of compatible size and type).
    It also returns a reference to x.
    Although it will accept all types raw_send is thought to be used 
    mainly for Numeric arrays.    

  x, status = raw_receive(x, source, tag=0, vanilla=0, return_status=True):  -- Raw blocking receive (Fastest)
    receives data and status object from source (possible with specified tag) and puts
    it in x (which must be of compatible size and type).
    

  bcast(X, rootid) -- Broadcasts X from rootid to all other processors.
                      All processors must issue the same bcast.


  raw_scatter(x, nums, buffer, source, vanilla=0):  
     Scatter the first nums elements in x to buffer
     (of size given by nums) from source.

    
  scatter(x, source, vanilla=0):
     Scatter all elements in x to a buffer 
     created by this function and returned.


  raw_gather(x, buffer, source, vanilla=0):
     Gather all elements in x to buffer
     from source.
     Buffer must have size len(x) * numprocs and 
     shape[0] == x.shape[0]*numprocs

  gather(x, source, vanilla=0):
     Gather all elements in x to buffer of 
     size len(x) * numprocs
     created by this function and returned.     
     If x is multidimensional buffer will have
     the size of zero'th dimension multiplied by numprocs


  raw_reduce(x, buffer, op, source, vanilla=0):
     Reduce all elements in x to buffer (of the same size as x)
     at source applying operation op elementwise.

     
  reduce(x, op, source, vanilla=0):
     Reduce all elements in x at source
     applying operation op elementwise and return result in new buffer.  
     Buffer is created and returned.

		      		         
  Wtime() -- MPI wall time
  Barrier() -- Synchronisation point. Makes processors wait until all 
               processors have reached this point.
  Abort() -- Terminate all processes. 
  Finalize() -- Cleanup MPI. No parallelism can take place after this point. 

  See pypar.py for doc strings on individual functions.   

  
DATA TYPES
  Pypar automatically handles different data types differently
  There are three protocols:
    'array': Numeric arrays of type 'i', 'l', 'f', or 'd' can be communicated
             with the underlying mpiext.send_array and mpiext.receive_array. 
	     This is the fastest mode.
    'string': Text strings can be communicated with mpiext.send_string and
              mpiext.receive_string.
    'vanilla': All other types can be communicated using the scripts 
               send_vanilla and receive_vanilla provided that the objects 
	       can be serialised using
               pickle (or cPickle). The latter mode is less efficient than the
               first two but it can handle complex structures.

     Rules:
     If keyword argument vanilla == 1, vanilla is chosen regardless of 
     x's type.
     Otherwise if x is a string, the string protocol is chosen
     If x is an array, the 'array' protocol is chosen provided that x has one
     of the admissible typecodes.
     
  Function that take vanilla as a keyword argument can force vanilla mode
  on any datatype.     
  
STATUS OBJECT
  A status object can be optionally returned from receive and raw_receive by
  specifying return_status=True in the call.
  The status object can subsequently be queried for information about the communication.
  The fields are:	
    status.source: The origin of the received message (use e.g. with pypar.any_source)
    status.tag: The tag of the received message (use e.g. with pypar.any_tag)    
    status.error: Error code generated by underlying C function
    status.length: Number of elements received
    status.size: Size (in bytes) of one element
    status.bytes(): Number of payload bytes received (excl control info)

  The status object is essential when use together with any_source or any_tag.  	
  

EXTENSIONS
  At this stage only a subset of MPI is implemented. However,
  most real-world MPI programs use only these simple functions.
  If you find that you need other MPI calls, please feel free to
  add them to the C-extension. Alternatively, drop me note and I'll
  use that as an excuse to update pypar.
  
  
PERFORMANCE
  If you are passing simple Numeric arrays around you can reduce 
  the communication time by using the '_raw' versions of send and 
  receive (see REFERENCE above). These version are closer to the underlying MPI 
  implementation in that one must provide receive buffers of the right size.
  However, you will find that this can be somewhat faster as they bypass 
  pypar's mechanism for automatically transferring the needed buffer size.
  Also, using simple numeric arrays will bypass pypar's pickling of complex 
  structures.

  This will mainly be of interest if you are using a 'fine grained' 
  parallelism, i.e. if you have frequent communications.	
  
  Try both versions and see for yourself if there is any noticable difference.
  

PROBLEMS
  If you encounter any problems with this package please drop me a line.
  I cannot guarantee that I can help you, but I will try.
  
  Ole Nielsen
  Australian National University
  Email: Ole.Nielsen@anu.edu.au
   
  
REFERENCES   
  To learn more about MPI in general, see WEB sites
    http://www-unix.mcs.anl.gov/mpi/
    http://www.redbooks.ibm.com/pubs/pdfs/redbooks/sg245380.pdf

  or books  
    "Using MPI, 2nd Edition", by Gropp, Lusk, and Skjellum, 
    "The LAM companion to 'Using MPI...'" by Zdzislaw Meglicki
    "Parallel Programming With MPI", by Peter S. Pacheco
    "RS/6000 SP: Practical MPI Programming", by Yukiya Aoyama and Jun Nakano 
     
  To learn more about Python, see the WEB site
    http://www.python.org