Monthly output MPI quandary


Subject: Monthly output MPI quandary
From: Keith Eric Grant (keg@strathspey.llnl.gov)
Date: Mon Jan 04 1999 - 17:07:05 MST


Using the input parameters below, I seem to have run into an MPI blocking
quandary. The input is essentially example 1 from the User's Guide modified
to run 32 days and requesting monthly averaged data.

I have compiled in SPMD (MPI) mode and am using 16 processors with 4
latitudes on each processor. The run goes normally until 30 day output is
reached and then blocks with all processors hung.

Input parameters:
--------------
&CCMEXP
 caseid = 'ccm3bld'
 datadir = '/g/g16/keg/projects/ccm3/ccm/bld/../data'
 ncdata = 'SEP1.T42.0198.nc'
 bndtvs = 'T42M5079.nc'
 bndtvo = 'ozn.0596.r8.nc'
 incorbuf = .TRUE.
 incorhst = .TRUE.
 incorrad = .TRUE.
 iradsw = -1
 iradlw = -1
 iradae = -12
 dtime = 1200.
 nestep = -32
 ninavg = 'Q'
 mfilt = 1
 irt = 0
 iyear_ad = 1950
 /
 &lsmexp
 datadir = '/g/g16/keg/projects/ccm3/ccm/bld/../data'
 finidat = 'arbitrary initialization'
 /
--------------

I note the the variable lat, giving the latitude index, is generated in
SCAN1BC and passed through LINEMSBC to WRITUP. WRITUP contains the following
loop:
--------------
      kfld = 1
      do ktape=1,mtapes
C
C Check hstwr to determine if it is time to write hist. file no. ktape
C
         if (hstwr(ktape)) then
C
C Write data to history file
C
            call wshist(ktape ,lat ,mflds(1,kfld),
     $ hbuf(hbufpt(ktape)))
         end if
         kfld = kfld + nflds(ktape)
      end do
--------------

This loop implies that, for a given latitude, WSHIST is called for each
history file to be output at the current time. In my case, I find at 30 days
I have mtapes=2, and
hstwr(1:2)=1. Thus, for example, my processor 1 will call WSHIST twice for
each of latitudes 5, 6, 7, and 8. I have confirmed, using totalview, that
this is indeed the case.

In WSHIST, I find the following for processors other that the master
processor:
-------------

c
c Pack and ship my data to processor 0
c
        write(0,*)'WSHIST: writing sending my latitude = ',lat
        call MPI_SEND(hbuf,nplen(ktape),
     $ REALTYPE,0,
     $ msgtype+lat,MPI_COMM_WORLD,ier)

------------

For processor 1 and WRITEUP as given above, this will result in data being
sent to processor 0 in the following order:

(latitude,ktape) = (5,1), (5,2), (6,1), (6,2), (7,1), (7,2), (8,1), (8,2)

For processor 0, the following seqment picks up messages from the other
processors
when lat=1. This should thus occur on two successive calls to WSHIST on
processor 0, one with (lat=1,ktape=1) and one with (lat=1,ktape=2).

-------------------------------------------
c The following code assumes lat=1 is on processor 0. Get all latitudes for
c a given processor now since number of lats per processor may vary.
c
        if (lat.eq.1) then
          do iproc=1,npes-1
            do j=cut(1,iproc),cut(2,iproc)
c
c receive latitude "j"
c
              call MPI_RECV(hbuf,nplen(ktape),
     $ REALTYPE,proc(j),
     $ msgtype+j,MPI_COMM_WORLD,stat,ier)
              call wrtharr(hunit(ktape),hbuf,nplen(ktape),j,plon)
            end do
          end do
        end if
      end if
-----------------------------------------

Thus processor 0 is trying to sweep through all latitudes for ktape=1 and
then, on the next call to WSHIST, sweep through all latitudes for ktape=2.
WHICH IS NOT THE ORDER IN WHICH THEY ARE BEING SENT. Thus, when processor 1
generates (lat=5,ktape=2) processor 0 is looking for (lat=6,ktape=1). Since
MPI_SEND and MPI_RECV are blocking, a hang ensues. Processors 0 and 1 are
deadlocked, and all other processors are waiting for their first latitude to
be accepted. Note that accepting a message depends on processor and on the
communication id, which depends on the latitude index.

I'd greatly appreciate any comments enlightening the error of my ways or
providing a fix for this apparent logic error.

Thanks,

...Keith Grant

-- 
 
 +-----------------------------+-------------------------------------------+
 I Keith Eric Grant            I Common sense and a sense of humor are the I
 I                             I same thing, moving at different speeds.   I
 I Atmospheric Science Div     I A sense of humor is just common sense,    I
 I P.O. Box 808, L-103         I dancing.   ... Clive James                I
 I Lawrence Livrmr Natn'l Lab  I                                           I
 I EMail: keg@llnl.gov         I (or perhaps dancing is just common sense) I
 I FAX:   (925) 422-5844       I                                           I
 +-----------------------------+-------------------------------------------+



This archive was generated by hypermail 2b27 : Thu Jun 01 2000 - 09:26:26 MDT