Subject: Monthly output MPI quandary
From: Keith Eric Grant (keg@strathspey.llnl.gov)
Date: Mon Jan 04 1999 - 17:07:05 MST
Using the input parameters below, I seem to have run into an MPI blocking
quandary. The input is essentially example 1 from the User's Guide modified
to run 32 days and requesting monthly averaged data.
I have compiled in SPMD (MPI) mode and am using 16 processors with 4
latitudes on each processor. The run goes normally until 30 day output is
reached and then blocks with all processors hung.
Input parameters:
--------------
&CCMEXP
caseid = 'ccm3bld'
datadir = '/g/g16/keg/projects/ccm3/ccm/bld/../data'
ncdata = 'SEP1.T42.0198.nc'
bndtvs = 'T42M5079.nc'
bndtvo = 'ozn.0596.r8.nc'
incorbuf = .TRUE.
incorhst = .TRUE.
incorrad = .TRUE.
iradsw = -1
iradlw = -1
iradae = -12
dtime = 1200.
nestep = -32
ninavg = 'Q'
mfilt = 1
irt = 0
iyear_ad = 1950
/
&lsmexp
datadir = '/g/g16/keg/projects/ccm3/ccm/bld/../data'
finidat = 'arbitrary initialization'
/
--------------
I note the the variable lat, giving the latitude index, is generated in
SCAN1BC and passed through LINEMSBC to WRITUP. WRITUP contains the following
loop:
--------------
kfld = 1
do ktape=1,mtapes
C
C Check hstwr to determine if it is time to write hist. file no. ktape
C
if (hstwr(ktape)) then
C
C Write data to history file
C
call wshist(ktape ,lat ,mflds(1,kfld),
$ hbuf(hbufpt(ktape)))
end if
kfld = kfld + nflds(ktape)
end do
--------------
This loop implies that, for a given latitude, WSHIST is called for each
history file to be output at the current time. In my case, I find at 30 days
I have mtapes=2, and
hstwr(1:2)=1. Thus, for example, my processor 1 will call WSHIST twice for
each of latitudes 5, 6, 7, and 8. I have confirmed, using totalview, that
this is indeed the case.
In WSHIST, I find the following for processors other that the master
processor:
-------------
c
c Pack and ship my data to processor 0
c
write(0,*)'WSHIST: writing sending my latitude = ',lat
call MPI_SEND(hbuf,nplen(ktape),
$ REALTYPE,0,
$ msgtype+lat,MPI_COMM_WORLD,ier)
------------
For processor 1 and WRITEUP as given above, this will result in data being
sent to processor 0 in the following order:
(latitude,ktape) = (5,1), (5,2), (6,1), (6,2), (7,1), (7,2), (8,1), (8,2)
For processor 0, the following seqment picks up messages from the other
processors
when lat=1. This should thus occur on two successive calls to WSHIST on
processor 0, one with (lat=1,ktape=1) and one with (lat=1,ktape=2).
-------------------------------------------
c The following code assumes lat=1 is on processor 0. Get all latitudes for
c a given processor now since number of lats per processor may vary.
c
if (lat.eq.1) then
do iproc=1,npes-1
do j=cut(1,iproc),cut(2,iproc)
c
c receive latitude "j"
c
call MPI_RECV(hbuf,nplen(ktape),
$ REALTYPE,proc(j),
$ msgtype+j,MPI_COMM_WORLD,stat,ier)
call wrtharr(hunit(ktape),hbuf,nplen(ktape),j,plon)
end do
end do
end if
end if
-----------------------------------------
Thus processor 0 is trying to sweep through all latitudes for ktape=1 and
then, on the next call to WSHIST, sweep through all latitudes for ktape=2.
WHICH IS NOT THE ORDER IN WHICH THEY ARE BEING SENT. Thus, when processor 1
generates (lat=5,ktape=2) processor 0 is looking for (lat=6,ktape=1). Since
MPI_SEND and MPI_RECV are blocking, a hang ensues. Processors 0 and 1 are
deadlocked, and all other processors are waiting for their first latitude to
be accepted. Note that accepting a message depends on processor and on the
communication id, which depends on the latitude index.
I'd greatly appreciate any comments enlightening the error of my ways or
providing a fix for this apparent logic error.
Thanks,
...Keith Grant
-- +-----------------------------+-------------------------------------------+ I Keith Eric Grant I Common sense and a sense of humor are the I I I same thing, moving at different speeds. I I Atmospheric Science Div I A sense of humor is just common sense, I I P.O. Box 808, L-103 I dancing. ... Clive James I I Lawrence Livrmr Natn'l Lab I I I EMail: keg@llnl.gov I (or perhaps dancing is just common sense) I I FAX: (925) 422-5844 I I +-----------------------------+-------------------------------------------+
This archive was generated by hypermail 2b27 : Thu Jun 01 2000 - 09:26:26 MDT