Subject: Re: creating output files on an ORIGIN2000
From: Keith Eric Grant (keg@strathspey.llnl.gov)
Date: Thu Aug 12 1999 - 16:06:19 MDT
Didi Sariyska wrote:
>
> Hello CCM-Users,
>
> The problem began when I ran the model with a namelist that was a little
> bit different from that of the test run. In particular, I define two
> auxiliary history files, one has data written to it every 12 hours, the
> other has daily averages written to it. The model runs for a while with
> all assigned processors, opens two output files (as it is supposed to
> do), but does not write any data there. After a while most of the
> processors except two go to 'sleep' mode. For more than 10 hours there
> were no error messages and the model was still pretending to be running.
>
> ....
>
> Didi Sariyska
> Department of Earth System Science
> University of California
> Irvine, CA 92697-3100
> dsariysk@uci.edu
Didi,
Your mention of using the non-CRAY SPMD coupled with multiple output files
matches a problem with MPI deadlocks that I posted a question about late
last year.
No one had an answer at that time, but after some debugging I figured out
the cause of the the deadlocks and found a workaround to the problem.
What happens, when multiple tapes are requested, is that messages are sent
to processor zero in the order file varying within latitude while they are
received latitude within file.
This sets up a situation in which buffering of messages is required for
completion. Gropp et al in "Using MPI" note on pages 75-76 that such
buffering is allowed but not required by the standard. The MPI_SEND is
considered to be a blocking send, hence the problem when a processor has
sent one message to processor 0 and processor 0 is waiting for a message
that will only be sent if the current message is picked up.
I have made some minor changes to WHIST.F that appear to get around this
problem, although I haven't examined the output files produced thoroughly. A
coworker running CCM3 on the DEC Alphas believes that this change may have
also removed what he believes is data corruption in an auxiliary data file.
I'd appreciate hearing back if this solves your problem.
Thanks.
...Keith Grant
Modified WHIST.F follows:
-----------------------------------------------------------------------------
#include <misc.h>
#include <params.h>
subroutine wshist(ktape ,lat ,mflds ,hbuf)
C-----------------------------------------------------------------------
C
C Transfer latitude data record to history file unit for ktape
C
C The cpp SPMD definition provides for the funnelling of all program i/o
C through the master processor. Processor 0 either reads restart/history
C data from the disk and distributes it to all processors, or collects
C data from all processors and writes it to disk.
C
C---------------------------Code history--------------------------------
C
C Original version: CCM1
C Standardized: L. Bath, June 1992
C T. Acker, March 1996
C
C-----------------------------------------------------------------------
c
c $Id: wshist.F,v 1.2 1998/07/22 18:19:46 rosinski Exp $
c $Author: rosinski $
c
C-----------------------------------------------------------------------
#include <implicit.h>
C------------------------------Parameters-------------------------------
#include <pmgrid.h>
C-----------------------------------------------------------------------
#include <pagrid.h>
C------------------------------Commons----------------------------------
#include <comhst.h>
C-----------------------------------------------------------------------
#if ( defined SPMD )
#include <comspmd.h>
#include <mpif.h>
integer msgtype,ier,stat(MPI_STATUS_SIZE),iproc
parameter (msgtype=17000)
logical recvd(plat,ptapes) ! latitude has been recvd (proc 0)
integer count(ptapes) ! count of recvd latitudes (proc 0)
save count,recvd
logical flag
integer j
#endif
C------------------------------Arguments--------------------------------
C
C Input arguments
C
integer ktape ! History file number
integer lat ! Record number to write
integer mflds(3,*) ! Integer header information for ktape
real hbuf(*) ! Address of ktape portion of history buffer
C
pointer (phfield,hfield)
real hfield(plon,*) ! Part of hist. file relevant to current field
C
C---------------------------Local Parameters-----------------------------
C
real tootiny
#if ( defined CRAY )
parameter (tootiny=1.e-2000)
#else
parameter (tootiny=0.)
#endif
C
C---------------------------Local workspace-----------------------------
C
integer ifield ! Field index
integer ip ! Incrementing pointer thru unpacked h-file buffer
integer ibufp ! Incrementing pointer through packed h-file buffer
integer itape ! History file number index
integer jtape ! First active history file for current time
integer ndensh ! Packing density
integer kfld ! Field index for nacs,fnorm
integer i ! Longitude index
integer k ! Vertical index
integer numlev ! Number of levels in model field
C
real packsp(plond) ! Packed data array
real fnorm(pflds) ! Divisor for accumulated fields
real dummy ! Placeholder for call to packaf
C
integer iplen ! Length of packed data
C
C------------------------------Externals--------------------------------
C
external packaf ! Fortran packer, library routine
external resetr ! Set real array to specified value
external wrtharr ! Array write of history buffer
C
C-----------------------------------------------------------------------
C
C Initialize history buffer pointers for packed and unpacked data, and
C field index.
C
ibufp = 1
ip = 1
kfld = 1
if (ktape.gt.1) then
do itape = 1,ktape-1
kfld = kfld + nflds(itape)
end do
end if
C
C Beginning of loop over no. of fields in history file
C
do 100 ifield = 1,nflds(ktape)
C
C Set "numlev" appropriately for single or multi-level field
C
if (mod(mflds(1,ifield),10).ge.1) then
numlev = plev
else
numlev = 1
end if
C
C Set pointer to appropriate location in history buffer for this field.
C After setting this pointer, array "hfield" may be accessed. Note that
C hfield is always a part of hbuf.
C
phfield = loc(hbuf(ip))
C
C normalize by number of accumulations.
C
if (mflds(1,ifield)/10.gt.0) then
fnorm(kfld) = 1./amax0(nacs(kfld,lat),1)
nacs(kfld,lat) = 0
do k=1,numlev
do i=1,plon
hfield(i,k) = hfield(i,k)*fnorm(kfld)
end do
end do
end if
C
C Pack field if requested. Note that the field is packed on top
C of itself. This is done to save memory because the history buffer
C is so huge.
C
ndensh = mflds(3,ifield)
cjr
c protect against bug in packer and cprtps which screws up on very
c small numbers
cjr
do k=1,numlev
do i=1,plon
if (abs(hfield(i,k)).lt.tootiny) then
hfield(i,k) = 0.
endif
end do
end do
C
if (ndensh.gt.1) then
iplen = (plon+ndensh-1)/ndensh + 2
do k=1,numlev
call packaf(hfield(1,k), plon, packsp, ndensh, dummy,
$ dummy)
do i=1,iplen
hbuf(ibufp+i-1) = packsp(i)
end do
ibufp = ibufp + iplen
end do
else
ibufp = ibufp + numlev*plon
end if
ip = ip + numlev*plon ! Set unpacked pointer for next field
kfld = kfld + 1
100 continue
C
C End of loop over no. of fields in history file. Now write data out to
disk
C
#if ( defined SPMD )
if (.not.masterproc) then
c
c Pack and ship my data to processor 0
c
c write(0,*)'WSHIST: Sending my latitude,ktape = ',lat,ktape
call MPI_SEND(hbuf,nplen(ktape),
$ REALTYPE,0,
$ msgtype+lat,MPI_COMM_WORLD,ier)
else
c write(0,*)'WSHIST: writing my latitude,ktape = ',lat,ktape
call wrtharr(hunit(ktape),hbuf,nplen(ktape),lat,plon)
cp
c The following code assumes lat=1 is on processor 0. Get all latitudes for
c a given processor now since number of lats per processor may vary.
c
c xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
c
c.... Modifications by Keith Eric Grant, LLNL, 09 March 1999
c
c.... Pick up output from other processors only on first lat
c.... and for call for first history tape.
c
c.... Inner loop over active history files, so that output is
c.... picked up from processors in the order that it is sent.
c.... This prevents an MPI lockup on a blocking MPI_SEND. The
c.... original code could fail when there were auxillary
c... history files.
c
c xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
c
c
c.... Determine the first active history file at current time
jtape = 0
do itape=1,mtapes
if ( jtape .eq. 0 .and. hstwr(itape) ) jtape = itape
end do
c.... Get output from other processors for all history files
c.... only on the first call to processor 0
c
if (lat.eq.1 .and. ktape .eq. jtape ) then
do iproc=1,npes-1
do j=cut(1,iproc),cut(2,iproc)
do itape=1,mtapes
if (hstwr(itape)) then
c
c receive latitude "j"
c
c write(0,*)'WSHIST: Receiving latitude,ktape = ',j,itape
call MPI_RECV(hbuf,nplen(itape),
$ REALTYPE,proc(j),
$ msgtype+j,MPI_COMM_WORLD,stat,ier)
c write(0,*)'WSHIST: writing latitude,ktape = ',j,itape
call wrtharr(hunit(itape),hbuf,nplen(itape),j,plon)
endif
end do
end do
end do
end if
end if
#else
call wrtharr(hunit(ktape) ,hbuf ,nplen(ktape) ,
$ lat ,plon)
#endif
C
C Clear history file buffer (unpacked length)
C
if (nrlen(ktape).gt.0) call resetr(hbuf ,nrlen(ktape),0.)
C
return
end
-- +-----------------------------+-------------------------------------------+ I Keith Eric Grant I Common sense and a sense of humor are the I I I same thing, moving at different speeds. I I Atmospheric Science Div I A sense of humor is just common sense, I I P.O. Box 808, L-103 I dancing. ... Clive James I I Lawrence Livrmr Natn'l Lab I I I EMail: keg@llnl.gov I (or perhaps dancing is just common sense) I I FAX: (925) 422-5844 I I +-----------------------------+-------------------------------------------+
This archive was generated by hypermail 2b27 : Thu Jun 01 2000 - 09:28:16 MDT