Re: On running CCM3.6.6 on LINUX clusters


Subject: Re: On running CCM3.6.6 on LINUX clusters
From: Jim Rosinski (rosinski@cgd.ucar.edu)
Date: Tue Jun 13 2000 - 11:13:08 MDT


Ping Liu;

Regarding the problems you are encountering running CCM3.6.6 in SPMD mode on
Linux clusters, I have a couple of comments. First I see that that you are
writing 2 history tapes. There is a known problem writing more than 1 history
tape in SPMD mode with this model (also CCM3.6). Usually the symptom is
that the model just hangs, but it might also produce the symptom you are
encountering. Someone has posted a fix for this problem to the ccm-users
mail group. So you could try either reformulating your integration to
require only 1 history tape, or search through the ccm-users mail archive
(www.cgd.ucar.edu/cms/ccm3/ccm-users.shtml) to find the code fix for writing
multiple history tapes in SPMD mode.

The second comment has to do with the fact that CCM3.6.6 has severe
performance problems on certain platforms when run in SPMD mode (notably the
IBM SP3). We (CCM developers) are in the process of installing mods to fix
this problem. Sparing you the gory details, the reformulated model will
write history tapes directly in netcdf format. In addition to drastically
improved IO performance, the SPMD multiple history tapes bug will not exist.
Stay tuned.

Regards,

Jim Rosinski
CCM Core Group

On Tue, 13 Jun 2000 liup@lasgsgi4.iap.ac.cn wrote:

> While runing CCM3.6.6 on LINUX clusters with just 2 nodes, the
> initialization seems finished. However, just some steps later, the
> following error message prompts:
>
> **** Summary of Logical Unit assignments ****
>
> History file number 1 = 31
> History file number 2 = 32
> Restart dataset unit (nsds) = 1
> Master regeneration unit (nrg) = 2
> Regeneration dataset units (nrg1) = 3 4 7 8 9
> Abs/ems unit for restart (nrg2) = 10 11 12 13 14
> Regeneration units for hist file 1 = 17 18 19 20 21
> Regeneration units for hist file 2 = 22 23 24 25 26
> 1 (<--lat cycling in scan1bc.F)
> Segmentation fault
> rm_l_1_3936: p4_error: net_recv read: probable EOF on socket: 1
> bm_list_26145: p4_error: net_recv read: probable EOF on socket: 1
>
> The compiler and linker are mpif90 from mpich and pgf90, they all work
> well. I traced the code and found the subroutine linemsbc could not be
> executed in /src/dynamics/eul/scan1bc.F even one step. The number 1
> mentioned above is lat cycling in scan1bc.
>
> Any suggestions especially from NCAR CCM group?
>
> Thanks,
>
> Ping Liu
>



This archive was generated by hypermail 2b27 : Thu Jan 04 2001 - 10:01:53 MST