CCM3 on Beowulf cluster - run time errors


Subject: CCM3 on Beowulf cluster - run time errors
From: Chul Eddy Chung (cchung@fiji.ucsd.edu)
Date: Tue Oct 17 2000 - 11:48:00 MDT


Dear CCM users,

I'd like to report my problem, hoping that somebody can help me out about it.

I have a Beowulf cluster computer consisting of 1 master + 8 slaves (each of them
being a 1-CPU PC). I installed PGI CDK and netCDF 3.5 beta version. PGI CDK
includes MPI library, which is apparently linkable with pgf90; my netCDF library
is also linkable with pgf90, as I compiled the netCDF package using pgf90.
I tested MPI library and netCDF library using a sample pgf90 code, and they worked
fine.

Then, I downloaded the cluster version of CCM3. I erased "Msecond_underscore" in
Makefile, and changed one line under LINUX into "#define FORTRANUNDERSCORE in
/src/ccmlsm_share/cfort.h, because none of my libmpich.a and libnetcdf.a has two
underscores.

Then, the compilation of CCM3 does not give me any error message. However, I have
run time error messages:
(I quote)

NODE# NAME
  0 gcm1.ucsd.edu
  1 n01.ucsd.edu
  2 n02.ucsd.edu
  3 n03.ucsd.edu

.......................

 47 QPERT
max rss=0 shared mem=0 unshared data=0 unshared$max rss=0 shared mem=0 unshared
data=0 unshared stack=0
max rss=0 shared mem=0 unshared data=0 unshared stack=0
max rss=0 shared mem=0 unshared data=0 unshared stack=0
max rss=0 shared mem=0 unshared data=0 unshared stack=0
max rss=0 shared mem=0 unshared data=0 unshared stack=0
max rss=0 shared mem=0 unshared data=0 unshared stack=0
rm_l_1_6060: p4_error: net_recv read: probable EOF on socket: 1
rm_l_2_5983: p4_error: net_recv read: probable EOF on socket: 1
rm_l_3_6046: p4_error: net_recv read: probable EOF on socket: 1
 1 A
 48 QRL 18 A

......................

p0_5739: (16.420643) Trying to receive a message when there are no connections;$
--------------------------------------------------------
then the model stops. The model stops just before it makes the first time
integration. I believe that the model stops as slaves participate.

I have very little idea what is wrong. If I can get any clue, I'd be happy. One
thing I now suspect is the file /etc/fstab at each slave, which has this line
"192.168.0.100:/CCM /CCM nfs noac,rsize=8192,wsize=8192". /CCM is the
CCM3 working directory and is exported from the master.

Bye.

--
Chul Eddy Chung: http://www-c4.ucsd.edu/personnel/cchung
Postdoc with Ramanathan, Center for Atmospheric Sciences
Scripps Institution of Oceanography,                UCSD
Tel) 858-822-1356                      Fax) 858-534-7452



This archive was generated by hypermail 2b27 : Thu Jan 04 2001 - 10:02:08 MST