Using PVM and pvmloop


I have written some software which facilitates data reduction or other tasks on the Titan cluster (or on any other pvm). See man pvmloop for more info.


Setting up PVM

To use pvmloop you need to start a pvm. This in turn requires setting up environment variables and modifying your .rhosts file. This is documented in a general manner in pvm3 users guide and reference manual but can be a bit fiddly, so I have described the procedure in detail below.

In what follows I will assume that you are running on cronus, and that your shell is csh or tcsh (to see what your shell is, do echo $SHELL).

First, pvm is installed in /usr/local/bin so make sure this is in your path and do which pvm to check you can get to it.

IMPORTANT: this, and other parts of the path to be described in a moment, will also need to be in your path on the slaves when remote processes get executed on them. This means that you MUST set your path in your shell rc file .cshrc or .tcshrc rather than in your .login file, since the latter does not get sourced when eg rsh is used.

Also, watch out for things being sourced conditionally depending on whether it is a login shell for example.

If in doubt, to check what the actual path is on the slaves, do

rsh node01 'echo $path'

(be sure to use single quotes here). However, note that the enviromnent inherited by processes spawned by pvm is that in effect when the pvm daemons are started, so if you change your path definition in your shell rc files, for example to add the path to some needed software, then you will need to 'halt' your pvm and start it afresh.

Second, you need to define the environment variables PVM_ROOT and PVM_ARCH by inserting the lines

setenv PVM_ROOT /usr/local/pvm3

setenv PVM_ARCH SUN4SOL2

in in your shell rc file.

Following this, you should insert lines to add the following to your path:

$PVM_ROOT/lib

$PVM_ROOT/lib/$PVM_ARCH

$PVM_ROOT/bin/$PVM_ARCH

$HOME/pvm3/bin/$PVM_ARCH

where the latter will contain your private pvm executables.

Finally, pvm needs to be able to execute processes remotely, so you need to have a .rhosts set up on all of the machines you intend to use. For the titan cluster, our home directories are the same for all the machines, so we only need to create one. Mine contains the following lines:


cronus kaiser

hyperion kaiser

theia kaiser

atlas kaiser

phoebe kaiser

crius kaiser

dione kaiser

coeus kaiser

metis kaiser

you will want to replace my username with yours!


The nodes are aliased to hyperion = node01 etc.

To check that this works, try executing something like

rsh node01 date

on cronus. You should see the usual date output.


Now execute pvm. You should get a prompt like

pvm>

if you type conf you will see the configuration of the pvm - right now it just contains cronus.

Now try to 'enroll' node01 in the pvm by typing add node01 at the pvm> prompt.

If you get an error message like node01 Can't start pvmd it probably means your path is not set up properly for rsh commands on the host. If the add command claims to be successful, type conf to see the new configuration.

Please refer to the excellent pvm3 users guide and reference manual for more information on installing and running pvm or writing programs to use the pvm library.


You could add all of the slaves manually in this way, but it is easier to provide pvm with a list of slaves when you start it. To do this, type halt at the pvm> prompt to kill the pvm, and then restart it with

pvm ~kaiser/pvm/titan/all.lst

and check that all 8 nodes have been successfully enrolled by typing conf as before.

You can now exit from the pvm command interface bu typing quit, which leaves the virtual machine running.

A couple of useful pvm command line commands are ps -a to see all of the pvm processes running, and reset which kills them all, but leaves the pvm running.


Using pvmloop

If you read the pvmloop man page, you will see that it allows you execute repetitive commands over the pvm. To do this, it uses two auxiliary commands pvmserver and topvm. These must live in your personal $HOME/pvm3/bin/$PVM_ARCH , so you can copy these from mine with

cp ~kaiser/pvm3/bin/SUN4SOL2/pvmloop ~/pvm3/bin/SUN4SOL2/

and similarly for pvmserver and topvm .

With luck you are all nearly ready to go. The final thing you need to do is create a file pvmslaves.lst which pvmloop uses to figure out how many slaves to use and associates with each node a 'node string' which is used primarily to generate the names of the slaves' local scratch disks.

As a test, go into some convenient temporary directory, copy over the file ~kaiser/pvm/titan/pvmslaves.lst, which defines tells pvmloop to use the 6 slaves node01 ... node06 with associated node strings 01 ... 06 . Then execute the command

pvmloop 6 'df -k /d%N'

What pvmloop does is to start up a set of pvmserver processes on each of the slaves. It then takes the 'command string template' provided as its second argument and generate copies of this, with the special substring '%N' replaced by the node string. It then sends these strings to the pvmserver processes using the pvm message passing machinery and these processes then use the system() command to have these executed. It does this in such a way that the stdout and stderr of the child processes are send back to the pvmloop process running on the master to be merged into the standard output streams of the pvmloop process. Finally, the pvmserver processes are sent a message telling them to exit().

Since the scratch disks on the titan cluster are called /d01 /d02 ... this example should tell you how much disk space is free on each of the scratch disks on the 6 slaves.

Please refer to the pvmloop man page for more information and how to use the other 4 special codes %n, %i, %I and %%.


PVM hints

I have a utility pvmmonitor that checks the CPU use on the slaves periodically. You can get a copy of it from ~kaiser/pvm3/bin/SUN4SOL2/ .

You may want to inspect some of the scripts in e.g. ~kaiser/12survey/cfh0900/phase1/scripts or in ~kaiser/MACS/subaru1200/phase1/scripts to get an idea on how to use imcat commands to reduce your data (though pvmloop will work happily with any other software that can be executed from scripts.)

Note that ps -a in the pvm command line interface will only report the presence of the pvmserver processes, and not the actual tasks spawned by these servers, which get put in the background. Thus, for example, if you kill a pvmloop process and then use pvm's reset command to kill the servers, the actual data processing tasks may still be running. This can lead to serious side effects if you restart the process. I tend to use pvmmonitor in such circumstances to figure out when it is safe to restart the process.


Experimenting with PVM and pvmloop

There is naturally something of a learning curve involved in developing scripts to handle data reduction using a pvm. In this regard, note that it is not necessary or desirable to book exclusive use of the titan cluster while experimenting or debugging scripts. If you explore some of the scripts mentioned above you will see that all details about the format of the CCD mosaic camera, size of chips, location of scratch directories etc are externalized in database files (suffix .db) which get 'required' by the scripts as necessary. If you follow this approach then it is easy to set up a set of db files for a set of small test images that can be used to test one's procedure without seriously loading the cluster slaves' CPU or scratch disk capacity.

Note also that it is very easy to install a private version of PVM on any workstation, and that even a single workstation installation can be used to simulate the behaviour of the titan cluster (though of course with less processing power).


Nick Kaiser - 1/16/2001