Computations on Palma

Compiling Software

The module concept

Environment variables (like PATH, LD_LIBRARY_PATH) for compilers and libraries can be set by modules:

Command (Short- and Long-form)	Meaning
module av[ailable]	Lists all available modules
module li[st]	Lists all modules in the actual enviroment
module show modulname	Lists all changes caused by a module
module add modul1 modul2 ...	Adds module to the actual environment
module rm modul1 modul2 ...	Deletes module from the actual environment
module purge	Deletes all modules from actual environment

Several environment variables will be set by the modules.

Example: Compile a program that uses the FFTW:

module add intel/cc/11.1.059
module add mpi/intel/4.0.3.008
module add fftw/intel/3.3.3

${MPIICC} -I ${FFTW_INCLUDE_DIR} -o program program.c -g ${FLAGS_FAST} -L${FFTW_LIB_DIR} -lfftw_mpi -lfftw -lm

Explanaition: The module fftw/2.1.5 sets the environment variables FFTW2_INCLUDE_DIR and FFTW2_LIB_DIR. These can be used to shorten the compiler calls (also in makefiles).

For the best performance, we recommend the usage of Intel MPI.

Using the module command in submit scripts

If you want to use the module command in submit scripts, the line

source /etc/profile.d/modules.sh

has to be added before.

Submitting jobs

The batch system Torque and the scheduler Maui are used to submit jobs. It is not allowed, to start jobs manually. Batch jobs should only be submitted from the server palma1.

Creating submit-files

Example of a submit-file of a MPI-job:

#PBS -o output.dat
#PBS -l walltime=01:00:00,nodes=4:westmere:ppn=12
#PBS -A project_name
#PBS -M username@uni-muenster.de
#PBS -m ae
#PBS -q default
#PBS -N job_name
#PBS -j oe
cd $PBS_O_WORKDIR
mpdboot --rsh=ssh -n 4 -f $PBS_NODEFILE  -v
mpirun --rsh=ssh -machinefile $PBS_NODEFILE -np 32 ./executable

An MPI-job with 32 processes is started. For this purpose, 4 Westmere nodes with 32 cores each are demanded.

Further Information:

project_name: Has to be replaced by the own project, otherwise the job will not run
username: Replace by own username
job_directory: Replace by the path, where the executable can be found
executable: Enter the name of the executable
walltime: The time needed for a whole run. At the moment, maximal 48 hours are possible

When no MPI is needed, the submit-file can be simpler.

Example for a job using openMP:

#PBS -o output.dat
#PBS -l walltime=01:00:00,nodes=1:westmere:ppn=12
#PBS -A project_name
#PBS -M username@uni-muenster.de
#PBS -m ae
#PBS -q default
#PBS -N job_name
#PBS -j oe
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=12
./executable

Setting up hybrid jobs

To start a hybrid parallelized job, i.e with MPI and openMP, you can use the following method:

#PBS -o output.dat
#PBS -l walltime=01:00:00,nodes=8:westmere:ppn=12
#PBS -A project_name
#PBS -M username@uni-muenster.de
#PBS -m ae
#PBS -q default
#PBS -N job_name
#PBS -j oe
export PARNODES=`wc -l $PBS_NODEFILE |gawk '{print $1}'`
export UNIQNODES=`cat $PBS_NODEFILE |gawk '{print $1}' | uniq | wc -l`
cd $PBS_O_WORKDIR
mpdboot --rsh=ssh -n $UNIQNODES -f $PBS_NODEFILE  -v
OMP_NUM_THREADS=12 mpirun --rsh=ssh -machinefile $PBS_NODEFILE -np $UNIQNODES -env OMP_NUM_THREADS=12 ./executable

This would start 8 MPI processes on 8 nodes with 12 openMP threads each.

Using gcc and OpenMPI, this would read in this way:

#PBS -o output.dat
#PBS -l walltime=01:00:00,nodes=8:westmere:ppn=12
#PBS -A project_name
#PBS -M username@uni-muenster.de
#PBS -m ae
#PBS -q default
#PBS -N job_name
#PBS -j oe
export PARNODES=`wc -l $PBS_NODEFILE |gawk '{print $1}'`
export UNIQNODES=`cat $PBS_NODEFILE |gawk '{print $1}' | uniq | wc -l`
cd $PBS_O_WORKDIR
OMP_NUM_THREADS=12 mpirun -machinefile $PBS_NODEFILE -n $UNIQNODES -x OMP_NUM_THREADS=12 ./executable

Submitting jobs / Managing the queue

A job is submitted by entering the command

 qsub submit.cmd

, where submit.cmd is the name of the submit-file.

Further commands:

qstat: Shows the current queue
qstat -a: As above, but with the number of requested cores
qstat -n: Shows in detail, which nodes are used
qdel job_number: Deletes jobs from the queue
showbf: Shows the number of free cores

Choosing the compute nodes

The option "#PBS -l" determines, which resources are required for the batch job. Due to the existence of two different kind of nodes it can be distinguished between them with the attribute "nehalem" and "westmere". The following tables shows different possibilities to reserve nodes.

in node-file	nodes that will be reserved
-l nodes=10:westmere:ppn=12	10 Westmere nodes with 12 cores each
-l nodes=2:himem:ppn=12	2 Westmere himem nodes (with 48 GB memory per node)
-l nodes=1:ppn=8	8 cores of a Westmere or a Nehalem node (not recommended)
-l nodes=1:nehalem:ppn=1	1 core of a Nehalem node (recommended for serial jobs)

Please send serial jobs only to the Nehalem nodes and parallel jobs to the Westmere nodes.

The queues

There are some diffent queues on Palma:

default: maximal walltime of 48 hours
long: same as default, but with maximal 160 hours walltime; only 8 jobs of each user will be executed at a time
mescashort: jobs will be sent to the nodes palma060-palma063 or palma065 which have 256 GB RAM and 32 CPU cores; they are not accesible via the default queue; only 4 jobs per user will run at the same time, maximal walltime of 48 hours
mescalong: the same as mescashort, but with maximal 2 jobs per user and a walltime of 160 hours
mescatest: higher priotized queue to test jobs on the mesca nodes; walltime is restricted to 1 hour
mescabig: jobs will be send to palma064 only, which has 512 GB memory, maximum walltime of 160 hours

Global restrictions:

Jobs will only start, if they use less than 132.710.400 seconds of CPU time. This corresponds for example to 48 hours on 64 westmere nodes. If you want to use more nodes at a time, please reduce the walltime appropriately.

Using the batch system for jobs with a GUI

To start these programs with a GUI on a node with the batch system, see this guide

Monitoring jobs

There are different tools for monitoring

qstat -a: Shows the queues with running and waiting jobs
pbstop: Similar to qstat but with a text-based graphical output
myJAM: Webinterface; see "Rack View" to view the state of every node in the system
Ganglia: Shows even more information of every node including memory and CPU usage

The scratch partition

In /scratch there are 180 TB space waiting for user data. For space and performance reasons, data generated by the codes running on palma should be stored here. The filesystem used here is lustre, which is a parallel filesystem. It is therefore optimized for writing small amounts of large files. Writing huge amounts of small files can decrease the performance drastically.

There is no backup of the scratch partition so please make a backup of your data!

-- HolgerAngenent - 2010-08-16

Topic revision: r14 - 2014-08-01 - HolgerAngenent

~~Edit~~
~~Attach~~

Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding ZIVwiki? Send feedback
Datenschutzerklärung Impressum