Computations on Palma
Compiling Software
The module concept
Environment variables (like PATH, LD_LIBRARY_PATH) for compilers and libraries can be set by modules:
Command (Short- and Long-form) |
Meaning |
module av[ailable] |
Lists all available modules |
module li[st] |
Lists all modules in the current enviroment |
module show modulname |
Lists all changes caused by a module |
module add modul1 modul2 ... |
Adds module to the current environment |
module rm modul1 modul2 ... |
Deletes module from the current environment |
module purge |
Deletes all modules from current environment |
Several environment variables will be set by the modules.
For the beginning it is mostly sufficient to load the module "intel/2016a", which is a toolchain that will load further modules like Intel MPI and the MKL.
Using the module command in submit scripts
If you want to use the
module command in submit scripts, the line
source /etc/profile.d/modules.sh
has to be added before.
Submitting jobs
The batch system Torque and the scheduler Maui are used to submit jobs. It is not allowed, to start jobs manually. Batch jobs should only be submitted from the server palma1.
Creating submit-files
Example of a submit-file of a MPI-job:
#PBS -o output.dat
#PBS -l walltime=01:00:00,nodes=4:westmere:ppn=12
#PBS -A project_name
#PBS -M username@uni-muenster.de
#PBS -m ae
#PBS -q default
#PBS -N job_name
#PBS -j oe
cd $PBS_O_WORKDIR
mpdboot --rsh=ssh -n 4 -f $PBS_NODEFILE -v
mpirun --rsh=ssh -machinefile $PBS_NODEFILE -np 32 ./executable
An MPI-job with 32 processes is started. For this purpose, 4 Westmere nodes with 32 cores each are demanded.
Further Information:
- project_name: Has to be replaced by the own project, otherwise the job will not run
- username: Replace by own username
- job_directory: Replace by the path, where the executable can be found
- executable: Enter the name of the executable
- walltime: The time needed for a whole run. At the moment, maximal 48 hours are possible
When no MPI is needed, the submit-file can be simpler.
Example for a job using openMP:
#PBS -o output.dat
#PBS -l walltime=01:00:00,nodes=1:westmere:ppn=12
#PBS -A project_name
#PBS -M username@uni-muenster.de
#PBS -m ae
#PBS -q default
#PBS -N job_name
#PBS -j oe
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=12
./executable
Setting up hybrid jobs
To start a hybrid parallelized job, i.e with MPI and openMP, you can use the following method:
#PBS -o output.dat
#PBS -l walltime=01:00:00,nodes=8:westmere:ppn=12
#PBS -A project_name
#PBS -M username@uni-muenster.de
#PBS -m ae
#PBS -q default
#PBS -N job_name
#PBS -j oe
export PARNODES=`wc -l $PBS_NODEFILE |gawk '{print $1}'`
export UNIQNODES=`cat $PBS_NODEFILE |gawk '{print $1}' | uniq | wc -l`
cd $PBS_O_WORKDIR
mpdboot --rsh=ssh -n $UNIQNODES -f $PBS_NODEFILE -v
OMP_NUM_THREADS=12 mpirun --rsh=ssh -machinefile $PBS_NODEFILE -np $UNIQNODES -env OMP_NUM_THREADS=12 ./executable
This would start 8 MPI processes on 8 nodes with 12 openMP threads each.
Using gcc and OpenMPI, this would read in this way:
#PBS -o output.dat
#PBS -l walltime=01:00:00,nodes=8:westmere:ppn=12
#PBS -A project_name
#PBS -M username@uni-muenster.de
#PBS -m ae
#PBS -q default
#PBS -N job_name
#PBS -j oe
export PARNODES=`wc -l $PBS_NODEFILE |gawk '{print $1}'`
export UNIQNODES=`cat $PBS_NODEFILE |gawk '{print $1}' | uniq | wc -l`
cd $PBS_O_WORKDIR
OMP_NUM_THREADS=12 mpirun -machinefile $PBS_NODEFILE -n $UNIQNODES -x OMP_NUM_THREADS=12 ./executable
Submitting jobs / Managing the queue
A job is submitted by entering the command
qsub submit.cmd
, where
submit.cmd is the name of the submit-file.
Further commands:
- qstat: Shows the current queue
- qstat -a: As above, but with the number of requested cores
- qstat -n: Shows in detail, which nodes are used
- qdel job_number: Deletes jobs from the queue
- showbf: Shows the number of free cores
Choosing the compute nodes
The option "#PBS -l" determines, which resources are required for the batch job. Due to the existence of two different kind of nodes it can be distinguished between them with the attribute "nehalem" and "westmere". The following tables shows different possibilities to reserve nodes.
in node-file |
nodes that will be reserved |
-l nodes=10:westmere:ppn=12 |
10 Westmere nodes with 12 cores each |
-l nodes=2:himem:ppn=12 |
2 Westmere himem nodes (with 48 GB memory per node) |
-l nodes=1:ppn=8 |
8 cores of a Westmere or a Nehalem node (not recommended) |
-l nodes=1:nehalem:ppn=1 |
1 core of a Nehalem node (recommended for serial jobs) |
Please send serial jobs only to the Nehalem nodes and parallel jobs to the Westmere nodes.
The queues
There are some diffent queues on Palma:
- default: maximal walltime of 48 hours
- long: same as default, but with maximal 160 hours walltime; only 8 jobs of each user will be executed at a time
- mescashort: jobs will be sent to the nodes palma060-palma063 or palma065 which have 256 GB RAM and 32 CPU cores; they are not accesible via the default queue; only 4 jobs per user will run at the same time, maximal walltime of 48 hours
- mescalong: the same as mescashort, but with maximal 2 jobs per user and a walltime of 160 hours
- mescatest: higher priotized queue to test jobs on the mesca nodes; walltime is restricted to 1 hour
- mescabig: jobs will be send to palma064 only, which has 512 GB memory, maximum walltime of 160 hours
Global restrictions:
- Jobs will only start, if they use less than 132.710.400 seconds of CPU time. This corresponds for example to 48 hours on 64 westmere nodes. If you want to use more nodes at a time, please reduce the walltime appropriately.
Using the batch system for jobs with a GUI
To start these programs with a GUI on a node with the batch system, see
this guide
Monitoring jobs
There are different tools for monitoring
-
qstat -a
: Shows the queues with running and waiting jobs
-
pbstop
: Similar to qstat but with a text-based graphical output
- myJAM: Webinterface; see "Rack View" to view the state of every node in the system
- Ganglia: Shows even more information of every node including memory and CPU usage
The scratch partition
In /scratch there are 180 TB space waiting for user data. For space and performance reasons, data generated by the codes running on palma should be stored here. The filesystem used here is GPFS, which is a parallel filesystem. It is therefore optimized for writing small amounts of large files. Writing huge amounts of small files can decrease the performance.
There is no backup of the scratch partition so please make a backup of your data!
--
HolgerAngenent - 2010-08-16