PALMA-NG

Content

PALMA-NG

Overview

palma3 is the login node to a newer part of the PALMA system. It has various queues/partitions for different purposes:

u0dawin: A queue for general purpose. It is usable for everyone, even without being a member of the groups that have submittet a proposal for PALMA. It replaces the old ZIVHPC cluster
k20gpu: Four nodes equipped with 3 K20 nVidia Tesla accelerators each
normal: 29 nodes with 32 Broadwell CPU cores each and 128 GB RAM.
zivsmp: A SMP machine with 512 GB RAM. The old login node of ZIVHPC. (not available yet)
phi: Two nodes with 4 Intel Xeon Phi Knights Corner accelerators each. (not available yet).
knl: Four nodes with a Xeon Phi Knights Landing accelerator
requeue: Job in this queue will run on the nodes of the above mentioned nodes. If your jobs are running on one of the exclusive nodes while jobs are put in there, your job will be terminated and requeued, so use with care.

There are some special queues, which are only allowed for certain groups (these are also Broadwell nodes like in the normal queue):

p0fuchs: 8 nodes for exclusive usage
p0kulesz: 4 nodes for exclusive usage
p0klasen: 1 nodes for exclusive usage
p0kapp: 1 nodes for exclusive usage
hims: 4 nodes for exclusive usage

The module concept

Environment variables (like PATH, LD_LIBRARY_PATH) for compilers and libraries can be set by modules:

Command (Short- and Long-form)	Meaning
module add modul1 modul2 ...	Adds module to the current environment
module av[ailable]	Lists all available modules
module li[st]	Lists all modules in the actual enviroment
module purge	Deletes all modules from czrrent environment
module rm modul1 modul2 ...	Deletes module from the current environment
module show modulname	Lists all changes caused by a module

Several environment variables will be set by the modules.

When you log in to palma3, some modules are loaded automatically.

Using the module command in submit scripts

If you want to use the module command in submit scripts, the line

source /etc/profile.d/modules.sh; source /etc/profile.d/modules_local.sh

has to be added before. Otherwise, just put the "module add" commands in your .bashrc (which can be found in your home-directory).

Monitoring

Ganglia

If you have X forwarding enabled, you can use llview (Just type "llview" at the command line).

The batch system

The batch system on PALMA3 is SLURM, but there is a wrapper for PBS installed, so most of your skripts should still be able to work. If you want to switch to SLURM, this document might help you: https://slurm.schedmd.com/rosetta.pdf

When using PBS skript, there are some differences to PALMA:

The first line of the submit script has to be #!/bin/bash
A queue is called partition in terms of SLURM. These terms will be used synonymous here.
The variable $PBS_O_WORKDIR will not be set. Instead you will start in the directory in which the script resides.
For using the "module add" command, you will have to source some scripts first: "source /etc/profile.d/modules.sh; source /etc/profile.d/modules_local.sh"

Submit a job

Create a file for example called submit.cmd

#!/bin/bash

# set the number of nodes
#SBATCH --nodes=1

# set the number of CPU cores per node
#SBATCH -n 8

# set a partition
#SBATCH -p u0dawin

# set max wallclock time
#SBATCH --time=24:00:00

# set name of job
#SBATCH --job-name=test123

# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# set an output file
#SBATCH -o output.dat

# send mail to this address
#SBATCH --mail-user=your_account@uni-muenster.de

# In the u0dawin queue, you will need the following line
source /etc/profile.d/modules.sh; source /etc/profile.d/modules_local.sh

# run the application
./program

You can send your submission to the batch system with the command "sbatch submit.cmd"

A detailed description can be found here: http://slurm.schedmd.com/sbatch.html

Show information about the queues

scontrol show partition

Show information about the nodes

sinfo

Running interactive jobs with SLURM

Use for example the following command:

srun -p u0dawin -N 1 --ntasks-per-node=8 --pty bash

This starts a job in the u0dawin queue/partition on one node with eight cores.

Information on jobs

List all current jobs for a user:

squeue -u <username>

List all running jobs for a user:

squeue -u <username> -t RUNNING

List all pending jobs for a user:

squeue -u <username> -t PENDING

List all current jobs in the normal partition for a user:

squeue -u <username> -p normal

List detailed information for a job (useful for troubleshooting):

scontrol show jobid -dd <jobid>

Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.

To get statistics on completed jobs by jobID:

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

To view the same information for all jobs of a user:

sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

Show priorities for waiting jobs:

sprio

Controlling jobs

To cancel one job:

scancel <jobid>

To cancel all the jobs for a user:

scancel -u <username>

To cancel all the pending jobs for a user:

scancel -t PENDING -u <username>

To cancel one or more jobs by name:

scancel --name myJobName

To pause a particular job:

scontrol hold <jobid>

To resume a particular job:

scontrol resume <jobid>

To requeue (cancel and rerun) a particular job:

scontrol requeue <jobid>

-- Holger Angenent - 2016-08-22

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
png	llview.png	r2 r1	manage	49.3 K	2016-12-06 - 12:31	HolgerAngenent

Topic revision: r11 - 2017-02-13 - HolgerAngenent

~~Edit~~
~~Attach~~

Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding ZIVwiki? Send feedback
Datenschutzerklärung Impressum