Line: 1 to 1 | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PALMA II!!! Attention !!! A new Wiki concerning information about PALMA II and HPC in general can be found at the WWU Confluence! | |||||||||||||||||||
Deleted: | |||||||||||||||||||
< < |
Content
OverviewPalma II is the HPC system of the Zentrum für Informationsverarbeitung. To be able to log in, you have to register for the group u0clstr in MeinZIV![]() ![]() FilesystemsWhen you log in to the cluster for the first time, a directory in /home is created for you. Please use this only to store your programs, but don't store your numerical results there. We have limited your storage in home to 400GB. You have to create a directory in /scratch/tmp to store the data you create on the compute nodes there. To enforce this, we will mount home read only on the compute nodes in the future. And since /scratch is not intended as an archive you are asked to remove your data there as soon as you do not need them anymore.Software/The module conceptThe software on palma-ng can be accessed via modules. These are small script that set environment variables (like PATH and LD_LIBRARY_PATH) pointing to the locations where the software is installed (this is mostly on network drives so that the software is available on every node in the cluster). The module system we use here is LMOD![]() ![]() ![]()
and you will see the software that has been compiled with this version. Alternatively you can use the "module spider" command. Monitoring
The batch systemThe batch system on PALMA II is SLURM. If you are used to PBS/Maui and want to switch to SLURM, this document might help you: https://slurm.schedmd.com/rosetta.pdf![]() The partitions
Submit a jobCreate a file for example called submit.cmd#!/bin/bash # set the number of nodes #SBATCH --nodes=1 # set the number of CPU cores per node #SBATCH --ntasks-per-node 72 # How much memory is needed (per node). Possible units: K, G, M, T #SBATCH --mem=64G # set a partition #SBATCH --partition normal # set max wallclock time #SBATCH --time=24:00:00 # set name of job #SBATCH --job-name=test123 # mail alert at start, end and abortion of execution #SBATCH --mail-type=ALL # set an output file #SBATCH --output output.dat # send mail to this address #SBATCH --mail-user=your_account@uni-muenster.de # run the application ./programYou can send your submission to the batch system with the command "sbatch submit.cmd" It is recommended to reserve complete nodes, if you can use 72 threads. A detailed description can be found here: http://slurm.schedmd.com/sbatch.html ![]() Starting jobs with MPI-parallel codesmpirun will get all necessary information from SLURM, if submitted appropriately. If you for example want to start 144 MPI ranks distributed to two nodes, you could do this the following way:#!/bin/bash # set the number of nodes #SBATCH --nodes=2 # set the number of CPU cores per node #SBATCH --exclusive # How much memory is needed (per node). Possible units: K, G, M, T. #SBATCH --mem=64G # set a partition #SBATCH --partition normal # set max wallclock time #SBATCH --time=2-00:00:00 # set name of job #SBATCH --job-name=test123 # mail alert at start, end and abortion of execution #SBATCH --mail-type=ALL # set an output file #SBATCH --output output.dat # send mail to this address #SBATCH --mail-user=your_account@uni-muenster.de # run the application mpirun programSome codes do not profit from Hyperthreading, so it is better, to start only 36 processes per node: #!/bin/bash # set the number of nodes #SBATCH --nodes=2 # set the number of CPU cores per node #SBATCH --exclusive #SBATCH --ntasks-per-node=36 # How much memory is needed (per node). Possible units: K, G, M, T. #SBATCH --mem=64G # set a partition #SBATCH --partition normal # set max wallclock time #SBATCH --time=2-00:00:00 # set name of job #SBATCH --job-name=test123 # mail alert at start, end and abortion of execution #SBATCH --mail-type=ALL # set an output file #SBATCH --output output.dat # send mail to this address #SBATCH --mail-user=your_account@uni-muenster.de # run the application mpirun programFor starting hybrid jobs (meaning that they are using MPI and OpenMP parallelization at the same time), you can use the --cpus-per-task switch. srun -p normal --nodes=2 --ntasks=72 --ntasks-per-node=36 --cpus-per-task=2 --pty bash OMP_NUM_THREADS=2 mpirun ./program Using the GPU nodesIf you want to use a GPU for your computations:
Using CaffeCaffe 1.0 is available for Python3 on the GPU partitions in the fosscuda/2018b toolchain. To use it, you have to loadfosscuda/2018b and Caffe (ml fosscuda/2018b Caffe) and export the Caffe PYTHONPATH .
On Skylake nodes (gputitanxp and gpuv100 partitions)
PYTHONPATH=/Applic.HPC/skylakegpu/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/Caffe/1.0-Python-3.6.6/python:$PYTHONPATHOn Broadwell nodes ( gpuk20 partition)
PYTHONPATH=/Applic.HPC/k20gpu/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/Caffe/1.0-Python-3.6.6/python:$PYTHONPATH Show information about the partitionsscontrol show partition Show information about the nodessinfo Running interactive jobs with SLURMUse for example the following command:srun --partition express --nodes 1 --ntasks-per-node=8 --pty bashThis starts a job in the express partition on one node with eight cores. Information on jobsList all current jobs for a user:squeue -u <username>List all running jobs for a user: squeue -u <username> -t RUNNINGList all pending jobs for a user: squeue -u <username> -t PENDINGList all current jobs in the normal partition for a user: squeue -u <username> -p normalList detailed information for a job (useful for troubleshooting): scontrol show job -dd <jobid>Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID: sacct -j <jobid> --format=JobID,JobName,MaxRSS,ElapsedTo view the same information for all jobs of a user: sacct -u <username> --format=JobID,JobName,MaxRSS,ElapsedShow priorities for waiting jobs:
Controlling jobsTo cancel one job:scancel <jobid>To cancel all the jobs for a user: scancel -u <username>To cancel all the pending jobs for a user: scancel -t PENDING -u <username>To cancel one or more jobs by name: scancel --name myJobNameTo pause a particular job: scontrol hold <jobid>To resume a particular job: scontrol resume <jobid>To requeue (cancel and rerun) a particular job: scontrol requeue <jobid> VisualizationFor the visualization of bigger data sets, it is impractical to copy them to your local machine. We therefore offer a solution to do the postprocessing on Palma II. Since the CPUs are quite fast, the rendering is done in software.
![]() |