Running jobs on Clusty

Clusty employs SLURM for resource managing and job scheduling. This service makes sure that all resources available to Clusty are shared and all jobs submitted to Clusty scheduled fairly.

Clusty resources and scheduling jobs used to be governed by torque and maui. For old, obsolete instructions that are no longer applicable to clusty, refer to the old instructions.

To submit a job to Clusty, you need to write a SLURM shell script. The table below lists the most useful directives.

Option:
Description:
#SBATCH -J jobnameAssign the name of the job.
#SBATCH -p bigAssign the partition (queue) for the job.
#SBATCH -w nodelistRequest particular nodes. The list can be comma-separated or range, i.e. node[1-3]
#SBATCH -N numRequest num of nodes for the job.
#SBATCH -n numRequest num of processors for the job.
#SBATCH -t d-hh:mm:ssRequest walltime for the job.
#SBATCH -D dirnameSet dirname as the working directory. Relative to cwd, or absolute.
#SBATCH -o file.%j.outAssign file.%j.out as stdout for the job %j.
#SBATCH -e file.%j.outAssign file.%j.out as stderr for the job %j.
#SBATCH --mail-type=BEGIN,END,FAILNotify by mail on job start, completion or failure.
#SBATCH --mail-user=your.email@somewhere.netThe recepient of email notifications.


A typical example for a SLURM script (run.sh) would be:

#!/bin/bash

#SBATCH -J TestJob
#SBATCH -p big
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -t 0-00:01:00
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=me@somewhere.net

mpirun python my_mpi_job.py

This delegates a job to one node and requests 32 processors. It also sets the maximum wall time of 1 day.

To submit this job to Clusty, use the sbatch command:

sbatch run.sh

The table below lists the most common commands for managing and monitoring your job:

Command:
Description:
sinfo Provides information on cluster state, grouped by partitions
sinfo -leN Provides a bit more information on cluster state, by machine type
squeue List all jobs in the queue
smap Curses frontend to jobs, partitions and configuration
sview Graphical frontend to jobs, partitions and configuration


Some interesting situations:

Q: Running a process on a dedicated node is far more efficient than across nodes. Can I request a whole node to myself even though I might not need all the processors available on that node?
A: Yes, pass the --exclusive switch to sbatch/srun, or include #SBATCH --exclusive in the startup script.
Q: My process is running a long time, and I would like occasional diagnostics to be printed out. It works, but it outputs my diagnostics in the end instead of as it goes along.
A: This is because of the I/O buffering. Prefix your script with stdbuf -oL -eL for line buffering, or stdbuf -o0 -e0 for character buffering.
Q: This is so awesome! What can I do to make it up to you?
A: I accept Spanish chocolate, Belgian beer and Californian full-bodied red wine donations. ;)

How to contact me? Email me!

Finally, stuff that probably doesn't interest you as a user but might interest you as a prospective maintainer.