Running batch jobs

From Tycho
Jump to navigation Jump to search

Submit Jobs Via The SLURM Queueing System

This is the preferred method of submitting batch jobs to the cluster queueing system and to run jobs interactively.

Most Important Queue Commands:

Here we list the most commonly used queueing commands. If you are migrating from a different scheduling system, this cheat sheet may be useful for you. There also exists a compact two-page overview of the most important commands.

Use The 'Sinfo' Command To Display Information About Available Resources:

If you use the command without any options, it will display all available partitions. Use the -p switch to select a specific partition, for instance:

astro06:> sinfo -p astro2
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
astro2       up 10-00:00:0      1  down* node458
astro2       up 10-00:00:0     13  alloc node[454-457,459-462,480-481]
astro2       up 10-00:00:0     18   idle node[463-479,482]

The command displays how many nodes in the partition are offline (down), are busy (alloc) and how many are still available (idle)). For each sub-category, a NODELIST is displayed. The TIMELIMIT column shows the maximum job duration allowed for the partition in days-hours:minutes:seconds format. You can find more information about how to use the sinfo command on the official SLURM man pages.

Use The 'Squeue' Command To Display Information About Scheduled Jobs:

astro06:> squeue -astro_long
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 566136    astro_long jobname1 username  R      47:22      1 node485
 566135    astro_long jobname2 username  R      54:22      1 node481

The command displays a table with useful information. Use the JOBID of your job to modify or cancel a scheduled or already running job (see below). The status column (ST) shows the state of the queued job, the letters stand for: PD (pending), R (running), CA (cancelled), CG (completing), CD (completed), F (failed), TO (timeout), and NF (node failure).

Useful command line switches for squeue include -u (or --users) for only listing jobs that belong to a specific user. You can find more information about how to use the squeue command on the official SLURM man pages.

Use The 'Scancel' Command To Cancel A Scheduled Or Running Job:

astro06:> scancel 566136

You can find more information about how to use the scancel command on the official SLURM man pages.

Use The 'Srun' Command To Run Jobs Interactively:

You can run serial, openMP- or MPI-parallel code interactively using the srun command. Always make sure to specify the partition to run on via the -p command line switch. When running an MPI job, you can use the -n switch to specify the number of MPI tasks that you require. Command line arguments for your program can be passed at the end.

astro06:> srun -p astro_devel -n 20 <executable> [args...]

You can find more information about how to use the srun command on the official SLURM man pages.

Use The 'Sbatch' Command To Queue A Job Via A Submission Script:

astro06:> sbatch [additional options] job-submission-script.sh

You can find more information about how to use the sbatch command on the official SLURM man pages.