Submitting jobs using SLURM

The Simple Linux Utility for Resource Management (SLURM) is an open-source, highly fault tolerant and scalable system that enables the management of Linux-based Clusters in which jobs can be scheduled and managed.

SLURM has three main functions. First, the user is assigned access to computing resources in an exclusive or non-exclusive way for a certain period of time so that he can later run computations on these resources. Second, the system provides a framework for the actual execution of jobs, including sending them to the various computers while constantly monitoring them (usually parallel runs). Finally, the system generates order for multiple users by managing an ordered queue (usually order by entry time).

Below we will present the various commands and the nature of their use in the Green Grid in order to launch a job:

1) sinfo - This command displays information about the status of the calculation nodes at a given moment. The calculation nodes will be divided into groups according to their condition, name, or partition. In the blow example, you can see that there are only 4 active nodes (node001-004) and the rest (node000, node005-014) are inactive. You can also see that the name of the hyphenated partition (asterisk) is normal.

2) Squeue - This command allows you to see the status of SLURM job queue. Each new job will be added to the queue. For Each job, Squeue will display the serial number, partition name, program name, user name, time (in minutes), number of node assigned to the job and their names.

In addition, the status of each job is stated as follows: R - running. CG - Shutting down. PD - Pending. The reason for waiting is indicated in brackets at the end of the line.

3) srun - Allows you to run a job on all assigned calculation nodes. In the following example we can see that we have requested to run the hostname command that prints the node name (Capital N) over 3 nodes (node001-003).

alternatively, you can run the command over 4 different processes, when in practice all of them will be launched from the same computer (node001 in the example).

4) sbatch - Allows you to run a script file (with a .sh extension) to be executed in parallel. In the above example, the script my.script prints the calculation node's name as well as the results of the numactl command which shows the properties of that node.

5) scancel - terminates a given job.

usage: scancel {job_id}

In the blow example, the script testtime executes a pause command for 100 seconds. After we ran the script using sbatch and saw that it was running using squeue (job 360), we terminated the job using scancel 360.