Slurm job arrays · Wiki · SRVOAD-users / user-guide

An other approach to run several independent tasks in parallel is to use Slurm array of jobs. It can be done by adding a single #SBATCH option to your sbatch file. Here is a simple example that uses the --array option:

#!/bin/bash
#SBATCH --job-name=myarray_job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10%5
#SBATCH --mem=1gb
#SBATCH --time=0-00:30:00
#SBATCH --output=0-00:01:00

./myapp arg1 arg2

This script (note the --array=1-10 option) will run one parent job that will run 10 child jobs performing a single task.

What is different ?

The parent job will run a new child job everytime there is enough computation ressources available (instead of waiting to be able to run them all in parallel)
You can specify how many child jobs (at most) should be run in parallel --array=1-10%5 will run at most 5 child jobs simultaneously
You can define the set of identifiers for your child jobs (i.e. use --array=1,5,6)

Separated outputs

The use of array of Jobs is an opportunity to define a specific output file for every job.

#SBATCH --output=slurm-%A-%a.log

this command line will create one output file for every child job (%A will be replaced by the ID of the parent job and %a by the ID of the child job)

Using configuration files

Using arrays of jobs will instantiate different slurm variables and in particular SLURM_ARRAY_TASK_ID. This variable can then be used in your sbatch file to extract information from a configuration file. Here is an example of how you can use slurm variables to configure the execution of your jobs.

#!/bin/bash
#SBATCH --job-name=myarray_job             # Name of the parent job
#SBATCH --ntasks=1                         # Each child job run 1 task
#SBATCH --cpus-per-task=1                  # Each task require 1 cpu
#SBATCH --array=1-10%5                     # Running 50 child jobs with IDs in [1,50]
#SBATCH --mem=1gb                          # Using at most 1gb of memory per cpu
#SBATCH --time=0-00:30:00                  # Child jobs will be killed if longer than 30 minutes
#SBATCH --output=logs/array_%A-%a.logs     # One log file per child job


# Specify the path to the config file
config=config.txt

# Extract the instance number for the current $SLURM_ARRAY_TASK_ID
inst=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $2}' $config)

# Extract the value of a parameter for the current $SLURM_ARRAY_TASK_ID
param=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $3}' $config)

# Execute my application/code with the parameters specified in my configuration file
./myapp $inst $param

And the corresponding configuration file config.txt:

ArrayTaskID Instance Parameter
1           1        15
2           2        15
3           3        15
4           4        15
5           5        20
6           6        20
7           7        20
8           8        30
9           9        30
10          10       30

Running the example

$ sbatch script_array_job.sh &
  $ Submitted batch job 130995
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   130995_[6-10%5]      ls2n myarray_ b19loger PD       0:00      1 (JobArrayTaskLimit)
          130995_1      ls2n myarray_ b19loger  R       0:02      1 srvoad
          130995_2      ls2n myarray_ b19loger  R       0:02      1 srvoad
          130995_3      ls2n myarray_ b19loger  R       0:02      1 srvoad
          130995_4      ls2n myarray_ b19loger  R       0:02      1 srvoad
          130995_5      ls2n myarray_ b19loger  R       0:02      1 srvoad

For more specific usage and more information check the slurm documentation.

Comments

Please register or sign in to add a comment.