An other approach to run several independent tasks in parallel is to use Slurm array of jobs. It can be done by adding a single #SBATCH option to your sbatch file. Here is a simple example that uses the --array option:
#!/bin/bash
#SBATCH --job-name=myarray_job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10%5
#SBATCH --mem=1gb
#SBATCH --time=0-00:30:00
#SBATCH --output=0-00:01:00
./myapp arg1 arg2
This script (note the --array=1-10 option) will run one parent job that will run 10 child jobs performing a single task.
What is different ?
- The parent job will run a new child job everytime there is enough computation ressources available (instead of waiting to be able to run them all in parallel)
- You can specify how many child jobs (at most) should be run in parallel --array=1-10%5 will run at most 5 child jobs simultaneously
- You can define the set of identifiers for your child jobs (i.e. use --array=1,5,6)
Separated outputs
The use of array of Jobs is an opportunity to define a specific output file for every job.
#SBATCH --output=slurm-%A-%a.log
this command line will create one output file for every child job (%A will be replaced by the ID of the parent job and %a by the ID of the child job)
Using configuration files
Using arrays of jobs will instantiate different slurm variables and in particular SLURM_ARRAY_TASK_ID. This variable can then be used in your sbatch file to extract information from a configuration file. Here is an example of how you can use slurm variables to configure the execution of your jobs.
#!/bin/bash
#SBATCH --job-name=myarray_job # Name of the parent job
#SBATCH --ntasks=1 # Each child job run 1 task
#SBATCH --cpus-per-task=1 # Each task require 1 cpu
#SBATCH --array=1-10%5 # Running 50 child jobs with IDs in [1,50]
#SBATCH --mem=1gb # Using at most 1gb of memory per cpu
#SBATCH --time=0-00:30:00 # Child jobs will be killed if longer than 30 minutes
#SBATCH --output=logs/array_%A-%a.logs # One log file per child job
# Specify the path to the config file
config=config.txt
# Extract the instance number for the current $SLURM_ARRAY_TASK_ID
inst=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $2}' $config)
# Extract the value of a parameter for the current $SLURM_ARRAY_TASK_ID
param=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $3}' $config)
# Execute my application/code with the parameters specified in my configuration file
./myapp $inst $param
And the corresponding configuration file config.txt:
ArrayTaskID Instance Parameter
1 1 15
2 2 15
3 3 15
4 4 15
5 5 20
6 6 20
7 7 20
8 8 30
9 9 30
10 10 30
Running the example
$ sbatch script_array_job.sh &
$ Submitted batch job 130995
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
130995_[6-10%5] ls2n myarray_ b19loger PD 0:00 1 (JobArrayTaskLimit)
130995_1 ls2n myarray_ b19loger R 0:02 1 srvoad
130995_2 ls2n myarray_ b19loger R 0:02 1 srvoad
130995_3 ls2n myarray_ b19loger R 0:02 1 srvoad
130995_4 ls2n myarray_ b19loger R 0:02 1 srvoad
130995_5 ls2n myarray_ b19loger R 0:02 1 srvoad
For more specific usage and more information check the slurm documentation.