# Scheduling jobs ```{instructor-note} Total: 75min (Teaching:45Min | Discussion:0min | Breaks:0min | Exercises:30Min) ``` ```{objectives} - Objectives - Run a simple Hello World style program on the cluster. - Submit a simple Hello World style script to the cluster. - Use the batch system command line tools to monitor the execution of your job. - Inspect the output and error files of your jobs. ``` ## Job scheduler An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when. The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your job do not start instantly as in your laptop. ![Compare a job scheduler to a waiter in a restaurant](/images/restaurant_queue_manager.svg) The scheduler used in this lesson is `SLURM`. Although `SLURM` is not used everywhere, running jobs is quite similar regardless of what software is being used. The exact syntax might change, but the concepts remain the same. ```{discussion} Have you experienced having to wait to get into a popular restaurant ``` ## Creating our test job The most basic use of the scheduler is to run a command non-interactively. Any command (or series of commands) that you want to run on the cluster is called a **job**, and the process of using a scheduler to run the job is called **batch job submission**. In this case, the job we want to run is just a shell script. Let's create a demo shell script to run as a test. The landing pad will have a number of terminal-based text editors installed. Use whichever you prefer. Unsure? `nano` is a pretty good, basic choice. ```{instructor-note} What is a terminal text editor ? ``` ```console ## Create a file named example-job.sh your favourite editor ## The content should match what the ´cat´ command ## displays below [MY_USER_NAME@CLUSTER_NAME ~]$ nano example-job.sh [MY_USER_NAME@CLUSTER_NAME ~]$ cat example-job.sh #!/bin/bash echo -n "This script is running on " hostname ## Make the file executable [MY_USER_NAME@CLUSTER_NAME ~]$ chmod +x example-job.sh ``` Run the script in the login node. ```{instructor-note} * Remind what are login nodes and compute nodes * Try to use the same project as the one students have access to * If the cluster is too busy, even for a test job use * --reservation= * e.g. sbatch --account=nn9987k --reservation=nn9987k_gpu --mem=1G --partition=accel --time=01:00 example-job.sh * For tests jobs on Betzy use * --qos=preproc ``` ```console [MY_USER_NAME@CLUSTER_NAME ~ ]$ ./example-job.sh This script is running on ``` This job runs on the login node. There is a distinction between running the job through the scheduler and just "running it". To submit this job to the scheduler, we use the `sbatch` command. ```{note} **Find your project account(s)** All users on HPC clusters in Norway are part of at leaset one project. Each project is allocated an amount of credit. The currency used is "CPU-hours". When you communicating with the scheduler, we should indicate which account should be charged. ``` `````{tabs} ````{tab} SAGA/Fram/Betzy ```{literalinclude} snippets/saga/13-projects.txt :language: bash ``` ```` ````{tab} FOX ```{literalinclude} snippets/fox/13-projects.txt :language: bash ``` ```` ````` ```{warning} You need a replace the `` in the following command with one of the outputs you get from the `projects` command above ``` ```{discussion} Do you have any examples of automation that you can relate to the scheduler ? ``` ```console [MY_USER_NAME@CLUSTER_NAME ~]$ sbatch --account= --qos=preproc --mem=1G --time=01:00 example-job.sh Submitted batch job 137860 ``` And that's all we need to do to submit a job. Our work is done -- now the scheduler takes over and tries to run the job for us. While the job is waiting to run, it goes into a list of jobs called the *queue*. To check on our job's status, we check the queue using the command `squeue -u $USER`. ```console [MY_USER_NAME@CLUSTER_NAME ~]$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 137860 normal example- usernm R 0:02 1 c5-59 ``` The best way to check our job's status is with `squeue`. Of course, running `squeue` repeatedly to check on things can be a little tiresome. To see a real-time view of our jobs, we can use the `--iterate argument` command with how frequent should the output be refreshed. ```{warning} 1. Press `Ctrl-C` when you want to stop the command. 2. This test job may end before first iteration starts, if that happens we will see the output in a later job that takes more time ``` ```console [MY_USER_NAME@CLUSTER_NAME ~]$ sbatch --account= --mem=1G --time=01:00 example-job.sh [MY_USER_NAME@CLUSTER_NAME ~]$ squeue -u $USER --iterate 5 ``` ## Where's the output? On the login node, this script printed output to the terminal -- but when we exit `watch`, there's nothing. Where did it go? Cluster job output is typically redirected to a file in the directory you launched it from. Use `ls` to find and read the file. ### Move arguments to the job script ```console [MY_USER_NAME@CLUSTER_NAME ~]$ cat example-job.sh #!/bin/bash #SBATCH --nodes=1 #SBATCH --time=00:01:00 #SBATCH --account= #SBATCH --qos=preproc #SBATCH --mem=1G echo -n "This script is running on " hostname ``` ## Using software modules in scripts Here we create a job script that loads a particular version of Python, and prints the version number to the Slurm output file. ```console [MY_USER_NAME@CLUSTER_NAME ~ ]$ nano python-module.sh [MY_USER_NAME@CLUSTER_NAME ~ ]$ cat python-module.sh #!/bin/bash #SBATCH --nodes=1 #SBATCH --time=00:01:00 #SBATCH --account= #SBATCH --mem=1G #SBATCH --job-name=Python_module_test module purge module load Python/3.9.6-GCCcore-11.2.0 python --version [MY_USER_NAME@CLUSTER_NAME ~ ]$ sbatch python-module.sh ``` For full reproducibility it is always good practice to start your job script by purging any existing modules which you might have loaded when you submit the job script. You can then explicitly load all the dependencies for the current job, which makes it much more robust for future execution.