Accessing software

Instructor note

Total: 45min (Teaching:30Min | Discussion:0min | Breaks:0min | Exercises:15Min)

Objectives

  • Questions

    • How can we find out which scientific software is installed on the HPC cluster?

    • How can we access scientific software on the HPC cluster?

  • Objectives

    • Understand how the UNIX system looks for installed software

    • Understand how to load and use a software package

  • Keypoints

    • Search for software with module avail

    • Load software with module load

    • Unload software with module purge

    • The module system handles software versioning and will prevent package conflicts for you automatically

On a high-performance computing system, it is seldom the case that the software we want to use is available when we log in. It is installed, but we will need to “load” it before it can run.

Before we start using individual software packages, however, we should understand the reasoning behind this approach. The three biggest factors are:

  • software incompatibilities

  • versioning

  • dependencies

Software incompatibility is a major headache for programmers. Sometimes the presence (or absence) of a software package will break others that depend on it. Two of the most famous examples are Python 2 and 3 and C compiler versions. Python 3 famously provides a python command that conflicts with that provided by Python 2. Software compiled against a newer version of the C libraries and then used when they are not present will result in a nasty 'GLIBCXX_3.4.20' not found error, for instance.

Software versioning is another common issue. A team might depend on a certain package version for their research project - if the software version was to change (for instance, if a package was updated), it might affect their results. Having access to multiple software versions allow a set of researchers to prevent software versioning issues from affecting their results.

Dependencies are where a particular software package (or even a particular version) depends on having access to another software package (or even a particular version of another software package). For example, the VASP materials science software may depend on having a particular version of the FFTW (the Fastest Fourier Transform in the West) software library available for it to work.

Environment modules are the solution to these problems, and we will return to this after looking at globally installed packages.

Globally installed system packages

In this example we will use Python, which is installed globally on the login node in one particular version.

We can test what the python command is actually pointing to by another command called which. which looks for programs the same way that Bash does, so we can use it to tell us where a particular piece of software is stored.

MY_USER_NAMEd@CLUSTER_NAME ~]$ python --version
Python 3.9.14
[MY_USER_NAME@CLUSTER_NAME ~]$ which python
/usr/bin/python

MY_USER_NAMEd@CLUSTER_NAME ~]$ python3 --version
Python 3.9.14
[MY_USER_NAME@CLUSTER_NAME ~]$ which python3
/usr/bin/python3

What this tells us is that python and python3 is the same command. What the output of which tells us is that typing the command python3 is equivalent of running the full command /usr/bin/python3.

But how did the shell know that python should be linked to /usr/bin/python? To explain this, we first need to understand the nature of the PATH environment variable. PATH is a special environment variable that controls where a UNIX system looks for software. We can inspect its value with the following command (PATH is the variable, $ extracts its value, and echo prints the value):

[MY_USER_NAME@CLUSTER_NAME ~]$ echo $PATH
/node/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/cluster/bin:/cluster/home/MY_USER_NAME/.local/bin:/cluster/home/MY_USER_NAME/bin

What we see here is a colon-separated (:) list of search paths that the shell is looping through when looking for the python command. In this case it finds a match under /usr/bin, so then it exits the search and replaces python with /usr/bin/python.

Exercise

Exercise (10 min)

  1. What happens if there are other matching commands located later in the search PATH, e.g. /cluster/bin/python?

  2. What happens if you have an executable script in your current directory with the same name as a globally installed program?

Environment modules

A module is a self-contained description of a software package - it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.

There are a number of different environment module implementations commonly used on HPC systems: the two most common are TCL modules and Lmod. Both of these use similar syntax and the concepts are the same so learning to use one will allow you to use whichever is installed on the system you are using. In both implementations the module command is used to interact with environment modules. An additional subcommand is usually added to the command to specify what you want to do. For a list of subcommands you can use module -h or module help. As for all commands, you can access the full help on the man pages with man module.

On login, you may start out with a default set of modules loaded, or you may start out with an empty environment; this depends on the setup of the system you are using.

Listing currently loaded modules

You can use the module list command to see which modules you currently have loaded in your environment. After logging into one of our systems, your environment should ideally be clean like this:

[MY_USER_NAME@CLUSTER_NAME ~ ]$ module list

Currently Loaded Modules:
  1) StdEnv (S)

  Where:
   S:  Module is Sticky, requires --force to unload or purge

You can see that one module is loaded which has special attribute of being sticky (S). That means that it is not usually unloaded, typically because it is important for the system to function correctly (so --force removing it is adviced against).

Finding and listing available modules

One way to look for available software is to search for keywords using module keyword <KEYWORD>. This will look through the module meta data and return anything that matches. For example, let’s list bioinformatics programs that can be loaded using modules with module keyword bio:

[MY_USER_NAME@CLUSTER_NAME ~]$ module keyword bio
---------------------------------------------------------------------------------------------------

The following modules match your search criteria: "bio"
---------------------------------------------------------------------------------------------------

  ABySS: ABySS/2.0.2-gompi-2019a, ABySS/2.1.5-gompi-2020a
    Assembly By Short Sequences - a de novo, parallel, paired-end sequence assembler

  AUGUSTUS: AUGUSTUS/3.3.2-intel-2018b-Python-2.7.15, AUGUSTUS/3.3.3-foss-2019b, ...
    AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences

  BBMap: BBMap/38.50b-GCC-8.2.0-2.31.1, BBMap/38.79-GCC-8.3.0, BBMap/38.87-iccifort-2020.1.217
    BBMap short read aligner, and other bioinformatic tools.

  bioawk: bioawk/1.0-foss-2018b
    Bioawk is an extension to Brian Kernighan's awk, adding the support of several common
    biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and
    TAB-delimited formats with column names.

  BioPerl: BioPerl/1.7.2-GCCcore-8.2.0-Perl-5.28.1, BioPerl/1.7.2-GCCcore-8.3.0, ...
    Bioperl is the product of a community effort to produce Perl code which is useful in biology.
    Examples include Sequence objects, Alignment objects and database searching objects.


  [removed most of the output here for clarity]

Another option is to search directly on the module name using the module avail command. If you run this command without any search string it will produce a long list of all the installed software modules, like this:

[MY_USER_NAME@CLUSTER_NAME ~]$ module avail

---------------------- /cluster/modulefiles/all ---------------------------------------------------------------------------
   prodigal/2.6.3-GCCcore-10.3.0
   prodigal/2.6.3-GCCcore-11.2.0
   prodigal/2.6.3-GCCcore-11.3.0
   prodigal/2.6.3-GCCcore-12.2.0
   PROJ/8.0.1-GCCcore-10.3.0
   PROJ/8.1.0-GCCcore-11.2.0                               (L)
   PROJ/9.0.0-GCCcore-11.3.0
   PROJ/9.1.1-GCCcore-12.2.0
   PROJ/9.2.0-GCCcore-12.3.0
   prokka/1.14.5-gompi-2021a
   prokka/1.14.5-gompi-2021b
   prokka/1.14.5-gompi-2022a
   Pysam/0.16.0.1-GCC-10.3.0
   Pysam/0.18.0-GCC-11.2.0
   Pysam/0.19.1-GCC-11.3.0
   Pysam/0.21.0-GCC-12.2.0
   Python-bundle-PyPI/2023.06-GCCcore-12.3.0
   Python-bundle-PyPI/2023.10-GCCcore-13.2.0
   Python/3.9.5-GCCcore-10.3.0
   Python/3.9.6-GCCcore-11.2.0                             (L)
   Python/3.10.4-GCCcore-11.3.0
   Python/3.10.8-GCCcore-12.2.0
   Python/3.11.3-GCCcore-12.3.0
   Python/3.11.5-GCCcore-13.2.0
   PyTorch/1.12.0-foss-2022a-CUDA-11.7.0
   PyTorch/1.12.1-foss-2022a-CUDA-11.7.0
   QIIME2/2022.11
   Qualimap/2.2.1-foss-2021b-R-4.1.2
   Qualimap/2.3-foss-2022b-R-4.2.2
   QuantumESPRESSO/6.8-foss-2021a
   QuantumESPRESSO/6.8-intel-2021a
   QuantumESPRESSO/7.0-foss-2021b
   QuantumESPRESSO/7.1-foss-2022a
   QuantumESPRESSO/7.1-intel-2022a

[removed most of the output here for clarity]

---------------------- /cluster/modulefiles/external ----------------------------------------------------------------------
   appusage/1.0    hpcx/2.4    hpcx/2.5    hpcx/2.6

  Where:
   S:        Module is Sticky, requires --force to unload or purge
   L:        Module is loaded
   Aliases:  Aliases exist: foo/1.2.3 (1.2) means that "module load foo/1.2" will load foo/1.2.3

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

You can refine the search by adding a search string to the command, like module avail <SOFTWARE>. In contrast to the module keyword search, which will only be matched to the module name, not to any meta data. For example, we can list all modules that matched the string ‘python/’ (including the ‘/’):

[MY_USER_NAME@CLUSTER_NAME ~]$ module avail python/

------------------------------------------------ /cluster/modulefiles/all --------------------------
   Biopython/1.79-foss-2021a
   Biopython/1.79-foss-2021b
   Biopython/1.79-foss-2022a
   Biopython/1.81-foss-2022b
   Boost.Python/1.76.0-GCC-10.3.0
   Boost.Python/1.79.0-GCC-11.3.0
   bx-python/0.8.11-foss-2021a
   bx-python/0.8.13-foss-2021b
   bx-python/0.9.0-foss-2022a
   ecoPCR/1.0.1-GCCcore-11.2.0-Python-2.7.18
   netcdf4-python/1.5.7-foss-2021a
   netcdf4-python/1.5.7-foss-2021b
   netcdf4-python/1.6.1-foss-2022a
   netcdf4-python/1.6.3-foss-2022b
   netcdf4-python/1.6.4-foss-2023a
   Python-bundle-PyPI/2023.06-GCCcore-12.3.0
   Python-bundle-PyPI/2023.10-GCCcore-13.2.0
   Python/3.9.5-GCCcore-10.3.0
   Python/3.9.6-GCCcore-11.2.0               (L)
   Python/3.10.4-GCCcore-11.3.0
   Python/3.10.8-GCCcore-12.2.0
   Python/3.11.3-GCCcore-12.3.0
   Python/3.11.5-GCCcore-13.2.0

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

Loading and unloading software

Any of the software modules that we found in the previous section can be loaded into our environment using the module load command. Let’s say we are not happy with the system version of Python that we get when logging in to the cluster (see “Globally installed system packages” above). We can then instead load a module for the Python version that we want:

[MY_USER_NAME@CLUSTER_NAME ~ ]$ module load Python/3.9.6-GCCcore-11.2.0

[MY_USER_NAME@CLUSTER_NAME ~ ]$ which python
/cluster/software/Python/3.9.6-GCCcore-11.2.0/bin/python

[MY_USER_NAME@CLUSTER_NAME ~ ]$ python --version
Python 3.9.6

So, what just happened? Let’s have a look at the PATH variable again:

[MY_USER_NAME@CLUSTER_NAME ~ ]$ echo $PATH

/cluster/software/Python/3.9.6-GCCcore-11.2.0/bin:
/cluster/software/OpenSSL/1.1/bin:
....
/cluster/bin:
/cluster/home/MY_USER_NAME/.local/bin:
/cluster/home/MY_USER_NAME/bin

You’ll notice that the output is much longer than it was before we loaded the Python module, and if you look closely you’ll see that the last entries of the output are identical to what we had before. This means that by loading the module, we changed the PATH by adding entries to the beginning of the list. This means that the shell will now start looking into the /cluster/software/Python/3.9.6-GCCcore-11.2.0/bin etc. locations, before moving on the the “system” paths /usr/bin etc.

Let’s examine what’s there:

[MY_USER_NAME@CLUSTER_NAME ~ ]$ ls -lh /cluster/software/Python/3.9.6-GCCcore-11.2.0/bin

....
-rwxrwxr-x 1 vegarde sysapp  264 nov.   4 09:15 py.test
lrwxrwxr-x 1 vegarde sysapp    9 nov.   4 08:54 python -> python3.9
lrwxrwxr-x 1 vegarde sysapp    9 nov.   4 08:54 python3 -> python3.9
-rwxrwxr-x 1 vegarde sysapp  13K nov.   4 08:53 python3.9
-rwxrwxr-x 1 vegarde sysapp 3,2K nov.   4 08:54 python3.9-config
....

Taking this to its conclusion, module load will add software to your $PATH. It “loads” software. A special note on this - depending on which version of the module program that is installed at your site, module load will also load required software dependencies.

To demonstrate, let’s use module list. module list shows all loaded software modules.

[MY_USER_NAME@CLUSTER_NAME ~ ]$ module list

Currently Loaded Modules:
  1) StdEnv                         (S)   8) Tcl/8.6.11-GCCcore-11.2.0   (H)
  2) GCCcore/11.2.0                       9) SQLite/3.36-GCCcore-11.2.0  (H)
  3) zlib/1.2.11-GCCcore-11.2.0     (H)  10) XZ/5.2.5-GCCcore-11.2.0     (H)
  4) binutils/2.37-GCCcore-11.2.0   (H)  11) GMP/6.2.1-GCCcore-11.2.0    (H)
  5) bzip2/1.0.8-GCCcore-11.2.0     (H)  12) libffi/3.4.2-GCCcore-11.2.0 (H)
  6) ncurses/6.2-GCCcore-11.2.0     (H)  13) OpenSSL/1.1                 (H)
  7) libreadline/8.1-GCCcore-11.2.0 (H)  14) Python/3.9.6-GCCcore-11.2.0

  Where:
   S:  Module is Sticky, requires --force to unload or purge
   H:             Hidden Module

[MY_USER_NAME@CLUSTER_NAME ~ ]$ module purge
[MY_USER_NAME@CLUSTER_NAME ~ ]$ module list

Currently Loaded Modules:
  1) StdEnv (S)

  Where:
   S:  Module is Sticky, requires --force to unload or purge

[MY_USER_NAME@CLUSTER_NAME ~ ]$ module load BLAST+/2.11.0-gompi-2020a
[MY_USER_NAME@CLUSTER_NAME ~ ]$ module list

Currently Loaded Modules:
  1) GCCcore/9.3.0                  7) libxml2/2.9.10-GCCcore-9.3.0     13) OpenMPI/4.0.3-GCC-9.3.0    19) libpng/1.6.37-GCCcore-9.3.0
  2) zlib/1.2.11-GCCcore-9.3.0      8) libpciaccess/0.16-GCCcore-9.3.0  14) gompi/2020a                20) NASM/2.14.02-GCCcore-9.3.0
  3) binutils/2.34-GCCcore-9.3.0    9) hwloc/2.2.0-GCCcore-9.3.0        15) bzip2/1.0.8-GCCcore-9.3.0  21) libjpeg-turbo/2.0.4-GCCcore-9.3.0
  4) GCC/9.3.0                     10) libevent/2.1.11-GCCcore-9.3.0    16) PCRE/8.44-GCCcore-9.3.0    22) LMDB/0.9.24-GCCcore-9.3.0
  5) numactl/2.0.13-GCCcore-9.3.0  11) UCX/1.8.0-GCCcore-9.3.0          17) Boost/1.72.0-gompi-2020a   23) BLAST+/2.11.0-gompi-2020a
  6) XZ/5.2.5-GCCcore-9.3.0        12) libfabric/1.11.0-GCCcore-9.3.0   18) GMP/6.2.0-GCCcore-9.3.0


[MY_USER_NAME@CLUSTER_NAME ~ ]$ module unload BLAST+/2.11.0-gompi-2020a
[MY_USER_NAME@CLUSTER_NAME ~ ]$ module list

Currently Loaded Modules:
  1) GCCcore/9.3.0                  7) libxml2/2.9.10-GCCcore-9.3.0     13) OpenMPI/4.0.3-GCC-9.3.0    19) libpng/1.6.37-GCCcore-9.3.0
  2) zlib/1.2.11-GCCcore-9.3.0      8) libpciaccess/0.16-GCCcore-9.3.0  14) gompi/2020a                20) NASM/2.14.02-GCCcore-9.3.0
  3) binutils/2.34-GCCcore-9.3.0    9) hwloc/2.2.0-GCCcore-9.3.0        15) bzip2/1.0.8-GCCcore-9.3.0  21) libjpeg-turbo/2.0.4-GCCcore-9.3.0
  4) GCC/9.3.0                     10) libevent/2.1.11-GCCcore-9.3.0    16) PCRE/8.44-GCCcore-9.3.0    22) LMDB/0.9.24-GCCcore-9.3.0
  5) numactl/2.0.13-GCCcore-9.3.0  11) UCX/1.8.0-GCCcore-9.3.0          17) Boost/1.72.0-gompi-2020a
  6) XZ/5.2.5-GCCcore-9.3.0        12) libfabric/1.11.0-GCCcore-9.3.0   18) GMP/6.2.0-GCCcore-9.3.0

So using `module unload` "un-loads" a module not its dependencies.
If we wanted to unload everything at once, we could run `module purge` (unloads everything).

[MY_USER_NAME@CLUSTER_NAME ~ ]$ module purge

The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) StdEnv

[MY_USER_NAME@CLUSTER_NAME ~ ]$ module list

Currently Loaded Modules:
  1) StdEnv (S)

  Where:
   S:  Module is Sticky, requires --force to unload or purge

Note that module purge is informative. It lets us know that all but a default set of packages have been unloaded (and how to actually unload these if we truly so desired).

Software versioning & toolchains

So far, we’ve learned how to load and unload software packages. This is very useful. However, we have not yet addressed the issue of software versioning. At some point or other, you will run into issues where only one particular version of some software will be suitable. Perhaps a key bugfix only happened in a certain version, or version X broke compatibility with a file format you use. In either of these example cases, it helps to be very specific about what software is loaded.

Let’s examine the output of module avail <SOFTWARE> more closely:

[MY_USER_NAME@CLUSTER_NAME ~ ]$ module avail Python/

------------------------- /cluster/modulefiles/all --------------------------
   Biopython/1.79-foss-2021a          netcdf4-python/1.5.7-foss-2021b
   Biopython/1.79-foss-2021b          netcdf4-python/1.6.1-foss-2022a
   Biopython/1.79-foss-2022a          netcdf4-python/1.6.3-foss-2022b
   Biopython/1.81-foss-2022b          netcdf4-python/1.6.4-foss-2023a
   Boost.Python/1.76.0-GCC-10.3.0     Python/3.9.5-GCCcore-10.3.0
   Boost.Python/1.79.0-GCC-11.3.0     Python/3.9.6-GCCcore-11.2.0     (L)
   bx-python/0.8.11-foss-2021a        Python/3.10.4-GCCcore-11.3.0
   bx-python/0.8.13-foss-2021b        Python/3.10.8-GCCcore-12.2.0
   bx-python/0.9.0-foss-2022a         Python/3.11.3-GCCcore-12.3.0
   netcdf4-python/1.5.7-foss-2021a    Python/3.11.5-GCCcore-13.2.0

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

You can see that module avail Python/ lists six versions of ‘Python’ with the version number being the first part after the /. The GCCcore-* describes the toolchain with which ‘Python’ was compiled and its version. So the different ‘Python’ versions are compiled with toolchains from GCCcore 10.3 to 13.2.0.

Toolchains are standardized bundles used for installing modules. They usually consist of a compiler, math libraries and MPI implementation. The most common toolchains are GCCcore, intel and foss. It is important to know that modules created with different toolchains are often incompatible. If you try to load two modules that are based on different toolchains, you will get an error message from the module load command. This means that you should always try to find modules with matching toolchains whenever you need to load more than one application.

Using software modules in scripts

Here we create a job script that loads a particular version of Python, and prints the version number to the Slurm output file.

[MY_USER_NAME@CLUSTER_NAME ~ ]$ nano python-module.sh
[MY_USER_NAME@CLUSTER_NAME ~ ]$ cat python-module.sh

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --time=00:01:00
#SBATCH --account=<PROJECT_NAME>
#SBATCH --mem=1G
#SBATCH --job-name=Python_module_test

module purge
module load Python/3.9.6-GCCcore-11.2.0

python --version

[MY_USER_NAME@CLUSTER_NAME ~ ]$ sbatch python-module.sh

For full reproducibility it is always good practice to start your job script by purging any existing modules which you might have loaded when you submit the job script. You can then explicitly load all the dependencies for the current job, which makes it much more robust for future execution.

Exercise

Exercise (15 min)

This exercise can be performed directly on the login node. Before you start, run the command module purge to make sure your environment is clean. Verify that StdEnv is the only loaded module when running module list.

  1. How many programs (not counting versions) are there related to the keyword ‘chemistry’?

  2. Find a module for R version 4.1.2 using module avail (R is a popular software environment for statistical computing). Load this module and verify that you get a working R command in your terminal. e.g. using which R or R --version.

  3. How many other software packages were loaded alongside the requsted R module?

  4. Bonus: Find a suitable version of Ruby to load alongside the R module that you already have. Hint: Here we do not care about which version of Ruby we are loading, but it needs to be compatible with the modules we have already loaded (GCCcore versions needs to be the same).