# Accessing software

```{instructor-note}
Total: 45min
(Teaching:30Min | Discussion:0min | Breaks:0min | Exercises:15Min)
```

```{objectives}
- Keypoints
  - Search for software with `module avail`
  - Load software with `module load`
  - Unload software with `module purge`
  - The module system handles software versioning and will prevent package
    conflicts for you automatically

```


On a high-performance computing system, it is seldom the case that the software we want to use is
available when we log in. It is installed, but we will need to "load" it before it can run.

Before we start using individual software packages, however, we should understand the reasoning
behind this approach. The three biggest factors are:

- software incompatibilities
- versioning
- dependencies

Software incompatibility is a major headache for programmers. Sometimes the presence (or absence) of
a software package will break others that depend on it. Two of the most famous examples are Python 2
and 3 and C compiler versions. Python 3 famously provides a `python` command that conflicts with
that provided by Python 2. Software compiled against a newer version of the C libraries and then
used when they are not present will result in a nasty `'GLIBCXX_3.4.20' not found` error, for
instance.

Software versioning is another common issue. A team might depend on a certain package version for
their research project - if the software version was to change (for instance, if a package was
updated), it might affect their results. Having access to multiple software versions allow a set of
researchers to prevent software versioning issues from affecting their results.

Dependencies are where a particular software package (or even a particular version)
depends on having access to another software package (or even a particular version of
another software package). For example, the VASP materials science software may 
depend on having a particular version of the FFTW (the Fastest Fourier Transform in the West)
software library available for it to work.

Environment modules are the solution to these problems, and we will return to
this after looking at globally installed packages.


## Globally installed system packages

In this example we will use Python, which is installed globally on the login
node in one particular version.

We can test what the `python` command is actually pointing to by another command
called `which`.  `which` looks for programs the same way that Bash does, so we can use
it to tell us where a particular piece of software is stored.

```console

MY_USER_NAMEd@CLUSTER_NAME ~]$ python --version
Python 2.7.5
[MY_USER_NAME@CLUSTER_NAME ~]$ which python
/usr/bin/python

MY_USER_NAMEd@CLUSTER_NAME ~]$ python3 --version
Python 3.6.8
[MY_USER_NAME@CLUSTER_NAME ~]$ which python3
/usr/bin/python3

```

We can see that `python` and `python3` executables are available. The former
points towards a Python 2 version and the latter to a rather old Python 3.6.
What the output of `which` tells us is that typing the command `python3` is
equivalent of running the full command `/usr/bin/python3`.

But how did the shell know that `python` should be linked to `/usr/bin/python`?
To explain this, we first need to understand the nature of the `PATH` environment
variable. `PATH` is a special environment variable that controls where a UNIX system
looks for software. We can inspect its value with the following command (`PATH` is
the variable, `$` extracts its value, and `echo` prints the value):

```console

[MY_USER_NAME@CLUSTER_NAME ~]$ echo $PATH
/node/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/cluster/bin:/cluster/home/MY_USER_NAME/.local/bin:/cluster/home/MY_USER_NAME/bin

```

What we see here is a colon-separated (`:`) list of search paths that the shell
is looping through when looking for the `python` command. In this case it finds
a match under `/usr/bin`, so then it exits the search and replaces `python` with
`/usr/bin/python`.

`````{exercise} Exercise (10 min)

1. What happens if there are other matching commands located later in the search
`PATH`, e.g. `/cluster/bin/python`?

2. What happens if you have an executable script in your current directory with
the same name as a globally installed program?

````{solution}
1. If there are other matching commands later in the search path, these will be 
shadowed by first found command. The shell will stop searching for more commands
when it has found the command in a directory.

2. If your current directory is first in the search path it will executed. On the 
other hand if the directory with the global installed program is first in the search
path, it will be executed. To execute a command named python in your current directory, do:
```console
$ ./python
```
````
`````


## Environment modules

A *module* is a self-contained description of a software package - 
it contains the settings required to run a software package 
and, usually, encodes required dependencies on other software packages.

There are a number of different environment module implementations commonly
used on HPC systems: the two most common are *TCL modules* and *Lmod*. Both of
these use similar syntax and the concepts are the same so learning to use one will
allow you to use whichever is installed on the system you are using. In both 
implementations the `module` command is used to interact with environment modules. An
additional subcommand is usually added to the command to specify what you want to do. For a list
of subcommands you can use `module -h` or `module help`. As for all commands, you can 
access the full help on the *man* pages with `man module`.

On login, you may start out with a default set of modules loaded, or you may start out
with an empty environment; this depends on the setup of the system you are using.


## Listing currently loaded modules

You can use the `module list` command to see which modules you currently have loaded
in your environment. After logging into one of our systems, your environment
should ideally be clean like this:

```console
[MY_USER_NAME@CLUSTER_NAME ~ ]$ module list

Currently Loaded Modules:
  1) StdEnv (S)

  Where:
   S:  Module is Sticky, requires --force to unload or purge
```

You can see that one module is loaded which has special attribute of being
sticky (`S`). That means that it is not usually unloaded, typically because it
is important for the system to function correctly (so `--force` removing it is
adviced against).


## Finding and listing available modules

One way to look for available software is to search for keywords using `module keyword
<KEYWORD>`. This will look through the module meta data and return anything that
matches. For example, let's list bioinformatics programs that can
be loaded using modules with `module keyword bio`:

You can refine the search by adding a search string to the command, like `module
avail <SOFTWARE>`. In contrast to the `module keyword` search, which will only be
matched to the module name, not to any meta data. For example, we can list all modules
that matched the string 'python/' (including the '/'):

```console
[MY_USER_NAME@CLUSTER_NAME ~]$ module avail python/

------------------------------------------------ /cluster/modulefiles/all --------------------------
   Biopython/1.72-foss-2018b-Python-2.7.15         netcdf4-python/1.4.1-intel-2018b-Python-3.6.6
   Biopython/1.72-foss-2018b-Python-3.6.6          netcdf4-python/1.5.3-foss-2019b-Python-3.7.4
   Biopython/1.72-intel-2018b-Python-2.7.15        netcdf4-python/1.5.3-intel-2019b-Python-3.7.4
   Biopython/1.73-foss-2019a                       netcdf4-python/1.5.5.1-foss-2020b
   Biopython/1.73-intel-2019a                      netcdf4-python/1.5.5.1-intel-2020b
   Biopython/1.75-foss-2019b-Python-3.7.4          netcdf4-python/1.5.7-foss-2021b
   Biopython/1.75-intel-2019b-Python-3.7.4         netcdf4-python/1.5.7-intel-2021a
   Biopython/1.78-foss-2020a-Python-3.8.2          netcdf4-python/1.5.7-intel-2021b
   Biopython/1.78-foss-2020b                       pfft-python/0.1.21-foss-2020a-Python-3.8.2
   Biopython/1.78-intel-2020a-Python-3.8.2         Python/2.7.15-foss-2018b
   Biopython/1.78-intel-2020b                      Python/2.7.15-fosscuda-2018b
   Biopython/1.79-foss-2021a                       Python/2.7.15-GCCcore-8.2.0
   Biopython/1.79-foss-2021b                       Python/2.7.15-intel-2018b
   Biopython/1.79-intel-2021a                      Python/2.7.16-GCCcore-8.3.0
   Biopython/1.79-intel-2021b                      Python/2.7.18-GCCcore-9.3.0
   bx-python/0.8.2-foss-2018b-Python-2.7.15        Python/2.7.18-GCCcore-10.2.0
   bx-python/0.8.4-foss-2019a                      Python/3.6.6-foss-2018b
   bx-python/0.8.9-foss-2020a-Python-3.8.2         Python/3.6.6-fosscuda-2018b
   bx-python/0.8.11-foss-2021a                     Python/3.6.6-intel-2018b
   GitPython/3.1.9-GCCcore-9.3.0-Python-3.8.2      Python/3.7.2-GCCcore-8.2.0
   IPython/5.8.0-foss-2018b-Python-2.7.15          Python/3.7.4-GCCcore-8.3.0
   IPython/7.2.0-foss-2018b-Python-3.6.6           Python/3.8.2-GCCcore-9.3.0
   IPython/7.13.0-fosscuda-2020a-Python-3.8.2      Python/3.8.6-GCCcore-10.2.0
   IPython/7.15.0-intel-2020a-Python-3.8.2         Python/3.9.5-GCCcore-10.3.0
   netcdf4-python/1.4.1-foss-2018b-Python-3.6.6    Python/3.9.6-GCCcore-11.2.0

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

```

## Loading and unloading software

Any of the software modules that we found in the previous section can be loaded
into our environment using the `module load` command. Let's say we are not happy
with the system version of Python that we get when logging in to the cluster
(see "Globally installed system packages" above). We can then instead load a
module for the Python version that we want:

```console
[MY_USER_NAME@CLUSTER_NAME ~ ]$ module load Python/3.9.6-GCCcore-11.2.0

[MY_USER_NAME@CLUSTER_NAME ~ ]$ which python
/cluster/software/Python/3.9.6-GCCcore-11.2.0/bin/python

[MY_USER_NAME@CLUSTER_NAME ~ ]$ python --version
Python 3.9.6
```

So, what just happened? Let's have a look at the `PATH` variable again:

```console
[MY_USER_NAME@CLUSTER_NAME ~ ]$ echo $PATH

/cluster/software/Python/3.9.6-GCCcore-11.2.0/bin:
/cluster/software/OpenSSL/1.1/bin:
....
/cluster/bin:
/cluster/home/MY_USER_NAME/.local/bin:
/cluster/home/MY_USER_NAME/bin
```

You'll notice that the output is much longer than it was before we
loaded the Python module, and if you look closely you'll see that the last
entries of the output are identical to what we had before. This means that by
loading the module, we changed the `PATH` by adding entries to the _beginning_
of the list. This means that the shell will now start looking into the
`/cluster/software/Python/3.9.6-GCCcore-11.2.0/bin` etc. locations, before
moving on the the "system" paths `/usr/bin` etc.


Taking this to its conclusion, `module load` will add software to your `$PATH`. It "loads"
software. A special note on this - depending on which version of the `module` program that is
installed at your site, `module load` will also load required software dependencies.


Note that `module purge` is informative. It lets us know that all but a default set of packages 
have been unloaded (and how to actually unload these if we truly so desired).


`````{exercise} Exercise (15 min)

This exercise can be performed directly on the login node. Before you start,
run the command `module purge` to make sure your environment is clean. Verify
that `StdEnv` is the only loaded module when running `module list`.


1. How many programs (not counting versions) are there related to the keyword
'chemistry'?

2. Find a module for `R` version 4.1.0 using `module avail` (R is a popular
software environment for statistical computing). Load this module and verify
that you get a working `R` command in your terminal. e.g. using `which R` or
`R --version`.

3. How many other software packages were loaded alongside the requsted `R`
module?

4. **Bonus:** Find a suitable version of `Ruby` to load alongside the `R` module
that you already have. **Hint:** Here we do not care about which version of `Ruby`
we are loading, but it needs to be _compatible_ with the modules we have already
loaded (`GCCcore` versions needs to be the same).

````{solution}


1. Depends on cluster, check with
   ```console
   $ module keyword chemistry
   ```
   which at the time of writing found seven packages on Saga: `ADF`, `NBO7`, `NWChem`,
   `OpenBabel`, `OpenMolcas`, `ORCA` and `Schrodinger`.

2. We can search for modules using `module avail`, and we can restrict the
   search by being more specific on version `4`
   ```console
   $ module avail R/4
      -------------------------------------- /cluster/modulefiles/all ---------------------------------------
      MUMmer/4.0.0beta2-foss-2018b       R/4.1.0-foss-2021a
      MUMmer/4.0.0beta2-GCCcore-9.3.0    R/4.1.2-foss-2021b
      R/4.0.0-foss-2020a                 RepeatMasker/4.0.9-p2-gompi-2019a-HMMER
      R/4.0.0-fosscuda-2020a             RepeatMasker/4.0.9-p2-gompi-2019b-HMMER
      R/4.0.3-foss-2020b                 RepeatMasker/4.1.2-p1-foss-2020b
      R/4.0.3-fosscuda-2020b             Singular/4.1.2-GCC-8.2.0-2.31.1

   Use "module spider" to find all possible modules.
   Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
   ```
   We see there is only one module matching version `4.1.0`, so we load this one:
   ```console
   $ module load R/4.1.0-foss-2021a
   ```

   Finally, we verify that we have the correct version available on the command line:

   ```console
   $ which R
   /cluster/software/R/4.1.0-foss-2021a/bin/R

   $ R --version
   R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
   Copyright (C) 2021 The R Foundation for Statistical Computing
   Platform: x86_64-pc-linux-gnu (64-bit)

   R is free software and comes with ABSOLUTELY NO WARRANTY.
   You are welcome to redistribute it under the terms of the
   GNU General Public License versions 2 or 3.
   For more information about these matters see
   https://www.gnu.org/licenses/.
   ```

3. Check the number of loaded modules with
   ```console
   $ module list
   ...
   [removed long output]
   ...
   39) gettext/0.21-GCCcore-10.3.0       (H)  80) PCRE/8.44-GCCcore-10.3.0             (H)
   40) PCRE2/10.36-GCCcore-10.3.0        (H)  81) libgit2/1.1.0-GCCcore-10.3.0         (H)
   41) GLib/2.68.2-GCCcore-10.3.0        (H)  82) R/4.1.0-foss-2021a
   ```
   which in this case outputs 82 different modules. So in addition to the
   original `StdEnv` and the module we actively loaded (`R/4.1.0-foss-2021a`),
   we got 80 other software packages loaded at the same time.

4. **Bonus:** When we look at the output from the `module list` command above, we
   see that most of the loaded modules contain the `GCCcore-10.3.0` suffix. This
   means that they were all compiled using the same "core" compiler, and thus
   should be fully compatible. If we want to load another (seemingly independent)
   module at the same time, we need to make sure that it is compatible with this
   core compiler. Searching for `Ruby` gives:
   ```console
   $ module avail ruby

   -------------------------------------- /cluster/modulefiles/all ---------------------------------------
     Ruby/2.6.1-GCCcore-7.3.0    Ruby/2.7.1-GCCcore-8.3.0    Ruby/2.7.2-GCCcore-10.2.0
     Ruby/2.6.3-GCCcore-8.2.0    Ruby/2.7.2-GCCcore-9.3.0    Ruby/3.0.1-GCCcore-10.3.0
   ```
   were we see that only the last one has a compatible `GCCcore` version with
   our current `R`, so this one can be loaded without any problems:
   ```console
   $ module load Ruby/3.0.1-GCCcore-10.3.0
   $ ruby --version
   ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
   ```
   ```{warning}
   If you try to load any of the other versions of `Ruby`, you will get an
   error message telling you that the site does not allow "automatic swapping of
   module with the same name". You can still manually do such swapping of
   modules, as explained in the same error message, but it is **not recommended**,
   as it can lead to weird runtime errors that are hard to debug. 
   ```

````

`````