Transferring files

Instructor note

Total: 30min (Teaching&Demo:30Min | Discussion:0min | Breaks:0min | Exercises:0min)

Objectives

  • Questions

    • How do I upload/download files to the cluster?

  • Objectives

    • Be able to transfer files to and from a compute-cluster.

  • Keypoints

    • wget downloads a file from the internet.

    • rsync/scp transfer files to and from your computer.

    • You can use Visual Studio Code to transfer files with drag and drop.

Computing with a remote computer offers very limited use if we cannot get files to and from the cluster. There are several options for transferring data between computing resources, from command line options to GUI programs, which we will cover here.

Downloading from the internet

Download files from the internet using wget

One of the most straightforward ways to download files is to use wget. Any file that can be downloaded in your web browser with an accessible link can be downloaded using wget. This is a quick way to download datasets or source code.

The syntax is: wget https://some/link/to/a/file.tar.gz. For example, download the lesson sample files using the following command:

# To find the value of <URL> refer to the downloads section of the tutorial

[MY_USER_NAME@CLUSTER_NAME ~]$ wget <URL>

Downloading GitHub repositories

Sometimes the data, pipeline or software you need is stored in a repository on GitHub or GitLab. In this case you either download individual (“raw”) files using wget or the whole repository with git clone.

We can download for example this test repository, with:

[MY_USER_NAME@CLUSTER_NAME ~]$ git clone https://github.com/test/HelloWorld.git

It will be saved into the current directory with the new folder having the name of the repository, so `HelloWorld`` in this case.

Transferring files

Transferring single files and folders

To move files between your computer and the clusters you can use either scpor rsync or other tools. For best practices, we recommend using rsync. This utility allow you to transfer files in an easy and secure manner. On a Windows system, you need to use rsync through Windows Subsystem for Linux (WSL).

We provide examples for both rsync and scp.

To upload to the cluster:

To transfer a single file from your local machine to a cluster using scp, run the following command:

[user@laptop ~]$ scp /path/to/local/file username@saga.sigma2.no:/path/to/remote/directory

The example is for the saga cluster. Replace /path/to/local/file with the path to the file on your local machine, username with your username on the cluster, and /path/to/remote/directory with the path to the remote directory where you want to store the file.

An example:

[user@laptop ~]$ echo $(date) > from_laptop.txt
[user@laptop ~]$ scp from_laptop.txt username@saga.sigma2.no:

# Login to saga and check the file in the HOME folder

To download from the cluster:

# Create a file on SAGA
[username@login-5.SAGA ~]$ echo $(hostname -f) > from_cluster.txt
[username@login-5.SAGA ~]$ echo $(date) >> from_cluster.txt
# From the laptop download it
[user@laptop ~]$ scp username@saga.sigma2.no:from_cluster.txt .
Password: 

[user@laptop ~]$ cat from_cluster.txt
login-5.saga
ma. 14. mars 19:14:53 +0100 2023

The output from cat will vary depending on which login node you were in when you created the file.

You can transfer multiple files or directories from your local machine to the cluster and vice versa with scp. You can use -r option to copy files recursively. This assumes that there’s a single directory containing all of the files you want to transfer (and nothing else).

For example, to transfer a directory from your local machine to the cluster

[user@laptop ~]$ scp -r /path/to/local/directory1  username@.saga.sigma2.no:/path/to/remote/directory

Or you can use the wild card * to transfer multiple files

For example to transfer multiple files or directories from a cluster to your local machine, use this command:

[user@laptop ~]$ scp username@saga.sigma2.no:/path/to/remote/directory1/* /path/to/local/directory2

This will copy all the files under directory1 on the cluster to your laptop under directory2. Note that the directory1 itself will not transfer, only the content.

Transferring Large Amounts of Data with rsync

When transferring a large amount of data, it’s recommended to use the --partial and --progress options. The --partial option ensures that partially transferred files are kept, allowing you to resume the transfer in case of interruption. The --progress option displays the progress of the transfer. The -P option combines these flags into one.

To transfer a large amount of data from your local machine to the fram cluster, use the following command:

rsync -avzP /path/to/local/directory username@fram.sigma2.no:/path/to/remote/directory

To transfer a large amount of data from the fram cluster to your local machine, use this command:

rsync -avzP username@fram.sigma2.no:/path/to/remote/directory /path/to/local/directory

Transferring files using a graphical user interface

While command line tools like rsync are effcient for transferring files between your computer and the cluster where you want to do your work, it can be quite intimidating and overwhelming for beginners and inexperienced users. A nice built in feature of the tool [Visual Studio Code] (https://code.visualstudio.com), is the ability to act like a GUI in terms of moving files with drag- and drop-functionality.

This is how to use Visual Studio Code as a GUI for file transfer:

Log in with using ssh in Visual Studio Code as described here: Connecting to a system with Visual Studio Code. Follow the instructions fully.

Then open a local folder, for instance your Documents folder. Drag a file or folder to the left side column in your VS Code window if you want to move files or folders to the remote server. If you want to copy files back to your client machine, either right-click or on Mac ctrl+left-click on the folder or file you want to download, then choose Download from the dropdown menu.

You can also use VS Code to read the content of files, edit files to some extent, delete files and make new folders and files like you are used to on your local client using GUI tools.

Working with files generated in different environments

When you transfer files between different environments, please note that opening/executing a file that is made/edited in a different environment than where you plan to use it may be challenging. A well known issue is that files transferred from a Windows environment to a Unix system environment (Mac, Linux, BSD, Solaris, etc.) can cause problems. On a Unix system, every line in a file ends with a \n (newline). On Windows, every line in a file ends with a \r\n (carriage return + newline).

Though most modern programming languages and software handles this correctly, in some rare instances, you may run into an issue. You can identify if a file has Windows line endings with cat -A filename. A file with Windows line endings will have ^M$ at the end of every line. A file with Unix line endings will have $ at the end of a line.

The solution is to either edit the file manually in the Unix system environment, or to convert a file from Windows to Unix encoding by running dos2unix filename:

 [MY_USER_NAME@CLUSTER_NAME ~]$ dos2unix File-created-on-windows.txt

(Conversely, to convert back to Windows format, you can run unix2dos filename)

Information that might be usefull

About Ports

All file transfers using the above methods use encrypted communication over port 22. This is the same connection method used by SSH. In fact, all file transfers using these methods occur through an SSH connection. If you can connect via SSH over the normal port, you will be able to transfer files.

About Backup

The files in your home (/cluster/home) and project folder (/cluster/projects) are regularly backed up to either NIRD or one of the other clusters, as described in the documentation. So if you ever accidentally delete or overwrite a file in one of those folder, you may be able to get your data back.

About File Transfer from NIRD

The NIRD project storage areas, namely NIRD Data Peak (TS) and NIRD Data Lake (DL) are mounted on the login nodes of Betzy, Fram, and Saga. One can directly access the NIRD project area from the login nodes of the mentioned compute clusters with cp command. More details here.