Transferring files
Instructor note
Total: 30min (Teaching&Demo:30Min | Discussion:0min | Breaks:0min | Exercises:0min)
Objectives
Questions
How do I upload/download files to the cluster?
Objectives
Be able to transfer files to and from a compute-cluster.
Keypoints
wget
downloads a file from the internet.rsync
/scp
transfer files to and from your computer.You can use Visual Studio Code to transfer files with drag and drop.
Computing with a remote computer offers very limited use if we cannot get files to and from the cluster. There are several options for transferring data between computing resources, from command line options to GUI programs, which we will cover here.
Downloading from the internet
Download files from the internet using wget
One of the most straightforward ways to download files is to use wget
. Any file that can be
downloaded in your web browser with an accessible link can be downloaded using wget
. This is a
quick way to download datasets or source code.
The syntax is: wget https://some/link/to/a/file.tar.gz
. For example, download the lesson sample
files using the following command:
# To find the value of <URL> refer to the downloads section of the tutorial
[MY_USER_NAME@CLUSTER_NAME ~]$ wget <URL>
Downloading GitHub repositories
Sometimes the data, pipeline or software you need is stored in a repository on
GitHub or GitLab. In this case you either download individual (“raw”) files
using wget or the whole repository with git clone
.
We can download for example this test repository, with:
[MY_USER_NAME@CLUSTER_NAME ~]$ git clone https://github.com/test/HelloWorld.git
It will be saved into the current directory with the new folder having the name of the repository, so `HelloWorld`` in this case.
Transferring files
Transferring single files and folders
To move files between your computer and the clusters you can use either
scp
or rsync
or other tools.
For best practices, we recommend using rsync
. This utility allow you to transfer
files in an easy and secure manner. On a Windows system, you need to use rsync
through Windows Subsystem for Linux (WSL).
We provide examples for both rsync
and scp
.
To upload to the cluster:
To transfer a single file from your local machine to a cluster using scp
, run the following command:
[user@laptop ~]$ scp /path/to/local/file username@saga.sigma2.no:/path/to/remote/directory
The example is for the saga cluster.
Replace /path/to/local/file
with the path to the file on your local machine, username
with your username on the cluster, and /path/to/remote/directory
with the path to the remote directory where you want to store the file.
An example:
[user@laptop ~]$ echo $(date) > from_laptop.txt
[user@laptop ~]$ scp from_laptop.txt username@saga.sigma2.no:
# Login to saga and check the file in the HOME folder
To transfer a single file from your local machine to a cluster using rsync
, run the following command:
[user@laptop ~]$ rsync -avz /path/to/local/file username@login-1.saga.sigma2.no:/path/to/remote/directory
The example is for the saga cluster.
Replace /path/to/local/file
with the path to the file on your local machine, username
with your username on the cluster, and /path/to/remote/directory
with the path to the remote directory where you want to store the file. ```
An example:
[user@laptop ~]$ echo $(date) > from_laptop.txt
[user@laptop ~]$ rsync -avz from_laptop.txt username@saga.sigma2.no:
# Login to SAGA and check the file in the HOME folder
To download from the cluster:
# Create a file on SAGA
[username@login-5.SAGA ~]$ echo $(hostname -f) > from_cluster.txt
[username@login-5.SAGA ~]$ echo $(date) >> from_cluster.txt
# From the laptop download it
[user@laptop ~]$ scp username@saga.sigma2.no:from_cluster.txt .
Password:
[user@laptop ~]$ cat from_cluster.txt
login-5.saga
ma. 14. mars 19:14:53 +0100 2023
The output from cat
will vary depending on which login node you were in when you created the file.
You can transfer multiple files or directories from your local machine to the cluster and vice versa with scp
. You can use -r
option to copy files recursively. This assumes that there’s a single directory containing all of the files you want to transfer (and nothing else).
For example, to transfer a directory from your local machine to the cluster
[user@laptop ~]$ scp -r /path/to/local/directory1 username@.saga.sigma2.no:/path/to/remote/directory
Or you can use the wild card *
to transfer multiple files
For example to transfer multiple files or directories from a cluster to your local machine, use this command:
[user@laptop ~]$ scp username@saga.sigma2.no:/path/to/remote/directory1/* /path/to/local/directory2
This will copy all the files under directory1
on the cluster to your laptop under directory2
. Note that the directory1
itself will not transfer, only the content.
# Create a file on SAGA
[MY_USER_NAME@login-1.SAGA~]$ echo $(hostname) > from_cluster.txt
[MY_USER_NAME@login-1.SAGA~]$ echo $(date) >> from_cluster.txt
# From the laptop download it
[user@laptop ~]$ rsync -avz username@login-1.saga.sigma2.no:from_cluster .
Password:
[user@laptop ~]$ cat from_cluster.txt
login-1.saga
ma. 14. mars 19:14:53 +0100 2023
To transfer multiple files or directories from your local machine to the cluster, use the following command:
[user@laptop ~]$ rsync -avz /path/to/local/directory1 /path/to/local/file2 username@saga.sigma2.no:/path/to/remote/directory
To transfer multiple files or directories from a cluster to your local machine, use this command:
rsync -avz username@saga.sigma2.no:/path/to/remote/directory1 /path/to/local/directory
A trailing slash on the target directory is optional, and has no effect, but it can be important in other commands.
Adding a trailing slash on an source directory would make the command copy only the content of the folder, not the folder itself.
Transferring Large Amounts of Data with rsync
When transferring a large amount of data, it’s recommended to use the --partial
and --progress
options. The --partial
option ensures that partially transferred files are kept, allowing you to resume the transfer in case of interruption. The --progress
option displays the progress of the transfer. The -P
option combines these flags into one.
To transfer a large amount of data from your local machine to the fram
cluster, use the following command:
rsync -avzP /path/to/local/directory username@fram.sigma2.no:/path/to/remote/directory
To transfer a large amount of data from the fram
cluster to your local machine, use this command:
rsync -avzP username@fram.sigma2.no:/path/to/remote/directory /path/to/local/directory
Transferring files using a graphical user interface
While command line tools like rsync
are effcient for transferring files between your computer and the cluster where you want to do your work, it can be quite intimidating and overwhelming for beginners and inexperienced users. A nice built in feature of the tool [Visual Studio Code] (https://code.visualstudio.com), is the ability to act like a GUI in terms of moving files with drag- and drop-functionality.
This is how to use Visual Studio Code as a GUI for file transfer:
Log in with using ssh in Visual Studio Code as described here: Connecting to a system with Visual Studio Code. Follow the instructions fully.
Then open a local folder, for instance your Documents folder. Drag a file or folder to the left side column in your VS Code window if you want to move files or folders to the remote server. If you want to copy files back to your client machine, either right-click
or on Mac ctrl+left-click
on the folder or file you want to download, then choose Download
from the dropdown menu.
You can also use VS Code to read the content of files, edit files to some extent, delete files and make new folders and files like you are used to on your local client using GUI tools.
Working with files generated in different environments
When you transfer files between different environments, please note that opening/executing a file that is made/edited in a different environment than where you plan to use it may be challenging. A well known issue is that files transferred from a Windows environment to a Unix system environment (Mac, Linux, BSD, Solaris, etc.) can cause problems. On a Unix system, every line in a file ends with a \n
(newline). On Windows, every line in a file ends with a \r\n
(carriage return + newline).
Though most modern programming languages and software handles this correctly, in some rare instances, you may run into an issue. You can identify if a file has Windows line endings with cat -A filename
. A file with Windows line endings will have ^M$
at the end of every line. A file with Unix line endings will have $
at the end of a line.
The solution is to either edit the file manually in the Unix system environment, or to convert a file from Windows to Unix encoding by running dos2unix filename
:
[MY_USER_NAME@CLUSTER_NAME ~]$ dos2unix File-created-on-windows.txt
(Conversely, to convert back to Windows format, you can run unix2dos filename
)
Information that might be usefull
About Ports
All file transfers using the above methods use encrypted communication over port 22. This is the same connection method used by SSH. In fact, all file transfers using these methods occur through an SSH connection. If you can connect via SSH over the normal port, you will be able to transfer files.
About Backup
The files in your home (/cluster/home
) and project folder
(/cluster/projects
) are regularly backed up to either NIRD or one of the
other clusters, as described in the
documentation. So
if you ever accidentally delete or overwrite a file in one of those folder, you
may be able to get your data back.
About File Transfer from NIRD
The NIRD project storage areas, namely NIRD Data Peak (TS) and NIRD Data Lake (DL)
are mounted on the login nodes of Betzy, Fram, and Saga. One can directly access the NIRD project area from the login nodes of the mentioned compute clusters with cp
command. More details here.