Transferring files
Instructor note
Total: 45min (Teaching:30Min | Discussion:0min | Breaks:0min | Exercises:10Min)
Objectives
Questions
How do I upload/download files to the cluster?
Objectives
Be able to transfer files to and from a computing cluster.
Keypoints
wget
downloads a file from the internet.scp
transfer files to and from your computer.You can use an SFTP client like FileZilla to transfer files through a GUI.
Computing with a remote computer offers very limited use if we cannot get files to or from the cluster. There are several options for transferring data between computing resources, from command line options to GUI programs, which we will cover here.
Downloading from the internet
Download files from the internet using wget
One of the most straightforward ways to download files is to use wget
. Any file that can be
downloaded in your web browser with an accessible link can be downloaded using wget
. This is a
quick way to download datasets or source code.
The syntax is: wget https://some/link/to/a/file.tar.gz
. For example, download the lesson sample
files using the following command:
# To find the value of <URL> refer to the downloads section of tutorial
[MY_USER_NAME@CLUSTER_NAME ~]$ wget <URL>
Downloading GitHub repositories
Sometimes the data, pipeline or software you need is stored in a repository on
GitHub or GitLab. In this case you either download individual (“raw”) files
using wget or the whole repository with git clone
.
We can download for example this test repository, with:
[MY_USER_NAME@CLUSTER_NAME ~]$ git clone https://github.com/test/HelloWorld.git
It will be saved into the current directory with the new folder having the name of the repository, so “HelloWorld” in this case.
Transferring files
Transferring single files and folders with scp
To upload a single file to or from the cluster, we can
use scp
(“secure copy”). The syntax can be a little complex
for new users, but we’ll break it down.
To create and upload a file:
[user@laptop ~]$ echo $(date) > from_laptop.txt
[user@laptop ~]$ scp from_laptop.txt MY_USER_NAME@saga.sigma2.no:
# Login to SAGA and check the file in the HOME folder
[user@laptop ~]$ echo $(date) > from_laptop.txt
[user@laptop ~]$ scp from_laptop.txt MY_USER_NAME@fox.educloud.no:
One-Time_Code:
Password:
from_laptop.txt 100% 33 3.7KB/s 00:00
#Now login to fox and check
To download from the cluster:
# Create a file on SAGA
[MY_USER_NAME@login-5.SAGA~]$ echo $(hostname) > from_saga.txt
[MY_USER_NAME@login-5.SAGA~]$ echo $(date) >> from_saga.txt
# From the laptop download it
[user@laptop ~]$ scp MY_USER_NAME@saga.sigma2.no:from_saga.txt .
Password:
from_fox.txt 100% 63 5.4KB/s 00:00
[user@laptop ~]$ cat from_saga.txt
login-5.saga
ma. 14. mars 19:14:53 +0100 2022
# Create a file on Fox
[MY_USER_NAME@fox.educloud.no ~]$ echo $(hostname) > from_fox.txt
[MY_USER_NAME@fox.educloud.no ~]$ echo $(date) >> from_fox.txt
# From the laptop download it
[user@laptop ~]$ scp ec-sabryr@fox.educloud.no:from_fox.txt .
One-Time_Code:
Password:
from_fox.txt 100% 63 5.4KB/s 00:00
[user@laptop ~]$ cat from_fox.txt
login-4.fox.ad.fp.educloud.no
ma. 14. mars 19:08:33 +0100 2022
To recursively copy a directory, we just add the -r
(recursive) flag:
[user@laptop ~]$ scp -r some-local-folder MY_USER_NAME@CLUSTER_NAME:
This will create the directory ‘some-local-folder’ on the remote system, and recursively copy all the content from the local to the remote system. Existing files on the remote system will not be modified, unless there are files from the local system with the same name, in which case the remote files will be overwritten.
A trailing slash on the target directory is optional, and has no
effect for scp -r
, but it can be important in other commands.
Transferring files using a graphical user interface
While scp
is an efficient way of transferring files between your computer and
the cluster, it can be quite intimidating and overwhelming in the beginning.
Luckily we can also use programs with a GUI (Graphical User Interface) to make
it easier for us to browse through remote folders on the cluster and upload or
download files.
FileZilla is available for all popular
operating systems and can be downloaded
here.
After installing it, you only have to enter the host (e.g. saga.sigma2.no
),
your username, password and port 22, like shown here:
You can upload and download files by dragging them from the local (left side) to the remote (right side) pane or vice versa.
Archiving files
One of the biggest challenges we often face when transferring data between remote HPC systems is that of large numbers of files. There is an overhead to transferring each individual file and when we are transferring large numbers of files these overheads combine to slow down our transfers to a large degree.
The solution to this problem is to archive multiple files into smaller numbers of larger files before we transfer the data to improve our transfer efficiency. Sometimes we will combine archiving with compression to reduce the amount of data we have to transfer and so speed up the transfer.
The most common archiving command you will use on a (Linux) HPC cluster is tar
. tar
can be used
to combine files into a single archive file and, optionally, compress. For example, to collect
all files contained inside output_data
into an archive file called output_data.tar
we would use:
[user@laptop ~]$ tar -cvf output_data.tar output_data/
The options we used for tar
are:
-c
- Create new archive-v
- Verbose (print what you are doing!)-f mydata.tar
- Create the archive in file output_data.tar
The tar command allows users to concatenate flags. Instead of typing tar -c -v -f
, we can use
tar -cvf
.
The tar
command can also be used to interrogate and unpack archive files. The -t
argument
(”table of contents”) lists the contents of the referred-to file without unpacking it.
The -x
(“extract”) flag unpacks the referred-to file. To unpack the file after we have
transferred it:
[user@laptop ~]$ tar -xvf output_data.tar
This will put the data into a directory called output_data
. Be careful, it will overwrite data
there if this directory already exists!
Sometimes you may also want to compress the archive to save space and speed up the transfer.
However, you should be aware that for large amounts of data compressing and un-compressing can take
longer than transferring the un-compressed data so you may not want to transfer. To create a
compressed archive using tar
we add the -z
option and add the .gz
extension to the file to
indicate it is gzip
-compressed, e.g.:
[user@laptop ~]$ tar -czvf output_data.tar.gz output_data/
The tar
command is used to extract the files from the archive in exactly the same way as for
uncompressed data. The tar
command recognizes that the data is compressed, and automatically
selects the correct decompression algorithm at the time of extraction:
[user@laptop ~]$ tar -xvf output_data.tar.gz
Working with Windows
When you transfer files to from a Windows system to a Unix system (Mac, Linux, BSD, Solaris, etc.) this can cause problems. Windows encodes its files slightly different than Unix, and adds an extra character to every line.
On a Unix system, every line in a file ends with a \n
(newline). On Windows, every line in a
file ends with a \r\n
(carriage return + newline). This causes problems sometimes.
Though most modern programming languages and software handles this correctly, in some rare
instances, you may run into an issue. The solution is to convert a file from Windows to Unix
encoding with the dos2unix
command.
You can identify if a file has Windows line endings with cat -A filename
. A file with Windows
line endings will have ^M$
at the end of every line. A file with Unix line endings will have $
at the end of a line.
To convert the file, just run dos2unix filename
. (Conversely, to convert back to Windows format,
you can run unix2dos filename
.)
[MY_USER_NAME@CLUSTER_NAME ~]$ dos2unix File-created-on-windows.txt
A note on ports
All file transfers using the above methods use encrypted communication over port 22. This is the same connection method used by SSH. In fact, all file transfers using these methods occur through an SSH connection. If you can connect via SSH over the normal port, you will be able to transfer files.
How to use backup
The files in your home (/cluster/home
) and project folder
(/cluster/projects
) are regularly backed up to either NIRD or one of the
other clusters, as described in the
documentation. So
if you ever accidentally delete or overwrite a file in one of those folder, you
can get your data back.