Data Transfer
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How do I move data in and out of Blue Waters?
Objectives
Know which tools to use and where to put data
Learn how to use Globus Online and a
globus
tool
If we step back and consider the entire process of our interaction with Blue Waters, it consists of three main steps:
- Bringing data (input) and, possibly, code onto the system
- Running simulations and/or analysis
- Moving data (results) from the system
In this episode we will discuss several ways of transfering data between your machine and Blue Waters. Because this is an introduction on how to work with Blue Waters and not any specific scientific code, we will use a tarball of Linux Kernel version 4.16.8. Please download it to your computer by clicking here.
Cunning scp
The simplest way to bring data from your computer to Blue Waters is by using the scp
command.
scp
command works similarly to the standard cp
command and its general form is:
$ scp source destination
For example, to upload Linux Kernel to Blue Waters, navigate to the directory where you saved
linux-4.16.8.tar.xz
and execute:
$ scp linux-4.16.8.tar.xz bw:~/
In this command we are uploading this file to our home directory on Blue Waters.
Note, that we are using bw
shortcut that we created earlier.
Enter your password and let it upload the file.
Let us now examine both source and destination files by
obtaining the basic information about the files with stat linux-4.16.8.tar.xz
on your personal computer and on Blue Waters.
Note, that macOS users have to add -x
flag to get the output in the form similar to that on Linux.
If your terminal emulator does not provide stat
command, use ls -lh
.
$ stat -x linux-4.16.8.tar.xz # on your computer
File: "/Path/to/linux-4.16.8.tar.xz"
Size: 103030932 FileType: Regular File
Mode: (0640/-rw-r-----) Uid: ( 502/ macUser) Gid: ( 20/ staff)
Device: 1,5 Inode: 8609432423 Links: 1
Access: Wed May 10 11:12:13 2018
Modify: Wed May 9 03:00:40 2018
Change: Wed May 10 11:12:13 2018
And now on Blue Waters:
$ stat ~/linux-4.16.8.tar.xz
File: `/u/sciteam/username/linux-4.16.8.tar.xz'
Size: 103030932 Blocks: 201248 IO Block: 4194304 regular file
Device: 6c963686h/1821783686d Inode: 144126968785316320 Links: 1
Access: (0640/-rw-r-----) Uid: (47840/ username) Gid: (14802/bw_staff)
Access: 2018-05-16 02:04:26.000000000 -0500
Modify: 2018-05-16 02:06:47.000000000 -0500
Change: 2018-05-16 02:06:47.000000000 -0500
Birth: -
All timestamps have been reset to the time at which the upload took place!
If you repeat the upload or, instead, download the file, timestamps will be updated again.
This constitutes a problem if you rely on data timestamps in your analysis.
To preserve timestamps, use -p
flag when transferring data with scp
:
$ scp -p linux-4.16.8.tar.xz bw:~/
But what happens if we (accidentally) execute the command again?
Well, scp
will blindly transfer the file again, even though the destination file exists
and is identical to the one we’re uploading. This is the first problem with the scp
command.
Let’s consider another scenario, in which we have to transfer a directory rather than a single file.
In such a case, we have to add -r
flag:
$ scp -pr local-directory bw:~/path/
That was not too difficult… But there is, in fact, a caveat: scp
follows symbolic links.
This means that instead of transfering symbolic links as such, scp
transfers actual files they
point to.
rsync
, the savior
rsync
is a popular and, in many ways, a much better alternative to scp
.
Without much ado, its most useful syntax is:
$ rsync -avP linux-4.16.8.tar.xz bw:~/
building file list ...
1 file to consider
sent 92 bytes received 20 bytes 6.79 bytes/sec
total size is 103030932 speedup is 919919.04
Note that command completes almost instantly.
rsync
determined that there is, in fact file with the same contents on Blue Waters
and did not transfer it, saving us bandwidth and time!
Training accounts
If you’re using a training account, add
-e 'ssh bwbay.ncsa.illinois.edu ssh'
to thersync
command above:$ rsync -avP -e 'ssh -l traXXX bwbay.ncsa.illinois.edu ssh' linux-4.16.8.tar.xz bw.ncsa.illinois.edu:~/
Note, that again we are using the bw
shortcut that we created for the ssh
command.
If we “accidentally” try uploading it again, rsync
will, again, skip the upload.
wget
and curl
When files that you’d like to upload onto Blue Waters are located somewhere in the World Wide
Web, it makes sense to skip the step of downloading them to your computer and copy them directly
to Blue Waters. In such a case, you can use wget
or curl
:
$ wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.16.8.tar.xz
$ curl -O https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.16.8.tar.xz
git
If you’re hosting your personal code in a git
repository, use the git clone
command:
$ git clone https://github.com/your-github-account/repository.git ~/source-code/
Large/long data transfers
When dealing with large data, we have to consider the details of the data transfer process.
Commands that we issued above use bw
as one of the “end points” of the data transfer.
As we already know, bw
is, in fact, a non-existent machine that connects us
to one of the three login nodes.
Thus, when we transfer data using bw
as destination we, in fact use Blue Waters login nodes.
This can and, in fact, does cause problems for other Blue Waters users who happen to be logged in
to those login nodes.
Moreover, if you launch a long data transfer and there is a problem with network connection,
you will have to manually re-initiate the transfer.
For these purposes, Blue Waters has dedicated Input/Output (IO) nodes. These nodes, however, can be accessed using a tool called Globus Online.
And again, without much ado, let’s jump right in!
- We need to create an account on https://www.globus.org:
- Click “Log in”
- Select your organization, use your Google or ORCiD iD, or create a Globus ID to sign in.
- Navigate to the Transfer Files section at https://www.globus.org/app/transfer
- To transfer data to or from your computer, click on Get Globus Connect Personal link and follow the instructions to create your personal End Point.
- Go back to the Transfer Files section, specify the source and destination End Points by clicking on the two End Point text fields and specifying End Points. Blue Waters has two End Points that you can use: ncsa#BlueWaters and ncsa#Nearline, the latter being tape storage system.
- Select files and folders that you wish to trasfer and click one of the big arrows to initiate data transfer.
Key Points
Use
scp
,rsync
,wget
,git
for “small” data transfersUse “Globus Online” (GO) for large/long data transfers
Blue Waters’ and Nearline addresses in GO are
ncsa#BlueWaters
andncsa#Nearline
, correspondinglyUse Globus to store data in Nearline