Use S3 as major storage

S3 is the most important storage service on AWS. Knowing how to use it crucial for almost any projects on the cloud.

Use S3 from the console

Please follow the official S3 tutorial to learn how to use S3 in the graphical console. It feels pretty much like Dropbox or Google Cloud Drive, which allows you to upload and download files by clicking on buttons. Before continuing, you should know how to:

  1. Create a S3 bucket
  2. Upload a file (also called “object” in S3 context) into that bucket
  3. Download that file
  4. Delete that file and Bucket

The S3 console is convenient for viewing files, but most of time you will use AWSCLI to work with S3 because:

  • It is much easier to recursively upload/download directories with AWSCLI.
  • To transfer data between S3 and EC2, you have to use AWSCLI since there is no graphical console on EC2 instances.
  • To work with public data set, AWSCLI is almost the only way you can use. Recall that in the previous chapter you use aws s3 ls s3://nasanex/ to list the NASA-NEX data. But you cannot see the “s3://nasanex/” bucket in S3 console, since it doesn’t belong to you.

Working with S3 using AWSCLI

On an EC2 instance launched from the GEOSChem tutorial AMI, configure AWSCLI by aws configure as in the previous chapter and make sure aws s3 ls can run without error.

Now, say you’ve made some changes to the geosfp_4x5_standard/ run directory, such as tweaking model configurations in input.geos or running simulations to produce new diagnostics files. You want to keep those changes after you terminate the server, so you can retrieve them when you continue the work next time.

Create a new bucket by aws s3 mb s3://your-bucket-name. Note that S3 bucket names must be unique across all accounts, as this facilitates sharing data between different people (If others’ buckets are public, you can access them just like how the owners access them). Getting a unique name is easy – it the name already exists, just add your name initials or some prefix. If you just use the name in the example below, you are likely to get an make_bucket failed error since that bucket already exists in my account:

$ aws s3 mb s3://geoschem-run-directory
make_bucket: geoschem-run-directory

Then you can see your bucket by aws s3 ls (you can also see it in the S3 console)

$ aws s3 ls
2018-03-09 18:54:18 geoschem-run-directory

Now use aws s3 cp local_file s3://your-bucket-name to transfer the directory to S3 (add the --recursive option is to recursively copy a directory, just like the normal Linux command cp -r)

$ aws s3 cp --recursive geosfp_4x5_standard s3://geoschem-run-directory/
upload: geosfp_4x5_standard/FJX_spec.dat to s3://geoschem-run-directory/FJX_spec.dat
upload: geosfp_4x5_standard/HEMCO.log to s3://geoschem-run-directory/HEMCO.log
upload: geosfp_4x5_standard/HISTORY.rc to s3://geoschem-run-directory/HISTORY.rc
...

The default bandwidth between EC2 and S3 is ~100 MB/s so copying that run directory would just take seconds.

Note

To make incremental changes to existing S3 buckets, aws s3 sync is more efficient then aws s3 cp. Instead of overwriting the entire bucket, sync only write the files that have actually changed.

Now list your S3 bucket content by aws s3 ls s3://your-bucket-name:

$ aws s3 ls s3://geoschem-run-directory
2018-03-09 19:13:45        364 .gitignore
2018-03-09 19:13:45       9712 FJX_j2j.dat
2018-03-09 19:13:45      50125 FJX_spec.dat
...

You can also see all the files in the S3 console, which is a quite convenient way to view your data without launching any servers.

Then, try to get data back from S3 by swapping the arguments to aws s3 cp:

ubuntu@ip-172-31-46-2:~$ aws s3 cp --recursive s3://geoschem-run-directory/ rundir_copy
download: s3://geoschem-run-directory/.gitignore to rundir_copy/.gitignore
download: s3://geoschem-run-directory/FJX_j2j.dat to rundir_copy/FJX_j2j.dat
download: s3://geoschem-run-directory/FJX_spec.dat to rundir_copy/FJX_spec.dat
...

Since your run directory is now safely living in the S3 bucket that is independent to any servers, terminating your EC2 instance won’t cause data loss. You can use aws s3 cp to get data back from S3, on any number of newly-launched EC2 instances.

Warning

S3 is not a standard Linux file system and thus cannot preserve Linux file permissions. After retrieving your run directory back from S3, the executable geos.mp and getRunInfo will not have execute-permission by default. Simply type chmod u+x geos.mp getRunInfo to grant permission again.

Another approach to preserve permissions is to use tar -zcvf to compress your directory before loading to S3, and then use tar -zxvf to decompress it after retrieving from S3. Only consider this approach if you absolutely want to preserve all the permission information.

S3 also has no concept of symbolic links created by ln -s. By default, it will turn all links into real files by making real copies. You can use aws s3 cp --no-follow-symlinks ... to ignore links.

Those simplifications make S3 much more scalable (and cheaper) than normal file systems. You just need to be aware of those caveats.

Access GEOS-Chem input data repository in S3

List our bucket by:

$ aws s3 ls --request-payer=requester s3://gcgrid/
                           PRE BPCH_RESTARTS/
                           PRE CHEM_INPUTS/
                           PRE GCHP/
                           PRE GEOS_0.25x0.3125/
                           PRE GEOS_0.25x0.3125_CH/
                           PRE GEOS_0.25x0.3125_NA/
                           PRE GEOS_0.5x0.625_AS/
                           PRE GEOS_0.5x0.625_NA/
                           PRE GEOS_2x2.5/
                           PRE GEOS_4x5/
                           PRE GEOS_MEAN/
                           PRE GEOS_NATIVE/
                           PRE GEOS_c360/
                           PRE HEMCO/
                           PRE SPC_RESTARTS/
2018-03-08 00:18:41       3908 README

GEOS-Chem input data bucket uses requester-pay mode. Transferring data from S3 to EC2 (in the same region) has no cost. But you do need to pay for the egress fee if you download data to local machines.

The tutorial AMI only has 4x5 GEOS-FP metfield for 1-month (2013/07). You can get other metfields from that S3 bucket, to support simulations with any configurations.

For example, download the 4x5 GEOS-FP data over the next month (2013/08)

$ aws s3 cp --request-payer=requester --recursive \
  s3://gcgrid/GEOS_4x5/GEOS_FP/2013/08/ \
  ~/gcdata/ExtData/GEOS_4x5/GEOS_FP/2013/08/

download: s3://gcgrid/GEOS_4x5/GEOS_FP/2013/08/GEOSFP.20130801.A1.4x5.nc to gcdata/ExtData/GEOS_4x5/GEOS_FP/2013/08/GEOSFP.20130801.A1.4x5.nc
download: s3://gcgrid/GEOS_4x5/GEOS_FP/2013/08/GEOSFP.20130801.A3mstC.4x5.nc to gcdata/ExtData/GEOS_4x5/GEOS_FP/2013/08/GEOSFP.20130801.A3mstC.4x5.nc
...

Downloading this ~2.5 GB data should just take 10~20s.

To download more months (but not the entire year), consider simple bash “for” loop:

for month in 09 10
do
aws s3 cp --request-payer=requester --recursive \
  s3://gcgrid/GEOS_4x5/GEOS_FP/2013/$month \
  ~/gcdata/ExtData/GEOS_4x5/GEOS_FP/2013/$month
done

Wildcards are also supported, but it feels pretty different from common Linux wildcards. I often find writing bash scripts a lot quicker.

Then you may want to change the simulation date in input.geo to test the new data. For example, change to next month:

Start YYYYMMDD, hhmmss  : 20130801 000000
End   YYYYMMDD, hhmmss  : 20130901 000000
Run directory           : ./
Input restart file      : GEOSChem_restart.201307010000.nc

(Note that the restart file is still at 2013/07 in this case.)

The EC2 instance launched from the tutorial AMI only has 70 GB disk by default, so the disk will be full very soon. You will learn how to increase the disk size, right in the next tutorial.

Note

Get tired of lengthy S3 commands? The s3fs-fuse tool can make S3 buckets and objects behave just like normal directories and files on disk. However, it doesn’t work well with requester-pay buckets yet (issue#635). If that issue is resolved we will add more instructions.