AWS Snowball User Guide
How to Transfer Petabytes of Data Efficiently
you want to transfer. Now when you check the performance, you note that all three instances of the
snowball cp command are operating at a performance of 25 MB/second, with a total performance of
75 MB/second. Even though the individual performance of each instance has decreased in this example,
the overall performance has increased.
Experimenting in this way, using the techniques listed in Speeding Up Data Transfer (p. 34), can help
you optimize your data transfer performance.
Performance Considerations for HDFS Data Transfers
When getting ready to transfer data from a Hadoop Distributed File System (HDFS) cluster (version
2.x) into a Snowball, we recommend that you follow the guidance in the previous section, and also the
following tips:
• Don't copy the entire cluster over in a single command – Transferring an entire cluster in a single
command can cause performance issues, including slow transfers, "flipped" bits, and missing or
corrupted data on the Snowball. We recommend that in this case you separate the data transfer into
multiple parts.
• Don't transfer a large number of small files – Suppose that you have a large number of files, say over
1000, and those files are small, say under 1 MB each in size. In this case, transferring them all at once
has a negative impact on your performance. This performance degradation is due to per-file overhead
when you transfer data from HDFS clusters.
If you must transfer a large number of small files, we recommend that you find a method of collecting
them into larger archive files, and then transferring those. However, these archives are what is
imported into Amazon S3. If you want the files in their original state, take them out of the archives
after importing the archives.
Important
The --batch option for the Snowball client's copy command is not supported for HDFS data
transfers.
How to Transfer Petabytes of Data Efficiently
When transferring petabytes of data, we recommend that you plan and calibrate your data transfer
between the Snowball you have on-site and your workstation according to the following guidelines.
Small delays or errors can significantly slow your transfers when you work with large amounts of data.
Topics
• Planning Your Large Transfer (p. 36)
• Calibrating a Large Transfer (p. 38)
• Transferring Data in Parallel (p. 39)
Planning Your Large Transfer
To plan your petabyte-scale data transfer, we recommend the following steps:
• Step 1: Understand What You're Moving to the Cloud (p. 37)
• Step 2: Prepare Your Workstations (p. 37)
• Step 3: Calculate Your Target Transfer Rate (p. 37)
• Step 4: Determine How Many Snowballs You Need (p. 37)
• Step 5: Create Your Jobs Using the AWS Snowball Management Console (p. 38)
36