Friday, December 28, 2012

How to copy files from S3 to Amazon EMR's HDFS?

Amazon itself has a wrapper implemented over distcp, namely : s3distcp . S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services (AWS), particularly Amazon Simple Storage Service (Amazon S3). You use S3DistCp by adding it as a step in a job flow. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3 I found that s3distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary. Example Copy log files from Amazon S3 to HDFS : This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.
elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'</pre>

7 comments:

  1. i copied the files from s3 amozone to my my website information Good Morning Quotes

    ReplyDelete
  2. I like everything from the post i also copy a large amount of files form the information Best Friend Quotes thank you.

    ReplyDelete
  3. thanks for sharing this post & if u want to know more about medical studies kindly visit School of Nursing

    ReplyDelete
  4. Awesome post!!! Thanks a lot for sharing with us....
    You will get an introduction to the Python programming language and understand the importance of it. How to download and work with Python along with all the basics of Anaconda will be taught. You will also get a clear idea of downloading the various Python libraries and how to use them.
    Topics
    About ExcelR Solutions and Innodatatics
    Do's and Don’ts as a participant
    Introduction to Python
    Installation of Anaconda Python
    Difference between Python2 and Python3
    Python Environment
    Operators
    Identifiers
    Exception Handling (Error Handling)

    ReplyDelete

Any feedback, good or bad is most welcome.

Name

Email *

Message *