HDFS Storage Tier – Archiving to Cloud w/ S3

By default HDFS does not distinguish between different storage types hence making it difficult to optimize installations with heterogeneous storage devices. Since Hadoop 2.3 and the integration of HDFS-2832 HDFS supports placing block replicas on persistent tiers with different durability and performance requirements.

Typically applications have the need to distinguish data based on their utilization temperature in HOT, WARM, and COLD, where HOT data is in high demand and currently utilized frequently, WARM data is often requested, and COLD is rarely being accessed. Especially since Spark more and more developers demand optimized storage performance for their applications.

For now the following storage policies are supported by HDFS:

  • Lazy_Persists
  • All_SSD
  • One_SSD
  • Hot (default)
  • Warm
  • Cold

with the corresponding block placement strategies:

  • RAM_DISK: 1, DISK n-1
  • SSD: n
  • SSD: 1, DISK: n-1
  • DISK: n
  • DISK: 1, ARCHIVE: n-1
  • ARCHIVE: n

Here n is the replication factor while RAM_DISK, SSD, DISK, and Archive are the different Storage Types supported. For details you can check here: Archival Storage, SSD & Memory

For the NameNode to be able to make the correct block placements for the policy attached to the data, at the heartbeat interval DataNodes are required to additionally send the attached storage types. At the DataNode the different storage types are configured with each data dir under dfs.datanode.data.dir. A sample config could look like this:


Additionally storage policies need to be enabled with dfs.storage.policy.enabled.

As a simple example in this post we are going to attache S3 block storage to HDFS for cloud archival. With cloud storage a HDFS cluster could easily be extended, while sacrificing performance which makes it an ideal archival use case.

Mount S3 to Linux

Mounting S3 to a possible DataNode can be achieved with using s3fs-fuse. First a bucket needs to be created here by using the AWS management console:

DataNodes would not be able to share the exact same remote location and would either need to be separated by folder or in case of S3 by bucket.

On a CentOS 7 a s3fs-fuse install could be achieved like this:

# installing dependencies
$ sudo yum install automake fuse fuse-devel 
        gcc-c++ git libcurl-devel libxml2-devel 
        make openssl-devel
# build from source
$ cd /tmp
$ git clone https://github.com/s3fs-fuse/s3fs-fuse.git
$ cd s3fs-fuse
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Store the credential of S3 to a password file:

$ echo MYIDENTITY:MYCREDENTIAL > /path/to/s3_passwd

With s3fs-fuse installed we can mount the bucket to the OS. Here we choose to use the directory /gridc:

# create the directory
$ mkdir /gridc
# need a directory for ownership
$ mkdir /gridc/hadoop
$ chown -R hdfs. /gridc

Execute the s3fs mount as the hdfs user, so the /gridc dir will be owned by hdfs and not root.

$ s3fs hdfs-hdp-archive /gridc -o passwd_file=/path/to/s3_passwd
$ df -h
s3fs                     256T       0  256T    0% /gridc

If you encounter an exception try:

$ s3fs hdfs-hdp-archive /gridc -o passwd_file=/path/to/s3_passwd 
-d -f -o f2 -o curldbg

HDFS Storage Tier

With the folder mounted to S3 we can use it to attache a ARCHIVE storage type to our DataNode. For dfs.datanode.data.dir we will configure the directory /gridc as our archive storage. The configuration could look like this:


To demonstrate the functionality we here move the data under /apps/hdp into our archive in S3:

$ hdfs storagepolicies -listPolicies

Block Storage Policies:
	BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], replicationFallbacks=[]}
	BlockStoragePolicy{WARM:5, storageTypes=[DISK, ARCHIVE], creationFallbacks=[DISK, ARCHIVE], replicationFallbacks=[DISK, ARCHIVE]}
	BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
	BlockStoragePolicy{ONE_SSD:10, storageTypes=[SSD, DISK], creationFallbacks=[SSD, DISK], replicationFallbacks=[SSD, DISK]}
	BlockStoragePolicy{ALL_SSD:12, storageTypes=[SSD], creationFallbacks=[DISK], replicationFallbacks=[DISK]}
	BlockStoragePolicy{LAZY_PERSIST:15, storageTypes=[RAM_DISK, DISK], creationFallbacks=[DISK], replicationFallbacks=[DISK]}

$ hdfs storagepolicies -setStoragePolicy -path /hdp/apps -policy COLD
Set storage policy COLD on /hdp/apps

The mover will periodically scan the blocks to make sure that the placement of blocks eventually satisfies the required policies, similar to the HDFS Balancer:

$ hdfs mover
17/01/22 22:40:09 INFO mover.Mover: namenodes = {hdfs://localhost:8020=null}
17/01/22 22:40:12 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
17/01/22 22:40:13 INFO block.BlockTokenSecretManager: Setting block keys
17/01/22 22:40:13 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
17/01/22 22:40:13 INFO block.BlockTokenSecretManager: Setting block keys
17/01/22 22:40:13 INFO net.NetworkTopology: Adding a new node: /default-rack/
17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741829_1005 with size=134217728 from to through
17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741832_1008 with size=71232335 from to through
17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741837_1013 with size=54509450 from to through
17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741836_1012 with size=134217728 from to through
17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741834_1010 with size=134217728 from to through

The result can be viewed in the AWS Console:

Further Reading


4 thoughts on “HDFS Storage Tier – Archiving to Cloud w/ S3

  1. Great article Henning, super helpful! I actually had the same idea. Have you tried this in a production environment? I am wondering where the pitfalls are.


    1. Thanks and glad it was helpful.

      I personally have not used it in a production environment, but for me there seems no reason why not
      Sure it adds complexity as you have another process to monitor and a mounted drive to watch. Things that can fail.

      Interesting would also be, if the blocks in S3 could be used independently?


      1. Also on the complexity side you need a process to assign storage policies to files in HDFS as Hadoop has no automatic means to identify which blocks to tier to ARCHIVE. Any thoughts on how to best accomplish that?


  2. Hi Henning,
    I wanted to let you know I tested this in my dev enviroment and ran into some scalability issues when HDFS attempted to run a du on the S3FS volume to list all the 1k blocks. I’d love to hear if you have any feedback around this or if you have any ideas to alleviate it!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s