By default HDFS does not distinguish between different storage types hence making it difficult to optimize installations with heterogeneous storage devices. Since Hadoop 2.3 and the integration of HDFS-2832 HDFS supports placing block replicas on persistent tiers with different durability and performance requirements.
Typically applications have the need to distinguish data based on their utilization temperature in HOT, WARM, and COLD, where HOT data is in high demand and currently utilized frequently, WARM data is often requested, and COLD is rarely being accessed. Especially since Spark more and more developers demand optimized storage performance for their applications.
For now the following storage policies are supported by HDFS:
- Lazy_Persists
- All_SSD
- One_SSD
- Hot (default)
- Warm
- Cold
with the corresponding block placement strategies:
- RAM_DISK: 1, DISK n-1
- SSD: n
- SSD: 1, DISK: n-1
- DISK: n
- DISK: 1, ARCHIVE: n-1
- ARCHIVE: n
Here n is the replication factor while RAM_DISK, SSD, DISK, and Archive are the different Storage Types supported. For details you can check here: Archival Storage, SSD & Memory
For the NameNode to be able to make the correct block placements for the policy attached to the data, at the heartbeat interval DataNodes are required to additionally send the attached storage types. At the DataNode the different storage types are configured with each data dir under dfs.datanode.data.dir. A sample config could look like this:
Additionally storage policies need to be enabled with dfs.storage.policy.enabled.
As a simple example in this post we are going to attache S3 block storage to HDFS for cloud archival. With cloud storage a HDFS cluster could easily be extended, while sacrificing performance which makes it an ideal archival use case.
Mount S3 to Linux
Mounting S3 to a possible DataNode can be achieved with using s3fs-fuse. First a bucket needs to be created here by using the AWS management console:
DataNodes would not be able to share the exact same remote location and would either need to be separated by folder or in case of S3 by bucket.
On a CentOS 7 a s3fs-fuse install could be achieved like this:
# installing dependencies $ sudo yum install automake fuse fuse-devel gcc-c++ git libcurl-devel libxml2-devel make openssl-devel # build from source $ cd /tmp $ git clone https://github.com/s3fs-fuse/s3fs-fuse.git $ cd s3fs-fuse $ ./autogen.sh $ ./configure $ make $ sudo make install
Store the credential of S3 to a password file:
$ echo MYIDENTITY:MYCREDENTIAL > /path/to/s3_passwd
With s3fs-fuse installed we can mount the bucket to the OS. Here we choose to use the directory /gridc:
# create the directory $ mkdir /gridc # need a directory for ownership $ mkdir /gridc/hadoop $ chown -R hdfs. /gridc
Execute the s3fs mount as the hdfs user, so the /gridc dir will be owned by hdfs and not root.
$ s3fs hdfs-hdp-archive /gridc -o passwd_file=/path/to/s3_passwd $ df -h .... s3fs 256T 0 256T 0% /gridc
If you encounter an exception try:
$ s3fs hdfs-hdp-archive /gridc -o passwd_file=/path/to/s3_passwd -d -f -o f2 -o curldbg
HDFS Storage Tier
With the folder mounted to S3 we can use it to attache a ARCHIVE storage type to our DataNode. For dfs.datanode.data.dir we will configure the directory /gridc as our archive storage. The configuration could look like this:
To demonstrate the functionality we here move the data under /apps/hdp into our archive in S3:
$ hdfs storagepolicies -listPolicies Block Storage Policies: BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], replicationFallbacks=[]} BlockStoragePolicy{WARM:5, storageTypes=[DISK, ARCHIVE], creationFallbacks=[DISK, ARCHIVE], replicationFallbacks=[DISK, ARCHIVE]} BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} BlockStoragePolicy{ONE_SSD:10, storageTypes=[SSD, DISK], creationFallbacks=[SSD, DISK], replicationFallbacks=[SSD, DISK]} BlockStoragePolicy{ALL_SSD:12, storageTypes=[SSD], creationFallbacks=[DISK], replicationFallbacks=[DISK]} BlockStoragePolicy{LAZY_PERSIST:15, storageTypes=[RAM_DISK, DISK], creationFallbacks=[DISK], replicationFallbacks=[DISK]} $ hdfs storagepolicies -setStoragePolicy -path /hdp/apps -policy COLD Set storage policy COLD on /hdp/apps
The mover will periodically scan the blocks to make sure that the placement of blocks eventually satisfies the required policies, similar to the HDFS Balancer:
$ hdfs mover 17/01/22 22:40:09 INFO mover.Mover: namenodes = {hdfs://localhost:8020=null} 17/01/22 22:40:12 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 17/01/22 22:40:13 INFO block.BlockTokenSecretManager: Setting block keys 17/01/22 22:40:13 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 17/01/22 22:40:13 INFO block.BlockTokenSecretManager: Setting block keys 17/01/22 22:40:13 INFO net.NetworkTopology: Adding a new node: /default-rack/127.0.0.1:50010 17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741829_1005 with size=134217728 from 127.0.0.1:50010:DISK to 127.0.0.1:50010:ARCHIVE through 127.0.0.1:50010 17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741832_1008 with size=71232335 from 127.0.0.1:50010:DISK to 127.0.0.1:50010:ARCHIVE through 127.0.0.1:50010 17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741837_1013 with size=54509450 from 127.0.0.1:50010:DISK to 127.0.0.1:50010:ARCHIVE through 127.0.0.1:50010 17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741836_1012 with size=134217728 from 127.0.0.1:50010:DISK to 127.0.0.1:50010:ARCHIVE through 127.0.0.1:50010 17/01/22 22:40:14 INFO balancer.Dispatcher: Start moving blk_1073741834_1010 with size=134217728 from 127.0.0.1:50010:DISK to 127.0.0.1:50010:ARCHIVE through 127.0.0.1:50010
The result can be viewed in the AWS Console:
Great article Henning, super helpful! I actually had the same idea. Have you tried this in a production environment? I am wondering where the pitfalls are.
LikeLike
Thanks and glad it was helpful.
I personally have not used it in a production environment, but for me there seems no reason why not
Sure it adds complexity as you have another process to monitor and a mounted drive to watch. Things that can fail.
Interesting would also be, if the blocks in S3 could be used independently?
LikeLike
Also on the complexity side you need a process to assign storage policies to files in HDFS as Hadoop has no automatic means to identify which blocks to tier to ARCHIVE. Any thoughts on how to best accomplish that?
LikeLike
Hi Henning,
I wanted to let you know I tested this in my dev enviroment and ran into some scalability issues when HDFS attempted to run a du on the S3FS volume to list all the 1k blocks. I’d love to hear if you have any feedback around this or if you have any ideas to alleviate it!
Thanks
Nick
LikeLike