Installing HttpFS Gateway on a Kerberized Cluster

HttpFS gateway is the preferred way of accessing the Hadoop filesystem using HTTP clients like curl. Additionally it can be used from from the  hadoop fs command line tool ultimately being a replacement for the hftp protocol. HttpFS, unlike HDFS Proxy, has full support for all file operations with additional support for authentication. Given it’s stateless protocol it is ideal to scale out Hadoop filesystem access using HTTP clients.

In this post I would like to show how to install and setup a HttpFS gateway on a secure and kerberized cluster. By providing some troubleshooting topics, this post should also help you, when running into problems while installing the gateway.

Insatlling HttpFS Gateway on a kerberized Cluster

First you would need to install HttpFS gateway. Given the HDP repositories on a CentOS host this would look like this:

This installation will create a locale user httpfs on your system. When you require system users also to be created in your companies directory you should do that prior to running the installation.

Since we a talking about a kerberized cluster we will need to create keytabs for the httpfs user to support Kerberos authentication. HttpFS gateway will also require the keytab for HTTP (spenago). We will merge the two principals into one single keytab using ktutils.

Creating the keytabs with Kerberos krb5:

Creating the keytabs with FreeIPA:

If you don’t already have the spenago keytab for that host you will need to create that as well:

krb5:

FreeIPA:

You need to create the HTTP principal in capital letters HTTP not http. By merging the keytabs into one file our httpfs user can use them together:

To test that the keytab we created works we can use klist:

Setting the correct access rights for the keytabs:

We are now prepared to configure the gateway to make use of the newly created and merged Kerberos principals. We need to configure the NameNode to allow the httpfs user to proxy other users. In core-site.xml add the following:

The same needs to be added to the configuration file of HttpFS, but let’s have a look at a sample configuration of the gateway httpfs-site.xml under /etc/hadoop-httpfs/conf/ :

Most of the values provided in this configuration file should be self explanatory. We need to setup Hue as a proxy user so Hue can use the HttpFS gateway to impersonate other users. Also we need to provided the gateway with the  auth_to_local mappings needed to map Kerberos principals to local user. You can start the gateway now be either issuing service hadoop-httpfs restart or  /etc/init.d/hadoop-httpfs restart.

We can test that the gateway is running by making a simple curl request like this:

Now that we have the gateway we would also be able to point Hue to that instead of webhdfs.

Troubleshooting

If something goes wrong you should be able to resolve them by following some of the hints provided here. Setting up HttpFS you can run into multiple problems as you have to configure different parts to work together.

404

Getting a 404 Not Found response while trying to call the gateway most likely indicates that Catalina, the web container under which the gateway is deployed, did not correctly initialize the webhdfs context. In Hue the error would display as a “Cannot access: /. [Errno 2] File / not found”.

Hue File Not Found - HttpFSIn this case you need to check the log files under /var/log/hadoop-httpfs/ for an the specific error. If you don’t find anything but a SEVERE: Error listenerStart you are having issues with the webhdfs context being deployed correctly. Log for this in the  httpfs-catalina.out log file:

Unfortunately Catalina does not give you more details unless you create a logging.properties file in the class path of the web app and define an appropriate log appender for Catalina. In case of HttpFS create a logging.properties file here  /usr/lib/hadoop-httpfs/webapps/webhdfs/WEB-INF/classes/ with the following context:

Restart the service prior to checking the the logs for further error messages.

403

An 403 Unauthorized Access can mean, that you are not authorized with a correct Kerberos principal. You would need to check your current principal by initializing a klist.

In case you’re receiving an RemoteException: Unauthorized connection for super-user: httpfs/example.host on IP, than you very likely did not configure the hadoop.proxyuser.httpfs.[groups|hosts] correctly.

Hue 403 HttpFSYou might also run into the case that httpfs itself is not authenticating correctly towards your NameNode. This would either show up in the NameNode or hadoop-httpfs logs. If you see a message indicating you are having a problem conecting to the NameNode you need to check whether your Kerberos principlas and keytab files were setup correctly. Check your keytab files by running a klist -kte /etc/security/keytabs/httpfs.service.keytab or listing the principals by using klist -kt /etc/security/keytabs/httpfs.service.keytab.

Further Reading

2 thoughts on “Installing HttpFS Gateway on a Kerberized Cluster”

Leave a Reply

Your email address will not be published. Required fields are marked *