HttpFS gateway is the preferred way of accessing the Hadoop filesystem using HTTP clients like curl. Additionally it can be used from from the hadoop fs command line tool ultimately being a replacement for the hftp protocol. HttpFS, unlike HDFS Proxy, has full support for all file operations with additional support for authentication. Given it’s stateless protocol it is ideal to scale out Hadoop filesystem access using HTTP clients.
In this post I would like to show how to install and setup a HttpFS gateway on a secure and kerberized cluster. By providing some troubleshooting topics, this post should also help you, when running into problems while installing the gateway.
Insatlling HttpFS Gateway on a kerberized Cluster
First you would need to install HttpFS gateway. Given the HDP repositories on a CentOS host this would look like this:
yum -y install hadoop-httpfs
This installation will create a locale user httpfs on your system. When you require system users also to be created in your companies directory you should do that prior to running the installation.
Since we a talking about a kerberized cluster we will need to create keytabs for the httpfs user to support Kerberos authentication. HttpFS gateway will also require the keytab for HTTP (spenago). We will merge the two principals into one single keytab using ktutils.
Creating the keytabs with Kerberos krb5:
kadmin -q "ktadd -k /etc/security/keytabs/httpfs.service.keytab httpfs/example.host@MYREALM"
Creating the keytabs with FreeIPA:
$ ipa service-add httpfs/example.host@MYREALM $ ipa-getkeytab -s ipaserver.example.host -p httpfs/example.host@MYREALM -k /etc/security/keytabs/httpfs.service.keytab
If you don’t already have the spenago keytab for that host you will need to create that as well:
krb5:
kadmin -q "ktadd -k /etc/security/keytabs/spnego.service.keytab HTTP/example.host@MYREALM"
FreeIPA:
$ ipa service-add httpfs/example.host@MYREALM $ ipa-getkeytab -s ipaserver.example.host -p httpfs/example.host@MYREALM -k /etc/security/keytabs/httpfs.service.keytab
You need to create the HTTP principal in capital letters HTTP not http. By merging the keytabs into one file our httpfs user can use them together:
$ ktutil ktutil: rkt /etc/security/keytabs/httpfs.service.keytab ktutil: rkt /etc/security/keytabs/spnego.service.keytab ktutil: wkt /etc/security/keytabs/httpfs-http.service.keytab ktutil: quit
To test that the keytab we created works we can use klist:
klist -ket /etc/security/keytabs/httpfs-http.service.keytab"
Setting the correct access rights for the keytabs:
$ chwon httpfs:hadoop /etc/security/keytabs/httpfs-http.service.keytab $chmod 400 /etc/security/keytabs/httpfs-http.service.keytab
We are now prepared to configure the gateway to make use of the newly created and merged Kerberos principals. We need to configure the NameNode to allow the httpfs user to proxy other users. In core-site.xml add the following:
<property> <name>hadoop.proxyuser.httpfs.hosts</name> <value>example.host</value><!-- hosts w/ gateways installed --> </property> <property> <name>hadoop.proxyuser.httpfs.groups</name> <value>devs,marketing</value><!-- user in groups that can be impersonated --> </property>
The same needs to be added to the configuration file of HttpFS, but let’s have a look at a sample configuration of the gateway httpfs-site.xml under /etc/hadoop-httpfs/conf/ :
<configuration> <!-- HUE proxy user setting --> <property> <name>httpfs.proxyuser.hue.hosts</name> <value>hue1.example.host,hue2.example.host</value><!-- hosts hue is installed on --> </property> <property> <name>httpfs.proxyuser.hue.groups</name> <value>marketing,finance</value><!-- users in this groups can be impersonated by hue --> </property> <property> <name>httpfs.authentication.type</name> <value>kerberos</value> </property> <property> <name>httpfs.hadoop.authentication.type</name> <value>kerberos</value> </property> <property> <name>httpfs.authentication.kerberos.principal</name> <value>HTTP/example.host@MYREALM</value> </property> <property> <name>httpfs.authentication.kerberos.keytab</name> <value>/etc/security/keytabs/httpfs-http.service.keytab</value> </property> <property> <name>httpfs.hadoop.authentication.kerberos.principal</name> <value>httpfs/example.host@MYREALM</value> </property> <property> <name>httpfs.hadoop.authentication.kerberos.keytab</name> <value>/etc/security/keytabs/httpfs-http.service.keytab</value> </property> <property> <name>httpfs.authentication.kerberos.name.rules</name> <value>RULE:[2:$1@$0](rm@.*MYREALM)s/.*/yarn/ RULE:[2:$1@$0](nm@.*MYREALM)s/.*/yarn/ RULE:[2:$1@$0](nn@.*MYREALM)s/.*/hdfs/ RULE:[2:$1@$0](dn@.*MYREALM)s/.*/hdfs/ RULE:[2:$1@$0](hbase@.*MYREALM)s/.*/hbase/ RULE:[2:$1@$0](hbase@.*MYREALM)s/.*/hbase/ RULE:[2:$1@$0](oozie@.*MYREALM)s/.*/oozie/ RULE:[2:$1@$0](jhs@.*MYREALM)s/.*/mapred/ RULE:[2:$1@$0](jn/_HOST@.*MYREALM)s/.*/hdfs/ DEFAULT</value> </property> <property> <name>httpfs.hadoop.config.dir</name> <value>/etc/hadoop/conf</value> </property> </configuration>
Most of the values provided in this configuration file should be self explanatory. We need to setup Hue as a proxy user so Hue can use the HttpFS gateway to impersonate other users. Also we need to provided the gateway with the auth_to_local mappings needed to map Kerberos principals to local user. You can start the gateway now be either issuing service hadoop-httpfs restart or /etc/init.d/hadoop-httpfs restart.
We can test that the gateway is running by making a simple curl request like this:
$ kinit sample_user password: $ curl -i -u sample_user http://example.host:14000/webhdfs/v1/?ops=GETLISTSTATUS
Now that we have the gateway we would also be able to point Hue to that instead of webhdfs.
Troubleshooting
If something goes wrong you should be able to resolve them by following some of the hints provided here. Setting up HttpFS you can run into multiple problems as you have to configure different parts to work together.
404
Getting a 404 Not Found response while trying to call the gateway most likely indicates that Catalina, the web container under which the gateway is deployed, did not correctly initialize the webhdfs context. In Hue the error would display as a “Cannot access: /. [Errno 2] File / not found”.
In this case you need to check the log files under /var/log/hadoop-httpfs/ for an the specific error. If you don’t find anything but a SEVERE: Error listenerStart you are having issues with the webhdfs context being deployed correctly. Log for this in the httpfs-catalina.out log file:
SEVERE: Error listenerStart SEVERE: Context [/webhdfs] startup failed due to previous errors SEVERE: The web application [/webhdfs] appears to have started a thread named [FileWatchdog] but has failed to stop it. This is very likely to create a memory leak. SEVERE: Error listenerStart
Unfortunately Catalina does not give you more details unless you create a logging.properties file in the class path of the web app and define an appropriate log appender for Catalina. In case of HttpFS create a logging.properties file here /usr/lib/hadoop-httpfs/webapps/webhdfs/WEB-INF/classes/ with the following context:
org.apache.catalina.core.ContainerBase.[Catalina].level = INFO org.apache.catalina.core.ContainerBase.[Catalina].handlers = java.util.logging.ConsoleHandler
Restart the service prior to checking the the logs for further error messages.
403
An 403 Unauthorized Access can mean, that you are not authorized with a correct Kerberos principal. You would need to check your current principal by initializing a klist.
In case you’re receiving an RemoteException: Unauthorized connection for super-user: httpfs/example.host on IP, than you very likely did not configure the hadoop.proxyuser.httpfs.[groups|hosts] correctly.
You might also run into the case that httpfs itself is not authenticating correctly towards your NameNode. This would either show up in the NameNode or hadoop-httpfs logs. If you see a message indicating you are having a problem conecting to the NameNode you need to check whether your Kerberos principlas and keytab files were setup correctly. Check your keytab files by running a klist -kte /etc/security/keytabs/httpfs.service.keytab or listing the principals by using klist -kt /etc/security/keytabs/httpfs.service.keytab.
Further Reading
- Hadoop HttpFS
- Kerberos: The Definitive Guide (Amazon)
- Hadoop: The Definitve Guide (Amazon)
Install HttpFS Gateway on a Kerberized Cluster http://t.co/K0yzHzPG2t
LikeLike
Thank you so much for this post.
By the way, I couldn’t get the curl request running with your example. I needed to use the negotiation instead:
curl –negotiate -i -u:any_user http://myhttpfs:14000/webhdfs/v1/?op=LISTSTATUS
LikeLike