A Secure HDFS Client Example

It takes about 3 lines of Java code to write a simple HDFS client that can further be used to upload, read or list files. Here is an example:

Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://one.hdp:8020");
FileSystem fs = FileSystem.get(conf);

This file system API gives the developer a generic interface to (any supported) file system depending on the protocol being use, in this case hdfs. This is enough to alter data on the Hadoop Distributed Filesystem, for example to list all the files under the root folder:

FileStatus[] fsStatus = fs.listStatus(new Path("/"));
for(int i = 0; i < fsStatus.length; i++){
   System.out.println(fsStatus[i].getPath().toString());
}

For a secured environment this is not enough, because you would need to consider these further aspects:

  1. A secure protocol
  2. Authentication with Kerberos
  3. Impersonation (proxy user), if designed as a service

What we discuss here for a sample HDFS client can in variance also be applied to other Hadoop clients.

A Secure HDFS Protocol

One way to secure the communication between clients and Hadoop services in general is to use SSL encryption for all RPC calls. This does have a sever impact on the overall cluster performance in general. To avoid this and still ensure a secure communication it can be enough to just encrypt HTTP endpoints. If doing so swebhdfs (SSL+webhdfs) can be used as the protocol. Example:

Configuration conf = new Configuration();
conf.set("fs.defaultFS","swebhdfs://one.hdp:50470");
FileSystem fs = FileSystem.get(conf);

Authentication with Kerberos

A secure client would need to use Kerberos, which is the only authentication method currently supported by Hadoop. Kerberos does require very thoughtful configuration but rewards it’s users with an almost completely transparent authentication implementation that simply works.

Making use of Kerberos authentication in Java is provided by the Java Authentication and Authorization Service (JAAS) which is a pluggable authentication method similar to PAM supporting multiple authentication methods. In this case the authentication method being used is GSS-API for Kerberos.

For JAAS a proper configuration of GSS would be needed in addition to being in possession of proper credentials, obviously. Some credentials can be created with MIT Kerberos like this:

(as root)
$ kadmin.local -q "addprinc -pw hadoop hdfs-user" 
$ kadmin.local -q 
 "xst -norandkey -k /home/hdfs-user/hdfs-user.keytab hdfs-user@MYCORP.NET"

The last line is not necessarily needed as it creates us a so called keytab – basically an encrypted password of the user – that can be used for password less authentication for example for automated services. We will make use of that here as well.

Additionally we create a JAAS configuration, we can use for authentication:

com.sun.security.jgss.krb5.initiate {
    com.sun.security.auth.module.Krb5LoginModule required
    doNotPrompt=true
    principal="hdfs-user@MYCORP.NET"
    useKeyTab=true
    keyTab="/home/hdfs-user/hdfs-user.keytab"
    storeKey=true;
};

We now have multiple ways to use authentication and here I will start with probably the most simple approach regarding required code changes:

1. Authentication with Keytab

Authentication web based access to HDFS with a keytab requires almost no code changes despite the use of (s)webhdfs protocol and change of authentication method:

conf.set("fs.defaultFS", "webhdfs://one.hdp:50070");
conf.set("hadoop.security.authentication", "kerberos");

FileSystem fs = FileSystem.get(conf);
FileStatus[] fsStatus = fs.listStatus(new Path("/"));
for(int i = 0; i < fsStatus.length; i++){
   System.out.println(fsStatus[i].getPath().toString());
}

The above is enough if executed in a JAAS context. Creating the secure context can be done be using the above JAAS and keytab.

java -Djava.security.auth.login.config=/home/hdfs-user/jaas.conf 
-Djava.security.krb5.conf=/etc/krb5.conf 
-Djavax.security.auth.useSubjectCredsOnly=false 
-cp "./hdfs-sample-1.0-SNAPSHOT.jar:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-client/*"  
hdfs.sample.HdfsMain

webhdfs://one.hdp:50070/app-logs
webhdfs://one.hdp:50070/apps
webhdfs://one.hdp:50070/ats
webhdfs://one.hdp:50070/hdp
webhdfs://one.hdp:50070/mapred
webhdfs://one.hdp:50070/mr-history
webhdfs://one.hdp:50070/tmp
webhdfs://one.hdp:50070/user
2. Using UserGroupInformation

For authentication in Hadoop there exists a wrapper class around a JAAS Subject to provide methods for user login. The UserGroupInformation wrapper without a specific setup uses the system security context, in case of Kerberos this exist in the ticket cache (klist shows the existing security context of a user). This is demonstrated under “With Existing Security Context” below. Further a custom security context can be used for login, either with by using a keytab file or even with credentials. Both approaches are also demonstrated here under “Providing Credentials from Login” and “Via Keytab”.

With Existing Security Context

First we would need to authenticate and make sure we have a proper security context:

$ kinit 
Password for hdfs-user@MYCORP.NET: 
$ klist
Ticket cache: FILE:/tmp/krb5cc_1013
Default principal: hdfs-user@MYCORP.NET

Valid starting       Expires              Service principal
02/14/2016 14:54:32  02/15/2016 14:54:32  krbtgt/MYCORP.NET@MYCORP.NET

With this the following can HDFS client implementation can be used in a secured environment:

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://one.hdp:8020");
conf.set("hadoop.security.authentication", "kerberos");

UserGroupInformation.setConfiguration(conf);
// Subject is taken from current user context
UserGroupInformation.loginUserFromSubject(null);

FileSystem fs = FileSystem.get(conf);
FileStatus[] fsStatus = fs.listStatus(new Path("/"));

for(int i = 0; i <= fsStatus.length; i++){
  System.out.println(fsStatus[i].getPath().toString());
}

Creating the JAAS context during run-time the client could be executed like this:

java -cp "./hdfs-sample-1.0-SNAPSHOT.jar:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-client/*"  
hdfs.sample.HdfsMain

hdfs://one.hdp:8020/app-logs
hdfs://one.hdp:8020/apps
hdfs://one.hdp:8020/ats
hdfs://one.hdp:8020/hdp
hdfs://one.hdp:8020/mapred
hdfs://one.hdp:8020/mr-history
hdfs://one.hdp:8020/tmp
hdfs://one.hdp:8020/user
Providing Credentials from Login

Providing login credentials at execution requires the creation of a javax.security.auth.Subject with username and password. This means that we will have to use the GSS-API to do a kinit like this:

private static String username = "hdfs-user";
private static char[] password = "hadoop".toCharArray();
public static LoginContext kinit() throws LoginException {
  LoginContext lc = new LoginContext(HdfsMain.class.getSimpleName(), new CallbackHandler() {
  public void handle(Callback[] callbacks) throws IOException, UnsupportedCallbackException {
    for(Callback c : callbacks){
      if(c instanceof NameCallback)
        ((NameCallback) c).setName(username);
      if(c instanceof PasswordCallback)
        ((PasswordCallback) c).setPassword(password);
    }
 }});
 lc.login();
 return lc;
}

We still have to configure the JAAS login module referenced by the name that we provide in the above implementation. The name applied in the example above is set to be HdfsMain.class.getSimpleName(), so our module configuration should look like this:

HdfsMain {
  com.sun.security.auth.module.Krb5LoginModule required client=TRUE;
};

Having this in place we can now login with username and password:

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://one.hdp:8020");
conf.set("hadoop.security.authentication", "kerberos");
UserGroupInformation.setConfiguration(conf);

LoginContext lc = kinit();
UserGroupInformation.loginUserFromSubject(lc.getSubject());

FileSystem fs = FileSystem.get(conf);
FileStatus[] fsStatus = fs.listStatus(new Path("/"));

for(int i = 0; i < fsStatus.length; i++){
  System.out.println(fsStatus[i].getPath().toString());
}
Via Keytab

In the first part we injected the security context via the JAAS security context (-Djava.security.auth.login.config=/home/hdfs-user/jaas.conf) configuring keytab authentication. We can also achieve this using JAAS security wrapper UserGroupInformation provided by Hadoop.

UserGroupInformation.setConfiguration(conf);
        UserGroupInformation.loginUserFromKeytab("hdfs-user@MYCORP.NET", "/home/hdfs-user/hdfs-user.keytab");

The complete code being used:

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://one.hdp:8020");
conf.set("hadoop.security.authentication", "kerberos");

UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("hdfs-user@MYCORP.NET", 
   "/home/hdfs-user/hdfs-user.keytab");

FileSystem fs = FileSystem.get(conf);
FileStatus[] fsStatus = fs.listStatus(new Path("/"));
for(int i = 0; i < fsStatus.length; i++){
  System.out.println(fsStatus[i].getPath().toString());
}

Please note that it is not required to configure the JAAS context, as we are using UserGroupInformation:

$ klist
klist: Credentials cache file '/tmp/krb5cc_1013' not found
$ java -cp "./hdfs-sample-1.0-SNAPSHOT.jar:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-client/*"  hdfs.sample.HdfsMain
hdfs://one.hdp:8020/app-logs
hdfs://one.hdp:8020/apps
hdfs://one.hdp:8020/ats
hdfs://one.hdp:8020/hdp
hdfs://one.hdp:8020/mapred
hdfs://one.hdp:8020/mr-history
hdfs://one.hdp:8020/tmp
hdfs://one.hdp:8020/user

Impersonation

A last aspect when writing clients for a secure HDP cluster is the proxy user setting especially if you are designing a service implementation. The proxy user functionality enables services to access resources on the cluster on behalf of another user. Hence the service is impersonating the user.

A good example of such a service is the HiveServer2. The HS2 recieves SQL requests and creates an execution plan using a execution engine like MapReduce, Tez, or Spark. The plan is being executed on the cluster on behalf of the user doing the SQL request.

Of course it is important to be in control of who is able to impersonate whom and from where. This can be configured by adding a proxyuser config to Hadoop:

hadoop.proxyuser.{{service_user_name}}.groups
hadoop.proxuyser.{{service_user_name}}.hosts

If for example we have a Tomcat service running on host web.mycorp.net the following example configuration would enable the service to impersonate users in the group web-users from host web.mycorp.net.

hadoop.proxyuser.tomcat.groups=web-users
hadoop.proxyuser.tomcat.hosts=web.mycorp.net

It is important to make sure service are not able to impersonate hdfs or other service account that have special privileges. Only trusted services should be added to the proxyuser setup.

Clients can also use the UserGroupInformation class to impersonate other users. With the doAs implementation can be wrapped into the context of the user to be impersonated.

// system/service user able to proxy
UserGroupInformation proxyUser = UserGroupInformation.getCurrentUser();
// user = user to impersonate
UserGroupInformation ugi = UserGroupInformation.createProxyUser(user, proxyUser);
try {
  fsStatus = ugi.doAs(new PrivilegedExceptionAction<FileStatus[]>() {
    public FileStatus[] run() throws IOException {
      return FileSystem.get(conf).listStatus(p);
    }
  });
} catch (InterruptedException e) { e.printStackTrace(); }

Here the user parameter defines the user (as String) in which context the call should be executed. The proxyUser is the current service or system user running the client. Be aware of the fact that not in all circumstances the proxyUser might be equal to UserGroupInformation.getCurrentUser().

Further Readings

  1. Wire Encryption in Hadoop
  2. Difference between trustStore and keyStore in Java – SSL
  3. Kerberos and Hadoop
  4. Kerberos (Protocol)
  5. Java Authentication and Authorization Service
  6. GSS-API/Kerberos v5 Authentication
  7. JAAS Login Configuration File

16 thoughts on “A Secure HDFS Client Example

  1. Very important to keep in mind is that the hadoop hdfs api(s) rely on Java’s ServiceLoader to load the org.apache.hadoop.security.AnnotatedSecurityInfo service specified in the META-INF/services/ of the hadoop-common.jar.

    If you use tools to create an uberjar or repackaging, the authentication will fail with:

    Get token info proto:interface org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolPB info:null

    Get kerberos info proto:interface org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolPB info:null

    To solve either fix the META-INF/services/… or just set programmatically using SecurityUtil.setSecurityInfoProviders

    I made a post rant about this: http://funclojure.tumblr.com/post/155129283948/hdfs-kerberos-java-client-api-pains

    Like

  2. Is this different if I am running my code in Intellij on my local machine against a kerberized remote hdfs? Was not able to get it to work through my IDE.

    Like

    1. Should basically work in Windows environments as well.

      Tow things might be useful to consider for Windows environments.
      1. Kinit and klist utils can be found in the bin folder of your java distribution
      2. You need to research where your system expects the krb5.ini or krb5.conf file and adapt it to your Kerberos settings

      Like

      1. Thanks for your reply! I mean I’m running this code from windows client but I still have my kerberized hadoop cluster on an Ubuntu server. Do I need to install Kerberos on Windows as well ?
        I used System.setProperty to set the KDC and the realm name but my code is still not working..

        Like

  3. Hello,

    very nice and clear post. But there is something that, maybe because of my novelty in Hadoop, is not completely clear to me. That is, in the impersonation part, when you set the hosts and groups:

    hadoop.proxyuser.tomcat.groups=web-users
    hadoop.proxyuser.tomcat.hosts=web.mycorp.net

    where is this done? Is it in the core-site.xml file?

    And, more precisely, where is this group of users (web-users) coming from? I mean where and how did you create this group?

    Thanks in advance

    Like

  4. Hi, I’m curious if it is possible to configure JAAS from UserGroupInformation? My specific scenario is that I need to perform SPNEGO negotiation with a web server from within a Spark executor (which is already running inside of a UserGroupInformation.doAs(…) from supplying the –principal and –keytab arguments to spark-submit.

    Like

  5. I am not able to connect to the kerberised cluster getting PRE_AUTHENTICATION_FAILED. Attaching the log trace here and on failing authn the file system reads the local fs.
    Java config name: ./krb5.conf
    Loaded from Java config
    Java config name: ./krb5.conf
    Loaded from Java config
    >>> KdcAccessibility: reset
    >>> KdcAccessibility: reset
    >>> KeyTabInputStream, readName(): CRI.HADOOP.PREPROD
    >>> KeyTabInputStream, readName(): shreeishita.gupta
    >>> KeyTab: load() entry length: 90; type: 18
    >>> KeyTabInputStream, readName(): CRI.HADOOP.PREPROD
    >>> KeyTabInputStream, readName(): shreeishita.gupta
    >>> KeyTab: load() entry length: 74; type: 17
    Looking for keys for: shreeishita.gupta@CRI.HADOOP.PREPROD
    Added key: 17version: 2
    Added key: 18version: 2
    Looking for keys for: shreeishita.gupta@CRI.HADOOP.PREPROD
    Added key: 17version: 2
    Added key: 18version: 2
    Using builtin default etypes for default_tkt_enctypes
    default etypes for default_tkt_enctypes: 18 17 16 23.
    >>> KrbAsReq creating message
    >>> KrbKdcReq send: kdc=10.34.42.197 UDP:88, timeout=30000, number of retries =3, #bytes=179
    >>> KDCCommunication: kdc=10.34.42.197 UDP:88, timeout=30000,Attempt =1, #bytes=179
    >>> KrbKdcReq send: #bytes read=306
    >>>Pre-Authentication Data:
    PA-DATA type = 136

    >>>Pre-Authentication Data:
    PA-DATA type = 19
    PA-ETYPE-INFO2 etype = 18, salt = CRI.HADOOP.PREPRODshreeishita.gupta, s2kparams = null

    >>>Pre-Authentication Data:
    PA-DATA type = 2
    PA-ENC-TIMESTAMP
    >>>Pre-Authentication Data:
    PA-DATA type = 133

    >>> KdcAccessibility: remove 10.34.42.197
    >>> KDCRep: init() encoding tag is 126 req type is 11
    >>>KRBError:
    cTime is Tue Jul 28 04:43:03 IST 1987 554425983000
    sTime is Thu Dec 17 20:35:35 IST 2020 1608217535000
    suSec is 635778
    error code is 25
    error Message is Additional pre-authentication required
    crealm is CRI.HADOOP.PREPROD
    cname is shreeishita.gupta@CRI.HADOOP.PREPROD
    sname is krbtgt/CRI.HADOOP.PREPROD@CRI.HADOOP.PREPROD
    eData provided.
    msgType is 30
    >>>Pre-Authentication Data:
    PA-DATA type = 136

    >>>Pre-Authentication Data:
    PA-DATA type = 19
    PA-ETYPE-INFO2 etype = 18, salt = CRI.HADOOP.PREPRODshreeishita.gupta, s2kparams = null

    >>>Pre-Authentication Data:
    PA-DATA type = 2
    PA-ENC-TIMESTAMP
    >>>Pre-Authentication Data:
    PA-DATA type = 133

    KRBError received: NEEDED_PREAUTH
    KrbAsReqBuilder: PREAUTH FAILED/REQ, re-send AS-REQ
    Using builtin default etypes for default_tkt_enctypes
    default etypes for default_tkt_enctypes: 18 17 16 23.
    Looking for keys for: shreeishita.gupta@CRI.HADOOP.PREPROD
    Added key: 17version: 2
    Added key: 18version: 2
    Looking for keys for: shreeishita.gupta@CRI.HADOOP.PREPROD
    Added key: 17version: 2
    Added key: 18version: 2
    Using builtin default etypes for default_tkt_enctypes
    default etypes for default_tkt_enctypes: 18 17 16 23.
    >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
    >>> KrbAsReq creating message
    >>> KrbKdcReq send: kdc=10.34.42.197 UDP:88, timeout=30000, number of retries =3, #bytes=264
    >>> KDCCommunication: kdc=10.34.42.197 UDP:88, timeout=30000,Attempt =1, #bytes=264
    >>> KrbKdcReq send: #bytes read=793
    >>> KdcAccessibility: remove 10.34.42.197
    Looking for keys for: shreeishita.gupta@CRI.HADOOP.PREPROD
    Added key: 17version: 2
    Added key: 18version: 2
    >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
    >>> CksumType: sun.security.krb5.internal.crypto.HmacSha1Aes256CksumType
    >>> KrbAsRep cons in KrbAsReq.getReply shreeishita.gupta
    20/12/17 20:35:35 INFO security.UserGroupInformation: Login successful for user shreeishita.gupta@CRI.HADOOP.PREPROD using keytab file /tmp/hdfs-kerb-si/shreeishita.keytab
    20/12/17 20:35:35 INFO Main: Accessing file system

    Like

Leave a comment