Apache Knox: A Hadoop Bastion

Lately a lot of effort went into making Hadoop setups more secure for enterprise ready installations. With Apache Knox comes a connecting strap for your cluster that acts like a bastion server shielding direct access to your nodes. Knox is stateless and can therefor easily scale horizontally with the obvious limitation of also just supporting stateless protocols. Knox provides the following functionality:

  1. Authentication
    Users and groups can be managed using LDAP or Active Directory
  2. Federation/SSO
    Knox uses HTTP header based identity federation
  3. Authorization
    Authorization is mainly supported on service level through access control lists (ACL)
  4. Auditing
    Access through Knox is audited for

Here we are going to explore the necessary steps for a Knox setup. In this setup the authentication process is going through a LDAP directory service running on the same node as Knox while separated from the Hadoop cluster. Knox comes with an embedded Apache Directory for demo purposes. You can also read here on how to setup a secure OpenLDAP. Knox LDAP service can be started like this:

cd {KNOX_HOME}
bin/ldap.sh start

Here we are going to explorer necessary steps to setup Apache Know for your environment.

Download and Install Knox

Knox 0.4 can be downloaded from here or you can also check if a newer version is available on the webpage of Knox. After you’ve downloaded it’s binaries just unzip the content to /usr/lib or some other place according to you preference. We’ll refer to this as {KNOX_HOME} .
You probably not want to run Knox gateway as root, so you should then also create a separate knox user. Run adduser knox for this. Make sure the owner has the appropriate rights to access the folder you placed Knox’s binaries under.

Run and Deploy Knox

Knox comes with a gateway that can deploy multiple Hadoop cluster stubs. Prior to starting it’s service we need to master secret Knox uses to secure passwords and settings.

cd {KNOX_HOME}
su -l knox -c '/usr/lib/knox/bin/knoxcli.sh create-master'

After this the gateway can be started using the knox user we created before.

su -l knox -c '{KNOX_HOME}/bin/gateway.sh start'

This will start the gateway and deploy all the cluster topologies by default places under {KNOX_HOME}/conf/topologies. This can be configured by setting the gateway.gateway.conf.dir  in the gateway-site.xml under {KNOX_HOME}/conf.

Knox Topologies

When started the gateway deploys all the topologies described under {KNOX_HOME}/conf/topologies as a war applications to it’s application directory. The topologies describe the gate to a Hadoop cluster as well as possible security settings. For example the connection to LDAP. Let’s have a look at a sample topology configured to use LDAP for authentication.

<?xml version="1.0" encoding="utf-8"?>
<topology>
    <gateway>
        <provider>
            <role>authentication</role>
            <name>ShiroProvider</name>
            <enabled>true</enabled>
            <param>
                <name>sessionTimeout</name>
                <value>30</value>
            </param>
            <param>
                <name>main.ldapRealm</name>
                 <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value> 
				<!--<value>org.apache.shiro.realm.ldap.JndiLdapRealm</value>-->
            </param>
            <param>
                <name>main.ldapRealm.userDnTemplate</name>
                <!--<value>cn={0},dc=mycorp,dc=net</value>-->
				<value>sAMAccounName={0}</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory.url</name>
                <value>ldaps://mycorp.net:636</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory.authenticationMechanism</name>
                <value>simple</value>
            </param>
	    <param>
                <name>main.ldapRealm.searchBase</name>
	        <value>DC=mycorp,DC=net</value>
	    </param>
            <param>
		<name>main.ldapRealm.userSearchBase</name>
		<value>dc=mycorp,dc=net</value>
	    </param>
            <param>
                <name>main.ldapRealm.groupSearchBase</name>
		<value>dc=mycorpdir,dc=net</value>
	    </param>		
	    <param>
		<name>main.ldapRealm.contextFactory.systemUsername</name>
		<value>cn=root,dc=mycorp,dc=net</value>
	    </param>
	    <param>
		<name>main.ldapRealm.contextFactory.systemPassword</name>
		<value>horton</value>
	    </param>		
            <param>
                <name>urls./**</name>
                <value>authcBasic</value>
            </param>
        </provider>
        <provider>
            <role>identity-assertion</role>
            <name>Pseudo</name>
            <enabled>true</enabled>
        </provider>
...
</topology>

At the end of the topology configuration you would want to configure the Hadoop services so Knox knows where to access them.

<topology>
    <gateway>
    ....
    </gateway>
    <service>
        <role>NAMENODE</role>
        <url>hdfs://hadoop_host:8020</url>
    </service>

    <service>
        <role>JOBTRACKER</role>
        <url>rpc://hadoop_host:8050</url>
    </service>

    <service>
        <role>WEBHDFS</role>
        <url>http://hadoop_host:50070/webhdfs</url>
    </service>

    <service>
        <role>WEBHCAT</role>
        <url>http://hadoop_host:50111/templeton</url>
    </service>

    <service>
        <role>OOZIE</role>
        <url>http://hadoop_host:11000/oozie</url>
    </service>

    <service>
        <role>WEBHBASE</role>
        <url>http://hadoop_host:60080</url>
    </service>

    <service>
        <role>HIVE</role>
        <url>http://hadoop_host:10001/cliservice</url>
    </service>
</topology>

Any changes made to the configuration needs to be redeployed. This can be achieved using the Knox CLI {KNOX_HOME}/bin/knoxcli.sh redeploy .

The access pattern for REST clients wanting to access Hadoop services is as follows:

  • WebHDFS
    • Gateway: https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/webhdfs
    • Cluster: http://{webhdfs-host}:50070/webhdfs
  • WebHCat (Templeton)
    • Gateway: https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton
    • Cluster: http://{webhcat-host}:50111/templeton}
  • Oozie
    • Gateway: https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/oozie
    • Cluster: http://{oozie-host}:11000/oozie}
  • Stargate (HBase)
    • Gateway: https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/hbase
    • Cluster: http://{hbase-host}:60080
  • Hive JDBC
    • Gateway: jdbc:hive2://{gateway-host}:{gateway-port}/;ssl=true;sslTrustStore={gateway-trust-store-path};trustStorePassword={gateway-trust-store-password}?hive.server2.transport.mode=http;hive.server2.thrift.http.path={gateway-path}/{cluster-name}/hive
    • Cluster: http://{hive-host}:10001/cliservice

Here {gateway-host} refers to the host of Knox and the {gateway-path} to the deployment path. The cluster-name is typically the name of you topology file.

There is one last step to make Knox work with your cluster defined in the topology. Knox needs to be able to operate on behalf of the user issuing requests to HDFS, Hive, and so on. To do this you need to setup so called proxy groups and hosts for Knox in the core-site.xml. A good default setting would be:

<property>
    <name>hadoop.proxyuser.knox.groups</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.knox.hosts</name>
    <value>*</value>
</property>

With this you should be good to tryout Knox yourself.

Further Readings

Advertisement

3 thoughts on “Apache Knox: A Hadoop Bastion

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s