Lately a lot of effort went into making Hadoop setups more secure for enterprise ready installations. With Apache Knox comes a connecting strap for your cluster that acts like a bastion server shielding direct access to your nodes. Knox is stateless and can therefor easily scale horizontally with the obvious limitation of also just supporting stateless protocols. Knox provides the following functionality:
- Authentication
Users and groups can be managed using LDAP or Active Directory - Federation/SSO
Knox uses HTTP header based identity federation - Authorization
Authorization is mainly supported on service level through access control lists (ACL) - Auditing
Access through Knox is audited for
Here we are going to explore the necessary steps for a Knox setup. In this setup the authentication process is going through a LDAP directory service running on the same node as Knox while separated from the Hadoop cluster. Knox comes with an embedded Apache Directory for demo purposes. You can also read here on how to setup a secure OpenLDAP. Knox LDAP service can be started like this:
cd {KNOX_HOME} bin/ldap.sh start
Here we are going to explorer necessary steps to setup Apache Know for your environment.
Download and Install Knox
Run and Deploy Knox
Knox comes with a gateway that can deploy multiple Hadoop cluster stubs. Prior to starting it’s service we need to master secret Knox uses to secure passwords and settings.
cd {KNOX_HOME} su -l knox -c '/usr/lib/knox/bin/knoxcli.sh create-master'
After this the gateway can be started using the knox user we created before.
su -l knox -c '{KNOX_HOME}/bin/gateway.sh start'
This will start the gateway and deploy all the cluster topologies by default places under {KNOX_HOME}/conf/topologies. This can be configured by setting the gateway.gateway.conf.dir in the gateway-site.xml under {KNOX_HOME}/conf.
Knox Topologies
When started the gateway deploys all the topologies described under {KNOX_HOME}/conf/topologies as a war applications to it’s application directory. The topologies describe the gate to a Hadoop cluster as well as possible security settings. For example the connection to LDAP. Let’s have a look at a sample topology configured to use LDAP for authentication.
<?xml version="1.0" encoding="utf-8"?> <topology> <gateway> <provider> <role>authentication</role> <name>ShiroProvider</name> <enabled>true</enabled> <param> <name>sessionTimeout</name> <value>30</value> </param> <param> <name>main.ldapRealm</name> <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value> <!--<value>org.apache.shiro.realm.ldap.JndiLdapRealm</value>--> </param> <param> <name>main.ldapRealm.userDnTemplate</name> <!--<value>cn={0},dc=mycorp,dc=net</value>--> <value>sAMAccounName={0}</value> </param> <param> <name>main.ldapRealm.contextFactory.url</name> <value>ldaps://mycorp.net:636</value> </param> <param> <name>main.ldapRealm.contextFactory.authenticationMechanism</name> <value>simple</value> </param> <param> <name>main.ldapRealm.searchBase</name> <value>DC=mycorp,DC=net</value> </param> <param> <name>main.ldapRealm.userSearchBase</name> <value>dc=mycorp,dc=net</value> </param> <param> <name>main.ldapRealm.groupSearchBase</name> <value>dc=mycorpdir,dc=net</value> </param> <param> <name>main.ldapRealm.contextFactory.systemUsername</name> <value>cn=root,dc=mycorp,dc=net</value> </param> <param> <name>main.ldapRealm.contextFactory.systemPassword</name> <value>horton</value> </param> <param> <name>urls./**</name> <value>authcBasic</value> </param> </provider> <provider> <role>identity-assertion</role> <name>Pseudo</name> <enabled>true</enabled> </provider> ... </topology>
At the end of the topology configuration you would want to configure the Hadoop services so Knox knows where to access them.
<topology> <gateway> .... </gateway> <service> <role>NAMENODE</role> <url>hdfs://hadoop_host:8020</url> </service> <service> <role>JOBTRACKER</role> <url>rpc://hadoop_host:8050</url> </service> <service> <role>WEBHDFS</role> <url>http://hadoop_host:50070/webhdfs</url> </service> <service> <role>WEBHCAT</role> <url>http://hadoop_host:50111/templeton</url> </service> <service> <role>OOZIE</role> <url>http://hadoop_host:11000/oozie</url> </service> <service> <role>WEBHBASE</role> <url>http://hadoop_host:60080</url> </service> <service> <role>HIVE</role> <url>http://hadoop_host:10001/cliservice</url> </service> </topology>
Any changes made to the configuration needs to be redeployed. This can be achieved using the Knox CLI {KNOX_HOME}/bin/knoxcli.sh redeploy .
The access pattern for REST clients wanting to access Hadoop services is as follows:
- WebHDFS
- Gateway:
https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/webhdfs
- Cluster:
http://{webhdfs-host}:50070/webhdfs
- Gateway:
- WebHCat (Templeton)
- Gateway:
https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton
- Cluster:
http://{webhcat-host}:50111/templeton}
- Gateway:
- Oozie
- Gateway:
https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/oozie
- Cluster:
http://{oozie-host}:11000/oozie}
- Gateway:
- Stargate (HBase)
- Gateway:
https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/hbase
- Cluster:
http://{hbase-host}:60080
- Gateway:
- Hive JDBC
- Gateway:
jdbc:hive2://{gateway-host}:{gateway-port}/;ssl=true;sslTrustStore={gateway-trust-store-path};trustStorePassword={gateway-trust-store-password}?hive.server2.transport.mode=http;hive.server2.thrift.http.path={gateway-path}/{cluster-name}/hive
- Cluster:
http://{hive-host}:10001/cliservice
- Gateway:
Here {gateway-host} refers to the host of Knox and the {gateway-path} to the deployment path. The cluster-name is typically the name of you topology file.
There is one last step to make Knox work with your cluster defined in the topology. Knox needs to be able to operate on behalf of the user issuing requests to HDFS, Hive, and so on. To do this you need to setup so called proxy groups and hosts for Knox in the core-site.xml. A good default setting would be:
<property> <name>hadoop.proxyuser.knox.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.knox.hosts</name> <value>*</value> </property>
With this you should be good to tryout Knox yourself.
Further Readings
- Knox User’s Guide
- Kerberos (Amazon)
- LDAP System Administration (Amazon)
Apache Knox: A Hadoop Bastion http://t.co/MKD3pHjwa6
LikeLike
RT @jonbros: Apache Knox: A Hadoop Bastion http://t.co/MKD3pHjwa6
LikeLike
RT @IronBloggerMUC: Von @jonbros: Apache Knox: A Hadoop Bastion http://t.co/uS5jOJ4EkJ #IronBloggerMUC
LikeLike