YARN Secure Container

In a restricted setup YARN executes task of computation frameworks like Spark in a secured Linux or Window Container. The task are being executed in the local context of the user submitting the application and are not being executed in the local context of the yarn or some other system user. With this come certain constraints for the system setup.

How is YARN actually able to impersonate the calling user on the local OS level? This posts aims to give some background information to help answer such questions about secure containers. Only Linux systems are considered here, no Windows.

Continue reading “YARN Secure Container” →

Installing HDP Search with Ambari

Ambari Management Packs are a new convenient way to integrate various services to the Ambari stack. As an example in this post we are using the Solr service mpack to install HDP on top of a newly installed cluster.

The HDP search mpack is available on the Hortonworks public repository for download. A mpack essentially is tar balls containing a mpack.json file specification and related binaries. Continue reading “Installing HDP Search with Ambari” →

2016 in Numbers

Over two years ago in March 2014 I joined the Iron Blogger community in Munich, which is one of the largest, still active Iron Blogger communities worldwide. You can read more about my motivation behind it here in one of the 97 blog posts published to date: Iron Blogger: In for a Perfect Game.

The real fact is that I write blogs solely for myself. It’s my own technical reference I turn to. Additionally writing is a good way to improve once skills and technical capabilities, as Richard Guindon puts it in his famous quote:

“Writing is nature’s way of letting you know how sloppy your thinking is.”

What could be better suited to improve something than by leaning into the pain, how the great Aaron Swartz, who died way too early, once described it? And it is quite a bit of leaning into the pain publishing a blog post every week. Not only for me, but also for those close to me. But I am going to dedicate a separate blog post to a diligent retrospection in the near future. This post should all be about NUMBERS. Continue reading “2016 in Numbers” →

Kerberos Debug Notes

Some notes for Kerberos debugging in a secure HDP setup:

Setting Debug Logs
To enable debug logs in Java for Kerberos sun.security.krb5.debug needs to be set to true. Doing this for Hadoop can be done in the hadoop-env.sh file by adding it to the HADOOP_OPTS environment variable:
```
export HADOOP_OPTS="-Dsun.security.krb5.debug=true"
```
Additionally the HADOOP_JAAS_DEBUG variable can be set also:
```
HADOOP_JAAS_DEBUG
```
Receiving traces in bash/shell can be enabled by setting the following environment variable:
```
export KRB5_TRACE=/dev/stdout
```
Testing auth_to_local Settings
Setting the auth_to_local rules correclty can be quite crucial. This is especially true for KDS trust environments. The rules can be easily tested with the HadoopKerberosName call of Hadoop security. You can run it as:
```
$ hadoop org.apache.hadoop.security.HadoopKerberosName pinc@REALM.COM
```

Sunday Read: Distributed Consensus

In this Sunday Read with Horton edition we take a closer look at the selection of papers about Distributed Consensus provided by Camille Fournier (Zookeeper PMC) as part of the RfP (Research for Practice) of the ACM. For Hadoop practitioners distributed consensus is best know as Apache Zookeeper, which supports most critical aspects of almost all Hadoop components. Continue reading “Sunday Read: Distributed Consensus” →

Sample HDFS HA Client

In any HDP cluster with a HA setup with quorum there are two NameNodes configured with one working as the active and the other as the standby instance. As the standby node does not accept any write requests, for a client try to write to HDFS it is fairly important to know which one of the two NameNodes it the active one at any given time. The discovery process for that is configured through the hdfs-site.xml.

For any custom implementation it’s becomes relevant to set and understand the correct parameters if a current hdfs-site.xml configuration of the cluster is not given. This post gives a sample Java implementation of a HA HDFS client. Continue reading “Sample HDFS HA Client” →

Call For Abstract: Hadoop Summit 2017 in Munich

Next years Hadoop Summit will be held in Munich on April 5-6, 2017 which will be an exceptional opportunity for the community in Munich to present itself to the best and brightest in the data community.

Please take this opportunity to hand in your abstract now with only a few days left!

Submit Abstract: http://dataworkssummit.com/munich-2017
Deadline: Monday, November 21, 2016.
2017 Agenda: http://dataworkssummit.com/munich-2017/agenda/

The 2017 tracks include:

Applications
Enterprise Adoption
Data Processing & Warehousing
Apache Hadoop Core Internals
Governance & Security
IoT & Streaming
Cloud & Operations
Apache Spark & Data Science

Why DataWorks?

We want to expand the ecosystem to include technologies that were not explicitly in the Hadoop Ecosystem. For instance, in the community showcase we will have the following zones:

Apache Hadoop Zone
IoT & Streaming Zone
Cloud & Operations Zone
Apache Spark & Data Science Zone

The goal is to increase the breadth of technologies we can talk about and increase the potential of a data summit.

Future of Data Meetups

Want to present at Meetups?

If you would like to present at a Future of Data Meetup please don’t hesitate to reach out to me and send me a message.

Want to host a Meetup? Become a Sponsor?

We are also looking for rooms and organizations willing to host one of our Future of Data Meetups or become a sponsor. Please reach out and let me know.

Meetups:

Hive Join Strategies

Hive joins are executed by MapReduce jobs through different execution engines like for example Tez, Spark or MapReduce. Joins even of multiple tables can be achieved by one job only. Since it’s first release many optimizations have been added to Hive giving users various options for query improvements of joins.

Understanding how joins are implemented with MapReduce helps to recognize the different optimization techniques in Hive today. Continue reading “Hive Join Strategies” →

Secure Kafka Java Producer with Kerberos

The most recent release of Kafka 0.9 with it’s comprehensive security implementation has reached an important milestone. In his blog post Kafka Security 101 Ismael from Confluent describes the security features part of the release very well.

As a part II of the here published post about Kafka Security with Kerberos this post discussed a sample implementation of a Java Kafka producer with authentication. It is part of a mini series of posts discussing secure HDP clients, connecting services to a secured cluster, and kerberizing the HDP Sandbox (Download HDP Sandbox). In this effort at the end of this post we will also create a Kafka Servlet to publish messages to a secured broker.

Kafka provides SSL and Kerberos authentication. Only Kerberos is discussed here. Continue reading “Secure Kafka Java Producer with Kerberos” →

Browsing HDP Public Repo with s3cmd

From time to time it can be very useful to be able to search for HDP repository release directly from the public repo. Especially if you want to search for a recent development or technical preview version. Also this can become handy if you need to create an offline repository for your company intranet.

The HDP repositories are available through Amazon’s S3 storage layer. A tool quite convenient to use it s3cmd.

After downloading it s3cmd can easily be installed based on python:

// requires python-setuptools
$ cd ~/Downloads/
$ tar xfz s3cmd-1.5.2.tar.gz
$ cd s3cmd-1.5.2
$ more INSTALL // to read INSTALL guide
$ sudo python setup.py install

Browing HDP repo:

$ s3cmd ls s3://public-repo-1.hortonworks.com/HDP/centos6/

Using wildcards:

$ s3cmd ls s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/
DIR   s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/2.0-latest/
DIR   s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/2.1-latest/
DIR   s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/2.2-latest/
DIR   s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/2.3-latest/
DIR   s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/
DIR   s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/
2013-07-09 00:06  0  s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/

Wildcards can be used for filtering:

$ s3cmd ls s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.*
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0/
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.1.0/
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.2.0/
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.3.0/
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.4.0/
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.5.0/
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.6.0/
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.8.0/
DIR   s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.99.0

For help:

$ s3cmd --help

Category: General