2016 in Numbers

Over two years ago in March 2014 I joined the Iron Blogger community in Munich, which is one of the largest, still active Iron Blogger communities worldwide. You can read more about my motivation behind it here in one of the 97 blog posts published to date: Iron Blogger: In for a Perfect Game.

The real fact is that I write blogs solely for myself. It’s my own technical reference I turn to. Additionally writing is a good way to improve once skills and technical capabilities, as Richard Guindon puts it in his famous quote:

“Writing is nature’s way of letting you know how sloppy your thinking is.”

What could be better suited to improve something than by leaning into the pain, how the great Aaron Swartz, who died way too early, once described it? And it is quite a bit of leaning into the pain publishing a blog post every week. Not only for me, but also for those close to me. But I am going to dedicate a separate blog post to a diligent retrospection in the near future. This post should all be about NUMBERS. Continue reading “2016 in Numbers” →

Simple Spark Streaming & Kafka Example in a Zeppelin Notebook

Apache Zeppelin is a web-based, multi-purpose notebook for data discovery, prototyping, reporting, and visualization. With it’s Spark interpreter Zeppelin can also be used for rapid prototyping of streaming applications in addition to streaming-based reports.

In this post we will walk through a simple example of creating a Spark Streaming application based on Apache Kafka. Continue reading “Simple Spark Streaming & Kafka Example in a Zeppelin Notebook” →

Kerberos Debug Notes

Some notes for Kerberos debugging in a secure HDP setup:

Setting Debug Logs
To enable debug logs in Java for Kerberos sun.security.krb5.debug needs to be set to true. Doing this for Hadoop can be done in the hadoop-env.sh file by adding it to the HADOOP_OPTS environment variable:
```
export HADOOP_OPTS="-Dsun.security.krb5.debug=true"
```
Additionally the HADOOP_JAAS_DEBUG variable can be set also:
```
HADOOP_JAAS_DEBUG
```
Receiving traces in bash/shell can be enabled by setting the following environment variable:
```
export KRB5_TRACE=/dev/stdout
```
Testing auth_to_local Settings
Setting the auth_to_local rules correclty can be quite crucial. This is especially true for KDS trust environments. The rules can be easily tested with the HadoopKerberosName call of Hadoop security. You can run it as:
```
$ hadoop org.apache.hadoop.security.HadoopKerberosName pinc@REALM.COM
```

Broadcast Join with Spark

With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join which can bring significant performance improvement by omitting the required sort-and-shuffle phase during a reduce step. In this Post we are going to discuss the possibility for broadcast joins in Spark DataFrame and RDD API in Scala. Continue reading “Broadcast Join with Spark” →

Sunday Read: Distributed Consensus

In this Sunday Read with Horton edition we take a closer look at the selection of papers about Distributed Consensus provided by Camille Fournier (Zookeeper PMC) as part of the RfP (Research for Practice) of the ACM. For Hadoop practitioners distributed consensus is best know as Apache Zookeeper, which supports most critical aspects of almost all Hadoop components. Continue reading “Sunday Read: Distributed Consensus” →