Zeppelin Login with Demo LDAP of Knox

With the introduction of ZEPPELIN-548 it now supports Apache Shiro based AD and LDAP authentication. This quick example demonstrates the connection of Zeppelin to the Knox Demo LDAP server. Continue reading “Zeppelin Login with Demo LDAP of Knox” →

Custom MATLAB InputFormat for Apache Spark

Hadoop supports multiple file formats as input for MapReduce workflows, including programs executed with Apache Spark. Defining custom InputFormats is a common practice among Hadoop Data Engineers and will be discussed here based on publicly available data set.

The approach demonstrated in this post does not provide means for a general MATLAB™ InputFormat for Hadoop. This would require significant effort in finding a general purpose mapping of MATLAB™’s file format and type system to the ones of HDFS. Continue reading “Custom MATLAB InputFormat for Apache Spark” →

Reading Matlab files with R

MATLAB™ is a widely used professional tool for numerical processing used across multiple divers disciplines like Physics, Chemistry, and Mathematics. You can encounter multiple public data sets which are published in MATLAB™ format. This article gives a brief example of such data set and reading it from R. Continue reading “Reading Matlab files with R” →

Hive Join Strategies

Hive joins are executed by MapReduce jobs through different execution engines like for example Tez, Spark or MapReduce. Joins even of multiple tables can be achieved by one job only. Since it’s first release many optimizations have been added to Hive giving users various options for query improvements of joins.

Understanding how joins are implemented with MapReduce helps to recognize the different optimization techniques in Hive today. Continue reading “Hive Join Strategies” →

Kerberos Ambari Blueprint Installs

Apache Ambari rapidly improves support for secure installations and managing security in Hadoop. Already now it is fairly convenient to create kerberized clusters in a snap with automated procedures or the Ambari wizard.

With the latest release of Ambari kerberos setups get baked into blueprint installations making separate methods like API calls unnecessary. In this post I would like to briefly discuss the new option in Ambari to use pure Blueprint installs for secure cluster setups. Additionally explaining some of the prerequisites for a sandbox demo like install. Continue reading “Kerberos Ambari Blueprint Installs” →