Livy.io is a proxy service for Apache Spark that allows to reuse an existing remote SparkContext among different users. By sharing the same context Livy provides an extended multi-tenant experience with users being capable of sharing RDDs and YARN cluster resources effectively.
In summary Livy uses a RPC architecture to extend the created SparkContext with a RPC service. Through this extension the existing context can be controlled and shared remotely by other users. On top of this Livy introduces authorization together with enhanced session management.
Analytic applications like Zeppelin can use Livy to offer multi-tenant spark access in a controlled manner.
This post discusses setting up Livy with a secured HDP cluster.
Hadoop supports multiple file formats as input for MapReduce workflows, including programs executed with Apache Spark. Defining custom InputFormats is a common practice among Hadoop Data Engineers and will be discussed here based on publicly available data set.
The approach demonstrated in this post does not provide means for a general MATLAB™ InputFormat for Hadoop. This would require significant effort in finding a general purpose mapping of MATLAB™’s file format and type system to the ones of HDFS. Continue reading “Custom MATLAB InputFormat for Apache Spark”
MATLAB™ is a widely used professional tool for numerical processing used across multiple divers disciplines like Physics, Chemistry, and Mathematics. You can encounter multiple public data sets which are published in MATLAB™ format. This article gives a brief example of such data set and reading it from R. Continue reading “Reading Matlab files with R”
Apache Ambari rapidly improves support for secure installations and managing security in Hadoop. Already now it is fairly convenient to create kerberized clusters in a snap with automated procedures or the Ambari wizard.
With the latest release of Ambari kerberos setups get baked into blueprint installations making separate methods like API calls unnecessary. In this post I would like to briefly discuss the new option in Ambari to use pure Blueprint installs for secure cluster setups. Additionally explaining some of the prerequisites for a sandbox demo like install. Continue reading “Kerberos Ambari Blueprint Installs”
Controlling the environment of an application is vital for it’s functionality and stability. Especially in a distributed environment it is important for developers to have control over the version of dependencies. In such an scenario it’s a critical task to ensure possible conflicting requirements of multiple applications are not disturbing each other.
That is why frameworks like YARN ensure that each application is executed in a self-contained environment – typically in a Linux Container or Docker Container – that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. Continue reading “Running PySpark with Conda Env”
Sometimes you might find yourself in a situation where it becomes inevitable to clean up a node from a HDP install. Just like most installs are never really the same, cleaning a node from an install is not a straight path. As the documentation advises, to remove the installed packages using the systems package manager, is a good start. But some folders might remain and databases will be ignored. Continue reading “Completely Uninstall and Remove HDP from Nodes”