With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join which can bring significant performance improvement by omitting the required sort-and-shuffle phase during a reduce step. In this Post we are going to discuss the possibility for broadcast joins in Spark DataFrame and RDD API in Scala. Continue reading “Broadcast Join with Spark”
In this Sunday Read with Horton edition we take a closer look at the selection of papers about Distributed Consensus provided by Camille Fournier (Zookeeper PMC) as part of the RfP (Research for Practice) of the ACM. For Hadoop practitioners distributed consensus is best know as Apache Zookeeper, which supports most critical aspects of almost all Hadoop components. Continue reading “Sunday Read: Distributed Consensus”
In any HDP cluster with a HA setup with quorum there are two NameNodes configured with one working as the active and the other as the standby instance. As the standby node does not accept any write requests, for a client try to write to HDFS it is fairly important to know which one of the two NameNodes it the active one at any given time. The discovery process for that is configured through the hdfs-site.xml.
For any custom implementation it’s becomes relevant to set and understand the correct parameters if a current hdfs-site.xml configuration of the cluster is not given. This post gives a sample Java implementation of a HA HDFS client. Continue reading “Sample HDFS HA Client”
Next years Hadoop Summit will be held in Munich on April 5-6, 2017 which will be an exceptional opportunity for the community in Munich to present itself to the best and brightest in the data community.
Please take this opportunity to hand in your abstract now with only a few days left!
The 2017 tracks include:
- Enterprise Adoption
- Data Processing & Warehousing
- Apache Hadoop Core Internals
- Governance & Security
- IoT & Streaming
- Cloud & Operations
- Apache Spark & Data Science
We want to expand the ecosystem to include technologies that were not explicitly in the Hadoop Ecosystem. For instance, in the community showcase we will have the following zones:
- Apache Hadoop Zone
- IoT & Streaming Zone
- Cloud & Operations Zone
- Apache Spark & Data Science Zone
The goal is to increase the breadth of technologies we can talk about and increase the potential of a data summit.
Future of Data Meetups
Want to present at Meetups?
If you would like to present at a Future of Data Meetup please don’t hesitate to reach out to me and send me a message.
Want to host a Meetup? Become a Sponsor?
We are also looking for rooms and organizations willing to host one of our Future of Data Meetups or become a sponsor. Please reach out and let me know.
- Future of Data: Munich
- Future of Data: Toulouse
- Future of Data: San Francisco
- Future of Data: Silicon Valley
- Future of Data: Budapest
- Future of Data: Paris
- Future of Data: London
- Future of Data: New York
Livy.io is a proxy service for Apache Spark that allows to reuse an existing remote SparkContext among different users. By sharing the same context Livy provides an extended multi-tenant experience with users being capable of sharing RDDs and YARN cluster resources effectively.
In summary Livy uses a RPC architecture to extend the created SparkContext with a RPC service. Through this extension the existing context can be controlled and shared remotely by other users. On top of this Livy introduces authorization together with enhanced session management.
Analytic applications like Zeppelin can use Livy to offer multi-tenant spark access in a controlled manner.
This post discusses setting up Livy with a secured HDP cluster.
With the introduction of ZEPPELIN-548 it now supports Apache Shiro based AD and LDAP authentication. This quick example demonstrates the connection of Zeppelin to the Knox Demo LDAP server. Continue reading “Zeppelin Login with Demo LDAP of Knox”
Hadoop supports multiple file formats as input for MapReduce workflows, including programs executed with Apache Spark. Defining custom InputFormats is a common practice among Hadoop Data Engineers and will be discussed here based on publicly available data set.
The approach demonstrated in this post does not provide means for a general MATLAB™ InputFormat for Hadoop. This would require significant effort in finding a general purpose mapping of MATLAB™’s file format and type system to the ones of HDFS. Continue reading “Custom MATLAB InputFormat for Apache Spark”
MATLAB™ is a widely used professional tool for numerical processing used across multiple divers disciplines like Physics, Chemistry, and Mathematics. You can encounter multiple public data sets which are published in MATLAB™ format. This article gives a brief example of such data set and reading it from R. Continue reading “Reading Matlab files with R”
Hive joins are executed by MapReduce jobs through different execution engines like for example Tez, Spark or MapReduce. Joins even of multiple tables can be achieved by one job only. Since it’s first release many optimizations have been added to Hive giving users various options for query improvements of joins.