This is the End of Hadoop as We Know It (And I Feel Fine!)
The stocks of both companies soar instantly after market and as soon the announcement of the merger was made. Overall market participants seemed pleased by the outlook of a companied company. Across the media widely positive reactions sprawled and Forrester for example is sure that this is a “A Win-Win For All”. So all fine!?
Well, there is at least one that strongly disagrees with this assessment, although this might mainly be because his job title suggests so. In this regards did the CEO of MapR, John Schroeder, say:
“I can’t find any innovation benefits to customers in this merger”
Apart from this almost sole opinion, is this deal really the success story everyone believes it to be? No, it is not in my opinion and as the dust settles it be comes more and more obvious that this deal is surrounded by dark clouds.
Continue reading “Why the Hortonworks-Cloudera Merger Is a Big Defeat?”
In a restricted setup YARN executes task of computation frameworks like Spark in a secured Linux or Window Container. The task are being executed in the local context of the user submitting the application and are not being executed in the local context of the yarn or some other system user. With this come certain constraints for the system setup.
How is YARN actually able to impersonate the calling user on the local OS level? This posts aims to give some background information to help answer such questions about secure containers. Only Linux systems are considered here, no Windows.
Continue reading “YARN Secure Container”
By default HDFS does not distinguish between different storage types hence making it difficult to optimize installations with heterogeneous storage devices. Since Hadoop 2.3 and the integration of HDFS-2832 HDFS supports placing block replicas on persistent tiers with different durability and performance requirements. Continue reading “HDFS Storage Tier – Archiving to Cloud w/ S3”
In any HDP cluster with a HA setup with quorum there are two NameNodes configured with one working as the active and the other as the standby instance. As the standby node does not accept any write requests, for a client try to write to HDFS it is fairly important to know which one of the two NameNodes it the active one at any given time. The discovery process for that is configured through the hdfs-site.xml.
For any custom implementation it’s becomes relevant to set and understand the correct parameters if a current hdfs-site.xml configuration of the cluster is not given. This post gives a sample Java implementation of a HA HDFS client. Continue reading “Sample HDFS HA Client”
Next years Hadoop Summit will be held in Munich on April 5-6, 2017 which will be an exceptional opportunity for the community in Munich to present itself to the best and brightest in the data community.
Please take this opportunity to hand in your abstract now with only a few days left!
Submit Abstract: http://dataworkssummit.com/munich-2017
Deadline: Monday, November 21, 2016.
2017 Agenda: http://dataworkssummit.com/munich-2017/agenda/
The 2017 tracks include:
- Enterprise Adoption
- Data Processing & Warehousing
- Apache Hadoop Core Internals
- Governance & Security
- IoT & Streaming
- Cloud & Operations
- Apache Spark & Data Science
We want to expand the ecosystem to include technologies that were not explicitly in the Hadoop Ecosystem. For instance, in the community showcase we will have the following zones:
- Apache Hadoop Zone
- IoT & Streaming Zone
- Cloud & Operations Zone
- Apache Spark & Data Science Zone
The goal is to increase the breadth of technologies we can talk about and increase the potential of a data summit.
Future of Data Meetups
Want to present at Meetups?
If you would like to present at a Future of Data Meetup please don’t hesitate to reach out to me and send me a message.
Want to host a Meetup? Become a Sponsor?
We are also looking for rooms and organizations willing to host one of our Future of Data Meetups or become a sponsor. Please reach out and let me know.
Hadoop supports multiple file formats as input for MapReduce workflows, including programs executed with Apache Spark. Defining custom InputFormats is a common practice among Hadoop Data Engineers and will be discussed here based on publicly available data set.
The approach demonstrated in this post does not provide means for a general MATLAB™ InputFormat for Hadoop. This would require significant effort in finding a general purpose mapping of MATLAB™’s file format and type system to the ones of HDFS. Continue reading “Custom MATLAB InputFormat for Apache Spark”
Controlling the environment of an application is vital for it’s functionality and stability. Especially in a distributed environment it is important for developers to have control over the version of dependencies. In such an scenario it’s a critical task to ensure possible conflicting requirements of multiple applications are not disturbing each other.
That is why frameworks like YARN ensure that each application is executed in a self-contained environment – typically in a Linux Container or Docker Container – that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. Continue reading “Running PySpark with Conda Env”