10 Resources for Deep Dive Into Apache Flink

Around 2009 the Stratosphere research project started at the TU Berlin which a few years later was set to become the Apache Flink project. Often compared with Apache Spark in addition to that Apache Flink offers pipelining (inter-operator parallism) to better suite incremental data processing making it more suitable for stream processing. In total the Stratosphere project aimed to provide the following contributions to Big Data processing. Most of it can be found in Flink today:

1 – High-level, declarative language for data analyisis
2 – “in suit” data analysis for external data sources3 – Richer set of primitives as MapReduce
4 – UDFs as first class citizens
5 – Query optimization
6 – Support for iterative processing
7 – Execution engine (Nephele) with external memory query processing

The Architecture of Stratosphere:

The Stratosphere software stack

This posts contains 10 resource highlighting the building foundation of Apache Flink today. Continue reading “10 Resources for Deep Dive Into Apache Flink”

Distcp Between kerberized and none-kerberized Cluster

The standard tool for copying data between two clusters is probably distcp. It can also be used to keep the data of two clusters updated. Here the update process is a asynchronous process using a fairly basic update strategy. Distcp is a simple tool, but some edge cases can get complicated. For once the distributed copy between two HA clusters is such a case. Also important to know is that since the versions of RPC used by HDFS can be different it is always a good idea to use a read only protocol like hftp or webhdfs to copy the data from the source system. So the URL could look like this hftp://source.cluster/users/me . WebHDFS would also work, because it is not using RPC.

Another corner case using distcp is the need to copy data between a secure and none secure cluster. Such a process should always be triggered from the secure cluster. This would be the cluster the owner of the cluster has a valid ticket to authenticate against the secure cluster. But this would still yield an exception as the system would complain about a missing fallback mechanism. On the secure cluster it is important to set the  ipc.client.fallback-to-simple-auth-allowed to true in the core-site.xml  in order to make this work.

<property>
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value>
</property>

What is left to to is make sure the user has the needed right on both systems to read and write the data.