From time to time it can be very useful to be able to search for HDP repository release directly from the public repo. Especially if you want to search for a recent development or technical preview version. Also this can become handy if you need to create an offline repository for your company intranet.
The HDP repositories are available through Amazon’s S3 storage layer. A tool quite convenient to use it s3cmd.
After downloading it s3cmd can easily be installed based on python:
// requires python-setuptools
$ cd ~/Downloads/
$ tar xfz s3cmd-1.5.2.tar.gz
$ cd s3cmd-1.5.2
$ more INSTALL // to read INSTALL guide
$ sudo python setup.py install
Browing HDP repo:
$ s3cmd ls s3://public-repo-1.hortonworks.com/HDP/centos6/
$ s3cmd ls s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/
2013-07-09 00:06 0 s3://public-repo-1.hortonworks.com/HDP/centos6/2.x/
Wildcards can be used for filtering:
$ s3cmd ls s3://dev.hortonworks.com/HDP/centos6/2.x/updates/2.3.*
Ambari Shell is an interactive command line tool to administrate Ambari manged HDP clusters. It supports all available functionality provided by the UI of the Ambari web application. Written as a Java application based on a Groovy REST client it further provides tab completion and a context aware commands. In a previous post we already discussed various contexts like service and state will using REST calls to alter them. Ambari Shell is a convenient tool for managing most of the complex aspects discussed there.
With that it can also be used for automated cluster installs based on Ambari Blueprints. While it is fairly simple to use two curl request to do a blueprint based install, Ambari Shell gives the advantage of monitoring the process. In scripted setups and with the use of provisioning tools like Puppet, Chef, or Ansible it gives the possibility to time setup steps after a complete cluster install. Executing a cluster install with –exitOnFinish true will halt the execution of the script until the install has finished.
Apache Kafka developed as a durable and fast messaging queue handling real-time data feeds originally did not come with any security approach. Similar to Hadoop Kafka at the beginning was expected to be used in a trusted environment focusing on functionality instead of compliance. With the ever growing popularity and the widespread use of Kafka the community recently picked up traction around a complete security design including authentication with Kerberos and SSL, encryption, and authorization. Judging by the details of the security proposal found here the complete security measures will be included with the 0.9 release of Apache Kafka.
The releases of HDP 2.3 already today support a secure implementation of Kafka with authentication and authorization. Especially the integration with the security framework Apache Ranger this becomes a comprehensive security solution for any Hadoop deployment with real-time data demands. In this post we by example look at how working with a kerberized Kafka broker is different from before. Here working with the known shell tools and a custom Java producer. Continue reading “Kafka Security with Kerberos”→
Around 2009 the Stratosphere research project started at the TU Berlin which a few years later was set to become the Apache Flink project. Often compared with Apache Spark in addition to that Apache Flink offers pipelining (inter-operator parallism) to better suite incremental data processing making it more suitable for stream processing. In total the Stratosphere project aimed to provide the following contributions to Big Data processing. Most of it can be found in Flink today:
1 – High-level, declarative language for data analyisis
2 – “in suit” data analysis for external data sources3 – Richer set of primitives as MapReduce
4 – UDFs as first class citizens
5 – Query optimization
6 – Support for iterative processing
7 – Execution engine (Nephele) with external memory query processing
The standard tool for copying data between two clusters is probably distcp. It can also be used to keep the data of two clusters updated. Here the update process is a asynchronous process using a fairly basic update strategy. Distcp is a simple tool, but some edge cases can get complicated. For once the distributed copy between two HA clusters is such a case. Also important to know is that since the versions of RPC used by HDFS can be different it is always a good idea to use a read only protocol like hftp or webhdfs to copy the data from the source system. So the URL could look like this hftp://source.cluster/users/me . WebHDFS would also work, because it is not using RPC.
Another corner case using distcp is the need to copy data between a secure and none secure cluster. Such a process should always be triggered from the secure cluster. This would be the cluster the owner of the cluster has a valid ticket to authenticate against the secure cluster. But this would still yield an exception as the system would complain about a missing fallback mechanism. On the secure cluster it is important to set the ipc.client.fallback-to-simple-auth-allowed to true in the core-site.xml in order to make this work.
Working with complex data events can be a challenge designing Storm topologies for real-time data processing. In such cases emitting single values for multiple and varying event characteristics soon reveals it’s limitations. For message serialization Storm leverages the Kryo serialization framework used by many other projects. Kryo keeps a registry of serializers being used for corresponding Class types. Mappings in that registry can be overridden or added making the framework extendable to diverse type serializations.
On the other hand Avro is a very popular “data serialization system” that bridges between many different programming languages and tools. While the fact that data objects can be described in JSON makes it really easy to use, Avro is often being used for it’s support of schema evolution. With support for schema evolution the same implementation (Storm topology) could be capable of reading different versions of the same data event without adaptation. This makes it a very good fit for Storm as a intermediator between data ingestion points and data storage in today’s Enterprise Data Architectures.
The example here does not provide complex event samples to illustrated that point, but it gives an end to end implementation of a Storm topology where events get send to a Kafka queue as Avro objects processesed natively by a real-time processing topology. The example can be found here. It’s a simple Hive Streaming example where stock events are read from a CSV file and send to Kafka. Stock events are a flat, none complex data type as already mentioned, but we’ll still use it to demo serialization with using Avro. Continue reading “Storm Serialization with Avro (using Kryo Serializer)”→
As part of the installation of HDP with Ambari two repositories get generated with the URLs defined as user input during the first steps of the install wizard and distributed to the cluster hosts. In cases where you are using Red Hat Satellite to manage your Linux infrastructure, you need to disable the repositories defined to leverage Red Hat Satellite. The same is also true for SUSE’s Manager (Spacewalk).
Prior to the install and prior to starting the Ambari server you need to disable the repositories by altering the template responsible for generating them.
Prior to Ambari 2.x you would need to change repo_suse_rhel.j2 template to disable the generated repositories. In that file simply change the enabled=1 to enabled=0. To find the template file do $ find /var/lib/ambari-server -name repo_suse_rhel.j2 .
Starting with Ambari 2.x the configuration for the repositories can be found in the cluster-evn.xml under /var/lib/ambari-server/resources/stacks/HDP/2.0.6/configuration. Also here change the value of enbaled to 0. In that file look for the <name>repo_suse_rhel_template</name> .
With Flux for Apache Storm deploying streaming topologies for real-time processing becomes less programmatic and more declarative. Using Flux for deployments makes it less likely you will have to re-compile your project just because you have re-configured or re-arranged your topology. It leverages YAML, a human-readable serialization format, to describe a topology on a whole. You might still need to write some classes, but by taking advantage of existing, generic implementations this becomes less likely.
Since Java 5 developers have the possibility to define so called pre-main hooks to manipulate the execution of a Java program at runtime with Java agents. An agent as part of the classpath is triggered before execution of the main method and therefor can be used to either filter calls to or even manipulate the underlying Java code. A tool for code manipulation is javassists. Apache Ranger for example is using both java agents and javassits to override the authorization mechanism of components of the Hadoop stack. This together with Ranger Stacks could also be used to secure existing code unchanged during runtime.