Storm Serialization with Avro (using Kryo Serializer)

Working with complex data events can be a challenge designing Storm topologies for real-time data processing. In such cases emitting single values for multiple and varying event characteristics soon reveals it’s limitations. For message serialization Storm leverages the Kryo serialization framework used by many other projects. Kryo keeps a registry of serializers being used for corresponding Class types. Mappings in that registry can be overridden or added making the framework extendable to diverse type serializations.

On the other hand Avro is a very popular “data serialization system” that bridges between many different programming languages and tools. While the fact that data objects can be described in JSON makes it really easy to use, Avro is often being used for it’s support of schema evolution. With support for schema evolution the same implementation (Storm topology) could be capable of reading different versions of the same data event without adaptation. This makes it a very good fit for Storm as a intermediator between data ingestion points and data storage in today’s Enterprise Data Architectures.

The example here does not provide complex event samples to illustrated that point, but it gives an end to end implementation of a Storm topology where events get send to a Kafka queue as Avro objects processesed natively by a real-time processing topology. The example can be found here. It’s a simple Hive Streaming example where stock events are read from a CSV file and send to Kafka. Stock events are a flat, none complex data type as already mentioned, but we’ll still use it to demo serialization with using Avro. Continue reading “Storm Serialization with Avro (using Kryo Serializer)” →

Install HDP with Red Hat Satellite

As part of the installation of HDP with Ambari two repositories get generated with the URLs defined as user input during the first steps of the install wizard and distributed to the cluster hosts. In cases where you are using Red Hat Satellite to manage your Linux infrastructure, you need to disable the repositories defined to leverage Red Hat Satellite. The same is also true for SUSE’s Manager (Spacewalk).

Prior to the install and prior to starting the Ambari server you need to disable the repositories by altering the template responsible for generating them.

Prior to Ambari 2.x you would need to change repo_suse_rhel.j2 template to disable the generated repositories. In that file simply change the enabled=1 to enabled=0. To find the template file do $ find /var/lib/ambari-server -name repo_suse_rhel.j2 .

Starting with Ambari 2.x the configuration for the repositories can be found in the cluster-evn.xml under /var/lib/ambari-server/resources/stacks/HDP/2.0.6/configuration. Also here change the value of enbaled to 0. In that file look for the <name>repo_suse_rhel_template</name> .

Save the changes and start you install. Continue reading “Install HDP with Red Hat Satellite” →

Storm Flux: Easy Streaming Deployment

With Flux for Apache Storm deploying streaming topologies for real-time processing becomes less programmatic and more declarative. Using Flux for deployments makes it less likely you will have to re-compile your project just because you have re-configured or re-arranged your topology. It leverages YAML, a human-readable serialization format, to describe a topology on a whole. You might still need to write some classes, but by taking advantage of existing, generic implementations this becomes less likely.

While Flux can also be used with an existing topology, for this post we’ll take the Hive-Streaming example you can find here (blog post) to create the required topology from scratch using Flux. For experiments and demo purposes you can use the following Vagrant setup to run a HDP cluster locally. Continue reading “Storm Flux: Easy Streaming Deployment” →

JPMML Example Random Forest

The Predictive Model Markup Language (PMML) developed by the Data Mining Group is a standardized XML-based representation of mining models to be used and shared across languages or tools. The standardized definition allows a classification model trained with R to be used with Storm for example. Many projects related to Big Data have some support for PMML, which is often implemented by JPMML. Continue reading “JPMML Example Random Forest” →