HiveSink for Flume

With the most recent release of HDP (v2.2.4) Hive Streaming is shipped as technical preview. It can for example be used with Storm to ingest streaming data collected from Kafka as demonstrated here. But it also still has some serious limitations and in case of Storm a major bug. Nevertheless Hive Streaming is likely to become the tool of choice when it comes to streamline data ingestion to Hadoop. So it is worth to explore already today.

Flume’s upcoming release 1.6 will contain a HiveSink capable of leveraging Hive Streaming for data ingestion. In the following post we will use it as a replacement for the HDFS sink used in a previous post here. Other then replacing the HDFS sink with a HiveSink none of the previous setup will change, except for Hive table schema which needs to be adjusted as part of the requirements that currently exist around Hive Streaming. So let’s get started by looking into these restrictions. Continue reading “HiveSink for Flume” →

Hadoop File Ingest and Hive

In the beginning of all Hadoop adventures is the task of ingesting data to HDFS preferably today being queried for analysis by Hive at any point in time. High chances are that most enterprise data today at the beginning of any Hadoop project resides inside of RDBMS systems. Sqoop is the tool of choice within the Hadoop ecosystem for these kind of data. It is also quite convenient to use with Hive directly.

As most business is inherently event driven and more and more electronic devices are being used to track this events, ingesting a stream of data to Hadoop is a common demand. A tool like Kafka would be used for data ingestion into Hadoop in such a scenario of stream processing.

None of the methods mentioned above consider the sheer amount of data stored in files today. Not to mention the files newly created day by day. While WebHDFS or direct HDFS sure are convenient method for file ingestion they often require direct access to the cluster or a huge landing zone also with direct access to HDFS. A continues data ingest is also not supported.

For such scenarios Apache Flume sure would be a good option. Flume is capable of dealing with various continues data sources. Sources can be piped together over several nodes through channels writing data into various sink. In this post we look at the possibility to define a local directory where files can be dropped off, while Flume monitors for new files in that directory to sink to HDFS. Continue reading “Hadoop File Ingest and Hive” →

Reliably Store Postfix Logs in S3 with Apache Flume and rsyslog

Flume is a distributed system to aggregate log files into the Hadoop Distributed File System (HDFS). It has a simple design of Events, Sources, Sinks, and Channels which can be connected into a complex multi-hop architecture.

While Flume is designed to be resilient “with tunable reliability mechanisms for fail-over and recovery” in this blog post we’ll also look at the reliable forwarding of rsyslog, which we are going to use to store postfix logs in Amazon S3.

Continue reading “Reliably Store Postfix Logs in S3 with Apache Flume and rsyslog” →