Hive Join Strategies

Hive joins are executed by MapReduce jobs through different execution engines like for example Tez, Spark or MapReduce. Joins even of multiple tables can be achieved by one job only. Since it’s first release many optimizations have been added to Hive giving users various options for query improvements of joins.

Understanding how joins are implemented with MapReduce helps to recognize the different optimization techniques in Hive today. Continue reading “Hive Join Strategies” →

HiveSink for Flume

With the most recent release of HDP (v2.2.4) Hive Streaming is shipped as technical preview. It can for example be used with Storm to ingest streaming data collected from Kafka as demonstrated here. But it also still has some serious limitations and in case of Storm a major bug. Nevertheless Hive Streaming is likely to become the tool of choice when it comes to streamline data ingestion to Hadoop. So it is worth to explore already today.

Flume’s upcoming release 1.6 will contain a HiveSink capable of leveraging Hive Streaming for data ingestion. In the following post we will use it as a replacement for the HDFS sink used in a previous post here. Other then replacing the HDFS sink with a HiveSink none of the previous setup will change, except for Hive table schema which needs to be adjusted as part of the requirements that currently exist around Hive Streaming. So let’s get started by looking into these restrictions. Continue reading “HiveSink for Flume” →

Hadoop File Ingest and Hive

In the beginning of all Hadoop adventures is the task of ingesting data to HDFS preferably today being queried for analysis by Hive at any point in time. High chances are that most enterprise data today at the beginning of any Hadoop project resides inside of RDBMS systems. Sqoop is the tool of choice within the Hadoop ecosystem for these kind of data. It is also quite convenient to use with Hive directly.

As most business is inherently event driven and more and more electronic devices are being used to track this events, ingesting a stream of data to Hadoop is a common demand. A tool like Kafka would be used for data ingestion into Hadoop in such a scenario of stream processing.

None of the methods mentioned above consider the sheer amount of data stored in files today. Not to mention the files newly created day by day. While WebHDFS or direct HDFS sure are convenient method for file ingestion they often require direct access to the cluster or a huge landing zone also with direct access to HDFS. A continues data ingest is also not supported.

For such scenarios Apache Flume sure would be a good option. Flume is capable of dealing with various continues data sources. Sources can be piped together over several nodes through channels writing data into various sink. In this post we look at the possibility to define a local directory where files can be dropped off, while Flume monitors for new files in that directory to sink to HDFS. Continue reading “Hadoop File Ingest and Hive” →

Hive Streaming with Storm

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released as a Technical Preview to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally has been a batch oriented storage.

Based on the newly given ACID insert/update capabilities of Hive, the Streaming API is breaking down a stream of data into smaller batches which get committed in a transaction to the underlying storage. Once committed the data becomes immediately available for other queries.

Broadly speaking the API consists of two parts. One part is handling the transaction while the other is dealing with the underlying storage (HDFS). Transactions in Hive are handled by the the Metastore. Kerberos is supported from the beginning!

Some of the current limitations are:

Only delimited input data and JSON (strict syntax) are supported
Only ORC support
Hive table must be bucketed (unpartitioned tables are supported)

In this post I would like to demonstrate the use of a newly created Storm HiveBolt that makes use of the streaming API and is quite straightforward to use. The source of the here described example is provided at GitHub. To run this demo you would need a HDP 2.2 Sandbox, which can be downloaded for various virtualization environments here. Continue reading “Hive Streaming with Storm” →

Using Hive from R with JDBC

RHadoop is probably one of the best ways to take advantage of Hadoop from R by making use of Hadoop’s Streaming capabilities. Another possibility to make R work with Big Data in general is the use of SQL with for example a JDBC connector. For Hive there exists such a possibility with the Hive Server 2 Client JDBC. In combination with UDFs this has the potential to be quite a powerful approach to leverage the best of the two. In this post I would like to demonstrate the preliminary steps necessary to make R and Hive work.

If you have the Hortonworks Sandbox setup you should be able to simply follow along as you read. If not you probably are able to adapt where appropriate. First we’ll have to install R on a machine with access to Hive. By default this means the machine should be able to access port 1000 or 1001 where the Hive server is installed. Next we are going to use a sample table in Hive to query from R setting up all required packages.

Continue reading “Using Hive from R with JDBC” →

Getting Started with ORC and HCatalog

ORC (Optimized Row Columnar) is a columnar file format optimized to improve performance of Hive. Through the Hive metastore and HCatalog reading, writing, and processing can also be accomplished by MapReduce, Pig, Cascading, and so on. It is very similar to Parquet which is being developed by Cloudera and Twitter. Both are part of the most current Hive release and available to be used immediately. In this post I would like to describe some of the key concepts of ORC and demonstrate how to get started quickly using HCatalog. Continue reading “Getting Started with ORC and HCatalog” →