Getting Started with ORC and HCatalog

ORC (Optimized Row Columnar) is a columnar file format optimized to improve performance of Hive. Through the Hive metastore and HCatalog reading, writing, and processing can also be accomplished by MapReduce, Pig, Cascading, and so on. It is very similar to Parquet which is being developed by Cloudera and Twitter. Both are part of the most current Hive release and available to be used immediately. In this post I would like to describe some of the key concepts of ORC and demonstrate how to get started quickly using HCatalog.

The key concepts of ORC files are:

  1. Split the Rows Into Strips (default size 250MB)
    Splitting the data into independent Strips allows to access the data concurrently using separate RecordReaders. Strips can be split without scanning for markers.
  2. Use Lightweight Indexes
    Strips are surrounded be an lightweight index and footer. The index, so called strip index, is applied to the columns contained within the Strip. This makes for really fast filtering as records can be skipped not contained in the query.
  3. Compress Column Values
    Column values are compressed so that not the complete row or Strip has to be decompressed prior to reading but only the values that are potential of interest. Supported codecs are Snappy, Zlib, or none.
  4. Support for Hive’s Data Types
    ORC supports Hive types like datetime, decimals, and so on. For Hive data types you should check which are supported by you current version and also if they are supported by HCatalog if you plan to use it outside of Hive!

There are some more aspects to ORC files, most notable a custom serialization that helps on speed and space. This includes ZigZag, dictionary, and variable-length encoding. All of that is very similar to what Parquet uses and the most notable differences seems to be the Parquet’s emphasis on deeply nested data as well as the use of Thrift in Parquet over Protocol Buffers in ORC.

Here is a visiualization provided be Hortonworks which shows some of the compression benefits you could gain from using ORC files.

Source: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
Source: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

In general to achieve a good compression using Zlib, which is the default for ORC, over Snappy is a good choice. This comes likely at sacrificing speed.

Creating a first ORC Backed Table in Hive

As ORC (Ser/Deser) is included with the most recent release of Hive using it is very simple. As a quick example we are going to model crawled Blog posts to be stored in our database web. This could be modeled for example like this:

CREATE TABLE web.blogs(
  uri STRING,
  host STRING,
  type INT,
  publish_date DATE,
  crawled TIMESTAMP,
  title STRING,
  text STRING,
  comments ARRAY<STRUCT<author:STRING, text:STRING, published: DATE>>
)
PARTITIONED BY (crawl_day STRING)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");

This creates us a ORC backed table in Hive which uses Snappy compression instead of Zlib. The table will be partitioned by the day each blog post was crawled (Y-m-d). HCatalog only supports partitions being of type string. Partitions are basically folders in HDFS and and the column equals the partition values. That’s why it needs to be a string here and we use two columns to store the crawl date/day. There are some other table options (TBLEPROPERTIES) specific to ORC that might interest you:

  • orc.compress (default: zlib)
  • orc.compress.size (default: 262144 Bytes)
  • orc.stripe.size (default: 256 MB)
  • orc.row.index.stride (default: 10,000)
  • orc.create.index (default: true)

A blog post as consists of a URI, text, author and so on as well as the comments to the each post. We use a STRUCT to include the comments within each post. To create the table in our database we can execute the following command using hcat :

hcat -e 'CREATE TABLE ...' || hcat -f create_blog.hcat

We include here the data definition from above to achieve the desired result of creating our table in Hive/HCatalog by using -e or we put the above statement into a file and use -f . Refer to hcat –help for more information.

From here we can use either Pig, Sqoop, Hive, MapReduce, or Cascading to update and integrate data to our table. Let’s have a quick look at how we can use either Pig to insert and query some data. For this we’ll use a small set of data we can experiment with. Here are the blog posts and comments we are going to use for this example.

Posts (post.tsv):

/hadoop/mr	hortonworks.com	2	2014-04-01T10:33:24.123-0800	2014-04-01T01:03:44.123+0000	MR Title	MR Text
/hadoop/cdh54	cloudera.com	3	2014-04-11T12:13:24.125-0800	2014-04-01T01:03:44.523+0000	CDH 54	CDH Text
/ml/train	mapr.com	3	2014-04-09T17:53:44.123-0800	2014-04-01T01:03:45.003+0000	SVM	SVM Train

Comments (comments.tsv):

/hadoop/mr	hortonworks.com	hue	nice!	2014-04-02T11:33:24.567-0800
/hadoop/mr	hortonworks.com	sqoop	nice!	2014-04-02T12:33:24.765-0800

We have posts and comments in separate files. For simplicity the values are tab separated so we can use the default PigLoader. Please keep in mind that the default PigLoader does not supports proper escaping and would in such a case be a bad choice as title and text could likely contain tabs as well as line breaks.

Since posts and comments are in separated files we have to group them together. This sometimes does not very look nice in Pig and requires some re-naming. Pig appends in some cases the name of a tuple to the values to make them unique. This is often ignored so that posts::uri and uri are the same, but here unfortunately not. But a re-naming doesn’t hurt either as it will be part of any appropriate map or reduce step and doesn’t require a separated step. Let’s have a look on how we could load the posts and comments and then store them into Hive:

-- load posts
posts = LOAD 'posts.tsv' AS (
   uri: chararray, 
   host: chararray, 
   type: int, 
   publish_date: datetime, 
   crawled: datetime, 
   title: chararray, 
   text: chararray);

-- load comments
comments = LOAD 'comments.tsv' AS (
   uri: chararray, 
   host: chararray, 
   author: chararray, 
   text: chararray, 
   published: datetime);

-- group posts and comments together
posts_comments = COGROUP 
   posts BY (uri, host), 
   comments BY (uri, host);

-- prepare expected output
blog_posts = FOREACH posts_comments {
   coms = FOREACH comments GENERATE author, text, published;
   post = LIMIT posts 1;       
   GENERATE FLATTEN(post), coms AS comments;
}
blog_posts = FOREACH blog_posts GENERATE post::uri AS uri, post::host AS host, post::type AS type, 
post::publish_date AS publish_date, post::crawled AS crawled, post::title AS title, post::text AS text, comments;

STORE blog_posts INTO 'web.blogs' USING org.apache.hive.hcatalog.pig.HCatStorer('crawl_day=2014-05-05');

To execute the above you have to run Pig with HCatalog support. It does not by default include the required libraries. In addition depending on your setup it might be required to point to the metastore of Hive. You can do that with the following command:

pig -useHCatalog [-Dhive.metastore.uris=thrift://metastore.server:9083]

With -useHCatolog Pig can make use of HCatalog and here in addition we also point to the location of the Hive metastore.

That’s it! And you should be able to execute the following:

hive> use web;
OK
Time taken: 2.936 seconds
hive> select * from blogs;
OK
/ml/train	mapr.com	3	NULL	2014-03-31 18:03:45.003	SVM	SVM Train	[]	2014-05-05
/hadoop/mr	hortonworks.com	2	NULL	2014-03-31 18:03:44.123	MR Title	MR Text	[{"author":"sqoop","text":"nice!","published":null},{"author":"hue","text":"nice!","published":null}]	2014-05-05
/hadoop/cdh54	cloudera.com	3	NULL	2014-03-31 18:03:44.523	CDH 54	CDH Text	[]	2014-05-05
Time taken: 3.124 seconds, Fetched: 3 row(s)

Further Readings

One thought on “Getting Started with ORC and HCatalog

Leave a comment