Get Started with Hadoop – Now!!

Looking back it is insane how mature Hadoop has become. Not only the maturity itself but also the pace is quite impressive. Early projects jumped right onto the Hadoop wagon without clear but big expectations. Great about this times was that it felt like a gold-rush and Hadoop’s simple and inherently scalable paradigm made sure this path was sticked with successful projects. In his recent Book Arun Murthy identifies 4 Phases Hadoop has gone through so far:

Phase 0: The Area of Ad Hoc Hadoop
Phase 1: Hadoop on Demand
Phase 2: Dawn of the shared Cluster
Phase 3: Emergence of YARN

Continue reading “Get Started with Hadoop – Now!!” →

Getting Started with ORC and HCatalog

ORC (Optimized Row Columnar) is a columnar file format optimized to improve performance of Hive. Through the Hive metastore and HCatalog reading, writing, and processing can also be accomplished by MapReduce, Pig, Cascading, and so on. It is very similar to Parquet which is being developed by Cloudera and Twitter. Both are part of the most current Hive release and available to be used immediately. In this post I would like to describe some of the key concepts of ORC and demonstrate how to get started quickly using HCatalog. Continue reading “Getting Started with ORC and HCatalog” →

Forensic Analysis of a Spam Attack

Recently one of the sites I host was targeted by some script kiddie who used a fairly old exploit in a WordPress theme to misuse the server for sending spam. The way this in general works is that they use a known vulnerability in the Blog or CMS software or addon you use which gives them access to the file system to upload arbitrary scripts. They then upload so called injection scripts, for example C99, or something else. This scripts can be executed from outside and can be used to upload more files, read files containing login information, query your database, or what ever is possible for them to do from that point on in your system.

This has happened to me before and it is more then annoying as this poses a threat to the mailing system I and so many others rely on. Becoming blacklisted is a real pain and a real damage. This time I took the chance and time to investigate the incident in much detail and I want to give here a overview and document the steps I followed. Continue reading “Forensic Analysis of a Spam Attack” →

Map Reduce – tf-idf

tf-idf is the approach of determine relevant documents by the count of words they contain. While this would emphasis common words like ‘the’, tf-idf takes for each word it’s ratio of the overall appearence in a set of documents – the inverse-doucment-frequence. Here I’ll try to give a simple MapReduce implemention. As a little quirk Avro will be used to model the representation of a document. We are going to need secondary sorting to reach an effective implementation.

Continue reading “Map Reduce – tf-idf” →

Reliably Store Postfix Logs in S3 with Apache Flume and rsyslog

Flume is a distributed system to aggregate log files into the Hadoop Distributed File System (HDFS). It has a simple design of Events, Sources, Sinks, and Channels which can be connected into a complex multi-hop architecture.

While Flume is designed to be resilient “with tunable reliability mechanisms for fail-over and recovery” in this blog post we’ll also look at the reliable forwarding of rsyslog, which we are going to use to store postfix logs in Amazon S3.

Continue reading “Reliably Store Postfix Logs in S3 with Apache Flume and rsyslog” →