Inspired by Twitter‘s publication about “Large Scale Machine Learning” I turned to Pig when it came to implement a SVM classifier for Record Linkage. Searching for different solutions I also came across a presentation of the Huffington Post using a similar approach to training multiple SVM models. The overall idea is to use Hadoop to train multiple models with different parameters at the same time, selecting the best model for the actual classification. There are some limitations to this approach, which I’ll try to address at the end of this post, but first let me describe my approach to training multiple SVM classifiers with Pig.
Disclaimer: This post does not describe the process of training one model in parallel but training multiple models at the same time on multiple machines.
Continue reading “Training Multiple SVM Classifiers with Apache Pig” →
If you are using Python with Hadoop Streaming a lot then you might know about the trouble of keeping all nodes up to date with required packages. A nice way to work around this is to use Virtualenv for each streaming project. Besides the hurdle of keeping all nodes in sync with the necessary libraries another advantage of using Virtualenv is the possibility to try different versions and setups within the same project seamlessly.
In this example we are going to create a Python job that counts the n-grams of hotel names in relation to the country the hotel is located in. Besides the use of a Virtualenv where we install NLTK, we are going to strive the use of Avro as an input for a Python streaming job, as well as secondary sorting with the use of KeyFieldBasedPartitioner and KeyFieldBasedComparator . Continue reading “Python Virtualenv with Hadoop Streaming” →
RHadoop is probably one of the best ways to take advantage of Hadoop from R by making use of Hadoop’s Streaming capabilities. Another possibility to make R work with Big Data in general is the use of SQL with for example a JDBC connector. For Hive there exists such a possibility with the Hive Server 2 Client JDBC. In combination with UDFs this has the potential to be quite a powerful approach to leverage the best of the two. In this post I would like to demonstrate the preliminary steps necessary to make R and Hive work.
If you have the Hortonworks Sandbox setup you should be able to simply follow along as you read. If not you probably are able to adapt where appropriate. First we’ll have to install R on a machine with access to Hive. By default this means the machine should be able to access port 1000 or 1001 where the Hive server is installed. Next we are going to use a sample table in Hive to query from R setting up all required packages.
Continue reading “Using Hive from R with JDBC” →
MarkLogic is one of the leading Enterprise NoSQL vendors that offers through it’s server product a database centered mainly around search. It’s document centric design based on XML makes it attractive for content focused applications. MarkLogic Server combines a transactional document repository with search indexing and an application server.
The underlying data format for all stored documents, which can either be text or binary files, is XML. It’s considered schema-aware as a schema prior to insertion is not required but can be applied afterwards as needed. MarkLogic Server applies a full-text index to the documents stored within it’s repository. Indexes for search are also applied to the paths of the XML structure. This effectively makes documents search able right after insertion. This approach of advanced search around a document based design make it similar to a combination of MongoDB with ElasticSearch.
Developers can get started with MarkLogic Server 7 quite quickly by using Amazon Machine Image (AMI) supplied here. For this post we are going to use that image to build a small search application around the exported posts of this blog. In this post we are going to strive to build a search application solely around MarkLogic Server.
Continue reading “MarkLogic: NoSQL Search for Enterprise” →