MarkLogic: NoSQL Search for Enterprise

MarkLogic is one of the leading Enterprise NoSQL vendors that offers through it’s server product a database centered mainly around search. It’s document centric design based on XML makes it attractive for content focused applications. MarkLogic Server combines a transactional document repository with search indexing and an application server.

The underlying data format for all stored documents, which can either be text or binary files, is XML. It’s considered schema-aware as a schema prior to insertion is not required but can be applied afterwards as needed. MarkLogic Server applies a full-text index to the documents stored within it’s repository. Indexes for search are also applied to the paths of the XML structure. This effectively makes documents search able right after insertion. This approach of advanced search around a document based design make it similar to a combination of MongoDB with ElasticSearch.

Developers can get started with MarkLogic Server 7 quite quickly by using Amazon Machine Image (AMI) supplied here. For this post we are going to use that image to build a small search application around the exported posts of this blog. In this post we are going to strive to build a search application solely around MarkLogic Server.

Continue reading “MarkLogic: NoSQL Search for Enterprise”

Get Started with Hadoop – Now!!

Looking back it is insane how mature Hadoop has become. Not only the maturity itself but also the pace is quite impressive. Early projects jumped right onto the Hadoop wagon without clear but big expectations. Great about this times was that it felt like a gold-rush and Hadoop’s simple and inherently scalable paradigm made sure this path was sticked with successful projects. In his recent Book Arun Murthy identifies 4 Phases Hadoop has gone through so far:

  • Phase 0: The Area of Ad Hoc Hadoop
  • Phase 1: Hadoop on Demand
  • Phase 2: Dawn of the shared Cluster
  • Phase 3: Emergence of YARN

Continue reading “Get Started with Hadoop – Now!!”

Getting Started with ORC and HCatalog

ORC (Optimized Row Columnar) is a columnar file format optimized to improve performance of Hive. Through the Hive metastore and HCatalog reading, writing, and processing can also be accomplished by MapReduce, Pig, Cascading, and so on. It is very similar to Parquet which is being developed by Cloudera and Twitter. Both are part of the most current Hive release and available to be used immediately. In this post I would like to describe some of the key concepts of ORC and demonstrate how to get started quickly using HCatalog. Continue reading “Getting Started with ORC and HCatalog”

Forensic Analysis of a Spam Attack

Recently one of the sites I host was targeted by some script kiddie who used a fairly old exploit in a WordPress theme to misuse the server for sending spam. The way this in general works is that they use a known vulnerability in the Blog or CMS software or addon you use which gives them access to the file system to upload arbitrary scripts. They then upload so called injection scripts, for example C99, or something else. This scripts can be executed from outside and can be used to upload more files, read files containing login information, query your database, or what ever is possible for them to do from that point on in your system.

This has happened to me before and it is more then annoying as this poses a threat to the mailing system I and so many others rely on. Becoming blacklisted is a real pain and a real damage. This time I took the chance and time to investigate the incident in much detail and I want to give here a overview and document the steps I followed. Continue reading “Forensic Analysis of a Spam Attack”

Map Reduce – tf-idf

tf-idf is the approach of determine relevant documents by the count of words they contain. While this would emphasis common words like ‘the’, tf-idf takes for each word it’s ratio of the overall appearence in a set of documents – the inverse-doucment-frequence. Here I’ll try to give a simple MapReduce implemention. As a little quirk Avro will be used to model the representation of a document. We are going to need secondary sorting to reach an effective implementation.

Continue reading “Map Reduce – tf-idf”

Reliably Store Postfix Logs in S3 with Apache Flume and rsyslog

FlumeApache Flume Logo is a distributed system to aggregate log files into the Hadoop Distributed File System (HDFS). It has a simple design of Events, Sources, Sinks, and Channels which can be connected into a complex multi-hop architecture.

While Flume is designed to be resilient “with tunable reliability mechanisms for fail-over and recovery” in this blog post we’ll also look at the reliable forwarding of rsyslog, which we are going to use to store postfix logs in Amazon S3.

Continue reading “Reliably Store Postfix Logs in S3 with Apache Flume and rsyslog”

Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala

Web scraping is the process of extracting entities from web pages. This entities can either be news articles, blog posts, products, or any other information displayed on the web. Web pages consist of HTML (a XML like structure) and therefor present information in a structured form, in a DOM (Document Object Model), which can be extracted. The structure is usually not very strict or complaint to other web pages and object to change as the web page evolves. Using scripts over compiled programming languages has therefor a lot of advantages. On the other hand compiled languages have often a speed advantage as well as a broad foundation of fundamental libraries. The DOM reference implementation of the W3C is written in Java for example.

Java’s approach of opening up the JVM (Java Virtual Machine) for other (dynamically typed) languages is a great way to combine the benefits of both, a compiled and scripted language. With JDK 8 compiling dynamic languages to the JVM has become simpler with potentially improved implementations of compilers and runtime systems through the invokedynamic instruction. I found that this presentation currently explains best what invokedynamic is and how it works in general.

The language of the web is without doubt JavaScript. Fortunately the JDK 8 comes with a JavaScript implementation, which is called Nashorn and makes use of JSR 292 and invokedynamic. If you download the jdk you can simply get started with the jjs (JavaJavaScript) bin which is a REPL for Nashorn. To learn more about Nashorn you can read here, here, and here. In this post I would like to demonstrate how to use it for web scraping.

Continue reading “Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala”

Scala: Traits & Actors

Traits are Scala’s way of implementing mixins, which means to combine methods of other classes. It’s a powerful tool and an extension to the Java object model. You can think of Traits as Interfaces with concrete implementations.

The Actor model is a robust concurrency abstraction model that was introduced 1973 by Carl Hewitt, Peter Bishop, and Richard Steiger. A number of programming languages since then have adopted this model, probably most notably is the programming language Erlang. Since Erlang is widely used by the Telecommunications Industry or Facebook Chat the Actor model has already become the foundation of some of our day-to-day communication.

In this post I would like to give a brief overview of this two concepts applied in Scala by also providing some examples.

Continue reading “Scala: Traits & Actors”

Introduction to Node.js at the GfK Nurago and SirValUse Academy 2013

[wp_pdfjs id=1058 ]

Overview

This is an Introduction to Node.js given at the GfK Nurago and SirValUse Academy 2013 in Hamburg. In this talk I try to reason about the use of server-side JavaScript and try to workout what makes Node.js so special – The Event Loop.
At the end of the talk I give some examples of which I think are particular good  use cases for Node.js. I also make some notes about running Node in a productive environment.

For notes Continue reading “Introduction to Node.js at the GfK Nurago and SirValUse Academy 2013”

Closures with JavaScript and Python

Closures are functions or references to functions that hold within their scope non-local variables. This variables endure beyond their existence outside of these functions scope. These variables are therefor enclosed within the lexical-scope of that functions.
This is particular useful for JavaScript where with every function call (even if the same is called recursively) a new execution context is created, and an automatic garbage collection throws out all contexts with no reference. For a detailed explanation Jim Ley’s description of closures in JavaScript has proven itself as a great resource. Continue reading “Closures with JavaScript and Python”