Running PySpark with Conda Env

Controlling the environment of an application is vital for it’s functionality and stability. Especially in a distributed environment it is important for developers to have control over the version of dependencies. In such an scenario it’s a critical task to ensure possible conflicting requirements of multiple applications are not disturbing each other.

That is why frameworks like YARN ensure that each application is executed in a self-contained environment – typically in a Linux Container or Docker Container – that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. Continue reading “Running PySpark with Conda Env”

Running PySpark with Virtualenv

Controlling the environment of an application is vital for it’s functionality and stability. Especially in a distributed environment it is important for developers to have control over the version of dependencies. In such an scenario it’s a critical task to ensure possible conflicting requirements of multiple applications are not disturbing each other.

That is why frameworks like YARN ensure that each application is executed in a self-contained environment – typically in a Linux (Java) Container or Docker Container – that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. Continue reading “Running PySpark with Virtualenv”

Install SVM-light for R

SVMlight is an implementation of the Support Vector Machine providing methods for efficient estimation methods for both error rate and precision/recall. SVMlight exploits that the results of most leave-one-outs (often more than 99%) are predetermined and need not be computed. Further more it can also train SVMs with cost models. Many tasks have the property of sparse instance vectors. This implementation makes use of this property which leads to a very compact and efficient representation. Continue reading “Install SVM-light for R”

Uninstalling and Cleaning a HDP Node

Sometimes you might find yourself in a situation where it becomes inevitable to clean up a node from a HDP install. Just like most installs are never really the same, cleaning a node from an install is not a straight path. As the documentation advises, to remove the installed packages using the systems package manager, is a good start. But some folders might remain and databases will be ignored. Continue reading “Uninstalling and Cleaning a HDP Node”

Scripting Scala – JSR-223

With the release of Scala 2.11 it became fully JSR-223 compliant scripting language for Java. JSR-223 is the community request to allow scripting language to have an interface to Java and to allow Java to use the scripting language inside of applications.

In Java 8 the Nashorn scripting engine was released as a native component of the JDK to support JavaScript with applications. This is possible through another JSR in Java 7 – namely JSR-229, which brings support for invoke dynamics, a way to support dynamic programming by the Java byte code compiler. Nashorn can be seen as a proof of concept of this newly added functionality. Continue reading “Scripting Scala – JSR-223”

Fitbit Visualization with Apache Zeppelin

In a recent post I demonstrated how easy it is to connect to a REST API like the one of Fitbit with Scala to collect JSON data. Taking up the results of that post here, I would like to demonstrate how Apache Zeppelin can be used to also fetch but in the end visualize the data. Based on the once collected data Zeppelin allows to easily visualize the output through different graphs.

Apache Zeppelin itself is a notebook like, web-based data analytic tool with a specific focus on exploratory data analysis in modern BigData architectures supporting multiple interpreters like Tajo, Spark, Hive, HBase and more. Saying this, it is important to point out, that in this here described case only Scala is being used to display the received data. But this use case could easily be extended to include Apache Hive or Spark. Continue reading “Fitbit Visualization with Apache Zeppelin”

Access to Fitbit API with Scala

The developer API of Fitbit provides access to the data collected by it’s personal trackers for use with custom applications development. Besides read also write access can be used not only to it’s own but on behalf of other platform users via OAuth authentication. A comprehensive documentation of the Fitbit API can be found here: https://dev.fitbit.com/docs/ . Continue reading “Access to Fitbit API with Scala”

Automated Kerberos Install for HDP w/ Ambari + Puppet

With the release of Ambari 2.x kerberizing a HDP install improved quite a bit. Looking back at Kerberized Hadoop Cluster – A Sandbox Example compared to today most of the there described steps are much easier by now and can be automated. For long I was looking to include it into my existing Vagrant project for an end to end setup of a kerberized cluster. With the writing of this post I finally had the opportunity to do so.

In this post I would like to describe the parts added to the Vagrant setting needed to accomplish an end to end setup of a kerberized HDP cluster. Before the final step of the cluster setup by using the Ambari REST API, a KDC with credentials needs to be created. A Puppet module was created and included to achieve the installation of a MIT Kerberos install. Continue reading “Automated Kerberos Install for HDP w/ Ambari + Puppet”

10 Resources about Deep Learning & NLP

One approach to natural language processing that has gained tremendous traction recently is the vecotrization of words to represent their representation in a complex context. Deep Learning based sentiment analysis and general classifiers help improve the accuracy compared to results achieved with “classical” text analysis approaches. Tools like Word2Vec or GloVe are based on trained vectorized learning models that help in building commodity NLP available to a broad range of audience. Here is a list of resources to help you get started:

  1. Deep Learning for Natural Language ProcessingText By the Bay 2015 (Youtube)
  2. Deep Learning – Prof. Geoff Hinton (Youtube)
  3. Linguistic Regularities in Sparse and Explicit Word Representations (pdf)(slides)
  4. Efficient Estimation of Word Representations in Vector Space – Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

  5. word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, arXiv 2014. – Goldberg, Y., and Levy, O.
  6. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
  7. A Word is Worth a Thousand Vectors. MultiThreaded, StitchFix, 11 March 2015. (Youtube)
  8. A Neural Network For Factoid Question Answering Over Paragraphs. Proceedings of EMNLP 2014 – Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., and Daume III, H.
  9. Deep or Shallow, NLP is Breaking Out
  10. Deep Learning in a Nutshell (Part 1)(Part 2)(Part 3)

Plotting Graphs – Data Science with Scala

Data visualization is an integral part of data science. The programming language Scala has many characteristics that make it popular for data science use cases among other languages like R and Python. Immutable data structures and functional constructs are some of the features that make it so attractive to data scientists. Popular big data crunching frameworks like Spark or Flink do have their fair share on an ever growing ecosystem of tools and libraries for data analysis and engineering. Scala is particularly well suited to build robust libraries for scalable data analytics.

In this post we are going to introduce Breeze, a library for fast linear algebraic manipulation of data sets, together with tools for visualization and NLP. Starting with basic creation of vectors, we will create an application for plotting stock prices. The stock data is obtained form Yahoo Finance, but can also be downloaded here for SAP, YAHOO, BMW, and IBM. Continue reading “Plotting Graphs – Data Science with Scala”