SVMlight is an implementation of the Support Vector Machine providing methods for efficient estimation methods for both error rate and precision/recall. SVMlight exploits that the results of most leave-one-outs (often more than 99%) are predetermined and need not be computed. Further more it can also train SVMs with cost models. Many tasks have the property of sparse instance vectors. This implementation makes use of this property which leads to a very compact and efficient representation. Continue reading “Install SVM-light for R”
Sometimes you might find yourself in a situation where it becomes inevitable to clean up a node from a HDP install. Just like most installs are never really the same, cleaning a node from an install is not a straight path. As the documentation advises, to remove the installed packages using the systems package manager, is a good start. But some folders might remain and databases will be ignored. Continue reading “Uninstalling and Cleaning a HDP Node”
With the release of Scala 2.11 it became fully JSR-223 compliant scripting language for Java. JSR-223 is the community request to allow scripting language to have an interface to Java and to allow Java to use the scripting language inside of applications.
In a recent post I demonstrated how easy it is to connect to a REST API like the one of Fitbit with Scala to collect JSON data. Taking up the results of that post here, I would like to demonstrate how Apache Zeppelin can be used to also fetch but in the end visualize the data. Based on the once collected data Zeppelin allows to easily visualize the output through different graphs.
Apache Zeppelin itself is a notebook like, web-based data analytic tool with a specific focus on exploratory data analysis in modern BigData architectures supporting multiple interpreters like Tajo, Spark, Hive, HBase and more. Saying this, it is important to point out, that in this here described case only Scala is being used to display the received data. But this use case could easily be extended to include Apache Hive or Spark. Continue reading “Fitbit Visualization with Apache Zeppelin”
The developer API of Fitbit provides access to the data collected by it’s personal trackers for use with custom applications development. Besides read also write access can be used not only to it’s own but on behalf of other platform users via OAuth authentication. A comprehensive documentation of the Fitbit API can be found here: https://dev.fitbit.com/docs/ . Continue reading “Access to Fitbit API with Scala”
With the release of Ambari 2.x kerberizing a HDP install improved quite a bit. Looking back at Kerberized Hadoop Cluster – A Sandbox Example compared to today most of the there described steps are much easier by now and can be automated. For long I was looking to include it into my existing Vagrant project for an end to end setup of a kerberized cluster. With the writing of this post I finally had the opportunity to do so.
In this post I would like to describe the parts added to the Vagrant setting needed to accomplish an end to end setup of a kerberized HDP cluster. Before the final step of the cluster setup by using the Ambari REST API, a KDC with credentials needs to be created. A Puppet module was created and included to achieve the installation of a MIT Kerberos install. Continue reading “Automated Kerberos Install for HDP w/ Ambari + Puppet”
One approach to natural language processing that has gained tremendous traction recently is the vecotrization of words to represent their representation in a complex context. Deep Learning based sentiment analysis and general classifiers help improve the accuracy compared to results achieved with “classical” text analysis approaches. Tools like Word2Vec or GloVe are based on trained vectorized learning models that help in building commodity NLP available to a broad range of audience. Here is a list of resources to help you get started:
- Deep Learning for Natural Language Processing – Text By the Bay 2015 (Youtube)
- Deep Learning – Prof. Geoff Hinton (Youtube)
- Linguistic Regularities in Sparse and Explicit Word Representations (pdf)(slides)
Efficient Estimation of Word Representations in Vector Space – Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
- word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, arXiv 2014. – Goldberg, Y., and Levy, O.
- GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
- A Word is Worth a Thousand Vectors. MultiThreaded, StitchFix, 11 March 2015. (Youtube)
- A Neural Network For Factoid Question Answering Over Paragraphs. Proceedings of EMNLP 2014 – Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., and Daume III, H.
- Deep or Shallow, NLP is Breaking Out
- Deep Learning in a Nutshell (Part 1)(Part 2)(Part 3)
Data visualization is an integral part of data science. The programming language Scala has many characteristics that make it popular for data science use cases among other languages like R and Python. Immutable data structures and functional constructs are some of the features that make it so attractive to data scientists. Popular big data crunching frameworks like Spark or Flink do have their fair share on an ever growing ecosystem of tools and libraries for data analysis and engineering. Scala is particularly well suited to build robust libraries for scalable data analytics.
In this post we are going to introduce Breeze, a library for fast linear algebraic manipulation of data sets, together with tools for visualization and NLP. Starting with basic creation of vectors, we will create an application for plotting stock prices. The stock data is obtained form Yahoo Finance, but can also be downloaded here for SAP, YAHOO, BMW, and IBM. Continue reading “Plotting Graphs – Data Science with Scala”
Environments dedicated for a HDP install without connection to the internet require a dedicated HDP repository all nodes have access to. While such a setup can differ slightly depending on the connection, if they have temporary or no internet access, in any case they need a file service holding a copy of the HDP repo. Most enterprises have a dedicated infrastructure in place based on Aptly or Satellite. This post describes the setup of an Nginx host serving as a HDP repository host. Continue reading “HDP Repo with Nginx”
The most recent release of Kafka 0.9 with it’s comprehensive security implementation has reached an important milestone. In his blog post Kafka Security 101 Ismael from Confluent describes the security features part of the release very well.
As a part II of the here published post about Kafka Security with Kerberos this post discussed a sample implementation of a Java Kafka producer with authentication. It is part of a mini series of posts discussing secure HDP clients, connecting services to a secured cluster, and kerberizing the HDP Sandbox (Download HDP Sandbox). In this effort at the end of this post we will also create a Kafka Servlet to publish messages to a secured broker.
Kafka provides SSL and Kerberos authentication. Only Kerberos is discussed here. Continue reading “Secure Kafka Java Producer with Kerberos”