Controlling the environment of an application is vital for it’s functionality and stability. Especially in a distributed environment it is important for developers to have control over the version of dependencies. In such an scenario it’s a critical task to ensure possible conflicting requirements of multiple applications are not disturbing each other.
That is why frameworks like YARN ensure that each application is executed in a self-contained environment – typically in a Linux Container or Docker Container – that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. Continue reading “Running PySpark with Conda Env”
Sometimes you might find yourself in a situation where it becomes inevitable to clean up a node from a HDP install. Just like most installs are never really the same, cleaning a node from an install is not a straight path. As the documentation advises, to remove the installed packages using the systems package manager, is a good start. But some folders might remain and databases will be ignored. Continue reading “Completely Uninstall and Remove HDP from Nodes”
With the release of Scala 2.11 it became fully JSR-223 compliant scripting language for Java. JSR-223 is the community request to allow scripting language to have an interface to Java and to allow Java to use the scripting language inside of applications.
In a recent post I demonstrated how easy it is to connect to a REST API like the one of Fitbit with Scala to collect JSON data. Taking up the results of that post here, I would like to demonstrate how Apache Zeppelin can be used to also fetch but in the end visualize the data. Based on the once collected data Zeppelin allows to easily visualize the output through different graphs.
Apache Zeppelin itself is a notebook like, web-based data analytic tool with a specific focus on exploratory data analysis in modern BigData architectures supporting multiple interpreters like Tajo, Spark, Hive, HBase and more. Saying this, it is important to point out, that in this here described case only Scala is being used to display the received data. But this use case could easily be extended to include Apache Hive or Spark. Continue reading “Fitbit Visualization with Apache Zeppelin”
The developer API of Fitbit provides access to the data collected by it’s personal trackers for use with custom applications development. Besides read also write access can be used not only to it’s own but on behalf of other platform users via OAuth authentication. A comprehensive documentation of the Fitbit API can be found here: https://dev.fitbit.com/docs/ . Continue reading “Access to Fitbit API with Scala”
With the release of Ambari 2.x kerberizing a HDP install improved quite a bit. Looking back at Kerberized Hadoop Cluster – A Sandbox Example compared to today most of the there described steps are much easier by now and can be automated. For long I was looking to include it into my existing Vagrant project for an end to end setup of a kerberized cluster. With the writing of this post I finally had the opportunity to do so.
In this post I would like to describe the parts added to the Vagrant setting needed to accomplish an end to end setup of a kerberized HDP cluster. Before the final step of the cluster setup by using the Ambari REST API, a KDC with credentials needs to be created. A Puppet module was created and included to achieve the installation of a MIT Kerberos install. Continue reading “Automated Kerberos Install for HDP w/ Ambari + Puppet”
One approach to natural language processing that has gained tremendous traction recently is the vecotrization of words to represent their representation in a complex context. Deep Learning based sentiment analysis and general classifiers help improve the accuracy compared to results achieved with “classical” text analysis approaches. Tools like Word2Vec or GloVe are based on trained vectorized learning models that help in building commodity NLP available to a broad range of audience. Here is a list of resources to help you get started:
- Deep Learning for Natural Language Processing – Text By the Bay 2015 (Youtube)
- Deep Learning – Prof. Geoff Hinton (Youtube)
- Linguistic Regularities in Sparse and Explicit Word Representations (pdf)(slides)
Efficient Estimation of Word Representations in Vector Space – Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
- word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, arXiv 2014. – Goldberg, Y., and Levy, O.
- GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
- A Word is Worth a Thousand Vectors. MultiThreaded, StitchFix, 11 March 2015. (Youtube)
- A Neural Network For Factoid Question Answering Over Paragraphs. Proceedings of EMNLP 2014 – Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., and Daume III, H.
- Deep or Shallow, NLP is Breaking Out
- Deep Learning in a Nutshell (Part 1)(Part 2)(Part 3)
Data visualization is an integral part of data science. The programming language Scala has many characteristics that make it popular for data science use cases among other languages like R and Python. Immutable data structures and functional constructs are some of the features that make it so attractive to data scientists. Popular big data crunching frameworks like Spark or Flink do have their fair share on an ever growing ecosystem of tools and libraries for data analysis and engineering. Scala is particularly well suited to build robust libraries for scalable data analytics.
In this post we are going to introduce Breeze, a library for fast linear algebraic manipulation of data sets, together with tools for visualization and NLP. Starting with basic creation of vectors, we will create an application for plotting stock prices. The stock data is obtained form Yahoo Finance, but can also be downloaded here for SAP, YAHOO, BMW, and IBM. Continue reading “Plotting Graphs – Data Science with Scala”
Environments dedicated for a HDP install without connection to the internet require a dedicated HDP repository all nodes have access to. While such a setup can differ slightly depending on the connection, if they have temporary or no internet access, in any case they need a file service holding a copy of the HDP repo. Most enterprises have a dedicated infrastructure in place based on Aptly or Satellite. This post describes the setup of an Nginx host serving as a HDP repository host. Continue reading “HDP Repo with Nginx”
At some point you might require to connect your dashboard, data ingestion service or similar to a secured and kerberized HDP cluster. Most Java based webcontainers do support Kerberos for both client and server side communication. Kerberos does require very thoughtful configuration but rewards it’s users with an almost completely transparent authentication implementation that simply works. Steps described in this post should enable you to connect your application with a secured HDP cluster. For further support read the links listed at the end of this writing. A sample project is provided on github for hands-on exercises. Continue reading “Connecting Tomcat to a Kerberized HDP Cluster”