MATLAB™ is a widely used professional tool for numerical processing used across multiple divers disciplines like Physics, Chemistry, and Mathematics. You can encounter multiple public data sets which are published in MATLAB™ format. This article gives a brief example of such data set and reading it from R. Continue reading “Reading Matlab files with R”
SVMlight is an implementation of the Support Vector Machine providing methods for efficient estimation methods for both error rate and precision/recall. SVMlight exploits that the results of most leave-one-outs (often more than 99%) are predetermined and need not be computed. Further more it can also train SVMs with cost models. Many tasks have the property of sparse instance vectors. This implementation makes use of this property which leads to a very compact and efficient representation. Continue reading “Install SVM-light for R”
The Predictive Model Markup Language (PMML) developed by the Data Mining Group is a standardized XML-based representation of mining models to be used and shared across languages or tools. The standardized definition allows a classification model trained with R to be used with Storm for example. Many projects related to Big Data have some support for PMML, which is often implemented by JPMML. Continue reading “JPMML Example Random Forest”
RHadoop is probably one of the best ways to take advantage of Hadoop from R by making use of Hadoop’s Streaming capabilities. Another possibility to make R work with Big Data in general is the use of SQL with for example a JDBC connector. For Hive there exists such a possibility with the Hive Server 2 Client JDBC. In combination with UDFs this has the potential to be quite a powerful approach to leverage the best of the two. In this post I would like to demonstrate the preliminary steps necessary to make R and Hive work.
If you have the Hortonworks Sandbox setup you should be able to simply follow along as you read. If not you probably are able to adapt where appropriate. First we’ll have to install R on a machine with access to Hive. By default this means the machine should be able to access port 1000 or 1001 where the Hive server is installed. Next we are going to use a sample table in Hive to query from R setting up all required packages.