Using Hive from R with JDBC

RHadoop is probably one of the best ways to take advantage of Hadoop from R by making use of Hadoop’s Streaming capabilities. Another possibility to make R work with Big Data in general is the use of SQL with for example a JDBC connector. For Hive there exists such a possibility with the Hive Server 2 Client JDBC. In combination with UDFs this has the potential to be quite a powerful approach to leverage the best of the two. In this post I would like to demonstrate the preliminary steps necessary to make R and Hive work.

If you have the Hortonworks Sandbox setup you should be able to simply follow along as you read. If not you probably are able to adapt where appropriate. First we’ll have to install R on a machine with access to Hive. By default this means the machine should be able to access port 1000 or 1001 where the Hive server is installed. Next we are going to use a sample table in Hive to query from R setting up all required packages.

Install + Setup of R

First we are going to install R with the required packages to run JDBC:

> yum install R
> R CMD javareconf               // upating java settings
> R
> install.packages("rJava")
> install.packages("RJDBC",dep=TRUE)

This is quite straight forward except for the need to configure javareconf. You might be able to omit this step but if you are having trouble and run into an exception like this:

checking whether JNI programs can be compiled... yes
checking JNI data types... configure: error: One or more JNI types differ from the corresponding native type. You may need to use non-standard compiler flags or a different compiler in order to fix this.
ERROR: configuration failed for package ‘rJava’
* removing ‘/usr/lib64/R/library/rJava’
ERROR: dependency ‘rJava’ is not available for package ‘RJDBC’
* removing ‘/usr/lib64/R/library/RJDBC’

you need to run R CMD javareconf  as root. With this in place we are ready to use JDBC from R.

Using Hive JDBC

With the installation of Hive comes a JDBC driver with dependencies to your Hadoop installation. You can find the libraries in the folder of your Hive installation in my case under /usr/lib/hive/lib/, but this depends on your installation. In this folder you’ll find the hive-jdbc.jar with many more. If you don’t find it on your system than downloading Hive with all libraries could also be in option.

Since the JDBC driver depends on some Hadoop dependencies and for convenience in the following part we are going to load all jars, including Hadoop common, into R’s classpath in a for loop and load the necessary R packages:

> library("DBI")
> library("rJava")
> library("RJDBC")
> for(l in list.files('/usr/lib/hive/lib/')){ .jaddClassPath(paste("/usr/lib/hive/lib/",l,sep=""))}
> for(l in list.files('/usr/lib/hadoop/')){ .jaddClassPath(paste("/usr/lib/hadoop/",l,sep=""))}

We now have all the jars to use the Hive JDBC driver in our classpath. To check whether or not the jars are really on the classpath you can use .jclassPath()  to print out the current classpath.

Hive Sample 08

Hive comes with some sample tables either pre-installed or ready to be setup after installation. We are going to use this tables to run some sample queries using R. But before we are able to make queries we have to grant the necessary privileges to the user in our case ‘hue‘.

To set the required privileges log in to the machine where Hive CLI is running and run the following command:

hive> GRANT SELECT ON TABLE sample_08 TO USER hue;

We can now connect to our Hive server running on localhost to read the sample_08 table and plot the distribution of salaries:

options( java.parameters = "-Xmx8g" )
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/default", "hue", "1111")
sample_08 <- dbReadTable(conn, "sample_08")
plot(sample_08$sample_08.salary)

From here you should be able to query your Big Data from R using Hive’s JDBC connector.

UPDATE

With the revamp of Hive JDBC it all became a little simpler than the above:

options( java.parameters = "-Xmx8g" )
library(rJava)
library(RJDBC)
 
cp = c("/usr/hdp/current/hive-client/lib/hive-jdbc.jar", 
       "/usr/hdp/current/hadoop-client/hadoop-common.jar")
.jinit(classpath=cp) 
 
drv <- JDBC("org.apache.hive.jdbc.HiveDriver",
             "/usr/hdp/current/hive-client/lib/hive-jdbc.jar",
             identifier.quote="`")
 
conn <- dbConnect(drv, "jdbc:hive2://servername:10000/demo", "user", "password")
 
show_databases <- dbGetQuery(conn, "show databases")
 
show_databases

 

Further Reading

13 thoughts on “Using Hive from R with JDBC

  1. Hello, I’ve used your directions to enable this connection, but unfortunately, I am getting some errors.
    I am using a Linux server with hadoop (hortonworks sandbox 2.1) and R (3.1.1) installed in the same box. This is the script I am using to connect:

    drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar")
    conn Error in .jcall(drv@jdrv, “Ljava/sql/Connection;”, “connect”, as.character(url)[1], :java.lang.NoClassDefFoundError: Could not initialize class org.apache.hive.service.auth.HiveAuthFactory

    I have checked that my .jclasspath contains all the jar files in `/usr/lib/hive` and `/usr/lib/hadoop`,but can not be sure if anything else is missing. I have view some of the jar files and found that hive-service.jar contains the class AuthFactory it complaints about. Any idea what is causing the problem??
    I am fairly new to R (and programming for that matter) so any specific steps are much appreciated.

    Like

  2. First of all thanks for your nice article..

    I have similar requirement only difference is I don’t have Hive2server only Hive.

    I’m using RJDBC 0.2-5 to connect to Hive in Rstudio. My server has hadoop-2.4.1 and hive-0.14. I follow the below mention steps to connect to Hive.

    library(DBI)
    library(rJava)
    library(RJDBC)
    .jinit(parameters=”-DrJava.debug=true”)
    drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
    c("/home/packages/hive/New folder3/commons-logging-1.1.3.jar",
    "/home/packages/hive/New folder3/hive-jdbc-0.14.0.jar",
    "/home/packages/hive/New folder3/hive-metastore-0.14.0.jar",
    "/home/packages/hive/New folder3/hive-service-0.14.0.jar",
    "/home/packages/hive/New folder3/libfb303-0.9.0.jar",
    "/home/packages/hive/New folder3/libthrift-0.9.0.jar",
    "/home/packages/hive/New folder3/log4j-1.2.16.jar",
    "/home/packages/hive/New folder3/slf4j-api-1.7.5.jar",
    "/home/packages/hive/New folder3/slf4j-log4j12-1.7.5.jar",
    "/home/packages/hive/New folder3/hive-common-0.14.0.jar",
    "/home/packages/hive/New folder3/hadoop-core-0.20.2.jar",
    "/home/packages/hive/New folder3/hive-serde-0.14.0.jar",
    "/home/packages/hive/New folder3/hadoop-common-2.4.1.jar"),
    identifier.quote="`")

    conHive <- dbConnect(drv, "jdbc:hive://myserver:10000/default",
    "usr",
    "pwd")
    But I am always getting the following error:

    Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], : java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars
    Even I tried with different version of Hive jar, Hive-jdbc-standalone.jar but nothing seems to work.. I also use RHive to connect to Hive but there was also no success.

    Can you please help?.. I kind of stuck 😦

    Like

    1. You would typically use LOAD DATA LOCAL INPATH .. to load a local file into Hive. The problem here running this thru JDBC will execute this local to the HiveServer2 so the file would need to be local to HS2 and not the client.

      Currently the best way upload a local csv into a Hive table, is to use HDFS put and then create an EXTERNAL TABLE based on the HDFS path.

      LOAD documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables
      EXTERNAL TABLE: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables

      Like

    1. I have pulled the data into my client machine, processed the data using some statistical techniques. Now I have a table created in hive, I want to write the processed data to this table.

      Like

      1. With ACID support in the most recent Hive release, you should be able to use INSERT and UPDATE.
        You can always write the data back e.g. as a CSV to HDFS and define a Hive schema on that as your new table.
        What’s not going to work most likely is the use of LOAD DATA IN PATH.

        Like

Leave a comment