Using Hive from R with JDBC

hkropp General, Hadoop, Hive, R July 13, 2014 3 Minutes

RHadoop is probably one of the best ways to take advantage of Hadoop from R by making use of Hadoop’s Streaming capabilities. Another possibility to make R work with Big Data in general is the use of SQL with for example a JDBC connector. For Hive there exists such a possibility with the Hive Server 2 Client JDBC. In combination with UDFs this has the potential to be quite a powerful approach to leverage the best of the two. In this post I would like to demonstrate the preliminary steps necessary to make R and Hive work.

If you have the Hortonworks Sandbox setup you should be able to simply follow along as you read. If not you probably are able to adapt where appropriate. First we’ll have to install R on a machine with access to Hive. By default this means the machine should be able to access port 1000 or 1001 where the Hive server is installed. Next we are going to use a sample table in Hive to query from R setting up all required packages.

Install + Setup of R

First we are going to install R with the required packages to run JDBC:

> yum install R
> R CMD javareconf               // upating java settings
> R
> install.packages("rJava")
> install.packages("RJDBC",dep=TRUE)

This is quite straight forward except for the need to configure javareconf. You might be able to omit this step but if you are having trouble and run into an exception like this:

checking whether JNI programs can be compiled... yes
checking JNI data types... configure: error: One or more JNI types differ from the corresponding native type. You may need to use non-standard compiler flags or a different compiler in order to fix this.
ERROR: configuration failed for package ‘rJava’
* removing ‘/usr/lib64/R/library/rJava’
ERROR: dependency ‘rJava’ is not available for package ‘RJDBC’
* removing ‘/usr/lib64/R/library/RJDBC’

you need to run R CMD javareconf as root. With this in place we are ready to use JDBC from R.

Using Hive JDBC

With the installation of Hive comes a JDBC driver with dependencies to your Hadoop installation. You can find the libraries in the folder of your Hive installation in my case under /usr/lib/hive/lib/, but this depends on your installation. In this folder you’ll find the hive-jdbc.jar with many more. If you don’t find it on your system than downloading Hive with all libraries could also be in option.

Since the JDBC driver depends on some Hadoop dependencies and for convenience in the following part we are going to load all jars, including Hadoop common, into R’s classpath in a for loop and load the necessary R packages:

> library("DBI")
> library("rJava")
> library("RJDBC")
> for(l in list.files('/usr/lib/hive/lib/')){ .jaddClassPath(paste("/usr/lib/hive/lib/",l,sep=""))}
> for(l in list.files('/usr/lib/hadoop/')){ .jaddClassPath(paste("/usr/lib/hadoop/",l,sep=""))}

We now have all the jars to use the Hive JDBC driver in our classpath. To check whether or not the jars are really on the classpath you can use .jclassPath() to print out the current classpath.

Hive Sample 08

Hive comes with some sample tables either pre-installed or ready to be setup after installation. We are going to use this tables to run some sample queries using R. But before we are able to make queries we have to grant the necessary privileges to the user in our case ‘hue‘.

To set the required privileges log in to the machine where Hive CLI is running and run the following command:

hive> GRANT SELECT ON TABLE sample_08 TO USER hue;

We can now connect to our Hive server running on localhost to read the sample_08 table and plot the distribution of salaries:

options( java.parameters = "-Xmx8g" )
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/default", "hue", "1111")
sample_08 <- dbReadTable(conn, "sample_08")
plot(sample_08$sample_08.salary)

From here you should be able to query your Big Data from R using Hive’s JDBC connector.

UPDATE

With the revamp of Hive JDBC it all became a little simpler than the above:

options( java.parameters = "-Xmx8g" )
library(rJava)
library(RJDBC)
 
cp = c("/usr/hdp/current/hive-client/lib/hive-jdbc.jar", 
       "/usr/hdp/current/hadoop-client/hadoop-common.jar")
.jinit(classpath=cp) 
 
drv <- JDBC("org.apache.hive.jdbc.HiveDriver",
             "/usr/hdp/current/hive-client/lib/hive-jdbc.jar",
             identifier.quote="`")
 
conn <- dbConnect(drv, "jdbc:hive2://servername:10000/demo", "user", "password")
 
show_databases <- dbGetQuery(conn, "show databases")
 
show_databases

Published by hkropp

View all posts by hkropp

Published July 13, 2014

13 thoughts on “Using Hive from R with JDBC”

jonbros says:

July 15, 2014 at 3:06 pm

Using Hive from R http://t.co/J6NTriiyUI

LikeLike

Reply
Tony says:

September 16, 2014 at 6:01 pm

Hello, I’ve used your directions to enable this connection, but unfortunately, I am getting some errors.
I am using a Linux server with hadoop (hortonworks sandbox 2.1) and R (3.1.1) installed in the same box. This is the script I am using to connect:

drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar")
conn Error in .jcall(drv@jdrv, “Ljava/sql/Connection;”, “connect”, as.character(url)[1], :java.lang.NoClassDefFoundError: Could not initialize class org.apache.hive.service.auth.HiveAuthFactory

I have checked that my .jclasspath contains all the jar files in `/usr/lib/hive` and `/usr/lib/hadoop`,but can not be sure if anything else is missing. I have view some of the jar files and found that hive-service.jar contains the class AuthFactory it complaints about. Any idea what is causing the problem??
I am fairly new to R (and programming for that matter) so any specific steps are much appreciated.

LikeLike

Reply
1. Sai Kumar says:
  
  December 20, 2017 at 10:09 pm
  
  were you able to resolve this error? if yes, please let me know, I’m facing the same issue
  
  LikeLike
  
  Reply
souravraj says:

October 9, 2015 at 9:04 am

First of all thanks for your nice article..

I have similar requirement only difference is I don’t have Hive2server only Hive.

I’m using RJDBC 0.2-5 to connect to Hive in Rstudio. My server has hadoop-2.4.1 and hive-0.14. I follow the below mention steps to connect to Hive.

library(DBI)
library(rJava)
library(RJDBC)
.jinit(parameters=”-DrJava.debug=true”)
drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c("/home/packages/hive/New folder3/commons-logging-1.1.3.jar",
"/home/packages/hive/New folder3/hive-jdbc-0.14.0.jar",
"/home/packages/hive/New folder3/hive-metastore-0.14.0.jar",
"/home/packages/hive/New folder3/hive-service-0.14.0.jar",
"/home/packages/hive/New folder3/libfb303-0.9.0.jar",
"/home/packages/hive/New folder3/libthrift-0.9.0.jar",
"/home/packages/hive/New folder3/log4j-1.2.16.jar",
"/home/packages/hive/New folder3/slf4j-api-1.7.5.jar",
"/home/packages/hive/New folder3/slf4j-log4j12-1.7.5.jar",
"/home/packages/hive/New folder3/hive-common-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-core-0.20.2.jar",
"/home/packages/hive/New folder3/hive-serde-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-common-2.4.1.jar"),
identifier.quote="`")

conHive <- dbConnect(drv, "jdbc:hive://myserver:10000/default",
"usr",
"pwd")
But I am always getting the following error:

Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], : java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars
Even I tried with different version of Hive jar, Hive-jdbc-standalone.jar but nothing seems to work.. I also use RHive to connect to Hive but there was also no success.

Can you please help?.. I kind of stuck 😦

LikeLike

Reply
1. Sai Kumar says:
  
  December 20, 2017 at 10:09 pm
  
  were you able to resolve this error? if yes, please let me know, I’m facing the same issue
  
  LikeLike
  
  Reply
Sabarinathan says:

September 17, 2016 at 10:19 am

Thanks for wonderful writeup …
It is working fine for me

LikeLike

Reply
Sabarinathan Sundaresan says:

September 17, 2016 at 10:21 am

I want to know how to use multiple db result set in map…
Any help on this ?

LikeLike

Reply
Michiel Tammeling says:

November 25, 2016 at 12:25 pm

Hi do you also happen to know how to upload a local .csv file to a Hive server?

LikeLike

Reply
1. Henning Kropp says:
  
  November 28, 2016 at 9:21 am
  
  You would typically use LOAD DATA LOCAL INPATH .. to load a local file into Hive. The problem here running this thru JDBC will execute this local to the HiveServer2 so the file would need to be local to HS2 and not the client.
  
  Currently the best way upload a local csv into a Hive table, is to use HDFS put and then create an EXTERNAL TABLE based on the HDFS path.
  
  LOAD documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables
  EXTERNAL TABLE: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables
  
  LikeLike
  
  Reply
Jakson says:

May 30, 2017 at 10:59 pm

is there a way to write the data back to hive using the same connection string.

LikeLike

Reply
1. Jakson says:
  
  May 30, 2017 at 11:02 pm
  
  I have pulled the data into my client machine, processed the data using some statistical techniques. Now I have a table created in hive, I want to write the processed data to this table.
  
  LikeLike
  
  Reply
  1. Henning Kropp says:
    
    May 31, 2017 at 4:31 am
    
    With ACID support in the most recent Hive release, you should be able to use INSERT and UPDATE.
    You can always write the data back e.g. as a CSV to HDFS and define a Hive schema on that as your new table.
    What’s not going to work most likely is the use of LOAD DATA IN PATH.
    
    LikeLike
Andreas says:

April 26, 2019 at 3:19 pm

Klasse Page. Danke.

LikeLike

Reply

Using Hive from R with JDBC

Install + Setup of R

Using Hive JDBC

Hive Sample 08

UPDATE

Further Reading

Published by hkropp

13 thoughts on “Using Hive from R with JDBC”

Leave a comment Cancel reply

Install + Setup of R

Using Hive JDBC

Hive Sample 08

UPDATE

Further Reading

Share this:

Related

Published by hkropp

13 thoughts on “Using Hive from R with JDBC”

Leave a comment Cancel reply