Installing R, RHadoop and RStudio over Cloudera Hadoop ecosystem (Revised)

The following is a step-by-step procedure of installing R, RStudio and RHadoop over a Cloudera hadoop system. OS is CentOS 6.5.

Step 1:

If you are behind proxy, as root user, set up environmental variable for proxy server, as:
# export http_proxy=http://192.168.1.254:6588
(http://proxy-server-ip:proxy-serverPort)
And in /etc/yum.conf, write one line as:
proxy=http://192.168.1.254:6588

Recheck yout CentOS version, if required
$  cat /etc/redhat-release
CentOS release 6.5 (Final)

Step 2:
Update your OS
# yum -y update
And, add EPEL repository to yum configuration
# rpm -ivh http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

(Note that rpmforge has an older version of R. Disable rpmforge, if active
by opening file /etc/yum.repos.d/rpmforge.repo and changing the line
enabled = 1
to
enabled = 0 )

Step 3

Install R & R-devel
# yum install R R-devel

Start R shell and issue commands within it to download packages  as below
(note that install.packages() line splits over many lines even though it is one
single line till the last closing brackets(s))

# R

(If your are behind proxy, set the proxyserver address and port in R-shell as follows before the next command)
> Sys.setenv(http_proxy=”http://192.168.1.254:6588″ )
(If you find strange characters here, the format is:   http://proxy-server:proxyserverPort enclosed within double inverted commas)
> install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”))
> q()

Step 4
Download rhdfs and rmr2 packages to your local Download folder from here:

Step 5
Set up environmental variables. Write the following three lines in file  ~/.bashrc and /etc/profile

export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar

(Note: Depending upon the Cloudera version you have downloaded, check the version of jar file above and change it accordingly)

After writing these lines, export the variables as:
# source /etc/profile
$ source ~/.bashrc

Step 6
Install the downloaded rhdfs & rmr2 packages in R-shell by specifying its location in your machine. First, start R:
# R

Then, install packages:
> install.packages(“/home/ashokharnal/Downloads/rhdfs_1.0.8.tar.gz”, repos = NULL, type=”source”)
> install.packages(“/home/ashokharnal/Downloads/rmr2_2.3.0.tar.gz”, repos = NULL, type=”source”)

At this point R and RHadoop are installed

Step 7
Install RStudio as (my machine has 64 bit-OS. Check yours before install)
# wget http://download2.rstudio.org/rstudio-server-0.98.490-x86_64.rpm
# yum install –nogpgcheck rstudio-server-0.98.490-x86_64.rpm

Note: RStudio service can be started/stopped as:

# service rstudio-server stop
# service rstudio-server status
# service rstudio-server start

The RStudio server is accessible at: http://localhost:8787
Step 8

Restart your machine
Step 9

Log in as non-root (local) user and create a file ~/.Rprofile. Write in it the following seven lines and save the file:

Sys.setenv(HADOOP_HOME=”/usr/lib/hadoop-0.20-mapreduce”)
Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
Sys.setenv(HADOOP_STREAMING=”/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar”)
library(rmr2)
library(rJava)
library(rhdfs)
hdfs.init()

The file ~/.Rprofile contains commands/environmental variables that need to be set/executed whenever
R starts. This file is picked up by R, as soon as it starts.

Step 10
Check hadoop-mapreduce capability in RHadoop as follows. Start R-shell as local user
and issue the following three commands in the shell, one after another.

> ints = to.dfs(1:100)
> calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))
> from.dfs(calc)

You will get a long series of output something as:

$key
NULL

$val
v
[1,]   1   2
[2,]   2   4
[3,]   3   6
[4,]   4   8
[5,]   5  10
[6,]   6  12
[7,]   7  14
[8,]   8  16
[9,]   9  18

Advertisements

Tags: , ,

21 Responses to “Installing R, RHadoop and RStudio over Cloudera Hadoop ecosystem (Revised)”

  1. tiwaryc Says:

    Reblogged this on Data Science and Big Data Analytics in practise and commented:
    Complete Steps, worked seamlessly. Please see the details if you are playing around with rhadoop.

  2. tiwaryc Says:

    Had to re-install rhadoop on my personal PC after a long time, found the steps helpful hence re-blogged, hope you don’t mind,

  3. tavpriteshsethi Says:

    Hi, I am a noob to hadoop. I followed the steps but got the following error on issuing the command hdfs.init() Please suggest.

    > hdfs.init()
    14/04/04 16:01:21 ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/cloudera/R/x86_64-redhat-linux-gnu-library/3.0/rJava/java/boot/], parent=gnu.gcj.runtime.ExtensionClassLoader{urls=[], parent=null}}
    14/04/04 16:01:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.io.IOException: failure to login

    • ngongongo Says:

      Hi tavpritesh … did you find the solution the error

      ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.above ?

      • Obaid Says:

        Hi,

        I had the same problem.

        Below two steps solved it for me:

        1. install a native JDK (not the one comes with cloudera manager). I installed “jdk-7u51-linux-x64.rpm” directly.

        2. Set the below environment variables. Paths may differ based on your hadoop distribution. Make sure to JAVA_HOME pointing to the JDK installed in step 1:

        export HADOOP_CMD=/usr/bin/hadoop

        export HADOOP_STREAMING=/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar

        export JAVA_HOME=/usr/java/jdk1.7.0_51

        export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib64:/usr/java/jdk1.7.0_45-cloudera/jre/lib/amd64/server

  4. Phil Matyash Says:

    Hi guys, have anyone fixed the error above?

  5. Chandan Says:

    Thanks! That was really helpful!

  6. colino Says:

    Hi, i’m stil have that proble:

    > hdfs.init()
    14/12/09 05:20:49 ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/cloudera/R/x86_64-redhat-linux-gnu-library/3.1/rJava/java/boot/], parent=gnu.gcj.runtime.ExtensionClassLoader{urls=[], parent=null}}
    14/12/09 05:20:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.io.IOException: failure to login

    Java version is this one:
    [cloudera@quickstart cloudera]$ java -version
    java version “1.7.0_67”
    Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
    Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

    what’s wrong??

  7. Vivek Misra Says:

    In Step#3, I required “caTools” also to be installed . I have R version 3.2.0 .Also, you need to set up HADOOP_CMD if you are installing from R command.
    >install.packages(c("rJava", "Rcpp", "RJSONIO","bitops","digest", "functional","stringr","plyr","reshape2","caTools"))
    >Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)

  8. Monica Franceschini Says:

    Question: what nodes do I have to install R + RHadoop packages in an Hadoop cluster? All data nodes?
    Thanks

  9. Baskar Jayakumar Says:

    Do R packages need to be installed on all servers (Datanodes) that are part of the cluster ?

    Or just install on one node and copy the libraries across all nodes ?

  10. stuti Says:

    java major minor version errors are coming please help

  11. Johne Says:

    Hy guys.
    After following the instructions above I’ve been received the issue below.

    # My environment variables
    Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
    Sys.setenv(JAVA_HOME=’/usr/java/jdk1.7.0_67-cloudera’)
    Sys.setenv(HADOOP_HOME=’/usr/lib/hadoop-0.20-mapreduce’)
    Sys.setenv(HADOOP_STREAMING=’/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar’)

    # adding the libraries
    library(rmr2)
    library(rJava)
    library(rhdfs)

    # My issue
    hdfs.init()
    Error in .jnew(“org/apache/hadoop/conf/Configuration”) :
    java.lang.UnsupportedClassVersionError: org/apache/hadoop/conf/Configuration : Unsupported major.minor version 51.0

    Please! How I can solve it?

  12. spacediver Says:

    Having the exact same problem as you Johne. Curious to see how to resolve it. Based on what I’ve read, it has something to do with mismatching Java versions during compile and runtime (though I’ve little clue what that means).

  13. dedalo Says:

    Same problem and still trying to solve it! any clue?

    • dedalo Says:

      Finally, I solved it!. In step 3, after installing R, R-devel, and before installing any packages (rJava), run

      # R CMD javareconf

      in order to update the java version to compile rJava to these provided with your Cloudera (in my case Cloudera VM 5.7.0 in CentOS 6.7).

  14. Supertramp Says:

    Anyone else getting this error when starting R?

    Please review your hadoop settings. See help(hadoop.settings)
    Error : .onLoad failed in loadNamespace() for 'rJava', details:
    call: dyn.load(file, DLLpath = DLLpath, ...)
    error: unable to load shared object '/usr/lib64/R/library/rJava/libs/rJava.so':
    libjvm.so: cannot open shared object file: No such file or directory
    Error: package or namespace load failed for ‘rJava’

  15. Priya Says:

    I have been trying to integrate HADOOP with R by installing RHADOOP. But when I start RStudio, I keep getting the following message :

    Warning message:
    S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found

    I am sure that I have not given my path correctly for HADOOP_CMD. How do we find the path for HADOOP_CMD? I use Cloudera VM, should I install stand alone HADOOP for integrating HADOOP with R ?

    Thank you
    Priya

  16. syam srikanth Says:

    Hi,

    Your blog is nice.
    While installing the R in Cloudera quick start 5.8 .
    I am facing the below error .

    [cloudera@quickstart ~]$ cat /etc/redhat-release
    CentOS release 6.7 (Final)
    [cloudera@quickstart ~]$ sudo rpm -ivh http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
    Retrieving http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
    warning: /var/tmp/rpm-tmp.cZNd3p: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
    Preparing… ########################################### [100%]
    package epel-release-6-8.noarch is already installed
    [cloudera@quickstart ~]$ sudo yum install R
    Loaded plugins: fastestmirror, security
    Setting up Install Process
    Loading mirror speeds from cached hostfile
    Could not get metalink http://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=x86_64 error was
    12: Timeout on http://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=x86_64: (28, ‘connect() timed out!’)
    * base: mirror.its.dal.ca
    * epel: epel.mirror.constant.com
    * extras: mirror.its.dal.ca
    * updates: mirror.its.dal.ca

    Time out error and this mirror not found errors are coming.
    please guide me.

    Thanks,
    Syam.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: