Installing R, RHadoop and RStudio over Cloudera Hadoop ecosystem (Revised)

The following is a step-by-step procedure of installing R, RStudio and RHadoop over a Cloudera hadoop system. OS is CentOS 6.5.

Step 1:

If you are behind proxy, as root user, set up environmental variable for proxy server, as:
# export http_proxy=http://192.168.1.254:6588
(http://proxy-server-ip:proxy-serverPort)
And in /etc/yum.conf, write one line as:
proxy=http://192.168.1.254:6588

Recheck yout CentOS version, if required
$  cat /etc/redhat-release
CentOS release 6.5 (Final)

Step 2:
Update your OS
# yum -y update
And, add EPEL repository to yum configuration
# rpm -ivh http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

(Note that rpmforge has an older version of R. Disable rpmforge, if active
by opening file /etc/yum.repos.d/rpmforge.repo and changing the line
enabled = 1
to
enabled = 0 )

Step 3

Install R & R-devel
# yum install R R-devel

Start R shell and issue commands within it to download packages  as below
(note that install.packages() line splits over many lines even though it is one
single line till the last closing brackets(s))

# R

(If your are behind proxy, set the proxyserver address and port in R-shell as follows before the next command)
> Sys.setenv(http_proxy=”http://192.168.1.254:6588″ )
(If you find strange characters here, the format is:   http://proxy-server:proxyserverPort enclosed within double inverted commas)
> install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”))
> q()

Step 4
Download rhdfs and rmr2 packages to your local Download folder from here:

Step 5
Set up environmental variables. Write the following three lines in file  ~/.bashrc and /etc/profile

export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar

(Note: Depending upon the Cloudera version you have downloaded, check the version of jar file above and change it accordingly)

After writing these lines, export the variables as:
# source /etc/profile
$ source ~/.bashrc

Step 6
Install the downloaded rhdfs & rmr2 packages in R-shell by specifying its location in your machine. First, start R:
# R

Then, install packages:
> install.packages(“/home/ashokharnal/Downloads/rhdfs_1.0.8.tar.gz”, repos = NULL, type=”source”)
> install.packages(“/home/ashokharnal/Downloads/rmr2_2.3.0.tar.gz”, repos = NULL, type=”source”)

At this point R and RHadoop are installed

Step 7
Install RStudio as (my machine has 64 bit-OS. Check yours before install)
# wget http://download2.rstudio.org/rstudio-server-0.98.490-x86_64.rpm
# yum install –nogpgcheck rstudio-server-0.98.490-x86_64.rpm

Note: RStudio service can be started/stopped as:

# service rstudio-server stop
# service rstudio-server status
# service rstudio-server start

The RStudio server is accessible at: http://localhost:8787
Step 8

Restart your machine
Step 9

Log in as non-root (local) user and create a file ~/.Rprofile. Write in it the following seven lines and save the file:

Sys.setenv(HADOOP_HOME=”/usr/lib/hadoop-0.20-mapreduce”)
Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
Sys.setenv(HADOOP_STREAMING=”/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar”)
library(rmr2)
library(rJava)
library(rhdfs)
hdfs.init()

The file ~/.Rprofile contains commands/environmental variables that need to be set/executed whenever
R starts. This file is picked up by R, as soon as it starts.

Step 10
Check hadoop-mapreduce capability in RHadoop as follows. Start R-shell as local user
and issue the following three commands in the shell, one after another.

> ints = to.dfs(1:100)
> calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))
> from.dfs(calc)

You will get a long series of output something as:

$key
NULL

$val
v
[1,]   1   2
[2,]   2   4
[3,]   3   6
[4,]   4   8
[5,]   5  10
[6,]   6  12
[7,]   7  14
[8,]   8  16
[9,]   9  18

Tags: , ,

23 Responses to “Installing R, RHadoop and RStudio over Cloudera Hadoop ecosystem (Revised)”

  1. tiwaryc Says:

    Reblogged this on Data Science and Big Data Analytics in practise and commented:
    Complete Steps, worked seamlessly. Please see the details if you are playing around with rhadoop.

  2. tiwaryc Says:

    Had to re-install rhadoop on my personal PC after a long time, found the steps helpful hence re-blogged, hope you don’t mind,

  3. tavpriteshsethi Says:

    Hi, I am a noob to hadoop. I followed the steps but got the following error on issuing the command hdfs.init() Please suggest.

    > hdfs.init()
    14/04/04 16:01:21 ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/cloudera/R/x86_64-redhat-linux-gnu-library/3.0/rJava/java/boot/], parent=gnu.gcj.runtime.ExtensionClassLoader{urls=[], parent=null}}
    14/04/04 16:01:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.io.IOException: failure to login

    • ngongongo Says:

      Hi tavpritesh … did you find the solution the error

      ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.above ?

      • Obaid Says:

        Hi,

        I had the same problem.

        Below two steps solved it for me:

        1. install a native JDK (not the one comes with cloudera manager). I installed “jdk-7u51-linux-x64.rpm” directly.

        2. Set the below environment variables. Paths may differ based on your hadoop distribution. Make sure to JAVA_HOME pointing to the JDK installed in step 1:

        export HADOOP_CMD=/usr/bin/hadoop

        export HADOOP_STREAMING=/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar

        export JAVA_HOME=/usr/java/jdk1.7.0_51

        export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib64:/usr/java/jdk1.7.0_45-cloudera/jre/lib/amd64/server

  4. Phil Matyash Says:

    Hi guys, have anyone fixed the error above?

  5. Chandan Says:

    Thanks! That was really helpful!

  6. colino Says:

    Hi, i’m stil have that proble:

    > hdfs.init()
    14/12/09 05:20:49 ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/cloudera/R/x86_64-redhat-linux-gnu-library/3.1/rJava/java/boot/], parent=gnu.gcj.runtime.ExtensionClassLoader{urls=[], parent=null}}
    14/12/09 05:20:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    java.io.IOException: failure to login

    Java version is this one:
    [cloudera@quickstart cloudera]$ java -version
    java version “1.7.0_67”
    Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
    Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

    what’s wrong??

  7. Vivek Misra Says:

    In Step#3, I required “caTools” also to be installed . I have R version 3.2.0 .Also, you need to set up HADOOP_CMD if you are installing from R command.
    >install.packages(c("rJava", "Rcpp", "RJSONIO","bitops","digest", "functional","stringr","plyr","reshape2","caTools"))
    >Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)

  8. Monica Franceschini Says:

    Question: what nodes do I have to install R + RHadoop packages in an Hadoop cluster? All data nodes?
    Thanks

  9. Baskar Jayakumar Says:

    Do R packages need to be installed on all servers (Datanodes) that are part of the cluster ?

    Or just install on one node and copy the libraries across all nodes ?

  10. stuti Says:

    java major minor version errors are coming please help

  11. Johne Says:

    Hy guys.
    After following the instructions above I’ve been received the issue below.

    # My environment variables
    Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
    Sys.setenv(JAVA_HOME=’/usr/java/jdk1.7.0_67-cloudera’)
    Sys.setenv(HADOOP_HOME=’/usr/lib/hadoop-0.20-mapreduce’)
    Sys.setenv(HADOOP_STREAMING=’/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar’)

    # adding the libraries
    library(rmr2)
    library(rJava)
    library(rhdfs)

    # My issue
    hdfs.init()
    Error in .jnew(“org/apache/hadoop/conf/Configuration”) :
    java.lang.UnsupportedClassVersionError: org/apache/hadoop/conf/Configuration : Unsupported major.minor version 51.0

    Please! How I can solve it?

  12. spacediver Says:

    Having the exact same problem as you Johne. Curious to see how to resolve it. Based on what I’ve read, it has something to do with mismatching Java versions during compile and runtime (though I’ve little clue what that means).

  13. dedalo Says:

    Same problem and still trying to solve it! any clue?

    • dedalo Says:

      Finally, I solved it!. In step 3, after installing R, R-devel, and before installing any packages (rJava), run

      # R CMD javareconf

      in order to update the java version to compile rJava to these provided with your Cloudera (in my case Cloudera VM 5.7.0 in CentOS 6.7).

  14. Supertramp Says:

    Anyone else getting this error when starting R?

    Please review your hadoop settings. See help(hadoop.settings)
    Error : .onLoad failed in loadNamespace() for 'rJava', details:
    call: dyn.load(file, DLLpath = DLLpath, ...)
    error: unable to load shared object '/usr/lib64/R/library/rJava/libs/rJava.so':
    libjvm.so: cannot open shared object file: No such file or directory
    Error: package or namespace load failed for ‘rJava’

  15. Priya Says:

    I have been trying to integrate HADOOP with R by installing RHADOOP. But when I start RStudio, I keep getting the following message :

    Warning message:
    S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found

    I am sure that I have not given my path correctly for HADOOP_CMD. How do we find the path for HADOOP_CMD? I use Cloudera VM, should I install stand alone HADOOP for integrating HADOOP with R ?

    Thank you
    Priya

  16. syam srikanth Says:

    Hi,

    Your blog is nice.
    While installing the R in Cloudera quick start 5.8 .
    I am facing the below error .

    [cloudera@quickstart ~]$ cat /etc/redhat-release
    CentOS release 6.7 (Final)
    [cloudera@quickstart ~]$ sudo rpm -ivh http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
    Retrieving http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
    warning: /var/tmp/rpm-tmp.cZNd3p: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
    Preparing… ########################################### [100%]
    package epel-release-6-8.noarch is already installed
    [cloudera@quickstart ~]$ sudo yum install R
    Loaded plugins: fastestmirror, security
    Setting up Install Process
    Loading mirror speeds from cached hostfile
    Could not get metalink http://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=x86_64 error was
    12: Timeout on http://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=x86_64: (28, ‘connect() timed out!’)
    * base: mirror.its.dal.ca
    * epel: epel.mirror.constant.com
    * extras: mirror.its.dal.ca
    * updates: mirror.its.dal.ca

    Time out error and this mirror not found errors are coming.
    please guide me.

    Thanks,
    Syam.

  17. dheeru Says:

    my program properly run but output is
    $key
    NULL

    $val
    NULL

  18. dharmendra Says:

    output = from.dfs(mapreduce(input = dfsdata,
    + map = function(k,v) keyval(v, 1),
    + reduce = function(k, vv) keyval(k, length(vv))))
    packageJobJar: [/tmp/hadoop-unjar4207476690427020877/] [] /tmp/streamjob4437951472069024217.jar tmpDir=null
    19/07/01 20:45:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    19/07/01 20:45:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    19/07/01 20:45:50 INFO mapred.FileInputFormat: Total input files to process : 1
    19/07/01 20:45:51 INFO mapreduce.JobSubmitter: number of splits:2
    19/07/01 20:45:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1561991644224_0003
    19/07/01 20:45:52 INFO impl.YarnClientImpl: Submitted application application_1561991644224_0003
    19/07/01 20:45:52 INFO mapreduce.Job: The url to track the job: http://nanu:8088/proxy/application_1561991644224_0003/
    19/07/01 20:45:52 INFO mapreduce.Job: Running job: job_1561991644224_0003
    19/07/01 20:46:01 INFO mapreduce.Job: Job job_1561991644224_0003 running in uber mode : false
    19/07/01 20:46:01 INFO mapreduce.Job: map 0% reduce 0%
    19/07/01 20:46:08 INFO mapreduce.Job: map 100% reduce 0%
    19/07/01 20:46:15 INFO mapreduce.Job: map 100% reduce 100%
    19/07/01 20:46:16 INFO mapreduce.Job: Job job_1561991644224_0003 completed successfully
    19/07/01 20:46:16 INFO mapreduce.Job: Counters: 49
    File System Counters
    FILE: Number of bytes read=6
    FILE: Number of bytes written=490247
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=997
    HDFS: Number of bytes written=122
    HDFS: Number of read operations=13
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
    Job Counters
    Launched map tasks=2
    Launched reduce tasks=1
    Data-local map tasks=2
    Total time spent by all maps in occupied slots (ms)=9931
    Total time spent by all reduces in occupied slots (ms)=3601
    Total time spent by all map tasks (ms)=9931
    Total time spent by all reduce tasks (ms)=3601
    Total vcore-milliseconds taken by all map tasks=9931
    Total vcore-milliseconds taken by all reduce tasks=3601
    Total megabyte-milliseconds taken by all map tasks=10169344
    Total megabyte-milliseconds taken by all reduce tasks=3687424
    Map-Reduce Framework
    Map input records=3
    Map output records=0
    Map output bytes=0
    Map output materialized bytes=12
    Input split bytes=188
    Combine input records=0
    Combine output records=0
    Reduce input groups=0
    Reduce shuffle bytes=12
    Reduce input records=0
    Reduce output records=0
    Spilled Records=0
    Shuffled Maps =2
    Failed Shuffles=0
    Merged Map outputs=2
    GC time elapsed (ms)=208
    CPU time spent (ms)=2470
    Physical memory (bytes) snapshot=772419584
    Virtual memory (bytes) snapshot=6660243456
    Total committed heap usage (bytes)=731381760
    Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
    File Input Format Counters
    Bytes Read=809
    File Output Format Counters
    Bytes Written=122
    19/07/01 20:46:16 INFO streaming.StreamJob: Output directory: /tmp/file1a6e462d567b
    Deleted /tmp/file1a6e370fb577
    Deleted /tmp/file1a6e280ade45
    > # Now you can check the output
    > output
    $key
    NULL

    $val
    NULL

    my out put
    $key
    NULL

    $val
    NULL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: