The following is a step-by-step procedure of installing R, RStudio and RHadoop over a Cloudera hadoop system. OS is CentOS 6.5.
Step 1:
If you are behind proxy, as root user, set up environmental variable for proxy server, as:
# export http_proxy=http://192.168.1.254:6588
(http://proxy-server-ip:proxy-serverPort)
And in /etc/yum.conf, write one line as:
proxy=http://192.168.1.254:6588
Recheck yout CentOS version, if required
$ cat /etc/redhat-release
CentOS release 6.5 (Final)
Step 2:
Update your OS
# yum -y update
And, add EPEL repository to yum configuration
# rpm -ivh http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
(Note that rpmforge has an older version of R. Disable rpmforge, if active
by opening file /etc/yum.repos.d/rpmforge.repo and changing the line
enabled = 1
to
enabled = 0 )
Step 3
Install R & R-devel
# yum install R R-devel
Start R shell and issue commands within it to download packages as below
(note that install.packages() line splits over many lines even though it is one
single line till the last closing brackets(s))
# R
(If your are behind proxy, set the proxyserver address and port in R-shell as follows before the next command)
> Sys.setenv(http_proxy=”http://192.168.1.254:6588″ )
(If you find strange characters here, the format is: http://proxy-server:proxyserverPort enclosed within double inverted commas)
> install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”))
> q()
Step 4
Download rhdfs and rmr2 packages to your local Download folder from here:
Step 5
Set up environmental variables. Write the following three lines in file ~/.bashrc and /etc/profile
export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar
(Note: Depending upon the Cloudera version you have downloaded, check the version of jar file above and change it accordingly)
After writing these lines, export the variables as:
# source /etc/profile
$ source ~/.bashrc
Step 6
Install the downloaded rhdfs & rmr2 packages in R-shell by specifying its location in your machine. First, start R:
# R
Then, install packages:
> install.packages(“/home/ashokharnal/Downloads/rhdfs_1.0.8.tar.gz”, repos = NULL, type=”source”)
> install.packages(“/home/ashokharnal/Downloads/rmr2_2.3.0.tar.gz”, repos = NULL, type=”source”)
At this point R and RHadoop are installed
Step 7
Install RStudio as (my machine has 64 bit-OS. Check yours before install)
# wget http://download2.rstudio.org/rstudio-server-0.98.490-x86_64.rpm
# yum install –nogpgcheck rstudio-server-0.98.490-x86_64.rpm
Note: RStudio service can be started/stopped as:
# service rstudio-server stop
# service rstudio-server status
# service rstudio-server start
The RStudio server is accessible at: http://localhost:8787
Step 8
Restart your machine
Step 9
Log in as non-root (local) user and create a file ~/.Rprofile. Write in it the following seven lines and save the file:
Sys.setenv(HADOOP_HOME=”/usr/lib/hadoop-0.20-mapreduce”)
Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
Sys.setenv(HADOOP_STREAMING=”/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar”)
library(rmr2)
library(rJava)
library(rhdfs)
hdfs.init()
The file ~/.Rprofile contains commands/environmental variables that need to be set/executed whenever
R starts. This file is picked up by R, as soon as it starts.
Step 10
Check hadoop-mapreduce capability in RHadoop as follows. Start R-shell as local user
and issue the following three commands in the shell, one after another.
> ints = to.dfs(1:100)
> calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))
> from.dfs(calc)
You will get a long series of output something as:
$key
NULL
$val
v
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
[7,] 7 14
[8,] 8 16
[9,] 9 18
Tags: R, RHadoop, RHadoop on CentOS 6.5
March 13, 2014 at 1:00 pm |
Reblogged this on Data Science and Big Data Analytics in practise and commented:
Complete Steps, worked seamlessly. Please see the details if you are playing around with rhadoop.
March 13, 2014 at 1:02 pm |
Had to re-install rhadoop on my personal PC after a long time, found the steps helpful hence re-blogged, hope you don’t mind,
April 4, 2014 at 8:08 pm |
Hi, I am a noob to hadoop. I followed the steps but got the following error on issuing the command hdfs.init() Please suggest.
> hdfs.init()
14/04/04 16:01:21 ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/cloudera/R/x86_64-redhat-linux-gnu-library/3.0/rJava/java/boot/], parent=gnu.gcj.runtime.ExtensionClassLoader{urls=[], parent=null}}
14/04/04 16:01:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
java.io.IOException: failure to login
September 2, 2014 at 4:28 am |
Hi tavpritesh … did you find the solution the error
ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.above ?
November 10, 2014 at 4:54 am
Hi,
I had the same problem.
Below two steps solved it for me:
1. install a native JDK (not the one comes with cloudera manager). I installed “jdk-7u51-linux-x64.rpm” directly.
2. Set the below environment variables. Paths may differ based on your hadoop distribution. Make sure to JAVA_HOME pointing to the JDK installed in step 1:
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar
export JAVA_HOME=/usr/java/jdk1.7.0_51
export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib64:/usr/java/jdk1.7.0_45-cloudera/jre/lib/amd64/server
September 12, 2014 at 10:17 am |
Hi guys, have anyone fixed the error above?
October 1, 2014 at 7:32 am |
Thanks! That was really helpful!
December 9, 2014 at 1:39 pm |
Hi, i’m stil have that proble:
> hdfs.init()
14/12/09 05:20:49 ERROR security.UserGroupInformation: Unable to find JAAS classes:com.sun.security.auth.UnixPrincipal not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/cloudera/R/x86_64-redhat-linux-gnu-library/3.1/rJava/java/boot/], parent=gnu.gcj.runtime.ExtensionClassLoader{urls=[], parent=null}}
14/12/09 05:20:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
java.io.IOException: failure to login
Java version is this one:
[cloudera@quickstart cloudera]$ java -version
java version “1.7.0_67”
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
what’s wrong??
June 14, 2015 at 12:21 am |
In Step#3, I required “caTools” also to be installed . I have R version 3.2.0 .Also, you need to set up HADOOP_CMD if you are installing from R command.
>install.packages(c("rJava", "Rcpp", "RJSONIO","bitops","digest", "functional","stringr","plyr","reshape2","caTools"))
>Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
September 23, 2015 at 10:18 am |
Question: what nodes do I have to install R + RHadoop packages in an Hadoop cluster? All data nodes?
Thanks
September 24, 2015 at 12:35 am |
I had installed on only one data-node. But it will have to be installed on all data-nodes.
May 21, 2017 at 4:35 am |
A perfect reply! Thanks for taking the trlbuoe.
September 29, 2015 at 1:06 am |
Do R packages need to be installed on all servers (Datanodes) that are part of the cluster ?
Or just install on one node and copy the libraries across all nodes ?
March 12, 2016 at 6:54 am |
java major minor version errors are coming please help
April 12, 2016 at 4:53 pm |
Hy guys.
After following the instructions above I’ve been received the issue below.
# My environment variables
Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
Sys.setenv(JAVA_HOME=’/usr/java/jdk1.7.0_67-cloudera’)
Sys.setenv(HADOOP_HOME=’/usr/lib/hadoop-0.20-mapreduce’)
Sys.setenv(HADOOP_STREAMING=’/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar’)
# adding the libraries
library(rmr2)
library(rJava)
library(rhdfs)
# My issue
hdfs.init()
Error in .jnew(“org/apache/hadoop/conf/Configuration”) :
java.lang.UnsupportedClassVersionError: org/apache/hadoop/conf/Configuration : Unsupported major.minor version 51.0
Please! How I can solve it?
April 13, 2016 at 6:46 pm |
Having the exact same problem as you Johne. Curious to see how to resolve it. Based on what I’ve read, it has something to do with mismatching Java versions during compile and runtime (though I’ve little clue what that means).
May 16, 2016 at 9:36 am |
Same problem and still trying to solve it! any clue?
May 20, 2016 at 9:04 am |
Finally, I solved it!. In step 3, after installing R, R-devel, and before installing any packages (rJava), run
# R CMD javareconf
in order to update the java version to compile rJava to these provided with your Cloudera (in my case Cloudera VM 5.7.0 in CentOS 6.7).
June 18, 2016 at 12:52 pm |
Anyone else getting this error when starting R?
Please review your hadoop settings. See help(hadoop.settings)
Error : .onLoad failed in loadNamespace() for 'rJava', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/usr/lib64/R/library/rJava/libs/rJava.so':
libjvm.so: cannot open shared object file: No such file or directory
Error: package or namespace load failed for ‘rJava’
July 6, 2016 at 12:33 pm |
I have been trying to integrate HADOOP with R by installing RHADOOP. But when I start RStudio, I keep getting the following message :
Warning message:
S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found
I am sure that I have not given my path correctly for HADOOP_CMD. How do we find the path for HADOOP_CMD? I use Cloudera VM, should I install stand alone HADOOP for integrating HADOOP with R ?
Thank you
Priya
May 29, 2017 at 10:08 am |
Hi,
Your blog is nice.
While installing the R in Cloudera quick start 5.8 .
I am facing the below error .
[cloudera@quickstart ~]$ cat /etc/redhat-release
CentOS release 6.7 (Final)
[cloudera@quickstart ~]$ sudo rpm -ivh http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
Retrieving http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
warning: /var/tmp/rpm-tmp.cZNd3p: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Preparing… ########################################### [100%]
package epel-release-6-8.noarch is already installed
[cloudera@quickstart ~]$ sudo yum install R
Loaded plugins: fastestmirror, security
Setting up Install Process
Loading mirror speeds from cached hostfile
Could not get metalink http://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=x86_64 error was
12: Timeout on http://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=x86_64: (28, ‘connect() timed out!’)
* base: mirror.its.dal.ca
* epel: epel.mirror.constant.com
* extras: mirror.its.dal.ca
* updates: mirror.its.dal.ca
Time out error and this mirror not found errors are coming.
please guide me.
Thanks,
Syam.
March 26, 2019 at 9:00 pm |
my program properly run but output is
$key
NULL
$val
NULL
July 1, 2019 at 3:19 pm |
output = from.dfs(mapreduce(input = dfsdata,
+ map = function(k,v) keyval(v, 1),
+ reduce = function(k, vv) keyval(k, length(vv))))
packageJobJar: [/tmp/hadoop-unjar4207476690427020877/] [] /tmp/streamjob4437951472069024217.jar tmpDir=null
19/07/01 20:45:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/07/01 20:45:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/07/01 20:45:50 INFO mapred.FileInputFormat: Total input files to process : 1
19/07/01 20:45:51 INFO mapreduce.JobSubmitter: number of splits:2
19/07/01 20:45:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1561991644224_0003
19/07/01 20:45:52 INFO impl.YarnClientImpl: Submitted application application_1561991644224_0003
19/07/01 20:45:52 INFO mapreduce.Job: The url to track the job: http://nanu:8088/proxy/application_1561991644224_0003/
19/07/01 20:45:52 INFO mapreduce.Job: Running job: job_1561991644224_0003
19/07/01 20:46:01 INFO mapreduce.Job: Job job_1561991644224_0003 running in uber mode : false
19/07/01 20:46:01 INFO mapreduce.Job: map 0% reduce 0%
19/07/01 20:46:08 INFO mapreduce.Job: map 100% reduce 0%
19/07/01 20:46:15 INFO mapreduce.Job: map 100% reduce 100%
19/07/01 20:46:16 INFO mapreduce.Job: Job job_1561991644224_0003 completed successfully
19/07/01 20:46:16 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=6
FILE: Number of bytes written=490247
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=997
HDFS: Number of bytes written=122
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=9931
Total time spent by all reduces in occupied slots (ms)=3601
Total time spent by all map tasks (ms)=9931
Total time spent by all reduce tasks (ms)=3601
Total vcore-milliseconds taken by all map tasks=9931
Total vcore-milliseconds taken by all reduce tasks=3601
Total megabyte-milliseconds taken by all map tasks=10169344
Total megabyte-milliseconds taken by all reduce tasks=3687424
Map-Reduce Framework
Map input records=3
Map output records=0
Map output bytes=0
Map output materialized bytes=12
Input split bytes=188
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=12
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=208
CPU time spent (ms)=2470
Physical memory (bytes) snapshot=772419584
Virtual memory (bytes) snapshot=6660243456
Total committed heap usage (bytes)=731381760
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=809
File Output Format Counters
Bytes Written=122
19/07/01 20:46:16 INFO streaming.StreamJob: Output directory: /tmp/file1a6e462d567b
Deleted /tmp/file1a6e370fb577
Deleted /tmp/file1a6e280ade45
> # Now you can check the output
> output
$key
NULL
$val
NULL
my out put
$key
NULL
$val
NULL