Installing R, RHadoop and RStudio over Cloudera Hadoop ecosystem

The following is a step-by-step procedure of installing R, RStudio and RHadoop over a Cloudera hadoop system. OS is CentOS 6.4.

(This blog has been revised. Please See the revised blog)

I have taken the help of following sites:

It is assumed you have installed latest version of  Cloudera Standard  (CDH 4.3) over a CentOS 6.4 system. If your version is different from CDH 4.3 then you may have to export appropriate jar file as mentioned in Step 6 below. Proceed now as follows:

Step 1.
Check your CentOS version ( after updation it should be 6.4)  as below:

    [root@master ~]# cat /etc/redhat-release
CentOS release 6.4 (Final)

Step 2:

Update yum configuration file to enable downloading packages from epel  (Extra Packages for Enterprise Linux)  repo for CentOS 6.4.

    [root@master ~]# rpm -ivh http://mirror.chpc.utah.edu/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

Step 3:
Install R and R-devel

   [root@master ~]# yum  -y  –enablerepo=epel  install  R  R-devel

   [root@master ~]# R  CMD  javareconf

Step 4:
Start R shell and issue commands within it to download packages  as below (note that install.packages() line splits over many lines even though it is one single line till the last closing brackets(s))

    [root@master ~]# R

   > install.packages(c(‘Rcpp’, ‘RJSONIO’, ‘itertools’, ‘digest’), repos=”http://cran.revolutionanalytics.com”, INSTALL_opts=c(‘–byte-compile’) )

   > install.packages(c(‘functional’, ‘stringr’, ‘plyr’), repos=”http://cran.revolutionanalytics.com”, INSTALL_opts=c(‘–byte-compile’) )

   > install.packages(c(‘rJava’), repos=”http://cran.revolutionanalytics.com” )

   > install.packages(c(‘randomForest’), repos=”http://cran.revolutionanalytics.com” )

   > install.packages(c(‘reshape2’), repos=”http://cran.revolutionanalytics.com” )

   > install.packages( c( ‘bitops’), repos=’http://cran.revolutionanalytics.com’)

   > q()

Step 5:
Download  RHadoop and install

   [root@master ~]# git clone git://github.com/RevolutionAnalytics/rmr2.git

   [root@master ~]# R CMD INSTALL –byte-compile rmr2/pkg/

Step 6:
Set environmental variables. DO IT BEFORE NEXT STEP. Also write these three export commands in files: ~/.bashrc and /etc/profile

   export HADOOP_HOME=/usr/lib/hadoop
   export HADOOP_CMD=/usr/bin/hadoop
   export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar
(Note: Depending upon the Cloudera version you have downloaded, check the version of jar file above and change it accordingly)

Run the source command to export the environmental variables:

   [root@master ~]# source /etc/profile

   [ashokharnal@master ~]$ source ~/.bashrc

Step 7:
Install rhdfs as:

   [root@master ~]# git clone git://github.com/RevolutionAnalytics/rhdfs.git

   [root@master ~]# R CMD INSTALL –byte-compile rhdfs/pkg/

(By mistake if you do not set environmental variables, then set them now and then do as follows:

   [root@master ~]# mv rhdfs rhdfs1

   [root@master ~]# git clone git://github.com/RevolutionAnalytics/rhdfs.git

   [root@master ~]# R CMD INSTALL –byte-compile rhdfs/pkg/

You have finished installing R, and RHadoop at this point

Step 8:
Install RStudio as:

[root@master ~]# wget http://download2.rstudio.org/rstudio-server-0.97.551-x86_64.rpm

[root@master ~]# yum install –nogpgcheck rstudio-server-0.97.551-x86_64.rpm

Note: RStudio service can be started/stopped as:

   [root@master ~]# service rstudio-server stop

  [root@master ~]# service rstudio-server status

   [root@master ~]# service rstudio-server start

The RStudio server is accessible at: http://localhost:8787

Step 9:
Restart your machine

Step 10:
Checking. Check if map-reduce is working in R:

# First change permissions of folder ‘/user’ in hdfs so that it is accessible from RStudio and R shell

[root@master ~]# su hdfs

bash-4.1$ hadoop fs -chmod -R 777 /user

# Start R shell and prepare for a map-reduce job:

[root@master ~]# R

> # Check where you are by issuing a system (bash shell) command. Then load libraries.

> system (“pwd”)

> library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2

> library(rJava)
> Sys.setenv(HADOOP_HOME=”/usr/lib/hadoop”)
> Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
> Sys.setenv(HADOOP_STREAMING=”/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar”)
> library(rhdfs)
HADOOP_CMD=/usr/bin/hadoop
Be sure to run hdfs.init()
> hdfs.init()
13/08/25 10:52:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where
applicable

We will next run the following three commands in R shell, one by one:

ints = to.dfs(1:100)
calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))
from.dfs(calc)

This is what happens when we run:

> ints = to.dfs(1:100)
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

13/08/25 10:55:28 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/08/25 10:55:28 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Warning message:
In to.dfs(1:100) : Converting to.dfs argument to keyval with a NULL key

> calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))

packageJobJar: [/tmp/RtmplRnWux/rmr-local-env27805e8383c0, /tmp/RtmplRnWux/rmr-global-env27804f1e1232, /tmp/RtmplRnWux/rmr-streaming-map2780334d60eb, /tmp
/hadoop-root/hadoop-unjar5151351822726915138/] [] /tmp/streamjob1232123517037304502.jar tmpDir=null
13/08/25 10:55:51 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/08/25 10:55:51 INFO mapred.FileInputFormat: Total input paths to process : 1
13/08/25 10:55:52 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
13/08/25 10:55:52 INFO streaming.StreamJob: Running job: job_201308250944_0002
13/08/25 10:55:52 INFO streaming.StreamJob: To kill this job, run:
13/08/25 10:55:52 INFO streaming.StreamJob: /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201308250944_0002
13/08/25 10:55:52 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201308250944_0002
13/08/25 10:55:53 INFO streaming.StreamJob: map 0% reduce 0%
13/08/25 10:56:00 INFO streaming.StreamJob: map 100% reduce 0%
13/08/25 10:56:01 INFO streaming.StreamJob: map 100% reduce 100%
13/08/25 10:56:01 INFO streaming.StreamJob: Job complete: job_201308250944_0002
13/08/25 10:56:01 INFO streaming.StreamJob: Output: /tmp/RtmplRnWux/file2780145db011

> from.dfs(calc)

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

$key
NULL

$val
v
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
[7,] 7 14
[8,] 8 16
[9,] 9 18
[10,] 10 20
[11,] 11 22
[12,] 12 24
[13,] 13 26
[14,] 14 28
[15,] 15 30
[16,] 16 32
[17,] 17 34
[18,] 18 36
[19,] 19 38
[20,] 20 40
[21,] 21 42
[22,] 22 44
[23,] 23 46
[24,] 24 48
[25,] 25 50
[26,] 26 52
[27,] 27 54
[28,] 28 56
[29,] 29 58
[30,] 30 60
[31,] 31 62
[32,] 32 64
[33,] 33 66
[34,] 34 68
[35,] 35 70
[36,] 36 72
[37,] 37 74
[38,] 38 76
[39,] 39 78
[40,] 40 80
[41,] 41 82
[42,] 42 84
[43,] 43 86
[44,] 44 88
[45,] 45 90
[46,] 46 92
[47,] 47 94
[48,] 48 96
[49,] 49 98
[50,] 50 100
[51,] 51 102
[52,] 52 104
[53,] 53 106
[54,] 54 108
[55,] 55 110
[56,] 56 112
[57,] 57 114
[58,] 58 116
[59,] 59 118
[60,] 60 120
[61,] 61 122
[62,] 62 124
[63,] 63 126
[64,] 64 128
[65,] 65 130
[66,] 66 132
[67,] 67 134
[68,] 68 136
[69,] 69 138
[70,] 70 140
[71,] 71 142
[72,] 72 144
[73,] 73 146
[74,] 74 148
[75,] 75 150
[76,] 76 152
[77,] 77 154
[78,] 78 156
[79,] 79 158
[80,] 80 160
[81,] 81 162
[82,] 82 164
[83,] 83 166
[84,] 84 168
[85,] 85 170
[86,] 86 172
[87,] 87 174
[88,] 88 176
[89,] 89 178
[90,] 90 180
[91,] 91 182
[92,] 92 184
[93,] 93 186
[94,] 94 188
[95,] 95 190
[96,] 96 192
[97,] 97 194
[98,] 98 196
[99,] 99 198
[100,] 100 200

>
# You can repeat the above in R-Studio after logging in as local linux user (say ashokharnal, password of ashokharnal on the system)

Screenshot

Advertisements

2 Responses to “Installing R, RHadoop and RStudio over Cloudera Hadoop ecosystem”

  1. Monika Says:

    can this integration process will work for cloudera 1.6?

  2. Kasthuri Says:

    I cant sign in to the rstudio what is the username and password

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: