Hadoop MultiNode Cluster Setup

In this tutorial we are going to talk about how you can setup your multi-node Hadoop Cluster for development. We will take the Single Node Hadoop Cluster which we setup earlier and use that as a base for setting up Multi Node Cluster. So If you have not setup your Single Node Cluster I would  recommend you to follow this tutorial first before moving to Multi Node. And you may ask why am I calling it as a development Cluster? Because setting up a production Cluster requires lot of skills and calculations about space and primary memory requirement for different nodes and also high end machines and super fast network setup. I will cover that later but now our main focus is to setup a small sized Cluster with just few real machine and virtual machines and with our local 100 Mbps network and start learning Hadoop and MapReduce.

Pre-Requisites -

Step1 -  Clone the virtual machine 

Right click on the single Node Hadoop Cluster in virtual box manager and clone it. Select full clone when it asks for clone type. After making clones make sure you allocate enough primary memory for all the machine to run. I have created 3-node cluster with 1 NameNode and  3 DataNodes and allocated 1.5 GB to NameNode and 1GB to DataNodes. You can creates as many clones as you can support in a single system or you may choose to connect multiple systems to make it even bigger. The memory requirement is not trivial but dealt with in a great detail which is out of scope of this tutorial.

cloning the virtual image

Cloning the virtual machine

 

select full clone

Select full clone

Step2 -  Network Configurations  

For my setup I created an internal network of my own. You can also do the same or if you are already working with your organization’s network you may want all your virtual machines to have different IPs in that case just change the MAC ID of all the virtual machine to get different IPs. Also for your own internal network it should be attached to “bridge network” whereas with organization ISP it should be attached to NAT.

network settings in virtual machines

network settings in virtual machines

My own internal network

My own internal network

2.1 Hostname Change:

First thing which needs to be done is to change the hostname of all the systems. Open the terminal type in the following command:

Replace whatever is there with appropriate name and then save the file.

For NameNode I use master and for slaves I used slave-1, slave-2, …

hostname setting

hostname setting

2.2 IP Mappings in hosts file:

Open the /etc/hosts file and Include all the IP mappings of the whole cluster in this file. Use the following command -

The entries in the /etc/hosts file in master and slaves would look this:

IP mapping

IP mapping

Step3 -  SSH Access

The hduser user on the master must be able to connect  to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost and  to the hduser user account on the slave  via a password-less SSH login. Use the following command to do this -

after copying password-less ssh key use following command to connect to master and slaves from master machine -

ssh master

ssh master

ssh slave-1

ssh slave-1

If you are able to successfully connect to all the machines from master then your network setting is ready for Hadoop Cluster.

Step4 -  Hadoop Configurations 

4.1 Changes in Master Node

conf/slaves 
In the “/usr/local/hadoop/conf” folder which you can access by typing “cd $HADOOP_HOME” and then “cd conf” where we need to change the slaves file. The entries in the file would include the hostnames of all the systems which are slaves in the Cluster. Remember here I am using my master node to run DataNode and TaskTracker so I have given master also in slave file. If you don’t want to run DataNode and TaskTracker in your master node simply remove that entry.

salves

salves

4.2 Changes in Slave Nodes

conf/masters
In the “/usr/local/hadoop/conf” folder which you can access by typing “cd $HADOOP_HOME” and then “cd conf” where we need to change the “masters” file.The entries in the file should include the master hostname of the system .

masters

masters

4.3 Common Changes 

conf/core-site.xml
This file is used to set the default file system name. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the File System implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.

conf/hdfs-site.xml
This file has the block replication factor. Here I am running 3 DataNode instances so replication factor is 3 for my Cluster.

conf/mapred-site.xml
This file contains The host and port that the MapReduce job tracker runs at. Again, this is the master in our case.

Formatting HDFS filesystem via NameNode
Use the following command to format the HDFS of your newly build Cluster-

Check the status of format. If format is successful then you can start your Cluster.

Formatting NameNode

Formatting NameNode

That’s it now you are ready to start your Cluster. Use the following commands on master node-

This command will start NameNode, JobTracker and Secondary NameNode on master and DataNode and TaskTracker on slaves. In our case because we have given master node also in slave list so it will start DataNode and TaskTracker also on master node.

You can check the running daemons on each machne by using following command -

Starting MultiNode Cluster

Starting MultiNode Cluster

Use the following command to stop the running Cluster-

That all guys you have successfully setup your MultiNode Hadoop Cluster. Next I am going to tell you about how you can implement and run MapReduce jobs on your Clusters. Keep following…

  • Cindy

    Hi, after I set up the cluster, run ‘start-all.sh’. I got nothing in the slave machine using ‘jps’. Could you help me to check what’s wrong? Thank you. The two machines I used are mac 10.7.5

    starting namenode, logging to /Users/cindyzhang/hadoop-1.2.1/libexec/../logs/hadoop-cindyzhang-namenode-master.out

    2013-09-16 12:34:51.916 java[12143:1c03] Unable to load realm info from SCDynamicStore

    master: starting datanode, logging to /Users/cindyzhang/hadoop-1.2.1/libexec/../logs/hadoop-cindyzhang-datanode-master.out

    master: 2013-09-16 12:34:53.093 java[12231:1f03] Unable to load realm info from SCDynamicStore

    cindy@NatiMac: starting datanode, logging to /Users/cindyzhang/hadoop-1.2.1/libexec/../logs/hadoop-cindy-datanode-NatiMac.out

    cindy@NatiMac: 2013-09-16 12:34:54.002 java[1023:1b03] Unable to load realm info from SCDynamicStore

    master: starting secondarynamenode, logging to /Users/cindyzhang/hadoop-1.2.1/libexec/../logs/hadoop-cindyzhang-secondarynamenode-master.out

    master: 2013-09-16 12:34:55.609 java[12317:1f03] Unable to load realm info from SCDynamicStore

    starting jobtracker, logging to /Users/cindyzhang/hadoop-1.2.1/libexec/../logs/hadoop-cindyzhang-jobtracker-master.out

    cindy@NatiMac: starting tasktracker, logging to /Users/cindyzhang/hadoop-1.2.1/libexec/../logs/hadoop-cindy-tasktracker-NatiMac.out

    master: starting tasktracker, logging to /Users/cindyzhang/hadoop-1.2.1/libexec/../logs/hadoop-cindyzhang-tasktracker-master.out

    cindy@NatiMac: 2013-09-16 12:34:58.224 java[1077:1b03] Unable to load realm info from SCDynamicStore

    master: 2013-09-16 12:34:58.801 java[12465:1f03] Unable to load realm info from SCDynamicStore

  • Smarty Juice

    Quick question in Step 2 – what do you mean by creating your own internal network, did you mean you have created a new network on your host ubuntu machine (on which virtualbox is installed)?

    • wolkus

      Yes I made a new network on my host machine and all my Vms are connected to the same network. The reason for this was your ISP may not give you extra IPs for all other machines inside your host machine.