Setup Hadoop 2 Multinode Cluster on Linux CentOS Using Virtual Box

Setup Your Own Multi Node Cluster!

Have you tried to set up a cluster only to run into technical issues and had to give up? Whether you are a Hadoop Administrator, Big Data Developer or Analyst, setting up your own small cluster is one of the best ways to understand how Hadoop works.

This article provides very detailed steps to set up your own cluster on CentOS using Virtual Box.

What Do I need to Setup a Small Multi-Node Cluster?

All the items below are free open source except your system. I will provide instructions on how to download and install all the components. A laptop or desktop with min 16 Gb RAM is recommended. You can set up on smaller configuration, but I will not recommend wasting time on a smaller system. These are instructions on how to setup on Windows 10. However, these instructions should pretty much work on other platforms with minor modifications.

  • Oracle Virtual Box
  • JDK 7 or 8
  • WinSCP
  • Grab a cup of coffee. Maybe you are smarter than me and will need only one!
  • Now let’s take a look at the cluster topology

    Overall Cluster Topology

    The cluster will have one data node and two name nodes. To keep the setup simple we will not include a Standby node. It can be added later. Also after successful installation, you can add more nodes if you wish.


    Step 1: Download and Install Oracle Virtual Box

    At the time of writing this article, the current version of Oracle Virtual Box is 5.2. You can do a google search “download Oracle Virtual Box”. Download Oracle Virtual Box from here.. Click on the Windows host distribution. Once the download is completed, install it as administrator.

    Also, download the VM Guest Additions iso. This should be installed as administrator as well.

    I recommend setting up one VM carefully and then clone it to make VMs for data nodes. You can change few settings on the cloned nodes as needed. This will save the time required to setup each VM separately.

    Step 2: Download Linux – CentOS

    Download CentOS 7 which is free from www.CentOS.org. There are multiple choices of distribution, please make sure the DVD iso is chosen.

    The file name should look similar to CentOS-7-x86_64-DVD-1708.iso




    Step 3: Create virtual machines for name node, and data nodes

    For the setup, we will use hostnames nn1, dn1, dn2 for name node, data node 1, and data node 2.

    Open Oracle Virtual Box, and select New, and enter Operating System name as ‘CentOS’. The version will automatically be populated as ‘Redhat 64 bit’. Follow the installation prompts.

    Specify 2 Gb of RAM for the VM. You can specify smaller but may run into issues, so recommend 2 Gb.
    For hard disk – select ‘Create a virtual hard disk now’. Select VDI format.
    On Choose file location and size’ page, browse to the folder where you would like to manage the VM.
    Allocate 16 Gb of disk space for the VM.
    Specify video memory of at least 128Screen shots below should help.


    At this point, you can click on the VM name and change the name from CentOS to nn1.

    Step 4: Setup network connectivity between Host and Virtual Machines

    Now open Global Tools in Virtual Box. Add the ‘Host Only’ adapter and DHCP server settings.

    We need to setup network between the host system and the VM. Select the VM and right click on the name to bring up properties.
    Change the ‘Network’ properties similar to those shown in the screenshots below. Usually, the first adapter will come up as NAT. Keep it, and set up the second adapter as ‘Host-only’ Adapter.


    Check the IP address of newly created VM using ifconfig Linux command

    At this point, you should be able to ping the VM from your host Windows system.

    Similarly, you should be able to ping the host from the VM using ping. If this does not work check the steps above to make sure you have not missed any step.

    Step 5: Setup JDK 7 or JDK 8

    I have tested with JDK 7, but JDK 8 will work. I have not tested with JDK 9, but if anyone does please add a comment to share the experience. We are installing JDK and Hadoop components before cloning so that we do not have to repeat the installation separately on each node.

    Search ‘Download JDK’ in browser search, or access the downloads at Oracle JDK download page. You will need to login or register if you have not done so before. Use the 64-bit gz version.

    The file should look similar to jdk-7u80-linux-x64.tar.gz

    You can either download directly using the browser on the VM, or Install WinSCP on your Windows host system, and set up a connection to the VM nn1, and copy the file to /root/Downloads folder.

    [root@localhost Downloads]#tar zvf jdk-7u80-linux-x64.tar.gz
    [root@localhost Downloads]# ls
    jdk1.7.0_80 jdk-7u80-linux-x64.tar.gz

    [root@localhost Downloads]# mv jdk1.7.0_80 /usr/local/
    [root@localhost Downloads]# cd /usr/local/
    [root@localhost local]# ls
    bin etc games include jdk1.7.0_80 lib lib64 libexec sbin share src

    Create a soft link to the java folder so that you do not need to specify the full folder name every time.

    [root@localhost local]# ln -s /usr/local/jdk1.7.0_80 /usr/local/java

    Check that java was installed successfully.

    [root@localhost local]# java -version

    Step 7: Download Apache Hadoop 2 and setup on all nodes

    To download Apache Hadoop simply search for ‘download apache hadoop’ in the browser, or access it at the ‘Apache Hadoop Releases Page

    Go to the mirror page and select any of the mirror sites. Make sure you download the distribution version that you want to install. The 2.9 version file should look similar to: hadoop-2.9.0.tar.gz


    >># tar -zxf Hadoop…tar.gz

    Move Hadoop folder to /opt
    Create a soft symbolic link for the folder so that you do not have to type the long name including version. Also in future you will be able to replace the folder with next version without having to change the folder names in bunch of places.

    Step 8: Setup Java and Hadoop environment variables

    Add Java and Hadoop env variables to .bash_profile, and more importantly to .bashrc
    The .bashrc will be executed when a process opens a connection to another server and communicates. If you miss this step, the installation will not work.

    Here are the environment variables to be added to .bashrc and .bash_profile

    export JAVA_HOME=/usr/local/java

    export PATH=/usr/local/java/bin/:/opt/hadoop/bin/:/opt/hadoop/sbin/:$PATH

    export HADOOP_HOME=/opt/hadoop
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export YARN_HOME=$HADOOP_HOME
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    export HADOOP_INSTALL=$HADOOP_HOME
    export HADOOP_PREFIX=$HADOOP_HOME
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

    Now you can clone the nn1 namenode VM to create two data node VMS. Name them dn1, and dn2.

    Check the IP address of each VM (aka node) and node them down. You will need the addresses to set up your host file.

    Step 5: Setup network between Host and Virtual Machines

    The node names and IP addresses should be added to /etc/hosts on each of the virtual machines. Remove 127.0.0.1, localhost or 127.0.1.1 entries from the host file. The hadoop software starts the name node with node name as ‘localhost’ instead of nn1, which will prevent the nodes from communicating with each other, so this is a small but important step.

    The host file should look similar to as follows.

    After adding the IP addresses with server names to the /etc/hosts file, run the following:

    # update /etc/sysconfig/network

    Typically for production installations, the administrators set up a Linux group (e.g. ‘hadoop’) and a specific Linux user (e.g. ‘hduser’) and perform installation under that user. However, for simplicity sake, we will install using root. This will save us the trouble of doing sudo all the time.

    Step 9: Setup ssh keys

    SSH setup is needed to enable the nodes to talk to each other without having to enter or provide a password. SSH uses standard cryptography. Hadoop relies on keys with no paraphrase.

    It is very important to generate keys on each of the node and then copy to other nodes using ssh-copy-id. There are multiple ways of doing this, however many times people new to Hadoop, or those not from Linux background do the setup on each node, but forget to copy the key to the node it was generated on. This is needed for a user to login fro example from nn1 to nn1.

    So login as root to node nn1 and issue following commands at Linux prompt:

    >># cd $HOME
    >># ssh-keygen -t rsa
    >># ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@dn1
    >># ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@dn2
    >># ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@nn1

    Don’t forget the last statement! Now do the same on all nodes.

    Login from each node to itself, and other nodes. You will be prompted to enter a password for the first time, and prompted to add the key. Specify yes. This will add the SSH key to $HOME/.ssh/authorized_keys

    If you are prompted for a password, some step was missed. Check the authorized_keys file to make sure the keys are present for all nodes on all nodes. That means nn1 should have an entry for nn1, dn1, dn2.

    Step 10: Format name node

    Now that java, hadoop, environment variables and keys are set and working, you are ready to format the name node. This will create a disk format that will be used by name node.

    Create directories for name node and data nodes. This is needed on each node. I called the directories namenode2, and datanode2 because I have another installation with folder names name node, and data node. You can use suitable names.

    >># cd /opt/
    >># mkdir hdfs
    >># mkdir hdfs/namenode2
    >># mkdir hdfs/datanode2

    Edit the configuration files core-site.xml, hdfs-site.xml, yarn-site.xml to include configuration settings similar to those shown in the screenshots below.


    >># hdfs namenode -format

    You should see a successfully formatted message. Don’t celebrate yet! You will have an opportunity a little bit later. The reason for this is that hadoop formatting creates metadata, but issues are not uncovered until yarn process tries to access or distribute the data.

    Step 11: Start the DFS and YARN processes

    Start DFS process

    >># start-dfs.sh

    Check the logs in /opt/hadoop/logs for any error messages. Even though it may seem that the name node has started, there could be issues with the processes.

    If you see keyword localhost anywhere in the log, there is an issue with how name node is starting with. Even though the node is up, the processes will not find it. It should start with the server name nn1. If this happens, check your host file to make sure there is no entry for 127.0.0.0 or localhost.

    >># more /opt/hadoop/logs/hadoop-root-namenode-nn1.log

    Step 12: Move a test data file from host OS to HDFS and see the distribution in action

    Create a test csv or some file in a folder. Or copy a sample file to a folder and move to HDFS using following command.

    >># hdfs dfs -mkdir /testdata

    >># hdfs -put aaa.csv /testdata

    Now you can log in to the data nodes and check the subfolders under /opt/hdfs/testdata and see the file being split into blocks on each data node.

    Step 13: You have your multi-node cluster operational. Now celebrate!

    You did it!

    If you face any issues or have questions, please put in comments and I will answer them as soon as possible. Everyone is welcome to contribute.

    News Reporter

    4,465 thoughts on “Setup Hadoop 2 Multinode Cluster on Linux CentOS Using Virtual Box