Setup Your Own Multi Node Cluster!
Have you tried to set up a cluster only to run into technical issues and had to give up? Whether you are a Hadoop Administrator, Big Data Developer or Analyst, setting up your own small cluster is one of the best ways to understand how Hadoop works.
This article provides very detailed steps to set up your own cluster on CentOS using Virtual Box.
What Do I need to Setup a Small Multi-Node Cluster?
All the items below are free open source except your system. I will provide instructions on how to download and install all the components. A laptop or desktop with min 16 Gb RAM is recommended. You can set up on smaller configuration, but I will not recommend wasting time on a smaller system. These are instructions on how to setup on Windows 10. However, these instructions should pretty much work on other platforms with minor modifications.
Now let’s take a look at the cluster topology
Overall Cluster Topology
The cluster will have one data node and two name nodes. To keep the setup simple we will not include a Standby node. It can be added later. Also after successful installation, you can add more nodes if you wish.
Step 1: Download and Install Oracle Virtual Box
At the time of writing this article, the current version of Oracle Virtual Box is 5.2. You can do a google search “download Oracle Virtual Box”. Download Oracle Virtual Box from here.. Click on the Windows host distribution. Once the download is completed, install it as administrator.
Also, download the VM Guest Additions iso. This should be installed as administrator as well.
I recommend setting up one VM carefully and then clone it to make VMs for data nodes. You can change few settings on the cloned nodes as needed. This will save the time required to setup each VM separately.
Step 2: Download Linux – CentOS
Download CentOS 7 which is free from www.CentOS.org. There are multiple choices of distribution, please make sure the DVD iso is chosen.
The file name should look similar to CentOS-7-x86_64-DVD-1708.iso
Step 3: Create virtual machines for name node, and data nodes
For the setup, we will use hostnames nn1, dn1, dn2 for name node, data node 1, and data node 2.
Open Oracle Virtual Box, and select New, and enter Operating System name as ‘CentOS’. The version will automatically be populated as ‘Redhat 64 bit’. Follow the installation prompts.
Specify 2 Gb of RAM for the VM. You can specify smaller but may run into issues, so recommend 2 Gb.
For hard disk – select ‘Create a virtual hard disk now’. Select VDI format.
On Choose file location and size’ page, browse to the folder where you would like to manage the VM.
Allocate 16 Gb of disk space for the VM.
Specify video memory of at least 128Screen shots below should help.
At this point, you can click on the VM name and change the name from CentOS to nn1.
Step 4: Setup network connectivity between Host and Virtual Machines
Now open Global Tools in Virtual Box. Add the ‘Host Only’ adapter and DHCP server settings.
We need to setup network between the host system and the VM. Select the VM and right click on the name to bring up properties.
Change the ‘Network’ properties similar to those shown in the screenshots below. Usually, the first adapter will come up as NAT. Keep it, and set up the second adapter as ‘Host-only’ Adapter.
Check the IP address of newly created VM using ifconfig Linux command
At this point, you should be able to ping the VM from your host Windows system.
Similarly, you should be able to ping the host from the VM using ping. If this does not work check the steps above to make sure you have not missed any step.
Step 5: Setup JDK 7 or JDK 8
I have tested with JDK 7, but JDK 8 will work. I have not tested with JDK 9, but if anyone does please add a comment to share the experience. We are installing JDK and Hadoop components before cloning so that we do not have to repeat the installation separately on each node.
Search ‘Download JDK’ in browser search, or access the downloads at Oracle JDK download page. You will need to login or register if you have not done so before. Use the 64-bit gz version.
The file should look similar to jdk-7u80-linux-x64.tar.gz
You can either download directly using the browser on the VM, or Install WinSCP on your Windows host system, and set up a connection to the VM nn1, and copy the file to /root/Downloads folder.[root@localhost Downloads]#tar zvf jdk-7u80-linux-x64.tar.gz
[root@localhost Downloads]# ls
jdk1.7.0_80 jdk-7u80-linux-x64.tar.gz [root@localhost Downloads]# mv jdk1.7.0_80 /usr/local/
[root@localhost Downloads]# cd /usr/local/
[root@localhost local]# ls
bin etc games include jdk1.7.0_80 lib lib64 libexec sbin share src
Create a soft link to the java folder so that you do not need to specify the full folder name every time.[root@localhost local]# ln -s /usr/local/jdk1.7.0_80 /usr/local/java
Check that java was installed successfully.[root@localhost local]# java -version
Step 7: Download Apache Hadoop 2 and setup on all nodes
To download Apache Hadoop simply search for ‘download apache hadoop’ in the browser, or access it at the ‘Apache Hadoop Releases Page‘
Go to the mirror page and select any of the mirror sites. Make sure you download the distribution version that you want to install. The 2.9 version file should look similar to: hadoop-2.9.0.tar.gz
>># tar -zxf Hadoop…tar.gz
Move Hadoop folder to /opt
Create a soft symbolic link for the folder so that you do not have to type the long name including version. Also in future you will be able to replace the folder with next version without having to change the folder names in bunch of places.
Step 8: Setup Java and Hadoop environment variables
Add Java and Hadoop env variables to .bash_profile, and more importantly to .bashrc
The .bashrc will be executed when a process opens a connection to another server and communicates. If you miss this step, the installation will not work.
Here are the environment variables to be added to .bashrc and .bash_profile
Now you can clone the nn1 namenode VM to create two data node VMS. Name them dn1, and dn2.
Check the IP address of each VM (aka node) and node them down. You will need the addresses to set up your host file.
Step 5: Setup network between Host and Virtual Machines
The node names and IP addresses should be added to /etc/hosts on each of the virtual machines. Remove 127.0.0.1, localhost or 127.0.1.1 entries from the host file. The hadoop software starts the name node with node name as ‘localhost’ instead of nn1, which will prevent the nodes from communicating with each other, so this is a small but important step.
The host file should look similar to as follows.
After adding the IP addresses with server names to the /etc/hosts file, run the following:
# update /etc/sysconfig/network
Typically for production installations, the administrators set up a Linux group (e.g. ‘hadoop’) and a specific Linux user (e.g. ‘hduser’) and perform installation under that user. However, for simplicity sake, we will install using root. This will save us the trouble of doing sudo all the time.
Step 9: Setup ssh keys
SSH setup is needed to enable the nodes to talk to each other without having to enter or provide a password. SSH uses standard cryptography. Hadoop relies on keys with no paraphrase.
It is very important to generate keys on each of the node and then copy to other nodes using ssh-copy-id. There are multiple ways of doing this, however many times people new to Hadoop, or those not from Linux background do the setup on each node, but forget to copy the key to the node it was generated on. This is needed for a user to login fro example from nn1 to nn1.
So login as root to node nn1 and issue following commands at Linux prompt:
>># cd $HOME
>># ssh-keygen -t rsa
>># ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@dn1
>># ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@dn2
>># ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@nn1
Don’t forget the last statement! Now do the same on all nodes.
Login from each node to itself, and other nodes. You will be prompted to enter a password for the first time, and prompted to add the key. Specify yes. This will add the SSH key to $HOME/.ssh/authorized_keys
If you are prompted for a password, some step was missed. Check the authorized_keys file to make sure the keys are present for all nodes on all nodes. That means nn1 should have an entry for nn1, dn1, dn2.
Step 10: Format name node
Now that java, hadoop, environment variables and keys are set and working, you are ready to format the name node. This will create a disk format that will be used by name node.
Create directories for name node and data nodes. This is needed on each node. I called the directories namenode2, and datanode2 because I have another installation with folder names name node, and data node. You can use suitable names.
>># cd /opt/
>># mkdir hdfs
>># mkdir hdfs/namenode2
>># mkdir hdfs/datanode2
Edit the configuration files core-site.xml, hdfs-site.xml, yarn-site.xml to include configuration settings similar to those shown in the screenshots below.
>># hdfs namenode -format
You should see a successfully formatted message. Don’t celebrate yet! You will have an opportunity a little bit later. The reason for this is that hadoop formatting creates metadata, but issues are not uncovered until yarn process tries to access or distribute the data.
Step 11: Start the DFS and YARN processes
Start DFS process
Check the logs in /opt/hadoop/logs for any error messages. Even though it may seem that the name node has started, there could be issues with the processes.
If you see keyword localhost anywhere in the log, there is an issue with how name node is starting with. Even though the node is up, the processes will not find it. It should start with the server name nn1. If this happens, check your host file to make sure there is no entry for 127.0.0.0 or localhost.
>># more /opt/hadoop/logs/hadoop-root-namenode-nn1.log
Step 12: Move a test data file from host OS to HDFS and see the distribution in action
Create a test csv or some file in a folder. Or copy a sample file to a folder and move to HDFS using following command.
>># hdfs dfs -mkdir /testdata
>># hdfs -put aaa.csv /testdata
Now you can log in to the data nodes and check the subfolders under /opt/hdfs/testdata and see the file being split into blocks on each data node.
Step 13: You have your multi-node cluster operational. Now celebrate!
You did it!
If you face any issues or have questions, please put in comments and I will answer them as soon as possible. Everyone is welcome to contribute.