Configuring The Hadoop Cluster Using Ansible

ADARSH KUMAR
5 min readJan 5, 2021

In this article, We are going to configure the NameNode,DataNode and ClientNode of Hadoop cluster and start the Hadoop services Using Ansible.

At first, let’s discuss about some terminologies.

Ansible:

Ansible is a radically simple IT automation engine that automates cloud provisioning, configuration management, application deployment, intra-service orchestration, and many other IT needs. It is Designed for multi-tier deployments since day one, Ansible models your IT infrastructure by describing how all of your systems inter-relate, rather than just managing one system at a time. It uses no agents and no additional custom security infrastructure, so it’s easy to deploy — and most importantly, it uses a very simple language (YAML, in the form of Ansible Playbooks) that allow you to describe your automation jobs in a way that approaches plain English.

Hadoop :

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop cluster:

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.

Such clusters run Hadoop’s open source distributed processing software on low-cost commodity computers. Typically one machine in the cluster is designated as the NameNode.The rest of the machines in the cluster act as both DataNode ,these are the slaves. Hadoop clusters are often referred to as “shared nothing” systems because the only thing that is shared between nodes is the network that connects them.There is also a ClientNode that send the data file to the cluster.

image from hadoop.apache.org

Pre-requisites:

  • Ansible Installed In Controller node
  • rpm file of Java Development kit(JDK 8) in the Controller node
  • rpm file of Hadoop in the controller node

Let’s Begin!!

We need at least 4 operating systems for this practical, one for ansible controller node, one for NameNode,at least one for DataNode and one for ClientNode. In my case I am using 4 ec2-instances on aws, you can also use Vms in your system.

Step 1:

First, We will create an inventory file, where we can store all the ips of NameNode, DataNode, ClientNode.

In the Inventory file we have three groups,one for NameNode, second one is for DataNode and third one is for ClientNode.

After the setup of the inventory file we will configure the ansible.cfg file. Because by default ,login to root user has been disabled.so you can’t login with the root power. So do some changes in the ansible.cfg.

vi /etc/ansible/ansible.cfg

Step 2:

In this step we will configure the NameNode. For this we have to create a playbook that contains all the necessary steps i.e. from copying the rpm files to staring the Hadoop services in the NameNode.So we have to write the below code inside the playbook namenode.yml.

Since we are also copying the hdfs-site.xml and core-site.xml from controller node to NameNode.

So,write the below code in hdfs-site.xml

write the below code in the core-site.xml

If you are doing the practical in your pc then put the ip of the controller node in the place where i put the ip 0.0.0.0,I have used this because I am using ec2-instance and in ec2-instance there is different public and private ip.

After completing all the above steps ,run the namenode.yml playbook by using the below command.

ansible-playbook namenode.yml

Our NameNode is configured. Let’s check the webUI of the NameNode.

Here We can see the whole details. We can see that “Live Node=0” and capacity=0 ,as we haven’t configured the DataNode.

After configuring the NameNode let’s jump on the configuration of DataNode.

Step 3:

For configuring the DataNode, We will write a playbook that will configure the DataNode. Here I am only configuring one DataNode but you can Configure as much as you requirement is.

We have to write the below code inside the playbook namenode.yml.

Since we are also copying the hdfs-site.xml and core-site.xml from controller node to DataNode .

So, write the below code in hdfs-site.xml

write the below code in the core-site.xml

After completing all these steps ,run the datanode.yml playbook by using the below command.

ansible-playbook datanode.yml

After successfully Configured the DataNode ,let’s jump on the configuration of ClientNode .

Step 4:

We have successfully configured the the NameNode and DataNode ,Now In this step we are going to configure the ClientNode , so that we can use it to upload the data file to the Hadoop Cluster.

For Configuring the ClientNode we will write a playbook that will configure the ClientNode and also it will upload a file to the Hadoop cluster from the ClientNode to check that the cluster is configured properly or not.

For ClientNode,only core-site.xml is required to be configured. So we only copy the core-site.xml file to the ClientNode.So write the below code in the core-site.xml

Now we will run the clientnode.yml playbook using the below command

ansible-playbook clientnode.yml

In above picture we can see that our ClientNode is configured Successfully, that means our whole Hadoop cluster is configured. Now Let’s check the whole cluster is working properly or not.

Since we have uploaded the data file to cluster from the ClientNode playbook, So we are going to check the webUi of NameNode ,whether It had received the data file or not.

Here, We can see that there is a file named “inv.txt” that we have uploaded from the ClientNode. That means our Hadoop Cluster is properly configured by Ansible.

--

--