In this article, we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.
Contents
1 Introduction
2 Step-by-Step: Build the Infrastructure
3 Install a Hadoop Distribution
Step-by-Step: Install a Hadoop Distribution
Now that we have set up the infrastructure for a Hadoop cluster in Azure, it is time to get our hands dirty with installing the actual Hadoop distribution.
1. Install Ambari Server
We start off with installing an Ambari Server that allows for a “graphical” way of installing and deploying Hadoop.
1.1. Set Up Bits
Log onto your master node (in this case oldkHDPm) as root. This node will serve as the main Installation host. Download the Ambari repository. Since we use CentOS 6 as our platform, access the repository as follows:
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo
Following, copy the files to your repos.d:
cp ambari.repo /etc/yum.repos.d
More information can be found here in the Hortonworks documentation.
You can confirm that the repository is configured, by running yum repolist. You then obtain a list of repo id’s and repo names as marked in blue below. The command may vary depending on the platform (see here for more information).
Now, we can install the Ambari bits by running yum install ambari-server.
1.2. Set Up Ambari
Now that the Ambari server is installed, let us set it up. Run
ambari-server setup
Here, we do not customise the user account for the ambari-server daemon since we have already changed the root password. Likewise, we accept the default settings. More Information can be found here in the Hortonworks documentation.
1.3. Start Ambari
The Ambari server is set up and installed – ready to be started:
ambari-server start
To have a look at the Ambari server processes, type in:
ps –ef | grep ambari
In case, more than one process is running ambari, kill the other process as follows:
2. Install Hadoop
Now we are ready to install the Hadoop distribution, i.e. HDP 2.1, using Ambari. Likewise, we will go along the Hortonworks documentation (here).
Log into your DNS server and open an internet browser pointint to
http://{ambari-server}:8080
In this case: http://oldkHDPm.oldkHDP.oliviak.com:8080
Name your cluster (see Hortonworks documentation), e.g. oldkHDPcluster:
Select your desired stack (see Hortonworks docs). We choose the latest for the time being, i.e. HDP 2.1:
The next window specifies the install options. Before we go into it, we take a little de-tour, i.e. how to copy the SSH private key onto the DNS server.
Detour: How to Copy the SSH Private Key to the Local Machine
For that purpose, we install WinSCP that enables the secure file Transfer between a local and a remote computer. Once installed, log in using the credentials to the master node (i.e. oldkHDP.cloudapp.net, port 22):
Use the WinSCP client to download the private SSH key (i.e. id_rsa) of the master node into the DNS server:
Once downloaded into the “local” machine, i.e. our DNS server, we can browse for it in the “Install Options” window:
Additionally, type in all the target hosts of your Hadoop cluster. In this case, it includes the master node and the three worker nodes:
oldkHDPm.oldkHDP.oliviak.com
oldkHDPw[1-3].oldkHDP.oliviak.com
When registering and confirming, you will be prompted with another window containing the host name pattern expressions:
Success – the hosts are confirmed. Have a look at the Hortonworks documentation for more information.
You may or may not receive some warnings as shown in the yellowish area:
It turns out that the ntpd services are not running but are required to be. You could run the HostCleanup Python script on each host…
…or manually get the ntpd services to run, by running
chkconfig ntpd on
on each host, i.e. the master and all three worker nodes:
To check the status of the ntpd services, run
service ntpd status
Back in the browser on the DNS server, rerun the checks:
Next, you can choose the services you wish to install on your Hadoop Cluster (see HDP documentation).
Next, select the hosts on which certain master components should run (see HDP doc). In this case, I choose to assign the master components of the Hive Server and the Oozie Server to the master node.
With the Ambari wizard, slave components (i.e. DataNodes, NodeManagers and RegionServers) can be appropriately assigned to certain hosts in the next window (see HDP doc).
Now you can manage the configuration settings for the Hadoop components along the tabs:
For instance, under HDFS we change the directories from
to the following:
or the remaining tabs marked with warnings, credentials are required, such as Nagios,
...Hive...
...and Oozie:
More Information on customising the services related to your Hadoop cluster can be found here.
Finally, before deploying the Hadoop cluster you obtain the usual summary of configuration settings:
It contains the following information:
- Admin Name: admin
- Cluster Name: oldkHDPcluster
- Total Hosts: 4 (4 new)
- Repositories:
- RHEL 5/CentOS 5/Oracle Linux 5: http://public-repo-1.hortonworks.com/HDP/centos5/2.x/updates/2.1.2.1
- RHEL 6/CentOS 6/Oracle Linux 6 : http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.2.1
- SLES 11/SUSE 11 : http://public-repo-1.hortonworks.com/HDP/suse11/2.x/updates/2.1.2.1
Services
- HDFS
- NameNode: oldkHDPm.oldkHDP.oliviak.com
- SecondaryNameNode: oldkHDPw1.oldkHDP.oliviak.com
- DataNodes: 3 Hosts
- YARN + MapReduce2
- NodeManager: 3 hosts
- ResourceManager: oldkHDPw1.oldkHDP.oliviak.com
- History Server: oldkHDPw1.oldkHDP.oliviak.com
- App Timeline Server: oldkHDPw1.oldkHDP.oliviak.com
- Tez
- Clients: 1 host
- Nagios
- Server: oldkHDPm.oldkHDP.oliviak.com
- Administrator: nagiosadmin / your-email-address@blabla.com)
- Ganglia
- Server: oldkHDPm.oldkHDP.oliviak.com
- Hive + HCatalog
- Hive Metastore: oldkHDPm.oldkHDP.oliviak.com
- Database: MySQL (New Database)
- HBase
- Master: oldkHDPm.oldkHDP.oliviak.com
- RegionServers: 3 hosts
- Pig
- Clients: 1 host
- Sqoop
- Clients: 1 host
- Oozie
- Server: oldkHDPm.oldkHDP.oliviak.com
- Database: Derby (New Derby Database)
- Zookeeper
- Servers: 3 hosts
- Falcon
- Server: oldkHDPw1.oldkHDP.oliviak.com
- Storm
- Nimbus: oldkHDPm.oldkHDP.oliviak.com
- Storm REST API Server: oldkHDPm.oldkHDP.oliviak.com
- Storm UI Server: oldkHDPm.oldkHDP.oliviak.com
- DRPC Server: oldkHDPm.oldkHDP.oliviak.com
- Supervisor: 3 Hosts
And away you deploy:
Finally, you obtain a summary of your Hadoop installation efforts:
Done!
You have the Ambari GUI nicely displayed in front of you:
While hovering over the tiles, you obtain more information, such as on the network usage:
or the cluster load: