Friday, May 29, 2015

Quickly build arbitrary size Hadoop Cluster based on Docker

Please Check the Updated Blog!!!

http://kiwenlau.com/2016/06/26/hadoop-cluster-docker-update-english/

GitHub: https://github.com/kiwenlau/hadoop-cluster-docker

You can go to the section 3 directly and build a 3 nodes Hadoop cluster following the directions.

1. Project Introduction
2. Hadoop-Cluster-Docker image Introduction
3. Steps to build a 3 nodes Hadoop Cluster
4. Steps to build arbitrary size Hadoop Cluster

1. Project Introduction

Building a Hadoop cluster using physical machines is very painful, especially for beginners. They will be frustrated by this problem before running wordcount.

My objective is to run Hadoop cluster based on Docker, and help Hadoop developer to quickly build an arbitrary size Hadoop cluster on their local host. This idea already has several implementations, but in my view, they are not good enough. Their image size is too large, or they are very slow and they are not user friendly by using third party tools. Following table shows some problems of existing Hadoop on Docker project.

Project                              Image Size      Problem
sequenceiq/hadoop-docker:latest      1.491GB         too large, only one node
sequenceiq/hadoop-docker:2.7.0       1.76 GB    
sequenceiq/hadoop-docker:2.60        1.624GB    

sequenceiq/ambari:latest             1.782GB         too large, too slow, using third party tool
sequenceiq/ambari:2.0.0              4.804GB    
sequenceiq/ambari:latest:1.70        4.761GB    

alvinhenrick/hadoop-mutinode         4.331GB         too large, too slow to build images, not easy to add nodes, have some bugs

My project is based on "alvinhenrick/hadoop-mutinode" project, however, I've reconstructed it for optimization. Following is the GitHub address and blog address of "alvinhenrick/hadoop-mutinode" project. GitHub, Blog

Following table shows the differences between my project "kiwenlau/hadoop-cluster-docker" and "alvinhenrick/hadoop-mutinode" project.

Image Name                    Build time      Layer number     Image Size
alvinhenrick/serf             258.213s        21               239.4MB
alvinhenrick/hadoop-base      2236.055s       58               4.328GB
alvinhenrick/hadoop-dn        51.959s         74               4.331GB
alvinhenrick/hadoop-nn-dn     49.548s         84               4.331GB

Image Name                    Build time     Layer number       Image Size
kiwenlau/serf-dnsmasq         509.46s        8                  206.6 MB
kiwenlau/hadoop-base          400.29s        7                  775.4 MB
kiwenlau/hadoop-master        5.41s          9                  775.4 MB
kiwenlau/hadoop-slave         2.41s          8                  775.4 MB

In summary, I did following optimizations:

Smaller image size
Faster build time
Less image layers

Change node number quickly and conveniently

For "alvinhenrick/hadoop-mutinode" project, If you want to change node number, you have to change hadoop configuration file (slaves, which list the domain name or ip address of all nodes ), rebuild hadoop-nn-dn image, change the shell sript for starting containers! As for my "kiwenlau/hadoop-cluster-docker" project, I write a shell script (resize-cluster.sh) to automate these steps. Then you can rebuild the hadoop-master image within one minutes and run an arbitrary size Hadoop Cluster quickly! The default node number of my project is 3 and you can change is to any size you like! In addition, building image, running container, starting Hadoop and run wordcount, all these jobs are automated by shell scripts. So you can use and develop this project more easily! Welcome to join this project

Develop environment

OS：ubuntu 14.04 and ubuntu 12.04
kernel: 3.13.0-32-generic
Docke：1.5.0 and1.6.2

Attention: old kernel version or small memory size will cause failure while running my project

2. Hadoop-Cluster-Docker image Introduction

I developed 4 docker images in this project

serf-dnsmasq
hadoop-base
hadoop-master
hadoop-slave

serf-dnsmasq

based on ubuntu:15.04: It is the smallest ubuntu image
install serf: serf is an distributed cluster membership management tool, which can recognize all nodes of the Hadoop cluster
install dnsmasq: dnsmasq is a lightweight dns server, which can provide domain name resolution service for the Hadoop Cluster

When containers start, the IP address of master node will passed to all slaves node. Serf will start when the containers start. Serf agents on all slaves node will recognize the master node because they know the IP address of master node. Then the serf agent on master node will recognize all slave nodes. Serf agents on all nodes will communicate with each other, so everyone will know everyone after a while. When serf agent recognize new node, it will reconfigure the dnsmasq and restart it. Eventually, dnsmasq will be able to provide domain name resolution service for all nodes of the Hadoop Cluster. However, the setup jobs for serf and dnsmasq will cause more time when node number increases. Thus, when you want run more nodes, you have to verify whether serf agent have found all nodes and whether dnsmasq can resolve all nodes before you start hadoop. Using serf and dnsmasq to solve FQDN problem is proposed by SequenceIQ, which is startup company focusing on runing Hadoop on Docker. You can read this slide for more details.

hadoop-base

based on serf-dnsmasq
install JDK(openjdk)
install openssh-server, configure password free ssh
install vim：happy coding inside docker container:)
install Hadoop 2.3.0: install compiled hadoop （2.5.2， 2.6.0， 2.7.0 is bigger than 2.3.0)

You can check my blog for compiling hadoop: Steps to compile 64-bit Hadoop 2.3.0 under Ubuntu 14.04

If you want to rebuild hadoop-base image, you need download the compiled hadoop, and put it inside hadoop-cluster-docker/hadoop-base/files directory. Following is the address to download compiled hadoop: hadoop-2.3.0)

If you want to try other version of Hadoop, you can download these compiled hadoop.

hadoop-master

based on hadoop-base
configure hadoop master
formate namenode

We need to configure slaves file during this step, and slaves file need to list the domain names and ip address of all nodes. Thus, when we change the node number of hadoop cluster, the slaves file should be different. That's why we need change slaves file and rebuild hadoop-master image when we want to change node number. I write a shell script named resize-cluster.sh to automatically rebuild hadoop-master image to support arbitrary size Hadoop cluster. You only need to give the node number as the parameter of resize-cluster.sh to change the node number of Hadoop cluster. Building the hadoop-master image only costs 1 minute since it only does some configuration jobs.

hadoop-slave

based on hadoop-base
configure hadoop slave node

image size analysis

following table shows the output of "sudo docker images"

REPOSITORY                 TAG       IMAGE ID        CREATED          VIRTUAL SIZE
kiwenlau/hadoop-slave      0.1.0     d63869855c03    17 hours ago     777.4 MB
kiwenlau/hadoop-master     0.1.0     7c9d32ede450    17 hours ago     777.4 MB
kiwenlau/hadoop-base       0.1.0     5571bd5de58e    17 hours ago     777.4 MB
kiwenlau/serf-dnsmasq      0.1.0     09ed89c24ee8    17 hours ago     206.7 MB
ubuntu                     15.04     bd94ae587483    3 weeks ago      131.3 MB

Thus：

serf-dnsmasq increases 75.4MB based on ubuntu:15.04
hadoop-base increases 570.7MB based on serf-dnsmasq
hadoop-master and hadoop-slave increase 0 MB based on hadoop-base

following table shows the partial output of "docker history kiwenlau/hadoop-base:0.1.0"

IMAGE            CREATED             CREATED BY                                          SIZE
2039b9b81146     44 hours ago        /bin/sh -c #(nop) ADD multi:a93c971a49514e787       158.5 MB
cdb620312f30     44 hours ago        /bin/sh -c apt-get install -y openjdk-7-jdk         324.6 MB
da7d10c790c1     44 hours ago        /bin/sh -c apt-get install -y openssh-server        87.58 MB
c65cb568defc     44 hours ago        /bin/sh -c curl -Lso serf.zip https://dl.bint       14.46 MB
3e22b3d72e33     44 hours ago        /bin/sh -c apt-get update && apt-get install        60.89 MB
b68f8c8d2140     3 weeks ago         /bin/sh -c #(nop) ADD file:d90f7467c470bfa9a3       131.3 MB

Thus:

base image ubuntu:15.04 is 131.3MB
installing openjdk costs 324.6MB
installing hadoop costs 158.5MB
total size of ubuntu,openjdk and hadoop is 614.4MB

Following picture shows the image architecture of my project:

So, my hadoop image is near minimal size and it's hard to do more optimization

3. steps to build a 3 nodes Hadoop cluster

a. pull image

sudo docker pull kiwenlau/hadoop-master:0.1.0
sudo docker pull kiwenlau/hadoop-slave:0.1.0
sudo docker pull kiwenlau/hadoop-base:0.1.0
sudo docker pull kiwenlau/serf-dnsmasq:0.1.0

check downloaded images

sudo docker images

output

REPOSITORY                TAG       IMAGE ID        CREATED         VIRTUAL SIZE
kiwenlau/hadoop-slave     0.1.0     d63869855c03    17 hours ago    777.4 MB
kiwenlau/hadoop-master    0.1.0     7c9d32ede450    17 hours ago    777.4 MB
kiwenlau/hadoop-base      0.1.0     5571bd5de58e    17 hours ago    777.4 MB
kiwenlau/serf-dnsmasq     0.1.0     09ed89c24ee8    17 hours ago    206.7 MB

hadoop-base is based on serf-dnsmasq，hadoop-slave and hadoop-master is based on hadoop-base
so the total size of all four images is only 777.4MB

b. clone source code

git clone https://github.com/kiwenlau/hadoop-cluster-docker

c. run container

 cd hadoop-cluster-docker
./start-container.sh

output

start master container...
start slave1 container...
start slave2 container...
root@master:~#

start 3 containers，1 master and 2 slaves
you will go to the /root directory of master container after start all containers list the files inside /root directory of master container

ls

output

hdfs  run-wordcount.sh    serf_log  start-hadoop.sh  start-ssh-serf.sh

start-hadoop.sh is the shell script to start hadoop
run-wordcount.sh is the shell script to run wordcount program

d. test serf and dnsmasq service

In fact, you can skip this step and just wait for about 1 minute. Serf and dnsmasq need some time to start service.

list all nodes of hadoop cluster

serf members

output

master.kiwenlau.com  172.17.0.65:7946  alive  
slave1.kiwenlau.com  172.17.0.66:7946  alive  
slave2.kiwenlau.com  172.17.0.67:7946  alive

you can wait for a while if any nodes don't show up since serf agent need time to recognize all nodes

test ssh

ssh slave2.kiwenlau.com

output

Warning: Permanently added 'slave2.kiwenlau.com,172.17.0.67' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 15.04 (GNU/Linux 3.13.0-53-generic x86_64)
 * Documentation:  https://help.ubuntu.com/
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
root@slave2:~#

exit slave2 nodes

exit

output

logout
Connection to slave2.kiwenlau.com closed.

Please wait for a whil if ssh fails, dnsmasq need time to configure domain name resolution service
You can start hadoop after these tests!

e. start hadoop

./start-hadoop.sh

you need to exit slave2 node after ssh to it...

f. run wordcount

./run-wordcount.sh

output

input file1.txt:
Hello Hadoop

input file2.txt:
Hello Docker

wordcount output:
Docker    1
Hadoop    1
Hello    2

4. Steps to build an arbitrary size Hadoop Cluster

a. Preparation

check the steps a~b of section 3：pull images and clone source code
you don't have to pull serf-dnsmasq but you need to pull hadoop-base, since rebuiding hadoop-master is based on hadoop-base