Friday, May 29, 2015

Quickly build arbitrary size Hadoop Cluster based on Docker

Please Check the Updated Blog!!!


You can go to the section 3 directly and build a 3 nodes Hadoop cluster following the directions.
1. Project Introduction
2. Hadoop-Cluster-Docker image Introduction
3. Steps to build a 3 nodes Hadoop Cluster
4. Steps to build arbitrary size Hadoop Cluster

1. Project Introduction

Building a Hadoop cluster using physical machines is very painful, especially for beginners. They will be frustrated by this problem before running wordcount.
My objective is to run Hadoop cluster based on Docker, and help Hadoop developer to quickly build an arbitrary size Hadoop cluster on their local host. This idea already has several implementations, but in my view, they are not good enough. Their image size is too large, or they are very slow and they are not user friendly by using third party tools. Following table shows some problems of existing Hadoop on Docker project.
Project                              Image Size      Problem
sequenceiq/hadoop-docker:latest      1.491GB         too large, only one node
sequenceiq/hadoop-docker:2.7.0       1.76 GB    
sequenceiq/hadoop-docker:2.60        1.624GB    

sequenceiq/ambari:latest             1.782GB         too large, too slow, using third party tool
sequenceiq/ambari:2.0.0              4.804GB    
sequenceiq/ambari:latest:1.70        4.761GB    

alvinhenrick/hadoop-mutinode         4.331GB         too large, too slow to build images, not easy to add nodes, have some bugs
My project is based on "alvinhenrick/hadoop-mutinode" project, however, I've reconstructed it for optimization. Following is the GitHub address and blog address of "alvinhenrick/hadoop-mutinode" project. GitHub, Blog
Following table shows the differences between my project "kiwenlau/hadoop-cluster-docker" and "alvinhenrick/hadoop-mutinode" project.
Image Name                    Build time      Layer number     Image Size
alvinhenrick/serf             258.213s        21               239.4MB
alvinhenrick/hadoop-base      2236.055s       58               4.328GB
alvinhenrick/hadoop-dn        51.959s         74               4.331GB
alvinhenrick/hadoop-nn-dn     49.548s         84               4.331GB
Image Name                    Build time     Layer number       Image Size
kiwenlau/serf-dnsmasq         509.46s        8                  206.6 MB
kiwenlau/hadoop-base          400.29s        7                  775.4 MB
kiwenlau/hadoop-master        5.41s          9                  775.4 MB
kiwenlau/hadoop-slave         2.41s          8                  775.4 MB
In summary, I did following optimizations:
  • Smaller image size
  • Faster build time
  • Less image layers
Change node number quickly and conveniently
For "alvinhenrick/hadoop-mutinode" project, If you want to change node number, you have to change hadoop configuration file (slaves, which list the domain name or ip address of all nodes ), rebuild hadoop-nn-dn image, change the shell sript for starting containers! As for my "kiwenlau/hadoop-cluster-docker" project, I write a shell script (resize-cluster.sh) to automate these steps. Then you can rebuild the hadoop-master image within one minutes and run an arbitrary size Hadoop Cluster quickly! The default node number of my project is 3 and you can change is to any size you like! In addition, building image, running container, starting Hadoop and run wordcount, all these jobs are automated by shell scripts. So you can use and develop this project more easily! Welcome to join this project
Develop environment
  • OS:ubuntu 14.04 and ubuntu 12.04
  • kernel: 3.13.0-32-generic
  • Docke:1.5.0 and1.6.2
Attention: old kernel version or small memory size will cause failure while running my project

2. Hadoop-Cluster-Docker image Introduction

I developed 4 docker images in this project
  • serf-dnsmasq
  • hadoop-base
  • hadoop-master
  • hadoop-slave
serf-dnsmasq
  • based on ubuntu:15.04: It is the smallest ubuntu image
  • install serf: serf is an distributed cluster membership management tool, which can recognize all nodes of the Hadoop cluster
  • install dnsmasq: dnsmasq is a lightweight dns server, which can provide domain name resolution service for the Hadoop Cluster
When containers start, the IP address of master node will passed to all slaves node. Serf will start when the containers start. Serf agents on all slaves node will recognize the master node because they know the IP address of master node. Then the serf agent on master node will recognize all slave nodes. Serf agents on all nodes will communicate with each other, so everyone will know everyone after a while. When serf agent recognize new node, it will reconfigure the dnsmasq and restart it. Eventually, dnsmasq will be able to provide domain name resolution service for all nodes of the Hadoop Cluster. However, the setup jobs for serf and dnsmasq will cause more time when node number increases. Thus, when you want run more nodes, you have to verify whether serf agent have found all nodes and whether dnsmasq can resolve all nodes before you start hadoop. Using serf and dnsmasq to solve FQDN problem is proposed by SequenceIQ, which is startup company focusing on runing Hadoop on Docker. You can read this slide for more details.
hadoop-base
  • based on serf-dnsmasq
  • install JDK(openjdk)
  • install openssh-server, configure password free ssh
  • install vim:happy coding inside docker container:)
  • install Hadoop 2.3.0: install compiled hadoop (2.5.2, 2.6.0, 2.7.0 is bigger than 2.3.0)
You can check my blog for compiling hadoop: Steps to compile 64-bit Hadoop 2.3.0 under Ubuntu 14.04
If you want to rebuild hadoop-base image, you need download the compiled hadoop, and put it inside hadoop-cluster-docker/hadoop-base/files directory. Following is the address to download compiled hadoop: hadoop-2.3.0)

If you want to try other version of Hadoop, you can download these compiled hadoop.
hadoop-master
  • based on hadoop-base
  • configure hadoop master
  • formate namenode
We need to configure slaves file during this step, and slaves file need to list the domain names and ip address of all nodes. Thus, when we change the node number of hadoop cluster, the slaves file should be different. That's why we need change slaves file and rebuild hadoop-master image when we want to change node number. I write a shell script named resize-cluster.sh to automatically rebuild hadoop-master image to support arbitrary size Hadoop cluster. You only need to give the node number as the parameter of resize-cluster.sh to change the node number of Hadoop cluster. Building the hadoop-master image only costs 1 minute since it only does some configuration jobs.
hadoop-slave
  • based on hadoop-base
  • configure hadoop slave node
image size analysis
following table shows the output of "sudo docker images"
REPOSITORY                 TAG       IMAGE ID        CREATED          VIRTUAL SIZE
kiwenlau/hadoop-slave      0.1.0     d63869855c03    17 hours ago     777.4 MB
kiwenlau/hadoop-master     0.1.0     7c9d32ede450    17 hours ago     777.4 MB
kiwenlau/hadoop-base       0.1.0     5571bd5de58e    17 hours ago     777.4 MB
kiwenlau/serf-dnsmasq      0.1.0     09ed89c24ee8    17 hours ago     206.7 MB
ubuntu                     15.04     bd94ae587483    3 weeks ago      131.3 MB
Thus:
  • serf-dnsmasq increases 75.4MB based on ubuntu:15.04
  • hadoop-base increases 570.7MB based on serf-dnsmasq
  • hadoop-master and hadoop-slave increase 0 MB based on hadoop-base
following table shows the partial output of "docker history kiwenlau/hadoop-base:0.1.0"
IMAGE            CREATED             CREATED BY                                          SIZE
2039b9b81146     44 hours ago        /bin/sh -c #(nop) ADD multi:a93c971a49514e787       158.5 MB
cdb620312f30     44 hours ago        /bin/sh -c apt-get install -y openjdk-7-jdk         324.6 MB
da7d10c790c1     44 hours ago        /bin/sh -c apt-get install -y openssh-server        87.58 MB
c65cb568defc     44 hours ago        /bin/sh -c curl -Lso serf.zip https://dl.bint       14.46 MB
3e22b3d72e33     44 hours ago        /bin/sh -c apt-get update && apt-get install        60.89 MB
b68f8c8d2140     3 weeks ago         /bin/sh -c #(nop) ADD file:d90f7467c470bfa9a3       131.3 MB
Thus:
  • base image ubuntu:15.04 is 131.3MB
  • installing openjdk costs 324.6MB
  • installing hadoop costs 158.5MB
  • total size of ubuntu,openjdk and hadoop is 614.4MB
Following picture shows the image architecture of my project:



























So, my hadoop image is near minimal size and it's hard to do more optimization

3. steps to build a 3 nodes Hadoop cluster

a. pull image
sudo docker pull kiwenlau/hadoop-master:0.1.0
sudo docker pull kiwenlau/hadoop-slave:0.1.0
sudo docker pull kiwenlau/hadoop-base:0.1.0
sudo docker pull kiwenlau/serf-dnsmasq:0.1.0
check downloaded images
sudo docker images
output
REPOSITORY                TAG       IMAGE ID        CREATED         VIRTUAL SIZE
kiwenlau/hadoop-slave     0.1.0     d63869855c03    17 hours ago    777.4 MB
kiwenlau/hadoop-master    0.1.0     7c9d32ede450    17 hours ago    777.4 MB
kiwenlau/hadoop-base      0.1.0     5571bd5de58e    17 hours ago    777.4 MB
kiwenlau/serf-dnsmasq     0.1.0     09ed89c24ee8    17 hours ago    206.7 MB
  • hadoop-base is based on serf-dnsmasq,hadoop-slave and hadoop-master is based on hadoop-base
  • so the total size of all four images is only 777.4MB
b. clone source code
git clone https://github.com/kiwenlau/hadoop-cluster-docker
c. run container
 cd hadoop-cluster-docker
./start-container.sh
output
start master container...
start slave1 container...
start slave2 container...
root@master:~#
  • start 3 containers,1 master and 2 slaves
  • you will go to the /root directory of master container after start all containers list the files inside /root directory of master container
ls
output
hdfs  run-wordcount.sh    serf_log  start-hadoop.sh  start-ssh-serf.sh
  • start-hadoop.sh is the shell script to start hadoop
  • run-wordcount.sh is the shell script to run wordcount program
d. test serf and dnsmasq service
In fact, you can skip this step and just wait for about 1 minute. Serf and dnsmasq need some time to start service.
list all nodes of hadoop cluster
serf members
output
master.kiwenlau.com  172.17.0.65:7946  alive  
slave1.kiwenlau.com  172.17.0.66:7946  alive  
slave2.kiwenlau.com  172.17.0.67:7946  alive
you can wait for a while if any nodes don't show up since serf agent need time to recognize all nodes
test ssh
ssh slave2.kiwenlau.com
output
Warning: Permanently added 'slave2.kiwenlau.com,172.17.0.67' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 15.04 (GNU/Linux 3.13.0-53-generic x86_64)
 * Documentation:  https://help.ubuntu.com/
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
root@slave2:~#
exit slave2 nodes
exit
output
logout
Connection to slave2.kiwenlau.com closed.
  • Please wait for a whil if ssh fails, dnsmasq need time to configure domain name resolution service
  • You can start hadoop after these tests!

e. start hadoop

./start-hadoop.sh
  • you need to exit slave2 node after ssh to it...
f. run wordcount
./run-wordcount.sh
output
input file1.txt:
Hello Hadoop

input file2.txt:
Hello Docker

wordcount output:
Docker    1
Hadoop    1
Hello    2
4. Steps to build an arbitrary size Hadoop Cluster
a. Preparation
  • check the steps a~b of section 3:pull images and clone source code
  • you don't have to pull serf-dnsmasq but you need to pull hadoop-base, since rebuiding hadoop-master is based on hadoop-base
b. rebuild hadoop-master
./resize-cluster.sh 5
  • It only take 1 minutes
  • you can use any interger as the parameter for resize-cluster.sh: 1, 2, 3, 4, 5, 6...
c. start container
./start-container.sh 5
  • you can use any interger as the parameter for start-container.sh: 1, 2, 3, 4, 5, 6...
  • you'd better use the same parameter as the step b
d. run the Hadoop cluster
  • check the steps d~f of section 3:test serf and dnsmasq, start Hadoop and run wordcount
  • please test serf and dnsmasq service before start hadoop
All rights reserved Please keep the author name: KiwenLau and original blog link :
http://kiwenlau.blogspot.com/2015/05/quickly-build-arbitrary-size-hadoop.html

34 comments:

  1. Hi,

    Thanks for your excellent post. I was looking for a clean way to get back to hadoop/big data -- without relying too much on third-party tools -- and your post meets my requirement very well.

    I'll play around with it for a while and post my feedback, if any.

    Once again, thank you, and Alvin.

    CT

    ReplyDelete
  2. Very nice article.

    I followed all the steps and everything is working fine.

    I wanted to know how can i add more node(s) on demand without disturbing running containers.

    Thank you for sharing.

    ReplyDelete
    Replies
    1. In fact, It is not possible to add nodes without disturbing running containers. Because we need to change hadoop configuration and rebuild docker images before adding nodes

      Delete
    2. Ok, Thanks for reply :)

      Delete
  3. new to docker and wanted to run your hadoop cluster.
    using mac osx seem to be having issue when executing docker from script.
    created a test.sh with one line
    sudo /usr/local/bin/docker-machine ls
    get a machine does not exist error-any idea?

    paulsintsmacair:hadoop-cluster-docker ponks$ docker-machine ls
    NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
    default * virtualbox Running tcp://192.168.99.100:2376 v1.9.1
    paulsintsmacair:hadoop-cluster-docker ponks$ vi test.sh
    paulsintsmacair:hadoop-cluster-docker ponks$ ./test.sh
    Password:
    NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
    default - virtualbox Error Unknown machine does not exist

    ReplyDelete
    Replies
    1. Hey Paul, you are doing SWARM ... and I have a feeling this post does not consider a multihost deployment at all !

      Delete
  4. top food franchises Old New York Deli & Bakery. A place great tasting food & great people meet. A top food franchise to own and fast casual for breakfast, lunch & dinner. Hand made daily,

    ReplyDelete
  5. Best SQL Query Tuning Training Center In Chennai This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

    ReplyDelete
  6. Hi, impressive stuff. Question: is it web based manager surface on this cluster? How can I reach it from the host mashine?

    ReplyDelete
  7. Warning: Permanently added 'slave2.kiwenlau.com,172.17.0.67' (ECDSA) to the list of known hosts.
    Welcome to Ubuntu 15.04 (GNU/Linux 3.13.0-53-generic x86_64)
    * Documentation: https://help.ubuntu.com/
    The programs included with the Ubuntu system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.
    Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
    applicable law. how to disable this banner message ? any idea?

    ReplyDelete
  8. Warning: Permanently added 'slave2.kiwenlau.com,172.17.0.67' (ECDSA) to the list of known hosts.
    Welcome to Ubuntu 15.04 (GNU/Linux 3.13.0-53-generic x86_64)
    * Documentation: https://help.ubuntu.com/
    The programs included with the Ubuntu system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.
    Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
    applicable law. how to disable this banner message ? any idea?

    ReplyDelete
  9. Great Job ! I will try to use like a sandbox to my proyect before deploy.

    ReplyDelete
  10. i can't execute ./start-container.sh when i change centos 6.7 i have permission denied any idea?

    ReplyDelete
  11. i can't execute ./start-container.sh when i change centos 6.7 i have permission denied any idea?

    ReplyDelete
  12. new horizon security services manassas reviews I would like to thank you for your nicely written post, its very informative and your writing style encouraged me to read it till end. Thanks

    ReplyDelete
  13. Website Design Virginia I wanted to thank you for this great read!! I definitely enjoying every little bit of it I have you bookmarked to check out new stuff you post.

    ReplyDelete
  14. I read this blog.This is very interesting blog.You have shared really very good information.

    Customs Clearance & Less-than Container Load services

    ReplyDelete

  15. The strategy you posted was nice. The people who want to shift their career to the IT sector then it is the right option to go with the ethical hacking course.
    Ethical hacking course in Chennai | Ethical hacking training in chennai

    ReplyDelete
  16. Rolex Watches Authentic Mens & Ladies Rolex Datejust, President Watches for Sale at JavyEstrella.com.

    ReplyDelete
  17. your project gives the impression that you can deploy a dockerized version of Hadoop in three separate nodes (physical servers). Have you play around with this? There are plenty of solutions out there to deploy Hadoop using Docker but none of the solutions address the need to deploy namenodes and data nodes in physically separate servers.

    ReplyDelete
  18. Hi, thanks for this excellent post!

    I was wondering if i could run Hadoop on a single container. Is it necessary to run hadoop in a clustorized form?

    ReplyDelete
  19. Thanks for providing this informative information. it is very useful you may also refer-http://www.s4techno.com/blog/2016/08/13/installing-a-storm-cluster/

    ReplyDelete
  20. Thanks for providing this informative information…..
    You may also refer-
    http://www.s4techno.com/blog/category/hadoop/

    ReplyDelete
  21. Thanks for sharing this informative content which provided me the required information about the latest technology.
    Salesforce training in Chennai | Salesforce CRM training in Chennai

    ReplyDelete
  22. Excellent post!!! The future of cloud computing is on positive side. With most of the companies integrate Salesforce CRM to power their business; there is massive demand for Salesforce developers and administrators across the world.Salesforce Training in Chennai | Salesforce Training Institutes in Chennai

    ReplyDelete
  23. Hi, excellent post!

    One question, can I add a new node that is in another docker/server?

    ReplyDelete
  24. Hi , its very informative,
    hadoop-cluster-docker venky$ docker network create hadoop

    ReplyDelete
  25. Hi, its a nice post

    I tried to create image with the instruction given,with a small modification , installed oracle-java8 and hadoop-2.7.3 , but when I start the services I get following error

    root@hadoop-master:~# ./start-hadoop.sh


    Starting namenodes on [hadoop-master]
    hadoop-master: Warning: Permanently added 'hadoop-master,172.18.0.2' (ECDSA) to the list of known hosts.
    hadoop-master: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-hadoop-master.out
    : Name or service not knownstname hadoop-slave3
    : Name or service not knownstname hadoop-slave1
    : Name or service not knownstname hadoop-slave2
    : Name or service not knownstname hadoop-slave4
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
    0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-hadoop-master.out


    starting yarn daemons
    starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-hadoop-master.out
    : Name or service not knownstname hadoop-slave4
    : Name or service not knownstname hadoop-slave3
    : Name or service not knownstname hadoop-slave1
    : Name or service not knownstname hadoop-slave2


    can you please help me out here.

    Thanks

    ReplyDelete

  26. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  27. this is one of the best Hadoop dockrization article I've seen so far, it covered the installation/configure part , if there is some prototype user case, then that will be perfect :)

    ReplyDelete
  28. Hai,
    It's very nice blog
    Thank you for giving valuable information on Hadoop
    I'm expecting much more from you...

    ReplyDelete