Saturday, May 30, 2015

Steps to compile 64-bit Hadoop 2.3.0 under Ubuntu 14.04


All rights reserved Please keep the author name: KiwenLau and original blog link :

http://kiwenlau.blogspot.com/2015/05/steps-to-compile-hadoop-230-under.html


hadoop-2.3.0.tar.gz provided by Hadoop official website is compiled under 32 bit machine,you will get some trouble if you use it under 64 bit machine, for example:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

So, you have to compile hadoop source code by your self. I compiled hadoop-2.3.0 under 64 bit ubuntu 14.04 and run wordcount program successfully, you can download it from this: http://1drv.ms/1HZ1TSV


1. update package lists
apt-get update

2. install dependencies
apt-get install -y openjdk-7-jdk libprotobuf-dev protobuf-compiler maven cmake build-essential pkg-config libssl-dev zlib1g-dev llvm-gcc automake autoconf make

3. download hadoop source file

4. extract hadoop source file
tar -xzvf hadoop-2.3.0-src.tar.gz

5. enter hadoop directory
cd hadoop-2.3.0-src

6. compile hadoop 2.3.0
mvn package -Pdist,native -DskipTests -Dtar

output when it success

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Hadoop Main ................................ SUCCESS [1:11.968s]
[INFO] Apache Hadoop Project POM ......................... SUCCESS [30.393s]
[INFO] Apache Hadoop Annotations ......................... SUCCESS [18.398s]
[INFO] Apache Hadoop Assemblies .......................... SUCCESS [0.246s]
[INFO] Apache Hadoop Project Dist POM .................... SUCCESS [20.372s]
[INFO] Apache Hadoop Maven Plugins ....................... SUCCESS [23.721s]
[INFO] Apache Hadoop MiniKDC ............................. SUCCESS [1:41.836s]
[INFO] Apache Hadoop Auth ................................ SUCCESS [22.303s]
[INFO] Apache Hadoop Auth Examples ....................... SUCCESS [7.052s]
[INFO] Apache Hadoop Common .............................. SUCCESS [2:29.466s]
[INFO] Apache Hadoop NFS ................................. SUCCESS [11.604s]
[INFO] Apache Hadoop Common Project ...................... SUCCESS [0.073s]
[INFO] Apache Hadoop HDFS ................................ SUCCESS [1:30.230s]
[INFO] Apache Hadoop HttpFS .............................. SUCCESS [17.976s]
[INFO] Apache Hadoop HDFS BookKeeper Journal ............. SUCCESS [19.927s]
[INFO] Apache Hadoop HDFS-NFS ............................ SUCCESS [3.304s]
[INFO] Apache Hadoop HDFS Project ........................ SUCCESS [0.032s]
[INFO] hadoop-yarn ....................................... SUCCESS [0.033s]
[INFO] hadoop-yarn-api ................................... SUCCESS [36.284s]
[INFO] hadoop-yarn-common ................................ SUCCESS [33.912s]
[INFO] hadoop-yarn-server ................................ SUCCESS [0.213s]
[INFO] hadoop-yarn-server-common ......................... SUCCESS [8.193s]
[INFO] hadoop-yarn-server-nodemanager .................... SUCCESS [41.181s]
[INFO] hadoop-yarn-server-web-proxy ...................... SUCCESS [2.768s]
[INFO] hadoop-yarn-server-resourcemanager ................ SUCCESS [13.923s]
[INFO] hadoop-yarn-server-tests .......................... SUCCESS [0.904s]
[INFO] hadoop-yarn-client ................................ SUCCESS [4.363s]
[INFO] hadoop-yarn-applications .......................... SUCCESS [0.120s]
[INFO] hadoop-yarn-applications-distributedshell ......... SUCCESS [2.262s]
[INFO] hadoop-yarn-applications-unmanaged-am-launcher .... SUCCESS [1.615s]
[INFO] hadoop-yarn-site .................................. SUCCESS [0.086s]
[INFO] hadoop-yarn-project ............................... SUCCESS [2.703s]
[INFO] hadoop-mapreduce-client ........................... SUCCESS [0.132s]
[INFO] hadoop-mapreduce-client-core ...................... SUCCESS [18.951s]
[INFO] hadoop-mapreduce-client-common .................... SUCCESS [14.320s]
[INFO] hadoop-mapreduce-client-shuffle ................... SUCCESS [3.330s]
[INFO] hadoop-mapreduce-client-app ....................... SUCCESS [9.664s]
[INFO] hadoop-mapreduce-client-hs ........................ SUCCESS [7.678s]
[INFO] hadoop-mapreduce-client-jobclient ................. SUCCESS [9.263s]
[INFO] hadoop-mapreduce-client-hs-plugins ................ SUCCESS [1.549s]
[INFO] Apache Hadoop MapReduce Examples .................. SUCCESS [5.748s]
[INFO] hadoop-mapreduce .................................. SUCCESS [2.880s]
[INFO] Apache Hadoop MapReduce Streaming ................. SUCCESS [7.080s]
[INFO] Apache Hadoop Distributed Copy .................... SUCCESS [14.648s]
[INFO] Apache Hadoop Archives ............................ SUCCESS [2.602s]
[INFO] Apache Hadoop Rumen ............................... SUCCESS [5.706s]
[INFO] Apache Hadoop Gridmix ............................. SUCCESS [3.649s]
[INFO] Apache Hadoop Data Join ........................... SUCCESS [2.483s]
[INFO] Apache Hadoop Extras .............................. SUCCESS [2.678s]
[INFO] Apache Hadoop Pipes ............................... SUCCESS [6.359s]
[INFO] Apache Hadoop OpenStack support ................... SUCCESS [5.088s]
[INFO] Apache Hadoop Client .............................. SUCCESS [4.534s]
[INFO] Apache Hadoop Mini-Cluster ........................ SUCCESS [0.433s]
[INFO] Apache Hadoop Scheduler Load Simulator ............ SUCCESS [7.757s]
[INFO] Apache Hadoop Tools Dist .......................... SUCCESS [4.099s]
[INFO] Apache Hadoop Tools ............................... SUCCESS [0.428s]
[INFO] Apache Hadoop Distribution ........................ SUCCESS [18.045s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 14:59.240s
[INFO] Finished at: Thu Jan 15 18:51:59 JST 2015
[INFO] Final Memory: 168M/435M
[INFO] ------------------------------------------------------------------------


you can get the compiled hadoop-2.3.0 under this directory:
hadoop-2.3.0-src/hadoop-dist/target/hadoop-2.3.0.tar.gz


The methods is the same when you want to compile hadoop-2.5.2, hadoop-2.6.0 and hadoop-2.7.0

Friday, May 29, 2015

Quickly build arbitrary size Hadoop Cluster based on Docker

Please Check the Updated Blog!!!


You can go to the section 3 directly and build a 3 nodes Hadoop cluster following the directions.
1. Project Introduction
2. Hadoop-Cluster-Docker image Introduction
3. Steps to build a 3 nodes Hadoop Cluster
4. Steps to build arbitrary size Hadoop Cluster

1. Project Introduction

Building a Hadoop cluster using physical machines is very painful, especially for beginners. They will be frustrated by this problem before running wordcount.
My objective is to run Hadoop cluster based on Docker, and help Hadoop developer to quickly build an arbitrary size Hadoop cluster on their local host. This idea already has several implementations, but in my view, they are not good enough. Their image size is too large, or they are very slow and they are not user friendly by using third party tools. Following table shows some problems of existing Hadoop on Docker project.
Project                              Image Size      Problem
sequenceiq/hadoop-docker:latest      1.491GB         too large, only one node
sequenceiq/hadoop-docker:2.7.0       1.76 GB    
sequenceiq/hadoop-docker:2.60        1.624GB    

sequenceiq/ambari:latest             1.782GB         too large, too slow, using third party tool
sequenceiq/ambari:2.0.0              4.804GB    
sequenceiq/ambari:latest:1.70        4.761GB    

alvinhenrick/hadoop-mutinode         4.331GB         too large, too slow to build images, not easy to add nodes, have some bugs
My project is based on "alvinhenrick/hadoop-mutinode" project, however, I've reconstructed it for optimization. Following is the GitHub address and blog address of "alvinhenrick/hadoop-mutinode" project. GitHub, Blog
Following table shows the differences between my project "kiwenlau/hadoop-cluster-docker" and "alvinhenrick/hadoop-mutinode" project.
Image Name                    Build time      Layer number     Image Size
alvinhenrick/serf             258.213s        21               239.4MB
alvinhenrick/hadoop-base      2236.055s       58               4.328GB
alvinhenrick/hadoop-dn        51.959s         74               4.331GB
alvinhenrick/hadoop-nn-dn     49.548s         84               4.331GB
Image Name                    Build time     Layer number       Image Size
kiwenlau/serf-dnsmasq         509.46s        8                  206.6 MB
kiwenlau/hadoop-base          400.29s        7                  775.4 MB
kiwenlau/hadoop-master        5.41s          9                  775.4 MB
kiwenlau/hadoop-slave         2.41s          8                  775.4 MB
In summary, I did following optimizations:
  • Smaller image size
  • Faster build time
  • Less image layers
Change node number quickly and conveniently
For "alvinhenrick/hadoop-mutinode" project, If you want to change node number, you have to change hadoop configuration file (slaves, which list the domain name or ip address of all nodes ), rebuild hadoop-nn-dn image, change the shell sript for starting containers! As for my "kiwenlau/hadoop-cluster-docker" project, I write a shell script (resize-cluster.sh) to automate these steps. Then you can rebuild the hadoop-master image within one minutes and run an arbitrary size Hadoop Cluster quickly! The default node number of my project is 3 and you can change is to any size you like! In addition, building image, running container, starting Hadoop and run wordcount, all these jobs are automated by shell scripts. So you can use and develop this project more easily! Welcome to join this project
Develop environment
  • OS:ubuntu 14.04 and ubuntu 12.04
  • kernel: 3.13.0-32-generic
  • Docke:1.5.0 and1.6.2
Attention: old kernel version or small memory size will cause failure while running my project

2. Hadoop-Cluster-Docker image Introduction

I developed 4 docker images in this project
  • serf-dnsmasq
  • hadoop-base
  • hadoop-master
  • hadoop-slave
serf-dnsmasq
  • based on ubuntu:15.04: It is the smallest ubuntu image
  • install serf: serf is an distributed cluster membership management tool, which can recognize all nodes of the Hadoop cluster
  • install dnsmasq: dnsmasq is a lightweight dns server, which can provide domain name resolution service for the Hadoop Cluster
When containers start, the IP address of master node will passed to all slaves node. Serf will start when the containers start. Serf agents on all slaves node will recognize the master node because they know the IP address of master node. Then the serf agent on master node will recognize all slave nodes. Serf agents on all nodes will communicate with each other, so everyone will know everyone after a while. When serf agent recognize new node, it will reconfigure the dnsmasq and restart it. Eventually, dnsmasq will be able to provide domain name resolution service for all nodes of the Hadoop Cluster. However, the setup jobs for serf and dnsmasq will cause more time when node number increases. Thus, when you want run more nodes, you have to verify whether serf agent have found all nodes and whether dnsmasq can resolve all nodes before you start hadoop. Using serf and dnsmasq to solve FQDN problem is proposed by SequenceIQ, which is startup company focusing on runing Hadoop on Docker. You can read this slide for more details.
hadoop-base
  • based on serf-dnsmasq
  • install JDK(openjdk)
  • install openssh-server, configure password free ssh
  • install vim:happy coding inside docker container:)
  • install Hadoop 2.3.0: install compiled hadoop (2.5.2, 2.6.0, 2.7.0 is bigger than 2.3.0)
You can check my blog for compiling hadoop: Steps to compile 64-bit Hadoop 2.3.0 under Ubuntu 14.04
If you want to rebuild hadoop-base image, you need download the compiled hadoop, and put it inside hadoop-cluster-docker/hadoop-base/files directory. Following is the address to download compiled hadoop: hadoop-2.3.0)

If you want to try other version of Hadoop, you can download these compiled hadoop.
hadoop-master
  • based on hadoop-base
  • configure hadoop master
  • formate namenode
We need to configure slaves file during this step, and slaves file need to list the domain names and ip address of all nodes. Thus, when we change the node number of hadoop cluster, the slaves file should be different. That's why we need change slaves file and rebuild hadoop-master image when we want to change node number. I write a shell script named resize-cluster.sh to automatically rebuild hadoop-master image to support arbitrary size Hadoop cluster. You only need to give the node number as the parameter of resize-cluster.sh to change the node number of Hadoop cluster. Building the hadoop-master image only costs 1 minute since it only does some configuration jobs.
hadoop-slave
  • based on hadoop-base
  • configure hadoop slave node
image size analysis
following table shows the output of "sudo docker images"
REPOSITORY                 TAG       IMAGE ID        CREATED          VIRTUAL SIZE
kiwenlau/hadoop-slave      0.1.0     d63869855c03    17 hours ago     777.4 MB
kiwenlau/hadoop-master     0.1.0     7c9d32ede450    17 hours ago     777.4 MB
kiwenlau/hadoop-base       0.1.0     5571bd5de58e    17 hours ago     777.4 MB
kiwenlau/serf-dnsmasq      0.1.0     09ed89c24ee8    17 hours ago     206.7 MB
ubuntu                     15.04     bd94ae587483    3 weeks ago      131.3 MB
Thus:
  • serf-dnsmasq increases 75.4MB based on ubuntu:15.04
  • hadoop-base increases 570.7MB based on serf-dnsmasq
  • hadoop-master and hadoop-slave increase 0 MB based on hadoop-base
following table shows the partial output of "docker history kiwenlau/hadoop-base:0.1.0"
IMAGE            CREATED             CREATED BY                                          SIZE
2039b9b81146     44 hours ago        /bin/sh -c #(nop) ADD multi:a93c971a49514e787       158.5 MB
cdb620312f30     44 hours ago        /bin/sh -c apt-get install -y openjdk-7-jdk         324.6 MB
da7d10c790c1     44 hours ago        /bin/sh -c apt-get install -y openssh-server        87.58 MB
c65cb568defc     44 hours ago        /bin/sh -c curl -Lso serf.zip https://dl.bint       14.46 MB
3e22b3d72e33     44 hours ago        /bin/sh -c apt-get update && apt-get install        60.89 MB
b68f8c8d2140     3 weeks ago         /bin/sh -c #(nop) ADD file:d90f7467c470bfa9a3       131.3 MB
Thus:
  • base image ubuntu:15.04 is 131.3MB
  • installing openjdk costs 324.6MB
  • installing hadoop costs 158.5MB
  • total size of ubuntu,openjdk and hadoop is 614.4MB
Following picture shows the image architecture of my project:



























So, my hadoop image is near minimal size and it's hard to do more optimization

3. steps to build a 3 nodes Hadoop cluster

a. pull image
sudo docker pull kiwenlau/hadoop-master:0.1.0
sudo docker pull kiwenlau/hadoop-slave:0.1.0
sudo docker pull kiwenlau/hadoop-base:0.1.0
sudo docker pull kiwenlau/serf-dnsmasq:0.1.0
check downloaded images
sudo docker images
output
REPOSITORY                TAG       IMAGE ID        CREATED         VIRTUAL SIZE
kiwenlau/hadoop-slave     0.1.0     d63869855c03    17 hours ago    777.4 MB
kiwenlau/hadoop-master    0.1.0     7c9d32ede450    17 hours ago    777.4 MB
kiwenlau/hadoop-base      0.1.0     5571bd5de58e    17 hours ago    777.4 MB
kiwenlau/serf-dnsmasq     0.1.0     09ed89c24ee8    17 hours ago    206.7 MB
  • hadoop-base is based on serf-dnsmasq,hadoop-slave and hadoop-master is based on hadoop-base
  • so the total size of all four images is only 777.4MB
b. clone source code
git clone https://github.com/kiwenlau/hadoop-cluster-docker
c. run container
 cd hadoop-cluster-docker
./start-container.sh
output
start master container...
start slave1 container...
start slave2 container...
root@master:~#
  • start 3 containers,1 master and 2 slaves
  • you will go to the /root directory of master container after start all containers list the files inside /root directory of master container
ls
output
hdfs  run-wordcount.sh    serf_log  start-hadoop.sh  start-ssh-serf.sh
  • start-hadoop.sh is the shell script to start hadoop
  • run-wordcount.sh is the shell script to run wordcount program
d. test serf and dnsmasq service
In fact, you can skip this step and just wait for about 1 minute. Serf and dnsmasq need some time to start service.
list all nodes of hadoop cluster
serf members
output
master.kiwenlau.com  172.17.0.65:7946  alive  
slave1.kiwenlau.com  172.17.0.66:7946  alive  
slave2.kiwenlau.com  172.17.0.67:7946  alive
you can wait for a while if any nodes don't show up since serf agent need time to recognize all nodes
test ssh
ssh slave2.kiwenlau.com
output
Warning: Permanently added 'slave2.kiwenlau.com,172.17.0.67' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 15.04 (GNU/Linux 3.13.0-53-generic x86_64)
 * Documentation:  https://help.ubuntu.com/
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
root@slave2:~#
exit slave2 nodes
exit
output
logout
Connection to slave2.kiwenlau.com closed.
  • Please wait for a whil if ssh fails, dnsmasq need time to configure domain name resolution service
  • You can start hadoop after these tests!

e. start hadoop

./start-hadoop.sh
  • you need to exit slave2 node after ssh to it...
f. run wordcount
./run-wordcount.sh
output
input file1.txt:
Hello Hadoop

input file2.txt:
Hello Docker

wordcount output:
Docker    1
Hadoop    1
Hello    2
4. Steps to build an arbitrary size Hadoop Cluster
a. Preparation
  • check the steps a~b of section 3:pull images and clone source code
  • you don't have to pull serf-dnsmasq but you need to pull hadoop-base, since rebuiding hadoop-master is based on hadoop-base
b. rebuild hadoop-master
./resize-cluster.sh 5
  • It only take 1 minutes
  • you can use any interger as the parameter for resize-cluster.sh: 1, 2, 3, 4, 5, 6...
c. start container
./start-container.sh 5
  • you can use any interger as the parameter for start-container.sh: 1, 2, 3, 4, 5, 6...
  • you'd better use the same parameter as the step b
d. run the Hadoop cluster
  • check the steps d~f of section 3:test serf and dnsmasq, start Hadoop and run wordcount
  • please test serf and dnsmasq service before start hadoop
All rights reserved Please keep the author name: KiwenLau and original blog link :
http://kiwenlau.blogspot.com/2015/05/quickly-build-arbitrary-size-hadoop.html

Saturday, May 9, 2015

Ubuntu 14.04.01中安装Virtualbox guest addition

 转载请注明作者:KiwenLau,以及原文地址:

PS: 红色部分为命令

1. 更新源
sudo apt-get update

2. 更新已安装的包
sudo apt-get -y upgrade

3. 安装所需的软件
sudo apt-get install -y build-essential module-assistant

4. 为编译安装驱动模块安装必须的各种软件包
sudo m-a prepare

5. 开启ubuntu虚拟机, 插入guest additionCD
Devices > Insert Guest Additions CD image...

6. 挂载CD
sudo mount /dev/cdrom /media/cdrom

7. 安装guest addtion
sudo /media/cdrom/VBoxLinuxAdditions.run

8. 运行结果
# sudo /media/cdrom/VBoxLinuxAdditions.run
Verifying archive integrity... All good.
Uncompressing VirtualBox 4.3.26 Guest Additions for Linux............
VirtualBox Guest Additions installer
Copying additional installer modules ...
Installing additional modules ...
Removing existing VirtualBox non-DKMS kernel modules ...done.
Building the VirtualBox Guest Additions kernel modules
The headers for the current running kernel were not found. If the following
module compilation fails then this could be the reason.

Building the main Guest Additions module ...done.
Building the shared folder support module ...done.
Building the OpenGL support module ...done.
Doing non-kernel setup of the Guest Additions ...done.
Starting the VirtualBox Guest Additions ...done.
Installing the Window System drivers
Could not find the X.Org or XFree86 Window System, skipping.

运行结果含” Building the main Guest Additions module ...done”以及” Starting the VirtualBox Guest Additions ...done”表示安装成功

9. 可以用以下方法验证安装成功:
# lsmod | grep -io vboxguest
vboxguest
# modinfo vboxguest
filename:       /lib/modules/3.13.0-32-generic/misc/vboxguest.ko
version:        4.3.26
license:        GPL
description:    Oracle VM VirtualBox Guest Additions for Linux Module
author:         Oracle Corporation
srcversion:     3D54FF6D3C1923680BE85CB
alias:          pci:v000080EEd0000CAFEsv00000000sd00000000bc*sc*i*
depends:       
vermagic:       3.13.0-32-generic SMP mod_unload modversions
# lsmod | grep -io vboxguest | xargs modinfo | grep -iw version

version:        4.3.26

参考: