CentOS HortonWorks
In this guide, CentOS 6.6 is used, coupled with HortonWorks Data Platform (HDP) 2.1.
Download the minimal ISO
The netinstall ISO is an option, but since the size difference between that and the minimal is negligible, I prefer the minimal one. In addition, the minimal will install some basic system packages.
wget http://archive.kernel.org/centos-vault/6.6/isos/x86_64/CentOS-6.6-x86_64-minimal.iso
Boot the ISO in text mode
To make life simpler or if using a headless server, boot in text mode.
At the boot menu, hit Tab, and add text
to the kernel options.
Alternatively, you can do a kickstart install.
Disk partitioning and filesystems
Hadoop comes with some recommendations for setting up the filesystem
- Don't use LVM to manage partitions
- Either do not install swap partition or set
vm.swappiness
to 0 insysctl.conf
- Set the
noatime
flag for the partitions - Use ext3 or ext4 as the filesystem type
- Disable root reserved amount
Using the text installer, your partitions are set up automatically. It will install a swap partition, and a separate one for the boot loader.
You'll only get an option to partition the drives through the GUI install. So in these cases of a text one, it'll auto-format, use LVM, and create an ext4 filesystem for root.
Set vm.swappiness
to 0 in /etc/sysctl.conf
, and apply it to the running system. This will let the kernel use swap only if something is going to OOM.
DHCP request
If you didn't do the netinstall, then your server might not get a DHCP address when booting up the first time. First, get a DHCP address for your existing install, assuming your network device is eth0
:
dhclient eth0
Install packages
Using yum, install some basic packages:
yum -y install man wget vim ntp ntpdate chkconfig ntsysv acpid screen sudo bind-utils nano rsync
Start services:
/etc/init.d/ntp start /etc/init.d/ntpdate start /etc/init.d/acpid start
DHCP client on boot
Edit /etc/sysconfig/network-scripts/ifcfg-eth0
so it will run it on boot:
ONBOOT=yes
Disable iptables
Unless needed, disable iptables per HortonWork's recommendation:
chkconfig iptables off chkconfig ip6tables off
NTP
It's best to have a Hadoop node in sync with an NTP server so that there is no drift between each server.
chkconfig ntp on chkconfig ntpdate on
Max open files and processes
Set the ulimit values for all users on the system. Hadoop will need this since it opens a lot of files and creates a lot of processes. There will be performance impact with the general defaults of 1024.
In /etc/security/limits.conf
:
* - nofile 32768 * - nproc 65536
Hostnames
Again, to improve performance for Hadoop, set DNS entries for nodes directly in the /etc/hosts
. This saves DNS lookups for the servers.
192.168.12.1 hadoop-node1 192.168.12.2 hadoop-node2 192.168.12.3 hadoop-node3
You would also add an entry for the server you are running on.
127.0.0.1 localhost 192.168.12.1 hadoop-node1
Set your server's hostname:
hostname hadoop-node1
Set the hostname on boot for CentOS. Add this to /etc/sysconfig/network
:
HOSTNAME=hadoop-node1
Hadoop also recommends disabling IPv6:
NETWORKING_IPV6=no
Setup SSH pubkeys
For each server, set up an SSH public key without a passphrase for root. Ambari will use it to communicate with the other servers and install packages.
ssh-keygen
SELinux
Depending on your install, SELinux may or may not be enabled.
Disable it in the running instance:
setenforce 0
And also disable it when booting in /etc/selinux/config
:
SELINUX=disabled
Note that if you disable it in your running state, and install Ambari and run ambari-server setup
, it will think that SE Linux is still enabled. Best to reboot, then, after everything else is complete.
Disable transparent hugepages
HortonWorks recommends disabling this memory setting since it may cause problems with network lookups.
Disable it in the running system, and also add to /etc/rc.local
so it's preserved on boot.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Primary node pubkeys
The primary node that has Ambari installed will need it's pubkey installed on all the nodes including itself.
ssh-copy-id hadoop-node1 ssh-copy-id hadoop-node2 ssh-copy-id hadoop-node3
Once everything above is done above on all the nodes, you're ready to install Ambari and use it to deploy a Hadoop cluster.
Ambari
Install the Ambari repo, which we'll use to set up the cluster. Ambari only runs on one server (for example, hadoop-node1). We'll use it to install HDP.
cd /etc/yum.repos.d wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.6.1/ambari.repo
Install the package through yum:
yum -y install ambari-server
Finally, run through the ambari server setup. It will pull in necessary packages itself. Using the defaults is fine.
ambari-server setup
Start up the Ambari server:
/etc/init.d/ambari-server start
And then access your Ambari instance on port 8080 at your server - http://hadoop-node1:8080/ The default user and password set by Ambari is admin
and admin
.
Ambari host checks
When Ambari sets up the new nodes, it will look through all of them to check for service problems.
You can run the check manually from the primary node:
python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users