CentOS HortonWorks

In this guide, CentOS 6.6 is used, coupled with HortonWorks Data Platform (HDP) 2.1.

Download the minimal ISO

The netinstall ISO is an option, but since the size difference between that and the minimal is negligible, I prefer the minimal one. In addition, the minimal will install some basic system packages.

wget http://archive.kernel.org/centos-vault/6.6/isos/x86_64/CentOS-6.6-x86_64-minimal.iso

Boot the ISO in text mode

To make life simpler or if using a headless server, boot in text mode.

At the boot menu, hit Tab, and add text to the kernel options.

Alternatively, you can do a kickstart install.

Disk partitioning and filesystems

Hadoop comes with some recommendations for setting up the filesystem

  • Don't use LVM to manage partitions
  • Either do not install swap partition or set vm.swappiness to 0 in sysctl.conf
  • Set the noatime flag for the partitions
  • Use ext3 or ext4 as the filesystem type
  • Disable root reserved amount

Using the text installer, your partitions are set up automatically. It will install a swap partition, and a separate one for the boot loader.

You'll only get an option to partition the drives through the GUI install. So in these cases of a text one, it'll auto-format, use LVM, and create an ext4 filesystem for root.

Set vm.swappiness to 0 in /etc/sysctl.conf, and apply it to the running system. This will let the kernel use swap only if something is going to OOM.

DHCP request

If you didn't do the netinstall, then your server might not get a DHCP address when booting up the first time. First, get a DHCP address for your existing install, assuming your network device is eth0:

dhclient eth0

Install packages

Using yum, install some basic packages:

yum -y install man wget vim ntp ntpdate chkconfig ntsysv acpid screen sudo bind-utils nano rsync

Start services:

/etc/init.d/ntp start
/etc/init.d/ntpdate start
/etc/init.d/acpid start

DHCP client on boot

Edit /etc/sysconfig/network-scripts/ifcfg-eth0 so it will run it on boot:

ONBOOT=yes

Disable iptables

Unless needed, disable iptables per HortonWork's recommendation:

chkconfig iptables off
chkconfig ip6tables off

NTP

It's best to have a Hadoop node in sync with an NTP server so that there is no drift between each server.

chkconfig ntp on
chkconfig ntpdate on

Max open files and processes

Set the ulimit values for all users on the system. Hadoop will need this since it opens a lot of files and creates a lot of processes. There will be performance impact with the general defaults of 1024.

In /etc/security/limits.conf:

* - nofile 32768
* - nproc 65536

Hostnames

Again, to improve performance for Hadoop, set DNS entries for nodes directly in the /etc/hosts. This saves DNS lookups for the servers.

192.168.12.1 hadoop-node1
192.168.12.2 hadoop-node2
192.168.12.3 hadoop-node3

You would also add an entry for the server you are running on.

127.0.0.1 localhost
192.168.12.1 hadoop-node1

Set your server's hostname:

hostname hadoop-node1

Set the hostname on boot for CentOS. Add this to /etc/sysconfig/network:

HOSTNAME=hadoop-node1

Hadoop also recommends disabling IPv6:

NETWORKING_IPV6=no

Setup SSH pubkeys

For each server, set up an SSH public key without a passphrase for root. Ambari will use it to communicate with the other servers and install packages.

ssh-keygen

SELinux

Depending on your install, SELinux may or may not be enabled.

Disable it in the running instance:

setenforce 0

And also disable it when booting in /etc/selinux/config:

SELINUX=disabled

Note that if you disable it in your running state, and install Ambari and run ambari-server setup, it will think that SE Linux is still enabled. Best to reboot, then, after everything else is complete.

Disable transparent hugepages

HortonWorks recommends disabling this memory setting since it may cause problems with network lookups.

Disable it in the running system, and also add to /etc/rc.local so it's preserved on boot.

echo never > /sys/kernel/mm/transparent_hugepage/enabled

Primary node pubkeys

The primary node that has Ambari installed will need it's pubkey installed on all the nodes including itself.

ssh-copy-id hadoop-node1
ssh-copy-id hadoop-node2
ssh-copy-id hadoop-node3

Once everything above is done above on all the nodes, you're ready to install Ambari and use it to deploy a Hadoop cluster.

Ambari

Install the Ambari repo, which we'll use to set up the cluster. Ambari only runs on one server (for example, hadoop-node1). We'll use it to install HDP.

cd /etc/yum.repos.d
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.6.1/ambari.repo

Install the package through yum:

yum -y install ambari-server

Finally, run through the ambari server setup. It will pull in necessary packages itself. Using the defaults is fine.

ambari-server setup

Start up the Ambari server:

/etc/init.d/ambari-server start

And then access your Ambari instance on port 8080 at your server - http://hadoop-node1:8080/ The default user and password set by Ambari is admin and admin.

Ambari host checks

When Ambari sets up the new nodes, it will look through all of them to check for service problems.

You can run the check manually from the primary node:

python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users