====== CentOS HortonWorks ======
* [[CentOS]]
* [[Hadoop]]
In this guide, CentOS 6.6 is used, coupled with HortonWorks Data Platform (HDP) 2.1.
** Download the minimal ISO **
The netinstall ISO is an option, but since the size difference between that and the minimal is negligible, I prefer the minimal one. In addition, the minimal will install some basic system packages.
wget http://archive.kernel.org/centos-vault/6.6/isos/x86_64/CentOS-6.6-x86_64-minimal.iso
** Boot the ISO in text mode **
To make life simpler or if using a headless server, boot in text mode.
At the boot menu, hit Tab, and add ''text'' to the kernel options.
Alternatively, you can do a [[CentOS Kickstart|kickstart]] install.
** Disk partitioning and filesystems **
Hadoop comes with some recommendations for setting up the filesystem
* Don't use LVM to manage partitions
* Either do not install swap partition or set ''vm.swappiness'' to 0 in ''sysctl.conf''
* Set the ''noatime'' flag for the partitions
* Use ext3 or ext4 as the filesystem type
* Disable root reserved amount
Using the text installer, your partitions are set up automatically. It will install a swap partition, and a separate one for the boot loader.
You'll only get an option to partition the drives through the GUI install. So in these cases of a text one, it'll auto-format, use LVM, and create an ext4 filesystem for root.
Set ''vm.swappiness'' to 0 in ''/etc/sysctl.conf'', and apply it to the running system. This will let the kernel use swap only if something is going to OOM.
** DHCP request **
If you didn't do the netinstall, then your server might not get a DHCP address when booting up the first time. First, get a DHCP address for your existing install, assuming your network device is ''eth0'':
dhclient eth0
** Install packages **
Using yum, install some basic packages:
yum -y install man wget vim ntp ntpdate chkconfig ntsysv acpid screen sudo bind-utils nano rsync
Start services:
/etc/init.d/ntp start
/etc/init.d/ntpdate start
/etc/init.d/acpid start
** DHCP client on boot **
Edit ''/etc/sysconfig/network-scripts/ifcfg-eth0'' so it will run it on boot:
ONBOOT=yes
** Disable iptables **
Unless needed, disable iptables per HortonWork's recommendation:
chkconfig iptables off
chkconfig ip6tables off
** NTP **
It's best to have a Hadoop node in sync with an NTP server so that there is no drift between each server.
chkconfig ntp on
chkconfig ntpdate on
** Max open files and processes **
Set the ulimit values for all users on the system. Hadoop will need this since it opens a lot of files and creates a lot of processes. There will be performance impact with the general defaults of 1024.
In ''/etc/security/limits.conf'':
* - nofile 32768
* - nproc 65536
** Hostnames **
Again, to improve performance for Hadoop, set DNS entries for nodes directly in the ''/etc/hosts''. This saves DNS lookups for the servers.
192.168.12.1 hadoop-node1
192.168.12.2 hadoop-node2
192.168.12.3 hadoop-node3
You would also add an entry for the server you are running on.
127.0.0.1 localhost
192.168.12.1 hadoop-node1
Set your server's hostname:
hostname hadoop-node1
Set the hostname on boot for CentOS. Add this to ''/etc/sysconfig/network'':
HOSTNAME=hadoop-node1
Hadoop also recommends disabling IPv6:
NETWORKING_IPV6=no
** Setup SSH pubkeys **
For each server, set up an SSH public key without a passphrase for root. Ambari will use it to communicate with the other servers and install packages.
ssh-keygen
** SELinux **
Depending on your install, SELinux may or may not be enabled.
Disable it in the running instance:
setenforce 0
And also disable it when booting in ''/etc/selinux/config'':
SELINUX=disabled
Note that if you disable it in your running state, and install Ambari and run ''ambari-server setup'', it will think that SE Linux is still enabled. Best to reboot, then, after everything else is complete.
** Disable transparent hugepages **
HortonWorks recommends disabling this memory setting since it may cause problems with network lookups.
Disable it in the running system, and also add to ''/etc/rc.local'' so it's preserved on boot.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
** Primary node pubkeys **
The primary node that has Ambari installed will need it's pubkey installed on all the nodes **including itself**.
ssh-copy-id hadoop-node1
ssh-copy-id hadoop-node2
ssh-copy-id hadoop-node3
Once everything above is done above on all the nodes, you're ready to install Ambari and use it to deploy a Hadoop cluster.
** Ambari **
Install the Ambari repo, which we'll use to set up the cluster. Ambari only runs on one server (for example, hadoop-node1). We'll use it to install HDP.
cd /etc/yum.repos.d
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.6.1/ambari.repo
Install the package through yum:
yum -y install ambari-server
Finally, run through the ambari server setup. It will pull in necessary packages itself. Using the defaults is fine.
ambari-server setup
Start up the Ambari server:
/etc/init.d/ambari-server start
And then access your Ambari instance on port 8080 at your server - http://hadoop-node1:8080/ The default user and password set by Ambari is ''admin'' and ''admin''.
** Ambari host checks **
When Ambari sets up the new nodes, it will look through all of them to check for service problems.
You can run the check manually from the primary node:
python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users