====== CentOS HortonWorks ====== * [[CentOS]] * [[Hadoop]] In this guide, CentOS 6.6 is used, coupled with HortonWorks Data Platform (HDP) 2.1. ** Download the minimal ISO ** The netinstall ISO is an option, but since the size difference between that and the minimal is negligible, I prefer the minimal one. In addition, the minimal will install some basic system packages. wget http://archive.kernel.org/centos-vault/6.6/isos/x86_64/CentOS-6.6-x86_64-minimal.iso ** Boot the ISO in text mode ** To make life simpler or if using a headless server, boot in text mode. At the boot menu, hit Tab, and add ''text'' to the kernel options. Alternatively, you can do a [[CentOS Kickstart|kickstart]] install. ** Disk partitioning and filesystems ** Hadoop comes with some recommendations for setting up the filesystem * Don't use LVM to manage partitions * Either do not install swap partition or set ''vm.swappiness'' to 0 in ''sysctl.conf'' * Set the ''noatime'' flag for the partitions * Use ext3 or ext4 as the filesystem type * Disable root reserved amount Using the text installer, your partitions are set up automatically. It will install a swap partition, and a separate one for the boot loader. You'll only get an option to partition the drives through the GUI install. So in these cases of a text one, it'll auto-format, use LVM, and create an ext4 filesystem for root. Set ''vm.swappiness'' to 0 in ''/etc/sysctl.conf'', and apply it to the running system. This will let the kernel use swap only if something is going to OOM. ** DHCP request ** If you didn't do the netinstall, then your server might not get a DHCP address when booting up the first time. First, get a DHCP address for your existing install, assuming your network device is ''eth0'': dhclient eth0 ** Install packages ** Using yum, install some basic packages: yum -y install man wget vim ntp ntpdate chkconfig ntsysv acpid screen sudo bind-utils nano rsync Start services: /etc/init.d/ntp start /etc/init.d/ntpdate start /etc/init.d/acpid start ** DHCP client on boot ** Edit ''/etc/sysconfig/network-scripts/ifcfg-eth0'' so it will run it on boot: ONBOOT=yes ** Disable iptables ** Unless needed, disable iptables per HortonWork's recommendation: chkconfig iptables off chkconfig ip6tables off ** NTP ** It's best to have a Hadoop node in sync with an NTP server so that there is no drift between each server. chkconfig ntp on chkconfig ntpdate on ** Max open files and processes ** Set the ulimit values for all users on the system. Hadoop will need this since it opens a lot of files and creates a lot of processes. There will be performance impact with the general defaults of 1024. In ''/etc/security/limits.conf'': * - nofile 32768 * - nproc 65536 ** Hostnames ** Again, to improve performance for Hadoop, set DNS entries for nodes directly in the ''/etc/hosts''. This saves DNS lookups for the servers. 192.168.12.1 hadoop-node1 192.168.12.2 hadoop-node2 192.168.12.3 hadoop-node3 You would also add an entry for the server you are running on. 127.0.0.1 localhost 192.168.12.1 hadoop-node1 Set your server's hostname: hostname hadoop-node1 Set the hostname on boot for CentOS. Add this to ''/etc/sysconfig/network'': HOSTNAME=hadoop-node1 Hadoop also recommends disabling IPv6: NETWORKING_IPV6=no ** Setup SSH pubkeys ** For each server, set up an SSH public key without a passphrase for root. Ambari will use it to communicate with the other servers and install packages. ssh-keygen ** SELinux ** Depending on your install, SELinux may or may not be enabled. Disable it in the running instance: setenforce 0 And also disable it when booting in ''/etc/selinux/config'': SELINUX=disabled Note that if you disable it in your running state, and install Ambari and run ''ambari-server setup'', it will think that SE Linux is still enabled. Best to reboot, then, after everything else is complete. ** Disable transparent hugepages ** HortonWorks recommends disabling this memory setting since it may cause problems with network lookups. Disable it in the running system, and also add to ''/etc/rc.local'' so it's preserved on boot. echo never > /sys/kernel/mm/transparent_hugepage/enabled ** Primary node pubkeys ** The primary node that has Ambari installed will need it's pubkey installed on all the nodes **including itself**. ssh-copy-id hadoop-node1 ssh-copy-id hadoop-node2 ssh-copy-id hadoop-node3 Once everything above is done above on all the nodes, you're ready to install Ambari and use it to deploy a Hadoop cluster. ** Ambari ** Install the Ambari repo, which we'll use to set up the cluster. Ambari only runs on one server (for example, hadoop-node1). We'll use it to install HDP. cd /etc/yum.repos.d wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.6.1/ambari.repo Install the package through yum: yum -y install ambari-server Finally, run through the ambari server setup. It will pull in necessary packages itself. Using the defaults is fine. ambari-server setup Start up the Ambari server: /etc/init.d/ambari-server start And then access your Ambari instance on port 8080 at your server - http://hadoop-node1:8080/ The default user and password set by Ambari is ''admin'' and ''admin''. ** Ambari host checks ** When Ambari sets up the new nodes, it will look through all of them to check for service problems. You can run the check manually from the primary node: python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users