How to build a diskless cluster?

Pegasus IV Cluster

Server installation

Pegasus Cluster Software is based on Scientific Linux 7.1. Using Red Hat Enterprise Linux 7 or CentOS 7 should work just as well because Scientific Linux is a Red Hat rebuild.

Server hardware

Any reasonably powerful PC can be used as the cluster server, at least for smaller clusters (for our current server, see Hardware). The only non-standard requirement is that it needs to have two network interfaces, one to connect to the outside world and one for the private network between the server and the compute nodes. If necessary, simply add a PCI or PCIe network card to your machine.

Basic installation

/	915 GByte	Raid 1 on two 1 TByte hard drives
/boot	1 Gbyte	Raid 1 on two 1 TByte hard drives
/home	3720 GByte	Raid 1 on two 4 TByte hard drives
swap	2 x 15 GByte	on each of the 1TByte hard drives

Perform a standard installation of Scientific Linux on the server. We have manually partitioned our hard drives as shown in the table on the right. On the software selection screen, select "Creative and Development Workstation". (This is not crucial as missing packages can always be installed later.)

Set the root password and create the first user.

After the installation finishes, reboot, and then run "Software update" to receive all the lastest security patches and fixes.

Most of the following configuration steps require root rights.

ONBOOT=yes
BOOTPROTO=none
BROADCAST=192.168.0.255
NETMASK=255.255.255.0
IPADDR=192.168.0.254

Configure the network interface used for the private cluster network: Assign the static ip number 192.168.0.254, the subnet mask 255.255.255.0, and set the interface to be activated on boot. To do so, edit the file "/etc/sysconfig/network-scripts/ifcfg-..." where the ellipsis stands for the interface name (enp3s0 in our case). Add or edit the lines shown on the right. The other network interface connecting to the external world (enp1s0, in our case) should have been properly configured during the installation process.

To configure the firewall, start the GUI firewall configuration tool by calling

firewall-config

Assign the external network interface permanently to the "public" zone. Assign the internal network interface permanently to the "trusted" zone. (You should do this only if the cluster network is isolated and safe.) Alternatively, you can achieve this by typing the commands

firewall-cmd --permanent --zone=public --change-interface=enp1s0
firewall-cmd --permanent --zone=trusted --change-interface=enp3s0

Install Midnight Commander via the command

yum install mc

PXE, DHCP, and TFTP

The goal of this section is to provide the infrastructure for net-booting the compute nodes. We need:

pxelinux boot loader
DHCP server (the compute nodes obtain their IP numbers via DHCP from the server)
TFTP server (the compute nodes download the Linux kernel and the root file system via TFTP)

192.168.0.254 pegasus4
192.168.0.1 node001 n1

B8:97:5A:10:F9:9B; 192.168.0.1

Edit the file "/etc/hosts" and add lines for the cluster server and all compute nodes as shown on the right.

Create or edit the file "/etc/ethers" to list the MAC addresses of all compute nodes by adding lines such as the one shown on the right.

Create the folder "/tftpboot". This folder will hold all the files to be handed to the nodes during their boot process (the Linux kernel, the root file system, and the pxelinux boot loader). Copy pxelinux.0 from the syslinux package into /tftpboot

cp /usr/share/syslinux/pxelinux.0 /tftpboot

DEFAULT net
PROMPT 0

LABEL net
KERNEL bzImage
APPEND initrd=initramfs.cpio.gz rw ip=dhcp net.ifnames=0 selinux=0

Create the directory "/tftpboot/pxelinux.cfg" which holds the pxelinux configuration files. In this directory, create the file C0A8 with the content shown on the right. This file specifies the kernel and file system image the nodes will download, as well as a few option.

the filename "C0A8" means that all nodes in the 192.168 network will have this configuration (hex C0 corresponds to 192 and A8 corresponds to 168).
"initrd=initramfs.cpio.gz" specifies the name of the compressed initramfs file system
"rw" means read/write access
"ip=dhcp" means the kernel gets its ip number via dhcp
"net.ifnames=0" disables the new, bios based network interface names so that the traditional names "eth0", "eth1", etc are used instead (this is necessary so that all compute nodes have the same interface names independent of their hardware)
"selinux=0" disables SE Linux (which cannot be used on the compute nodes because the Network File System does not support extended file attributes)

DHCP service and TFTP service are both provided by dnsmasq. Install the dnsmasq package (if it is not already installed).

yum install dnsmasq

listen-address=192.168.0.254
listen-address=127.0.0.1
domain=Pegasus
dhcp-range=192.168.0.1,192.168.0.253,static,255.255.255.0,infinite
dhcp-ignore=tag:!known
read-ethers
dhcp-option=40,Pegasus
dhcp-boot=pxelinux.0
enable-tftp
tftp-root=/tftpboot

Edit the dnsmasq configuration file "/etc/dnsmasq.conf". dnsmasq has a large number of configuration options. Edit or add the lines on the right.

the first two lines tell dnsmasq to only listen to requests from the private cluster network (coming in via 192.168.0.254) and from the loopback device
line 3: assign domain name "Pegasus" to machines getting their ip from dnsmasq
line 4: dhcp will assign static ip addresses in the range 192.168.0.1 to 192.168.0.253
line 5: only give ip numbers to known machines (explicitly listed in either "/etc/ethers" or in "dnsmasq.conf" itself
line 6: read MAC adresses of known nodes from /etc/ethers
line 7: set NIS domain to "Pegasus"
line 8: set network boot loader to pxelinux.0
lines 9 and 10: enable the TFTP service and set the tftp folder to "/tftpboot"

To allow the tftp service to access the "/tftpboot" directory under SE Linux, you need to change the security context via

semanage fcontext -a -t tftpdir_t 'tftpboot'
restorecon -v -R 'tftpboot'

Finally, you can start dnsmasq and enable its automatic start at boot time by typing the commands

systemctl start dnsmasq
systemctl enable dnsmasq

NFS, NIS, and chrony

The goals of this section are to

configure NFS to export "/usr" and "/home" from the server to the compute nodes
set up NIS as user authentication mechanism for the compute nodes
set up a time server that will permit the nodes to synchronize time with the server

To install the rpm packages necessary for the network file system (NFS), type the command

yum install nfs-utils rpcbind

/usr 192.168.0.0/24(ro,no_root_squash)
/home 192.168.0.0/24(rw,no_root_squash)

Edit the file "/etc/exports" and add the lines shown on the right. This means, we export both "/usr" and "/home" to the 192.168.0.* subnet. "/usr" is exported read-only while "/home" is exported with read--write access. The no_root_squash option allows the superuser root to be treated as such by the NFS server.

Now you can start the NFS server and enable its automatic start at boot by typing

systemctl start rpcbind; systemctl start nfs-server
systemctl enable rpcbind; systemctl enable nfs-server

To test the NFS installation, type the command

showmount -e localhost

The rpm packages required for installing NIS (Yellow Pages) can be installed by typing

yum install ypserv ypbind

Set the NIS domain name by typing

ypdomainname Pegasus
echo "NISDOMAIN=Pegasus" >> /etc/sysconfig/network

domain Pegasus server 192.168.0.254

Create or edit the file "/etc/yp.conf". Add the line shown on the right.

255.255.255.0 192.168.0.0
255.0.0.0 127.0.0.0

Create or edit the file "/var/yp/securenets". Add the lines shown on the right so that only hosts in the internal 192.168.0.0 network are allowed to connect to the NIS server

Now you can start NIS and enable its automatic start at boot by typing

systemctl start ypserv ypbind yppasswdd
systemctl enable ypserv ypbind yppasswdd

Initialize the NIS maps via

/usr/lib64/yp/ypinit -m

Specify "pegasus4", then type Ctrl-D and finish. If you later wish to update the NIS maps, for example after adding a new user, cd into the directory "/var/yp" and type

make

You can test your NIS installion via

yptest -u <user>

where <user> is a user name that exists in the NIS map.

We use chrony to allow the compute nodes to synchronize their system time with the cluster server. Install the chrony package via

yum install chrony

allow 192.168.0.0/24

Edit the file "/etc/chrony.conf". Add or edit the line that controls NTP client access from the private cluster network as shown on the right.

Now you can start chronyd and enable its automatic start at boot by typing

systemctl start chronyd
systemctl enable chronyd

Passwordless rlogin and rsh between the nodes

NOTE: rlogin and rsh are insecure, they should only be used if your private cluster network is isolated and safe. Otherwise, look into ssh which also allows passwordless logins.

pegasus4
node001

pegasus4 root
node001 root

To enable passwordless rlogin and rsh for normal (non-root) users, create or edit the file "/etc/hosts.equiv". Add lines for the cluster server and all compute nodes as shown on the right.

If you wish to enable passwordless rlogin/rsh for root you also need to create or edit the file "/root/.rhosts" and add the lines shown on the right for the server and all nodes. Make sure that the permissions of both files are set to 644.

Install the rsh and rsh-server packages

yum install rsh rsh-server

The files "/usr/bin/rsh", "/usr/bin/rlogin", "/usr/bin/rcp", and "/usr/bin/rexec" should have the SUID bit set.

chmod u+s /usr/bin/rsh /usr/bin/rlogin /usr/bin/rcp /usr/bin/rexec

(Note: Linux file capabilities that can assign certain root priviliges without the SUID bit cannot be used here because the compute nodes will mount "/usr" via the NFS file system which does not support extended attributes.)

Edit the file "/etc/securetty" and append "rsh, "rexec", and "rlogin" at the end of the file.

To allow rlogind on the cluster server to access the file "/root/.rhosts" under SE Linux:

semanage fcontext -a -t rlogind_home__t '/root/.rhosts'
restorecon -v -R '/root/.rhosts'

Finally, you can start rsh, rlogin and rexec and enable their automatic start at boot by typing

systemctl start rsh.socket rlogin.socket rexec.socket
systemctl enable rsh.socket rlogin.socket rexec.socket

Intel Fortran and C/C++ compilers

To install Intel Parallel Studio XE 2016 for Linux , download the installation packages and the license file and copy them into a staging directory such as "/var/install". (Note: In contrast to all other softare we use on Pegasus, Intel Parallel Studio is commericial software. We use it because it tends to produce faster code than gfortran, at least for our applications.) Unpack the installation package and run the installation script "install.sh".

Make sure you install the compilers into a directory that is exported to the compute nodes, such as "/usr/local/intel", because the nodes will need access to the libraries. The installer nontheless puts the license file under "/opt/intel" where the compute nodes cannot see it. Therefore copy the license file to the folder "/usr/local/intel/compilers_and_libraries_2016.0.150/linux/licenses/" or the corresponding folder for your compiler version.

If you have a floating license, you also need to install the Intel flexlm license manager. Download the installation package and copy it to the staging directory. Unpack the package and run the installation script.

To start flexlm automatically, add the following line to the file "/etc/rc.d/rc.local"

<server-install-dir>/lmgrd.intel -c <server-install-dir>/server.lic -l /var/log/lmgrd.log

where <server-install-dir> is the full path to the flexlm directory. Do not forget to make "rc.local" executable. Alternatively, you could write a proper systemd service file for flexlm.

Finally, add the following line to the user's ".cshrc"

source /usr/local/intel/bin/compilervars.csh intel64

to set the path and environment variables. The corresponding line for ".bashrc" is

source /usr/local/intel/bin/compilervars.sh intel64

Torque resource manager and Maui scheduler

Detailed installation instructions can be found in the Torque Administrator Guide (pdf version), here we just give a brief summary.

Install prerequisite packages:

yum install libtool openssl-devel libxml2-devel boost-devel gcc gcc-c++

Download the source code of Torque from www.adaptivecomputing.com. Unpack the tar ball into a directory. In this directory, run the commands

./configure
make
make install

Torque does not know how to find its libraries. Therefore type

echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf
ldconfig

Copy the systemd service files into the directory "/usr/lib/systemd/system":

cp contrib/systemd/trqauthd.service /usr/lib/systemd/system/
cp contrib/systemd/pbs_mom.service /usr/lib/systemd/system/
cp contrib/systemd/pbs_sched.service /usr/lib/systemd/system/
cp contrib/systemd/pbs_server.service /usr/lib/systemd/system/

Start the authentication daemon and enable its automatic start at boot via

systemctl enable trqauthd.service
systemctl start trqauthd.service

Now initialize the Torque server by executing from the build directory

./torque.setup <user>
qterm

<user> becomes a manager and operator of Torque.

node001 np=4 quad i5
node002 np=4 quad i5

Create the file "/var/spool/torque/server_priv/nodes" that lists all compute nodes and their properties. Add lines as shown on the right. The parameter "np" specifies how many CPUs the node has (this is used by Torque to determine how many processes to put on the node). "quad" and "i5" are examples of optional node attributes that can be used when submitting a job (see User Guide - Serial Jobs).

Start the Torque server and enable its automatic start at boot via

systemctl enable pbs_server.service
systemctl start pbs_server.service

Create and configure the desired queues, for example a queue "qsNormal".

qmgr -c "set server scheduling=true"
qmgr -c "create queue qsNormal queue_type=execution"
qmgr -c "set queue qsNormal started=true"
qmgr -c "set queue qsNormal enabled=true"
qmgr -c "set server default_queue=qsNormal"

Further configuration parameters are found in the pbs*.service files in the folder /usr/lib/systemd/system (such as the stacksize limit for processes spanned by pbs_mom).

Torque will also need to be configured on the compute nodes; the required steps are discussed in "Nodes".

To install the Maui scheduler, download the Maui source from adaptivecomputing.com. (You will need to fill out a free registration to get access.) Unpack the tar ball into a directory. In this directory, run the commands

./configure
make
make install

Add "/usr/local/maui/bin" to the user's path (in user's ".cshrc" and ".bashrc").

To start Maui at boot, add the line "/usr/local/maui/sbin/maui" to the file "/etc/rc.d/rc/local". (Alternatively, write a proper systemd service file for Maui.)

OpenMPI

Download source code of OpenMPI 1.10 from http://www.open-mpi.org/software/ompi/v1.10/. Unpack the tar ball into a directory. In this directory, run the commands

./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/usr/local --disable-dlopen
make all install

Here, the first four options specify the use of the Intel compilers, the "--prefix" switch sets the installation directory, and the switch "--disable-dlopen" disables the use of modules (this reduces the file system traffic when starting large jobs).

Make sure "/usr/local/bin" is in the user's path and "/usr/local/lib" is in the environment variable LD_LIBRARY_PATH.

Note: There is a conflict between Open MPI and the Intel MPI library installed as part of Parallel Studio. (Even if you do not order Intel's Cluster Edition, part of Intel's MPI software gets installed.) Rename "/usr/local/intel/compilers_and_libraries_2016.0.150/linux/mpi" into "/usr/local/intel/compilers_and_libraries_2016.0.150/linux/mpi_renamed". Otherwise the wrong libraries and the wrong "mpirun" may be called.