How to build a diskless cluster?

Pegasus IV Cluster

Node installation

Overview of the node boot process

In order to boot, a compute node needs a boot loader, a Linux kernel, and a root file system. As the compute nodes do not have any kind of persistent storage (hard drives, CD-ROMs, etc.), boot loader, kernel and root file system have to be provided over the network. In principle, one could hand the image of a complete Linux installation to the nodes. However, this is not practical, as this image would be very large and occupy a significant portion of the compute node's RAM. Moreover, the cluster boot process would be very slow when dozens or hundreds of nodes try to download this image.

In earlier versions of Pegasus, we solved this problem by handing the nodes a small root file system that contained only the binaries and libraries absolutely necessary for the boot process (in the directories "/bin", "/sbin", and "/lib"). The main part of the system (in "/usr") was mounted read-only via NFS from the server as part of the regular boot process.

Under Scientific Linux 7 / Red Hat Enterprise Linux 7 / CentOS 7, this strategy does not work anymore because "/bin", "/sbin", and "/lib" have been merged into "/usr" and because the initialization is goverened by systemd which resides in "/usr". If we still wish to mount "/usr" via NFS, it needs to be available before the proper boot process starts. For Pegasus IV, we have therefore devised a new strategy.

The compute node boot process now consists of the following steps:

Node obtains an IP number (via DHCP) and the pxelinux boot loader (via TFTP) from the cluster server
Boot loader downloads (again via TFTP) the Linux kernel and an initramfs file system which contains a few tools (realized via busybox) and a compressed tar ball of the actual node root file system.
Kernel runs the "/init" script in initramfs which does the following:
- create 128 MByte tmpfs in node RAM
- unpack the root file system into this tmpfs
- mount "/usr" and "/home" via NFS from the server
- execute a switch_root to the new root file system
Node boots normally, governed by systemd, to multi-user.target

This new strategy is actually much cleaner than the old one. It results in a smaller root file system image, and it avoids the messy procedure of picking which binaries and libararies to include in "/bin" and "/lib" for a susccesful boot.

In the following, we describe the steps to build the node kernel, the initramfs , and the root file system in detail. All steps must be carried out on the cluster server. Only when the whole package is ready, we will hand it to the nodes.

Node kernel

In principle one could use the server kernel with all its modules, but this is not practical. The modules would have to be included in the initramfs, increasing its size. Moreover, the node kernel does not need to support drives, sound cards, etc. We therefore build a small compact kernel without modules but with the drivers for the node hardware directly compiled into it.

Download the sources of a current stable kernel from www.kernel.org. We use longterm version 4.14.84. Unpack the tar ball into a directory under "/usr/src/kernels/". Next you need to configure the appropriate kernel options including network drivers for the node hardware, initramfs support, dhcp support, and NFS support. As an example, our kernel configuration file (and the actual compiled kernel) can be found on the Downloads page.

Copy the ".config" file into the kernel source directory. In this directory, run the command

make menuconfig

and select the relevant kernel options. After you are finshed, run

make

Copy the resulting kernel file "bzImage" from the subdirectory "arch/x86/boot" to "/tftpboot"

Initramfs

We build the initramfs in "/var/nodes/initramfs" and the node root file system in "/var/nodes/newroot".

Create the necessary directories by running the commands

mkdir /var/nodes /var/nodes/initramfs
cd /var/nodes/initramfs
mkdir bin dev etc proc sbin sys usr usr/bin usr/sbin
mkdir newroot

"newroot" will be the mount point for the final node root file system.

We use busybox to provide a self-contained set of Linux tools in a single executable. Download "busybox-i686" from http://www.busybox.net/downloads/binaries/latest/ and copy it into "/var/nodes/initramfs/bin". Change into this directory and run the commands

chmod +x busybox-i686
ln -s busybox sh

Copy the server's hosts file to the initramfs

cp /etc/hosts /var/nodes/initramfs/etc/hosts

Create the shell script "/var/nodes/initramfs/init" with the following content.

#!/bin/sh

echo "Welcome to the Pegasus IV initramfs!"

# Make all core utils reachable
export PATH="/bin:/sbin:/usr/bin:/usr/sbin"

# Create all busybox's symb links
/bin/busybox-i686 --install -s

# Mount /newroot in tmpfs
mount -t tmpfs -o size=128M tmpfs /newroot

# Unpack the root file system
tar -x -z -f newroot.tar.gz

# Mount /usr read-only and /home read-write from NFS server
mount -t nfs4 -o ro pegasus4:/usr /newroot/usr
mount -t nfs4 -o rw pegasus4:/home /newroot/home

# Switch the root filesystem and start normal boot process
exec switch_root /newroot /sbin/init

# Drop to shell if switch_root fails
sh

This script creates a node root file system of 128 MBytes. Less than 10 MBytes are actually used by the sytem. The generous size leaves room for log files and, importantly, for the captured screen outputs of Torque jobs. If your nodes do not have much RAM, you should be able to shrink the root file system to 64 MBytes or even less.

After the root file system has been created (see next section), pack it and put it to "/var/nodes/initramfs"

tar -czf /var/nodes/initramfs/newroot.tar.gz /var/nodes/newroot

Create cpio archive of the initramfs and put it into /tftpboot

cd /var/nodes/initramfs
find . -print0 | cpio --null -ov --format=newc | gzip -9 > /tftpboot/initramfs.cpio.gz

As an example, a copy of our initramfs can be found on the Downloads page.

Root file system

We build the node root file system on the server in the directory "/var/nodes/newroot". Create the necessary directories:

mkdir /var/nodes/newroot
cd /var/nodes/newroot
mkdir dev etc home proc root run sbin sys tmp usr var
chmod 700 root
chmod 1777 tmp

Create links from the old-style "/bin", "/sbin", "/lib", and "/lib64" to the corresponding directories in "/usr".

ln -s /usr/bin bin
ln -s /usr/sbin sbin
ln -s /usr/lib lib
ln -s /usr/lib64 lib64

Most of these directories remain empty ("dev", "home", "proc", "sys", "tmp", and "usr"). The others ("etc", "root", "run", "var") require some work.

root:

cp /root/.bashrc /var/nodes/newroot/root/
cp /root/.cshrc /var/nodes/newroot/root/
cp /root/.rhosts /var/nodes/newroot/root/

run:

cd /var/nodes/newroot/run
mkdir lock lock/subsys

var:

cd /var/nodes/newroot/var
mkdir cache lib log log/audit log/chrony log/journal log/sa
mkdir spool spool/cron spool/torque
touch var/log/lastlog

etc:

"etc" requires major work to properly configure the compute nodes. We need to copy several files and folders from "/etc" of the server and make modifications. You can either start by looking at our node root file system (see Downloads), or you can simply copy the entire "/etc" of the server and then delete unneccessary subdirectories.

We keep most of the files directly in "/etc" as well as the following subdirectories (and subdirectory links) under "/var/nodes/newroot/etc":

/NetworkManager
/alternatives
/audit
/cron.d
/dbus-1
/init

~init.d
/ld.so.conf.d
/mc
/pam.d
/polkit-1
/profile.d

/rc.d
~rc0.d
~rc1.d
~rc2.d
~rc3.d
~rc4.d

~rc5.d
~rc6.d
/security
/sysconfig
/sysctl.d
/systemd

Some of these files and subdirectories require further modifications.

passwd: files nis
shadow: files nis
group: files nis

Authentication: Edit passwd and shadow files. They should not contain any normal users. Instead add "+::::::" as last line of passwd file.

Edit the file "etc/nsswitch.conf". Add "nis" to the entries for "passwd", "shadow", and "group" as shown on the right.

password sufficient pam_unix.so sha512 shadow nis nullok try_first_pass use_authtok

Also add "nis" to the entry for "password sufficient" in the file "etc/pam.d/password-auth" as shown on the right.

ONBOOT=yes
BOOTPROTO=dhcp
TYPE=Ethernet
NAME=eth0
DEVICE=eth0

Edit or create the file "etc/sysconfig/network-scripts/ifcfg-eth0". Add the lines shown on the right.

The file system table of the nodes is, of course, different from that of the server. Therefore, create or edit the file "etc/fstab". Its content should read

tmpfs	/	tmpfs	size=128M
pegasus4:/usr	/usr	nfs4	ro
pegasus4:/home	/home	nfs4	rw

server pegasus4 iburst

To enable time synchronization between nodes and server, edit the file "etc/chrony.conf" and add the line shown on the right. (Note: When the nodes are running, you can test the chrony configuration by typing "chronyc sources".)

touch /var/lock/subsys/local
hostname > /etc/hostname

Add the following lines to the file "etc/rc.d/rc.local"

etc/systemd/system:

This directory contains the configuration files for systemd that govern the system startup of the compute nodes. It therefore requires major modifications.

Set the default systemd target to "multi-user.target". This corresponds to runlevel 3 in an older Linux system.

cd /var/nodes/newroot/etc/systemd/system
ln -s /usr/lib/systemd/system/multi-user.target default.target

The directories "etc/systemd/system/*.wants" contain links to services that are started as the nodes start up. Most of these services are not needed on the compute nodes. You can therefore simply delete their links in the directories "etc/systemd/system/*.wants". We only keep the following services:

in directory "basic.target.wants": microcode.service
in directory "getty.target.wants": getty@tty1.service
in directory "multi-user.target.wants": chronyd.service, crond.service, irqbalance.service, NetworkManager.service, pbs_mom.service, remote-fs.target, rsyslog.service, sysstat.service, ypbind.service
in directory "sockets.target.wants": rexec.socket rlogin.socket rpcbind.socket rsh.socket

We also found that we needed to modify the reboot service because otherwise the nodes would hang upon reboot. Copy the service file to "etc/systemd/system" by typing

cp /usr/lib/systemd/system/system-reboot.service /var/nodes/newroot/etc/systemd/system

Requires=umount.target
After=umount.target
ExecStart=/usr/bin/systemctl --force reboot

Edit the lines for "Requires", "After" and "ExecStart" in the copied file.

Configure Torque on compute nodes:

Make sure that you have copied the file "/etc/ld.so.conf.d/torque.conf" into the directory "/var/nodes/newrrot/etc/ld.so.conf.d". Also make sure that you have copied "/etc/ld.so.cache" to "var/nodes/newroot/etc/ld.so.conf" after you have configured Torque on the server.

Make sure that there is a link to "pbs_mom.service" in the directory "/var/nodes/newroot/etc/systemd/system/multi-user.target.wants"

Copy the entire directory "/var/spool/torque" to "/var/nodes/newroot/var/spool/torque". We do not need the server and scheduler subdirectories, they can be deleted.

$pbsserver pegasus4 # hostname running pbs server

$logevent 15 # bitmap of which events to log

$loglevel 0 # verbosity between 0 and 7

$log_keep_days 3 # delete log files older than

$usecp *:/home /home # use cp for copying output files

Create the file "/var/nodes/newroot/var/spool/torque/mom_priv/config" with the following content shown on the right.

Put everything together:

After the root file system is ready, pack it and put it into "/var/nodes/initramfs"

cd /var/nodes
tar -czf initramfs/newroot.tar.gz newroot

Then, create the cpio archive of the initramfs and put it into /tftpboot

cd /var/nodes/initramfs
find . -print0 | cpio --null -ov --format=newc | gzip -9 > /tftpboot/initramfs.cpio.gz

Do not forget to enable PXE in node BIOS!

The nodes are ready to rock 'n' roll!