Pegasus IV Cluster |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Node installation Overview of the node boot process In order to boot, a compute node needs a boot loader, a Linux kernel, and a root file system. As the compute nodes do not have any kind of persistent storage (hard drives, CD-ROMs, etc.), boot loader, kernel and root file system have to be provided over the network. In principle, one could hand the image of a complete Linux installation to the nodes. However, this is not practical, as this image would be very large and occupy a significant portion of the compute node's RAM. Moreover, the cluster boot process would be very slow when dozens or hundreds of nodes try to download this image. In earlier versions of Pegasus, we solved this problem by handing the nodes a small root file system that contained only the binaries and libraries absolutely necessary for the boot process (in the directories "/bin", "/sbin", and "/lib"). The main part of the system (in "/usr") was mounted read-only via NFS from the server as part of the regular boot process. Under Scientific Linux 7 / Red Hat Enterprise Linux 7 / CentOS 7, this strategy does not work anymore because "/bin", "/sbin", and "/lib" have been merged into "/usr" and because the initialization is goverened by systemd which resides in "/usr". If we still wish to mount "/usr" via NFS, it needs to be available before the proper boot process starts. For Pegasus IV, we have therefore devised a new strategy. The compute node boot process now consists of the following steps:
This new strategy is actually much cleaner than the old one. It results in a smaller root file system image, and it avoids the messy procedure of picking which binaries and libararies to include in "/bin" and "/lib" for a susccesful boot. In the following, we describe the steps to build the node kernel, the initramfs , and the root file system in detail. All steps must be carried out on the cluster server. Only when the whole package is ready, we will hand it to the nodes. Node kernel In principle one could use the server kernel with all its modules, but this is not practical. The modules would have to be included in the initramfs, increasing its size. Moreover, the node kernel does not need to support drives, sound cards, etc. We therefore build a small compact kernel without modules but with the drivers for the node hardware directly compiled into it. Download the sources of a current stable kernel from www.kernel.org. We use longterm version 4.14.84. Unpack the tar ball into a directory under "/usr/src/kernels/". Next you need to configure the appropriate kernel options including network drivers for the node hardware, initramfs support, dhcp support, and NFS support. As an example, our kernel configuration file (and the actual compiled kernel) can be found on the Downloads page. Copy the ".config" file into the kernel source directory. In this directory, run the command
and select the relevant kernel options. After you are finshed, run
Copy the resulting kernel file "bzImage" from the subdirectory "arch/x86/boot" to "/tftpboot" Initramfs We build the initramfs in "/var/nodes/initramfs" and the node root file system in "/var/nodes/newroot". Create the necessary directories by running the commands
"newroot" will be the mount point for the final node root file system. We use busybox to provide a self-contained set of Linux tools in a single executable. Download "busybox-i686" from http://www.busybox.net/downloads/binaries/latest/ and copy it into "/var/nodes/initramfs/bin". Change into this directory and run the commands
Copy the server's hosts file to the initramfs
Create the shell script "/var/nodes/initramfs/init" with the following content.
This script creates a node root file system of 128 MBytes. Less than 10 MBytes are actually used by the sytem. The generous size leaves room for log files and, importantly, for the captured screen outputs of Torque jobs. If your nodes do not have much RAM, you should be able to shrink the root file system to 64 MBytes or even less. After the root file system has been created (see next section), pack it and put it to "/var/nodes/initramfs"
Create cpio archive of the initramfs and put it into /tftpboot
As an example, a copy of our initramfs can be found on the Downloads page. Root file system We build the node root file system on the server in the directory "/var/nodes/newroot". Create the necessary directories:
Create links from the old-style "/bin", "/sbin", "/lib", and "/lib64" to the corresponding directories in "/usr".
Most of these directories remain empty ("dev", "home", "proc", "sys", "tmp", and "usr"). The others ("etc", "root", "run", "var") require some work. root:
run:
var:
etc: "etc" requires major work to properly configure the compute nodes. We need to copy several files and folders from "/etc" of the server and make modifications. You can either start by looking at our node root file system (see Downloads), or you can simply copy the entire "/etc" of the server and then delete unneccessary subdirectories. We keep most of the files directly in "/etc" as well as the following subdirectories (and subdirectory links) under "/var/nodes/newroot/etc":
Some of these files and subdirectories require further modifications.
Edit the file "etc/nsswitch.conf". Add "nis" to the entries for "passwd", "shadow", and "group" as shown on the right.
The file system table of the nodes is, of course, different from that of the server. Therefore, create or edit the file "etc/fstab". Its content should read
etc/systemd/system: This directory contains the configuration files for systemd that govern the system startup of the compute nodes. It therefore requires major modifications. Set the default systemd target to "multi-user.target". This corresponds to runlevel 3 in an older Linux system.
The directories "etc/systemd/system/*.wants" contain links to services that are started as the nodes start up. Most of these services are not needed on the compute nodes. You can therefore simply delete their links in the directories "etc/systemd/system/*.wants". We only keep the following services:
We also found that we needed to modify the reboot service because otherwise the nodes would hang upon reboot. Copy the service file to "etc/systemd/system" by typing
Configure Torque on compute nodes: Make sure that you have copied the file "/etc/ld.so.conf.d/torque.conf" into the directory "/var/nodes/newrrot/etc/ld.so.conf.d". Also make sure that you have copied "/etc/ld.so.cache" to "var/nodes/newroot/etc/ld.so.conf" after you have configured Torque on the server.
Make sure that there is a link to "pbs_mom.service" in the directory "/var/nodes/newroot/etc/systemd/system/multi-user.target.wants" Copy the entire directory "/var/spool/torque" to "/var/nodes/newroot/var/spool/torque". We do not need the server and scheduler subdirectories, they can be deleted.
Put everything together: After the root file system is ready, pack it and put it into "/var/nodes/initramfs"
Then, create the cpio archive of the initramfs and put it into /tftpboot
Do not forget to enable PXE in node BIOS! The nodes are ready to rock 'n' roll! |