Next: The Food Chain: Recycling Up: Building a Beowulf Previous: Beowulf Maintenance Contents

Tools and Tricks

There are lots of ways, as time passes, to reduce the time you need to spend maintaining the cluster. Most of this one typically learns from experience, and as time passes most systems administrators accumulate a sort of ``toolbox'' of scripts or clever tricks that can be used to significantly reduce their work load while making their network ever more stable. This way they look good to their users while increasing the amount of time they have available to play Quake or listen to music or write the next killer app.

I've been doing Unix systems administration and systems programming (in addition to physics and all that) for way too long now (since 1985 or 1986 or thereabouts). I have therefore accumulated my share of these tools and have built up my own set of biases about the ``right'' way to do a lot of things. The tools are far from static in time - some things that were ``right'' or ``clever'' ten years ago are very wrong now. Also, from time to time I learn of something really clever from other folks that I never would have thought of on my own (for all my experience). Systems folks talk and share, which is a good thing as the terrain upon which they play and work is amazingly complex and so there are always things to be learned for the first time or relearned better.

This particular chapter is devoted to passing on a few little bits of systems management wisdom, mostly beowulf specific ones. To an experienced administrator, some of them may seem obvious (or even wrong, as systems persons don't always agree about what is right). So take them with a grain of salt; try them out, and if they suit you feel free to adopt them. A number of these ideas actually have been discussed on the beowulf list, and I believe that what I present in those cases is more or less the consensus view of the best way to proceed, although as usual it is my fault if I have failed, not the list's.

Let's begin with something simple like the node naming and numbering scheme. It is a general consensus that it is advisable to use a simple naming and numbering scheme for your nodes. I tend to use things like b[1-N] (because I tend to work with smallish N). A good solution is to have a hosts table something like:


127.0.0.1           localhost.localdomain localhost   #Loopback 


xxx.xxx.xxx.xxx 		 mywulf.lan.dom.org    # Outside address 


192.168.1.1 		 bhead 		 # Head node/server/gateway 


192.168.1.101 		 b1 		 # First node 


192.168.1.102 		 b2 		 # Second node 

...

where I've assumed that this is a true beowulf with a head node (that doubles as a server and a gateway) that lives both on your organization's LAN (and xxx.xxx.xxx.xxx is its first address on e.g. eth0) and on the private LAN for the beowulf itself (the 192.168.1.1 ``bhead'' address on e.g. eth1). I'm also presuming, possibly foolishly, that you know how to set up appropriate routes for these two network addresses. There are HOWTO's to help you out for all this; use them.

A warning for the tyro's: Do not use the ``0'' and ``255'' network addresses for the most significant (rightmost) byte of an IP address. That is, only use 192.168.1.1 to 192.168.1.254 (at most) for host addresses in your internal LAN (or on your organizational LAN, for that matter). The zero addresses for higher bytes are ok to use (so 192.168.0.1 or 192.168.255.1 are OK), although I tend to avoid the zero address(es) because then 1 is the first address (which makes sense) and have never had so many hosts as to need the 255 address block.

The reason for this is that both 0 and 255 can function as broadcast addresses and tend to always match (or fail to match) wildcard addresses. It is also common enough to ``reserve'' certain blocks of addresses for particular functions. For example Duke likes to put routers on the X.X.X.250 address of any given subnet, so that it is always easy to guess how to set up routes on a new host. They similarly put the campus nameservers on the X.X.250.[1,2...] addresses so that nameservice can be configured easily. In many organizations X.X.X.1 is reserved for the primary server (and sometimes router) for a LAN. In the example above, I effectively reserved 192.168.1.[101-254] for nodes and the lower addresses for servers, head nodes, printers, or whatever.

The point is that a bit of thought and organization of your LAN IP space now can make life relatively easy later, as if nothing else it will be much easier to remember and implement consistently than mixing nodes, servers, printers, routers into the IP tables in first come first serve order. If you ever need to subnet your network (install routers between blocks of addresses) those address blocks will need to have a common netmask, as well, which argues for assigning blocks with boundaries that represent multiples of powers of two if you think that there is any chance of your doing so in the future^12.1.

If one wishes to have ``vanity names'' in addition to simple node names in the case of a NOW-style cluster, one can add aliases to a more normal hosts table:


127.0.0.1           localhost.localdomain localhost   

  # Loopback 


xxx.xxx.xxx.xxx 		 mywulf.lan.dom.org mywulf bhead    		 # server address 


xxx.xxx.xxx.yyy 		 toady.lan.dom.org toady b1 		 # First node 


xxx.xxx.xxx.zzz 		 froggy.lan.dom.org froggy b2 		 # Second node 

...

where mywulf, toady and froggy are all names of workstations (including a ``server'' workstation, in the case of mywulf) that double as beowulf nodes with names bhead, b1, b2 and so forth. In this case it is more difficult to ``guess'' what the IP number of b2 is from its name (as zzz may not be sequential to yyy) but one still identify nodes with a simple alias scheme.

If one has more than a few hundred nodes (lucky you!) then you'll have to extend this a bit and perhaps use a[1-100], b[1-100], c[1-100] on 192.168.1.[128-227], 192.168.2.[128-227], 192.168.3.[128-227], and so forth (to facilitate building netmasks, again). In this case to achieve anything like efficiency you'll almost certainly need a relatively complicated and expensive networking topology, high-performance routers, expensive switches or the like. At least 192.168.x.x has plenty of addresses to play with, though, (and 10.x.x.x even more!) so all of this is workable for pretty much any scale one might reasonably be able to conceive constructing a beowulf or cluster.

Now, why do we bother to arrange things like this, with a simple name and mnemonic numbering scheme? For many reasons. For example, it is now very simple to write a script to do all sorts of things on each node, one at a time. Examples of scripts like this in /bin/sh and perl are given in the beowulf software appendix. For another, one doesn't have to remember that mywulf, toady, froggy, and salamander are all beowulf hosts but that cobra, krait, mamba and newt are not. One can also tell Sally to use b[1-10] for a calculation while Tommy uses b[11-20], without having to tell Sally or Tommy just which hosts (by name) those are.

This latter idea extends to both setting up virtual parallel supercomputers within PVM or MPI or to configuring a cluster/LAN monitoring tool (one or two of which are also included in the software appendix). It's easy enough to work with ranges of either name or address space, but difficult to work with unique names and disconnected addresses.

If you've built a ``true beowulf'' with a real head node that functions as a gateway, it is also probably sensible to set this head node up to do IP masquerading or forwarding for the nodes (which is very easy to do in 2.2.x and higher kernels). Here is a place that some beowulf purists might easily disagree. A true beowulf built with lightweight nodes, one can argue, is a place where one ``never'' needs to login to a particular node at all, let alone access the outside world from that node.

My reasoning here (which I will stand behind) is that Murphy's Law makes it inevitable that one day one will want or need to login to a node, and even to login from one node to another node or to connect back to another workstation or resource in the outer world. That's why I recommend making the nodes ``fat'' as far as resources are concerned. It takes negligibly longer to install a node with the kitchen sink in available applications and resources, especially if one installs them in an NFS mount.

If one's favorite editors, xterms, debuggers, perl, and all the rest are all instantly available, the day you need them to cobble together some sort of ``emergency'' script designed to save your bacon you won't be trying to find some way of getting them onto a partially broken system that could die any minute and leave you with nothing. If you like, it makes hacking much easier, and any long-term system administrator knows that hacking a short term solution to an immediate problem may not be elegant but by damn it's going to be necessary, sooner or later. However, it also unleashes your creativity and capabilities for elegant and clever solutions to certain problems - if you need to update a particular directory tree on all your nodes, perhaps wget from a communal website is actually easier than messing with rsync and permissions, but this won't help you if wget isn't installed on the nodes and able to reach the external website.

With all that said, in a true beowulf one would normally require users or the administrator (you) to login to nodes either at the console (if there is a console switch of some sort) or over the network after logging into the head node (or ``a'' head node in cases where there is more than one) first. I can certainly imagine needing or wanting to get to say, my desktop workstation from a node, though, or even to a website as the wget example above suggests.

It is to facilitate this sort of thing that I made the installation of ssh a ``mandatory'' part of at least my recipe for a beowulf. Inside an ``isolated'' true beowulf (that might not ``need'' it for security) ssh manages forwarding of all sorts of network connections far better than rsh.

For example, starting at your desktop console in your organization LAN, if one ssh's into the head and then ssh's onto a node one can run X applications on the node and have the display automatically be set and forwarded back to your originating X console, which is rather awesome.

One can do even better. Both rsh and ssh have this nifty property that they check the name by which they are invoked, and if it isn't rsh (or ssh) it presumes that one is trying to run rsh or ssh to the host with the name by which the binary was invoked. The usual way to set this up is with a symbolic link, and Sun in particular had a standard directory and script for building symlinks for your organization so that rlogins or rsh's to systems within the organization could always be done just by the name of the machine.

Unfortunately, this hasn't been widely adopted within the common linux distributions, but the trick still works for both rsh and ssh (where we only care about the latter). In the appendix is a perl script for building a hostname symlink directory (historically, /usr/hosts) that contains, for example, a symlink from /usr/bin/ssh to e.g. /usr/hosts/mywulf. If /usr/hosts is on your path, then executing ``mywulf'' will log you into mywulf. Executing ``mywulf xterm'' will start an xterm on mywulf that should pop up on your current X console.

This permits a number of fabulously clever tricks. Presuming that mywulf has a similar /usr/hosts with symlinks for all the nodes, then executing ``mywulf b1 xterm'' on your desktop will crank up an xterm on b1 that displays on your current X console, forwarding through all the intermediate connections transparently even if IP forwarding per se is off. The connection is even bidirectionally encrypted in the event that you need to type any passwords into the xterm. It is fairly simple to give b1's root permission to execute root-based GUI tools on your console, if you ever need to!

Another important thing to consider when setting up your beowulf is node logging. I advise that your nodes do run syslogd - there is too much that can happen that you'd want to know about. I'd also suggest that you do not log at all to local files, e.g. /var/log/messages and so forth that are the default. Log only over the network to your head node or a similar auxiliary node inside or outside the network. That effectively eliminates the need to backup /var on your nodes, and also significantly reduces the risk that a cracker can erase their tracks, if you defend the logging node even more carefully than you defend regular nodes or workstations.

This is by no means all of the clever tricks one can come up with for managing or operating a beowulf. Browsing the beowulf list archives will turn up many more. There is also a need for a lot more tricks to be contributed, as there is evidence that a lot of the tricks are invented by three or four or ten different people on the list independently, and sometimes one of those efforts is far better than the rest. The beowulf list facilitates a genetic optimization process, but this process works best when information flows in and out and can be compared and sorted and recovered.

Next: The Food Chain: Recycling Up: Building a Beowulf Previous: Beowulf Maintenance Contents

Robert G. Brown 2004-05-24