Yum (Yellowdog Updater, Modified) HOWTO: Planning a Yum Repository

4. Planning a Yum Repository

Before beginning, let us define a couple of terms. When we refer to a "server" below, let me remind you that we are referring to a piece of hardware. Beware! In e.g. "man yum.conf" a [server] is really a label for a repository, not a web, ftp or file server that might be one of several that provide mirrored access to the same repository. This is made clear later in the man page, where it is referenced in practice as [serverid]. A single physical "server" might well offer several repositories, each identified by a unique URL path on the server and by a unique [serverid] in /etc/yum.conf on the clients. Each [serverid] may similarly have several fallback URL's for the same repository (generally on different servers). This may initially be confusing, but it makes sense and provides for extremely robust operation.

As discussed in the Introduction, yum works by means of accessing a "yummified" repository -- a URL-accessible path where yum-arch has been run on the server to create a special directory with a known relative path displacement from the RPMS directory used to actually provide the downloads, into which all of the split-off headers are placed so they can be separately downloaded and cached on the client side.

Simple as this is (really!) there is a bit of work that should be expended planning out a yum repository, especially if you are starting from scratch and don't already have an rpm repository in place to which you are planning to "add yum support". It is probably wise to start simple and work up to a complex system of repositories and servers gradually and as need dictates (many sites will ONLY need a single monolithic server, maintained by a single wise administrator, with a single repository apiece for each distribution supported).

However, it never hurts to know a bit about what can be done in terms of repository design. To meet specific needs of the fairly wide variety of systems persons already using yum and participating in its active development, some fairly slick features have been added: the ability to operate from multiple servers and repositories in an overlaid/prioritized fashion with failover, for example.

In a small LAN this is total overkill. If one is using yum to support an entire University, however, where each department has its own "special, must have" packages or (a more common case) some rpm's that set up e.g. /etc with the appropriate configuration for just that department, the ability to specify multiple repositories on different servers that are checked in a particular order lets those departments manage just the rpm's they require that are department specific or otherwise customized.

This keeps the manager of the central/primary repository(s) sane, as that individual does not have to juggle lots of departmental configuration rpm's with names like physics-stuff-0.1.1.noarch.rpm, biology-stuff-0.1.1.noarch.rpm,... and keeps the central/primary server secure, as the departmental lan managers do not need any sort of trusted access to the primary server. This has nothing to do with whether or not these departmental managers are fundamentally "trusted" -- it is simply always a good idea to heavily restrict and regulate root-level access to a server the cracking of which de facto cracks an entire network! Putting rabid mastiffs outside the door to the server room isn't out of the question.

The following is an approximate (probably incomplete, as features are constantly being discussed on the yum list and added) list of some of the options and True Facts to consider when planning, building, or modifying a yum-based rpm repository (set):

Yum can currently work from either http (recommended) or ftp or nfs/file servers. All that is required is that the repository be accessible via a URL from the client(s) in question.
Yum is configured in /etc/yum.conf to automagically access a list of repositories...
Which are just URL's -- server paths to directories that contain rpms...
According to a policy that determines the order they are checked...
With failover, so rpms can be reliably delivered...
And optional GPG signature checking, so rpms can be securely delivered...
And control over architecture specificity, so one can load (e.g.) Athlon or i686 where it matters but i386 or noarch where it does not...
And control over distribution specificity, so one doesn't accidentally update in a way that breaks things silently.
In the True Facts category, since yum is designed to facilitate distributed maintenance (automated updates) of an rpm-based distribution of something -- operating systems, complete distribution, or even just a finite set of packages on some system installed and maintained with completely different tools -- the repository should be designed and built so that it remains up to date!

Some people won't need any of these bells and whistles. Start simple, yum's defaults are probably what you want anyway. Most people won't need more than one or two servers, perhaps a base distribution server (with primary repository images) plus an overlay server which might also mirror the primary repositories, used in a specified order, with GPG checking, for example. Some people, generally toplevel and local systems managers co-maintaining a repository network for a large organization, will use all of these features and even clamour for more on the yum list.

In the subsections below a strategy for planning out a yum repository/server set will be indicated that lets one smoothly move from a single server of a single repository (possibly originally mirrored from and updated from one of the existing distribution repositories) through to a really complex set of repositories on multiple servers that use all of the features.

4.1 A Single Repository (Using Rsync)

To set up your first repository, it probably makes sense to clone a repository from an existing server. Yum works to support pretty much any kind of rpm-based repository for any of the major Linux distributions across several hardware architectures, as well as for other operating systems that either already support rpm package management or upon which the rpm toolkits and python can be built. Unfortunately, that makes giving detailed instructions for EVERY ONE of those possibilities impossibly complex. It is probably better to give an example and leave it up to you to figure out what you need to do to set up a "leaping frog linux" distribution repository or a repository for all of the rpm's required to do molecular dynamics off of a common set of tools and data on a small cluster.

Ordinarily as a very first step one would pick a server/repository to mirror. Be cautious. Some repositories are public and their servers have lots of capacity. Others may require permission to use. Some will support rsync, others will require wget or worse as retrieval agents. Usually the repository itself will tell you what its policy is with respect to mirroring (and of course won't let you in at all if it is truly restricted), but it never hurts to ask if there is any doubt.

Rsync is probably the tool of choice for mirroring and maintenance. It only downloads files if it needs to and is very smart and (one hopes) reasonably secure. So we'll use rsync in our examples below. However, not everybody sets it up on their servers. So what do you do if the repository you wish to mirror doesn't support rsync?

At least one possibility (there are probably others) is to use wget. wget is a common tool that can be used to get whole branches, recursively, from a website in order to create a locally rereferenced copy or a mirror. It is beyond the scope of this HOWTO to detail all the steps at this point; but the wget documentation is fairly clear. Basically, one simply substitutes the appropriate wget invocation, (recursive and to an adequate depth and probably with links rereferenced, with the possibly different source url) for rsync in the examples below

For specificity only, then, the following example describes one way to set up a repository that is a mirror of the rsync-enabled dulug (primary, tier1) Red Hat mirror, and we'll even more specifically restrict ourselves to version 9 i386 and pretend that yum is not already installed on the repository.

We start by trying:


rgb@lilith|T:121>rsync mirror.dulug.duke.edu::
mirror.dulug.duke.edu
 - This rsync server is currently available to any/all people.
 - This is subject to change with little or limited notice.
 - We may in the not-so-distant-future restrict rsync access
   to tier2 red hat mirrors.


Modules:

archive         Everything
redhat-ftp      Red Hat FTP Site
redhat-base     Red Hat FTP Site
redhat-beta     Red Hat Linux beta releases
redhat-rawhide  Rawhide FTP Site
redhat-updates  Updates FTP Site
redhat-contrib  Red Hat, Inc. -- Contrib FTP Site



archive         Main Tree
redhat-ftp      Red Hat FTP Site
redhat-base     Red Hat FTP Site
redhat-beta     Red Hat Linux beta releases
redhat-rawhide  Rawhide FTP Site
redhat-updates  Updates FTP Site
redhat-contrib  Red Hat, Inc. -- Contrib FTP Site

Oh, good. Anonymous rsync is supported on this repository, and the repository's policy says we can go ahead, at least for now, and mirror any repository to be found on this site ourselves. However, it also notes that quite possibly access to this mirror in the near future will be restricted to tier2 mirrors (basically sites that let other people mirror them and hence REDUCE the load on the tier1 sites) if load gets to be more than it can handle, so perhaps you (who read this) might consider using a mirror of one of the tier2 mirrors instead, unless you are hoping to set up a site that will be a tier2 mirror and publically available in turn.

Base Repository

After a little bit of monkeying around with rsync and/or a web browser, we find the path we're looking for to a suitable "base" Red Hat 9 tree that will constitute our repository. In actual fact we'd probably want to mirror one level up from here even for a "version 9 only" repository, as one level up we'd find e.g. documentation and iso images for the release which we'd almost certainly like to have handy even if yum itself points to a repository path one directory down. For the purposes of this example, however, we go to our (already set up) web or ftp server and create a suitably named path to our new repository.

This path should probably contain something that indicates distribution and revision it contains as part of its eventual URL; http://whatever.org/base is probably a bad choice, but http://whatever.org/RedHat9/base might be ok, or something much longer if you have lots of distributions or revisions or other repositories on the server.

After creating the path, enter the directory and enter (for example):


rgb@lucifer|T:27#rsync -avH
mirror.dulug.duke.edu::redhat-ftp/redhat/linux/9/en/os .
mirror.dulug.duke.edu
 - This rsync server is currently available to any/all people.
 - This is subject to change with little or limited notice.
 - We may in the not-so-distant-future restrict rsync access
   to tier2 red hat mirrors.


Modules:

archive         Everything
redhat-ftp      Red Hat FTP Site
redhat-base     Red Hat FTP Site
redhat-beta     Red Hat Linux beta releases
redhat-rawhide  Rawhide FTP Site
redhat-updates  Updates FTP Site
redhat-contrib  Red Hat, Inc. -- Contrib FTP Site



receiving file list ... done
os/
os/i386/
os/i386/RedHat/
os/i386/RELEASE-NOTES-it.html
os/i386/dosutils/
os/i386/dosutils/fips15c/
os/i386/dosutils/fips20/
os/i386/images/
os/i386/RELEASE-NOTES-ja.html
...

(for a few hundred directories packages and files more) and we're off to the races.

Depending on your bandwidth, you will have a brand, spanking new Red Hat 9 repository in minutes to hours (however long it takes to transfer a few GB of RPM's and support materials). Obviously, a nearly identical process would work for any other distribution that has rsync-enabled originals or mirror servers. Don't forget to consider buying SOMETHING from the original distribution provider (if they sell anything) to help them make some money from providing the tremendous value you are about to distribute at zero marginal cost across a whole network of systems!

Updates Repository

The next thing to worry about is keeping the new base repository itself up to date. There are two levels to even this -- one is rsync'ing the repository itself with the master on a periodic basis. This is accomplished with a simple cron script that changes to the appropriate directory and reruns the rsync command. rsync is smart enough to only download and replace files if they've changed. rsync can be run to either delete files that disappear from the master (remaining absolutely faithful to the master image) or to get new files and update changed ones but leave the old copies of ones that disappear. You have to decide which policy is right for your site, as both have risks.

You may want to quit there, and if the primary repository image already contains an "updates" path where updates automatically appear as bugs are fixed and security holes plugged, you may be able to. The repository just installed should support basic network installs (if the distribution does in the first place) and automated yum updates to the images in the repository once your clients are appropriately configured.

However, in many cases you will need to worry about keeping this base system correctly updated outside of the automatic updates provided by the distribution maintainer or vendor.

There are a couple of reasons for this. One is that not every distribution will handle updates the same way. Some distributions may be very aggressive and maintain updates that match and replace their distribution tree with a numbering/dating scheme that yum can understand and that changes rapidly as security, performance, and bug patches are released (Red Hat does). Others might replace files in their master distribution silently, after you've come to rely on a feature that suddenly goes away. Others might be lazy and not update their master distribution at all, or may have specialized tools for detecting and retrieving updates. Finally, even if your master distribution is very reliable and aggressively maintained, you may wish to override its offerings and update from a local copy of some of the rpm's, perhaps because they contain custom patches or local data.

One solution to these various dilemnas is to add an updates repository separate (different URL/path, recall, quite possibly on the same physical server). Here you can again choose to rsync to an updates server somewhere else, or you can create one all your own and split updates out of the distribution itself any way you like.

4.2 More Repositories and Servers

In addition to a separate update repository, you may well want (eventually) to add still more repositories, and possibly more servers (viewing a repository, recall as a specific path to a single URL source of rpm's not necessarily on a distinct server). Perhaps you want to make a games rpm repository available to everybody on campus, but not on the main server which is owned by philistines that view games as a waste of time. Perhaps you want to overlay department rpm's on top of a base distribution repository (set, including updates) on a central server.

The outline above, and the example of the more or less required base repository, should be enough to enable you to set up arbitrary URL pathways to new repositories for whatever purpose you like. However, yum cannot use these repositories yet. First one has to install yum on the server(s) and run the yum-arch command on each repository. This causes the rpm headers to be extracted and paths to be dereferenced in a way that makes yum ultimately very fast and efficient.

Once this step is completed, yum then allows clients, via options in yum.conf on the various clients, to access the repositories and servers in whatever order they like (or more likely, in the order specified in the yum client rpm they automatically loaded at install time from one of those repositories).

Next Previous Contents