2008-09-12

OS/X single-user-mode backups

I've been trying to take advantage of the script /etc/rc.server to do full backups of the boot drive. This file is present in OS/X server, and it can be added to non-server systems. Basically, the script is run via /bin/sh early in the boot process, at a time similar to single-user mode. Only kernel drivers are present, which means that the internal harddrive and firewire drives are available, but not (at least not yet) USB drives. The boot drive is semi-mounted, in read-only mode. Semi-mounted means that its device is listed as "root_device", not as an actual drive.

The advantage of bringing a server down on a regular schedule for backups is that there are no open files, and the entire system drive is unwritable. This maximized the thoroughness of the backup. Furthermore, there are some cases where backup programs such as asr(1) can use more efficient techniques for read-only drives than for read-write drives.

The are huge disadvantages, though. First and foremost is that while the server is down for backups, it can't provide whatever services it is responsible for. In the case of our servers, that's at least DHCP, DNS, OD, and file and web services. However, in a situation such as in our lab, where the amount of data is relatively small (probably less than 20-25 GB), the down-time will not be excessive, probably an hour or less.

The second class of disadvantages is almost a deal-breaker. OS/X has chosen to implement so much of its device and file-system interface code in terms of user-mode "frameworks" rather than kernel-mode drivers, hardly any even of the command-line utilities are available for use in single-user mode! For example, diskutil, the main filesystem tool, is unavailable. Disktool doesn't work any better. Hdiutil will do some things, but cannot attach images or use the -srcfolder mode. It turns out that the best tool of the ones that will actually work in single-user mode is asr.

Asr works far faster when it is in "device" mode. In order to enter device mode, the target drive must be greater than or equal to the size of the source drive, the "-erase" flag must be used, the source drive must be mounted in read-only mode. Asr is only for hfs-format drives. There is also "copy" mode, which is also fairly fast; it is used when device mode's criteria are not met.

The biggest problem with asr is that it copies everything, including the volume label, and there is no way to avoid this. Therefore, the copy ends up as a filesystem named the same as the boot drive. This can cause confusion. One would think that the solution would be to simply change the volume name after the clone operation. However, due to the impoverished runtime environment of single-user mode, there is no tool available to do it. The pdisk(1) program has a partition-labeling option that does work in single-user mode, but it turns out that this is not the same thing as the hfs volume label. When the system comes up, the clone will be mounted under the same name as the main drive with a " 1" suffix, like "Macintosh HD 1". If there are several backup partitions and nothing is done, they will all end up with the same name, like "... 1", "... 2", etc. (See update below.)

This can cause considerable confusion. The only "solution" I've been able to come up with is to write some information into the /tmp directory regarding the backup itself, and then once the system comes back up, use diskutil to rename the volume accordingly. A good way to do this is in crontab, with the "@reboot" time indicator.

As for a general strategy, it is important to have at least 2 backup partitions (2 drives would be better), so in case something bad happens, the previous backup would be available. Also, the backup partitions should be larger than the system disk.

At present, my servers both have 250GB Raid mirrors, and I have a 300GB firewire drive for each of them. As an initial test, I will simply do a single asr backup of the main drive onto the firewire drive--this is nearly as safe as having two partition on the firewire drive. Later, I'll get another firewire drive for each one, and swap the drives.

UPDATE

Here's a kind of strange way to get around the problem of the volume name. Instead of trying to change, ex post facto, the name of the clone, why not change the volume name of the system disk? This can be run very early in the launchd process, and all it takes is "[sudo] diskutil rename / newname". Obviously, the name will have the boot time in it, or, more difficult, a sequence number related to the backup system. I think a timestamp of the form: 200901171402 would be good. As for the rest of the string, why not simply look for the current name (you can get this from the "list -plist" diskutil command). If the last 12 digits of the current name (which will be identical to that of the most recent backup) are digits, then they will be replaced, otherwise nothing will happen. If there is a reasonably short system name, like "Lab 13", then why not name the root drive something like "Lab 13 System Disk " ? Each time the system is rebooted, the timestamp will be updated. The volume names of the clones will contain the timestamp of the previous boot. Since the root volume, unlike other mounted volumes, doesn't get mounted as /Volumes/VolName, the change of name will have no effect on paths, etc.

Plan 2: A slightly more satisfactory but more risky way to do a similar thing would be to change the name of the root volume (e.g., to "Lab 13 Clone 200901171402") before restarting the system for an automated backup. The name would be changed always to a constant (e.g., "Lab 13" or to the part before " Clone...") during the early the boot process.

In either case, there would need to be three scripts:
  1. The setup-and-reboot script, which would search for the target volumes, decide which one to write to next, and if there is one, to store that information in /tmp/rc.autoclone or whatever. This script will change the name of the root volume just before rebooting under plan 2.
  2. The clone script, which would expect information in /tmp/autoclone that if it checked out will control the clone operation.
  3. The cleanup script, to be run very early in the boot process, to change the name of the root volume back to its normal name. In plan 1, the name will contain a time stamp, in plan 2, it will be an ordinary name like "Lab 13".
The first script can be run either from the command line or from a launchd/crontab entry. The second script must be run at the end of/etc/rc.server, and it must contain safety checks to prevent writing into the wrong medium. The third script would ordinarily be run very early in the boot sequence from launchd/crontab.

As always, the most dangerous possibility is that the clone will be written into the wrong place. Since diskutil doesn't work right in single-user mode, probably the safest way to handle this is to write a check string into the destination volume, for example, a file in its root directory called rc.autoclone identical to the one in /tmp. Also, there must be a directory created in /tmp called "mnt" as a place to mount the destination volume, so that /tmp/mnt/rc.autoclone can be compared to /tmp/rc.autoclone.

So in rc.server,
  1. Check for /tmp/rc.autoclone
  2. Check that the timeout interval has not passed (e.g., 5 minutes)
  3. Check for /tmp/mnt
  4. Get info on all current drives
  5. Search for the drive indicated in /tmp/rc.autoclone
  6. Mount the drive on /tmp/mnt and compare its version of rc.autoclone
  7. Unmount the target drive
  8. Perform the clone operation
  9. Done
In the after-boot crontab (plan 2),
  1. Get the current / volume name
  2. If it needs to be changed, change it
The setup script does:
  1. Do general checking to prevent being called at the wrong time (e.g., too soon)
  2. Make sure that at least one target volume is available
  3. Look through all possible target volumes to find the one with the oldest timestamp.
  4. Compute its ID string
  5. Create an rc.autoboot containing the ID string
  6. Write the rc.autoboot into the root directory of the target volume
  7. Reboot

About Me

My photo
Ignavis semper feriƦ sunt.