The Future Is Now

The Working Archivist's Guide to Enthusiast CD-ROM Archiving Tools

I’ve seen a lot of professional archivists who use flux disc image archiving techniques for their collections—a technique in which a specialized floppy controller captures the raw signal coming from the floppy drive so that it can be preserved and decoded in software. I haven’t, however, seen many archivists using enthusiast-developed low-level reading techniques for CD-ROM. I’ve personally been making use of these techniques and I find them very helpful; I know that many other archivists and institutions could make great use of them. However, I know that information about enthusiast-developed tools are usually deeply embedded in those communities and can be hard to find for others. As someone with a foot in both worlds, I want to try to bridge the gap and make this information available a bit more widely. This post will summarize why archivists might be interested in these tools, what they can do, and how to make use of them.

Redump

People who are familiar with emulation may think of Redump as collections of disc images online, but they’re really a metadata database for CD-ROM preservation focused primarily on games. It collects metadata of transfers of disc images but also, crucially for us, it sets standards on how disc images should be created in order to ensure accuracy. Those standards are publicly available and are easy enough to follow by anyone—not just people looking to submit to Redump’s database.

Because Redump’s disc imaging standards are of sufficiently high quality, and their software and guides are freely available, I highly recommend them to all people looking to preserve CD-ROMs.

What does dumping to Redump’s standards do that typical dumping doesn’t?

Although the end product of Redump’s dumping process is a disc image in the common BIN/CUE format, the actual process is different in some key ways.

Typically, when reading a CD-ROM, the data the computer receives has been processed and transformed by the drive’s firmware. Data on a CD-ROM is stored in a scrambled1 (encoded) format, which the drive’s firmware descrambles into the standard format before the computer receives it. The firmware also performs checksum comparison using CD-ROM’s builtin fixity format and automatically corrects any errors it finds. (The next section will describe the format of CD-ROM in more detail.)

By comparison, analogous to how a raw flux read performs a low level image of a floppy2 and then processes it using software, Redump’s standards makes use of raw reading functions that are available on a certain set of CD drives. These raw reading functions completely disable the processing the firmware would normally apply to data tracks: the data is read in its original scrambled form, with error correction disabled, so that data is returned in as close to its original form as possible. The software then performs descrambling and error correction after it’s read. (For those interested in a more detailed technical summary of exactly what’s being done here, the redumper README goes into extensive detail.)

The primary benefit to performing rips this way is metadata: it’s possible to log better, more legible information about the descrambling and integrity check processes when it’s performed in software like this. The other benefit is that it becomes easier to reason about discs with unusual formats, disc with mastering errors from when they were produced, and discs with complex copy protection formats. Strangely-mastered or mis-mastered discs are surprisingly common, and this has been helpful for me in the past with a few discs that would otherwise have been difficult to reason about. Here are two recent examples:

  • One disc contains a mastering error which corrupted the fixity data for a single 2048-byte sector. Using a typical read, this would manifest as a read error and it would be difficult to tell from the logs that this was the result of a mastering error and not disc damage. With a raw read, it became easier to separate out the reading process from the decoding process and thus to get a better understanding of what had happened.
  • One disc contains a mastering error which places 75 sectors (150KB) of data at the start of an audio track. This would otherwise have been very easy to miss, and may not have been properly decoded by the drive’s firmware.

But Why? (aka, why is CD-ROM so weird?)

The CD-ROM format is very complex, and not all software or all disc image formats support its full set of features.

  • CD-ROM’s relationship to the audio disc format means discs can have a complex structure.
  • “ISO” files can only represent the most simple kinds of discs.
  • CD has a builtin metadata format which most disc image formats don’t support.
  • The same CD-ROM disc can have different data when viewed on different operating systems. OS-specific imaging tools may discard data for other OSs.

CD-ROM, CD audio, and multi track support

The CD format wasn’t originally designed for data at all—the original CD standard was purely designed around digital audio. The CD-ROM standard was only finalized later, and it acts as an extension to the CD audio format. The interaction between these two formats is the reason behind much of CD-ROM’s complexity.

CD audio isn’t a file-based format, and instead uses a series of unnamed, numbered tracks. CD-ROM extends this by making it possible for a track on a disc to contain data and a filesystem instead of audio. Since CD-ROM extends CD audio, the two formats aren’t mutually exclusive: a CD-ROM disc can still contain multiple tracks, and it can even contain more than one data track or a mixture of data and audio tracks.

The most commonly used disc image file format, the ISO, doesn’t support any of this advanced structure. An ISO represents a data track, not necessarily a full disc. Producing an ISO from a disc containing multiple tracks means that the rest of the disc is ignored, and only a single data track has been backed up.

The other unique feature of the ISO format compared to other disc image formats is that it omits fixity information. CD contains a builtin form of integrity protection, intended to protect against physical damage to a disc; up to a certain level of read error can be recovered using information in the error correction data. Typical data discs have sectors which are 2352 bytes long, of which 2048 bytes are data and 304 are error correction3. ISOs use a “cooked” format which strips the error correction component of each sector, leaving just the data. This data is less critical for a disc after it’s been transferred to a disc image, but it does mean that it serves as a less accurate representation of the physical structure of the original disc.

Subcode - CD’s builtin metadata format

CD defines a sidecar metadata format called the “subcode” or “subchannel”. It allows for small amounts of data to be stored alongside the audio or data on a disc. In most cases, it doesn’t contain anything significant and so most CD disc image formats omit it entirely. However, it’s possible for it to contain interesting or unique data that would be lost if it’s not transferred along with a disc. Examples include CD-Text (track names for CD audio discs); CD graphics (usually used for karaoke graphics on otherwise normal audio discs); and copy protection data for commercial software.

Other builtin metadata that’s not typically preserved is contained in the disc’s leadin and leadout segments. The leadin contains the disc’s table of contents; typically, this information is preserved in a processed form via the drive’s firmware, but not in the raw format direct from the disc. Likewise, the leadout contains finalizing metadata that isn’t otherwise preserved when a CD is backed up.

Multiple filesystems in a single track

The CD-ROM format doesn’t dictate which filesystem is used on a disc, and it’s possible for a single track on a disc to contain more than one filesystem. This also means that the same disc can display drastically different content depending on whether it’s inserted into a Windows, Mac or Linux PC. I’ve personally witnessed a hybrid Mac/PC disc which had completely different contents on both systems, without a single shared file between them. This means that simply backing up a disc by copying the files off the disc is unsafe: you may be missing data from one of the other filesystems. This also means that filesystem-specific backup tools can be unsafe.

I’ve seen some archivists use HFS Explorer to back up Mac CDs, for example, but this tool backs up individual filesystems from a disc—using it for a disc like this one would mean that the Windows contents would be completely lost. Even in the case that a disc is only for Mac, HFS Explorer doesn’t necessarily preserve structural filesystem content in the same format as it was stored on disc.

CD disc image formats

There are a wide variety of disc image formats, many of which are specific to the vendor of a particular disc image reading program, and which can represent differing levels of a CD’s features. A few common examples:

  • ISO, as mentioned above, represents a single data track at the start of a disc, and isn’t able to represent the remainder of a disc. It’s stored in a “cooked” format with error correction data removed, and omits subcode data.
  • BIN/CUE, which can represent a full multi-track disc. Stored in a “raw” format, with error correction data retained. Modern versions of the format can include subcode data and can represent complex disc structures. It uses a human-readable metadata format called the “cue sheet”. The software I’ll be talking about later in this post use the modern extended versions of BIN/CUE.
  • CloneCD, which was originally created to properly back up discs with complex copy protection schemes. It supports the same complex disc structures as BIN/CUE, and preserves subcode information, but differs in that its metadata format is lower level and not intended to be human-readable.

In summary

CD-ROM is a complex format with a wide number of variations, and many disc image formats support only some of the kinds of discs which exist in the real world. Capturing in a complex format ensures nothing is lost while still leaving the flexibility to convert into a simpler format in the future.

The Hardware

Unlike floppy disk image flux archiving, there’s no special enthusiast equipment needed here. Backing up CDs using these techniques uses certain models of standard off the shelf drives manufactured by Plextor. While these drives are no longer manufactured, they’re readily available secondhand from eBay or computer recycling stores. They can be frequently purchased in good working condition for $40 or less. A full list of compatible drives can be found on the Redump wiki: http://wiki.redump.org/index.php?title=Optical_Disc_Drive_Compatibility

This list contains a mixture of internal drives and USB-based external drives. Interal drives can also be converted into external drives using a cheap USB adapter.

The Software

There are a number of different tools available; this post will focus on the most popular ones and the ones with which I have personal experience. Redump’s wiki provides step-by-step usage guides for all of the tools I recommend.

Media Preservation Frontend (Windows only)

For users who prefer GUI tools to commandline tools, Media Preservation Frontend (MPF) provides a graphical interface to the redumper, DiscImageCreator and Aaru tools. (This blog post won’t be discussing Aaru.) Unfortunately, it’s only available for Windows at this time.

It exposes each underlying tool’s feature set to the fullest extent it can, and captures the appropriate metadata. Because it’s oriented around submissions to the Redump database it also contains some data entry fields specific to Redump, but they’re not mandatory and can be easily ignored.

redumper

redumper is a relatively new commandline disc archiving program which has quickly emerged as the Redump community’s new preferred disc backup tool. For archivists interested in using a commandline tool, redumper is my current recommendation.

Its feature set is relatively restricted compared to DiscImageCreator, but its opinionated defaults ensure it just does the right thing without extra configuration. Its focus on simplicity and reliability also extends to its metadata files: while it provides the same metadata as other options, it produces a smaller number of more organized files which I find easier to reason about. It also provides some additional metadata that I find useful.

DiscImageCreator

DiscImageCreator was formerly the tool Redump recommended, but its standards no longer recommend it. Compared to redumper, whose focus is reliability and simplicity, DiscImageCreator features a vast suite of options but is comparably less reliable. Its metadata is also less organized and harder to read.

Its large feature set does mean that there are times when DiscImageCreator can come in handy for something specialized, but at the moment I don’t recommend it as a primary tool.

Converting from more complex formats to simpler ones

After capturing in the formats produced by redumper and DiscImageCreator, it’s possible to convert into simpler formats for access. This provides a useful tradeoff: the more complex formats are kept for longterm preservation, while copies in other formats can be temporarily produced for access and compatibility with software that needs plain ISO images.

On Mac and Linux, bchunk is an open source program which can convert BIN/CUE disc images into plain ISO files. For audio CDs or mixed-mode CDs which contain audio tracks, it can also convert audio tracks to WAV files. On Windows, IsoBuster can similarly convert disc images from one format to another.

Both redumper and DiscImageCreator produce their BIN/CUE images in a split format with one BIN file per track. For those who need a unified image with a single BIN for the same disc, binmerge (cross-platform, written in Python) and chdman (cross-platform, written in C) can perform the conversion.

Useful metadata

In addition to backing up discs, both redumper and DiscImageCreator produce some very useful metadata after the read is complete. This information isn’t necessarily unique to this dumping technique—other software could do the same things after dumping a disc—but it’s very useful to have this automatically performed for every disc.

Both redumper and DiscImageCreator produce machine-readable XML metadata containing metadata about each track on the disc: its size, and hashes in several formats. DiscImageCreator places it in a file named .dat, while Redumper places it in the dat: section of its log file.

1
2
3
4
<rom name="moonlight (Track 1).bin" size="658917504" crc="ec48aea4" md5="ed350360b8f40c9c5fc4a8ce1bc41c99" sha1="8b0022a6b14842678f0beee961720103d6ca5431" />
<rom name="moonlight (Track 2).bin" size="21226800" crc="06284fb2" md5="e97b60b95764212ba4788911e236c349" sha1="8a112d2f60693f6c767d60514c9a35d3855c55b1" />
<rom name="moonlight (Track 3).bin" size="50189328" crc="2358ba07" md5="191b3f4132b862b8f9239cbe0ad22dd9" sha1="cfbb15b6782a482305a90dea00b1bf4288e617b3" />
<rom name="moonlight (Track 4).bin" size="25371024" crc="31a7d363" md5="1a5a08d9c4c4084e1a390ad5b32454bf" sha1="710ee4cb7a85d627ec9bc9c29deb0620a3d67cba" />

For ISO 9660/PC format discs, both programs also extract mastering date information. This comes from the primary volume descriptor (PVD) information, and contains date information pertaining to the disc’s creation. For example, from the logs for the same disc as the one above:

1
2
3
4
5
6
7
8
9
ISO9660 [moonlight (Track 1).bin]:
  volume identifier: CAFFE
  PVD:
0320 : 20 20 20 20 20 20 20 20  20 20 20 20 20 31 39 39                199
0330 : 36 30 36 30 37 31 34 32  39 31 36 30 30 00 31 39   6060714291600.19
0340 : 39 36 30 36 30 37 31 34  32 39 31 36 30 30 00 30   96060714291600.0
0350 : 30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 00   000000000000000.
0360 : 30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30   0000000000000000
0370 : 00 01 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

This shows that the disc has the title CAFFE, and four embedded timestamps representing the disc’s creation:

  • Volume creation date and time - 1996060714291600, aka June 7, 1996, at 14:29:16 (UTC)
  • Volume moditification date - identical to the above
  • Volume expiration date - date the disc should be considered obsolete; often left with null values, as it is here
  • Volume effective date - date the disc should be used starting from; also often left null

Redumper also produces a full file listing for ISO 9660 discs, along with calculating their hashes. An abbreviated example from the same disc:

1
2
3
4
5
6
7
8
9
*** SKELETON (time check: 3s)

excluded areas hashes (SHA-1):
1a7334e9350d06a69f5dbf1e8ec8ca9c98ad89da SYSTEM_AREA
edcae21603e3564acfea07e81c205031101976ea /SAVER/OPENING.MOV
1d73c3b2f53d251a56b61e0b75c6b5184600c4ae /SAVER/TOKIMEKI.MOV
4f89fe21c61e44e1b9dedc85e09b2c1390055f9b /SAVER/ENDING.MOV
091492f54a3a182921d5255ae3560f26d4dc4d11 /SAVER/CAFFES.MOV
c1589aa3e8f55b86d0be614e835127d254eabb54 /README.TXT

What do all these files mean?

Both redumper and DiscImageCreator produce a large number of files, which can be overwhelming at first; this list provides a little guide as to what those files mean, and which ones are most important to retain for longterm preservation.

redumper

A list of files can also be found on the Redump wiki.

  • All .bin files - The disc’s data and audio tracks, one file per track.
  • discname.log - The full set of logs and metadata from the read process.
  • discname.cue - The disc’s table of contents (list of tracks) in a human-readable cuesheet format.
  • discname.toc and discname.fulltoc - The disc’s table of contents, in its original, low-level binary format.
  • discname.state - The disc’s original fixity information, in a binary format.
  • discname.subcode - The subcode metadata, in its original binary format, as stored on the disc.
  • discname.scram - The scrambled version of the disc, as a single file. While this is generally no longer needed after the reading process is complete and the data has been decoded, it contains the leadin and leadout data that is normally omitted when reading a disc; some people may elect to preserve it for that reason.

DiscImageCreator

  • All .bin files - The disc’s data and audio tracks, one file per track.
  • All .txt files - The full set of logs and metadata from the read process. Unlike redumper, these are stored as a large number of separate files.
  • discname.sub - The subcode metadata, in a processed binary format which reorders the data in order to be easier to read.
  • discname.cue - The disc’s table of contents (list of tracks) in a human-readable cuesheet format.
  • discname.ccd - The disc’s table of contents (list of tracks) in the CloneCD format, which is more complex and not designed to be read by humans.
  • discname.toc - The disc’s table of contents, in its original, low-level binary format.
  • discname.dat - XML-format metadata for each track, containing file sizes and hashes/checksums in several formats. The same data is contained in the .log file from redumper.
  • discname.c2 - The disc’s original fixity information, in a binary format.
  • Filenames containing Track 0 and Track AA - The leadin and leadout sections of the disc.
  • discname.img - A single-file copy of the disc’s data. This duplicates exactly the contents of the .bin files, and can be easily recreated by concatenating them in the future, so it’s not important to keep.
  • discname_img.cue - A copy of the cuesheet adjusted for the above file.

Obtaining the tools

All of these tools are open source and can be downloaded from GitHub.

In addition, for Mac users, I package redumper and DiscImageCreator in Homebrew. While my packages aren’t always 100% up to date, I try to ensure that they work. They can be installed via:

  • redumper: brew install mistydemeo/digipres/redumper
  • DiscImageCreator: brew install mistydemeo/digipres/disc-image-creator

Limitations

Certain especially complex types of copy protection are still not fully supported by these tools, although the situation is improving. While Redumper recently added support for the SafeDisc protection format, for example, there are still discs it’s not able to handle properly; closed-source tools such as CloneCD are still needed to handle these discs.

Redumper has plans to add support for ring-based copy protection such as Ring Protech in the future, but it’s poorly-supported at the moment; again, closed-source tools such as Alcohol 120% are necessary to handle these discs.

Conclusion

I hope this guide has been helpful for those who are interested. If readers have any questions or need any other information, please feel free to reach out to me on Mastodon or Bluesky.


  1. Amazingly, this is actually the technical term - see ECMA-130 Annex B.

  2. It’s not quite analogous: a Redump-style disc rip isn’t operating on as low a level as a raw flux read is, but it’s lower-level than standard disc reading software. While the Domesday86 project exists to perform truly low-level raw laser dumps of laserdisc and LD-ROM discs, there isn’t a mature project to apply the same technique to CD.

  3. There are a few alternate sector formats which divide up the 2352 bytes differently; they devote more space to data and less space to error correction, at the risk of making a disc more susceptible to physical damage.