Fixing Errors in the Amazon Catalog

An early task that was assigned to me at Amazon was correcting errors in the Amazon catalog.

We built the catalog as a batch job, using two very large databases of books from the book distributors Baker & Taylor and Ingram. The book databases were indexed by ISBN, and were both widely distributed to bookstores, so that bookstores could order books from the distributors. They were not intended to be customer facing. Basically, as long as they were Good Enough for an employee or owner of a bookstore to find a book and order it to stock it in their bookstore, that was Good Enough for the distributor.

When we merged the information in these two databases into our catalog and showed it to the public, a lot of people who were not so forgiving (of confusion and error) started Helpfully Bringing These Errors To Our Attention. However, as you can likely conclude from the above description of how we created the catalog, there was no real way to take that customer feedback and use it to improve the quality of the catalog. The distributors were not going to care, after all! And while the distributors liked Amazon (even at the beginning, because we never returned anything), pestering them constantly to fix information in their catalog was not going to make them very happy.

Thus, I created a tool called “typo” which had an extremely rudimentary mechanism for:

looking up an item in the catalog (you had to know the ISBN, or later, ASIN),
displaying it, and then
giving you the option to change fields in the item.

In the original design, it was possible to make the change sensitive to what was in the field, so that our “improvement” would disappear if the field was fixed further upstream at the distributor’s database. However, it became clear that while sometimes the distributor made changes, those changes could not be viewed as improvements… ever. There was also a mechanism for making the change, regardless of the content of the field, and that was really all that was used over time. I also included a mechanism for adding a new item to the catalog. Initially, if you wanted to add an item to the catalog, and it was not listed in a distributor database, this was the only way to do it (<— oversimplification that is not important here).

Over time, an entire department grew to receive phone calls and email about problems in our catalog and use this and some later tools to fix those problems. They were really smart, nice people who were very kind to me, despite the fact that I was the source of many of their daily frustrations, as they tried to improve the catalog. Their hard work, plus the hard work of the people who built an increasing number of databases that were merged into our catalog in batch jobs, made Amazon’s catalog so accurate, so comprehensive, and so useful for Finding Things Out About Books that it was broadly assumed that we had bought Bowker’s Books In Print, which we had not.

How Records Were Stored in the Catalog

Our catalog was a key-value database. The key was the ISBN (later, ASIN). The value was a tab-separated, new line terminated set of fields. The structure of the database and the code which interpreted that structure could only be changed together — if you added a field to the database entries without changing the code that interpreted it, it would not go well! Similarly in the reverse — if the code expected a field that was not there, it was going to grab what was there and do “something” with it, and probably not what you intended.

This made it difficult to, for example, add a new field — say, page count, or weight, to use two early examples — to the database. As the code grew, the problem grew. If you know a little about software engineering, you recognize these problems, and you probably have pretty firm opinions about how to correct them.

While it was just books in the database, the number of new fields that we decided to add was comparatively small. However, as we branched out into new product lines (e.g. Music, DVD, Electronics), the need to add new fields grew rapidly, and it was hard to imagine ahead of time what precisely would be needed in the future. One of my later jobs at Amazon involved redesigning the biblio_record to create more flexibility for rolling out new products. At the time that I left, the key-value database, with tab-separated, new line terminated fields was still how the database was implemented; however, fields now came in pairs. Inspired by the MARC record used by the United States Library of Congress, the first element in a pair was a code, and the second element in a pair was the contents of the field. The code element could be looked up in a separate table (authoritative version in a relational database; dumped into another fast key-value database for the batch build) to understand what kind of data (a number, a string, something else) to expect. This removed some — but not all — of the Knowledge About the Data from the code base, for people who are into that sort of refactoring. But more importantly, it made it much, much easier to add a field, and created a much more predictable path for what happened when the code encountered something it did not recognize.

Searching the Catalog

Amazon CTO Shel Kaphan’s original design of the catalog included a clever search engine. In addition to the key-value database that was indexed by ISBN, and had the information about the item in the value field, there were a few other key-value databases. I will explain this in terms of Title Word search, and leave the rest up to your capable imagination.

The key of the Title index database was as follows:

TitleWord FirstNLettersoftheTitle ISBN

The values of all of the index databases were empty.

This is confusing! It takes a while to understand how this might be used, but if you think about the order produced by concatenating a Title Word, followed by the first several letters of a Title, followed by the ISBN, you get an inkling of Shel’s cleverness. If you look up Cat in this database, you’ll get probably thousands of books that have Cat in the title. They will be ordered alphabetically by the title of the book (“The Cat in the Hat” coming well before any of “The Cat Who” books). Finally, the ISBN at the end ensures uniqueness and you can pull it off the end of the index value to get the item out of the main catalog database, without even having to look at the value field.

Alas, it is not possible to _re-order_ the results of the search. The _only_ search order you can present to the browser is by (first N letters of the) Title. That’s unfortunate! There are some other issues that come up associated with authors whose last name has more than one word in it (it was tough keeping them happy, while also enabling browsers to find the author even if they don’t really know the nature of the author’s last name). However, this search was blindingly fast, and served us until we had the developer bandwidth to devote to acquiring or developing another search engine.

This post was written by Rebecca Allen (Amazonian Software Engineer 1996-1998).

Reader Interactions

Leave a Reply Cancel reply