The Database Row That Did and Didn't Exist

Every now and then, I run into a bug so mystifying on its face I know it’s going to be a journey just from the error message. This is one of those stories.

I work at a Python shop right now, and we were upgrading an app to Django 5. There are a few breaking changes, but nothing too unexpected or out of the ordinary. Everything was working great in our development environment, but once we deployed to our QA environment we noticed something very strange—an IntegrityError exception telling us that an object couldn’t be saved because the primary key already existed in the database.

My immediate reaction was that, well, that was unusual. I didn’t think we were ever manually assigning an id and we were always relying on the database’s autoincrement. I double checked the code in the traceback, and indeed, all I saw was something harmless. An anonymized version looks something like this:

object = models.Model.filter(property1=value).get()
object.property2 = method_result()
object.save()

Not only were we not assigning it a clashing id, we weren’t even creating a new row from scratch. This was an existing database record we’d just queried for, then updated a single property on and then saved. We shouldn’t have been trying to insert a new row into the database at all—Django should have been able to know to UPDATE the existing row it itself had fetched in the first place. I wracked my head trying to figure out what was going on, and started trying everything I could think of to narrow it down:

Did it always happen? (Yes.)
Did it happen with every record, not just that one? (Yes—or so we thought.)
Did it happen if I simplified the code to the most basic possible version, to isolate out anything else that could be messing with it? (Yes; just fetching the record and immediately saving it without changes triggered the bug.)
Did it happen if I forced Django to perform an UPDATE instead of an INSERT? (Yes, but with new symptoms; calling object.save(force_update=True) failed with a message about how it couldn’t do an UPDATE because the row didn’t exist.)
Did it happen from a django-admin shell on the QA server? (Yes.)
Could I manually query for and update the row from a psql console? (Yes, so Postgres itself was fine.)

By this point several of us were poking at the problem from different angles trying to figure out exactly what was going on. We had a little breakthrough when a coworker idly tried fetching not one of our recent records but the record with id 1—and Django was perfectly happy to save it back. It was only some of our records that seemed to be cursed, and again only in Django—Postgres itself would happily write to them.

By this point there were three of us poking at the problem, and each of us made a discovery building on the experiments we’d done so far that ended up cracking the problem. First, one person took a look at the queryset that Django used to determine if the primary key existed in the database and discovered that it was completely empty. This determines whether Django thinks the record is newly-created or not, and so—even though the record was fetched from the database—it determined the record was new and should be inserted into the database with an INSERT instead of using UPDATE on an existing row.

Well, that was the proximate cause, but that didn’t yet explain why that was happening. The next clue came in the discovery that the Django model definitions and the schema of the actual database had drifted for historical reasons. Specifically, several tables—including this one—were specified as using an int for their primary key in Django, but used bigint in the real database. This was true for both our QA server and production, so it wasn’t just a QA-specific drift. We also, at this point, noticed that the records we were having trouble with all had ids outside the int range. Until this point, the only small primary key we’d tried was id=1, but we tried a selection of other small ids and found that Django would read and write those just fine. It was only once we entered bigint’s range that the Django problem reared its head.

Which, finally, led me to look back at something I’d previously noticed but not paid much attention to in the Django 5.0 release notes. In the Miscellaneous section, it mentions that

Filtering querysets against overflowing integer values now always returns an empty queryset. As a consequence, you may need to use ExpressionWrapper() to explicitly wrap arithmetic against integer fields in such cases.

This took on new meaning with the knowledge we now had about the mismatch of primary key types. We were in fact filtering querysets against overflowing integer values, or rather Django itself was because Django had been given a wrong understanding of what the integer type was. In prior versions, Django had let the “impossible” comparison carry out and it worked by chance; in Django 5, it enforced internal correctness by rejecting a comparison the underlying database would be able to carry out. I certainly can’t call it the wrong decision from a correctness standpoint, but it was a hell of an issue to debug!