You Can Just Test Things

A lot of bad choices begin with marketing, flawed logic, or pursuasive people. Most of the time it might not matter, but before embarking on a months long project, spending a few hours doing some upfront experiments can make all of the difference.

I see too often that people overestimate the amount of effort it takes to run simple tests to check their assumptions. For example, my employer has a months long project to migrate from AWS RDS to AWS Aurora (postgres) because of some AWS marketing saying that some workloads were 3x faster on Aurora. Diving through decision logs, the best justification I found was where someone pasted a ChatGPT chat where they asked something along the lines of "how will Aurora help my SaaS application over RDS".

Asking a sophisticated autocomplete engine a leading question is a comically bad example of testing your assumptions. And although I have no hope of derailing the project, I was morbidly curious to see what some upfront testing would have shown.

I spent about an hour writing a benchmark that very roughly mirrored our application's traffic. It isn't perfect, but I tried to get the right ratio of reads to writes, and I seeded the DB with some dummy data to make a table that was around the size of one of our bigger tables. I then spun up EC2, Aurora, and RDS instances. SCPed my benchmark onto the EC2, and ran the benchmark several time on each DB.

My hypothesis was that there would be no difference. From the little I have read about Aurora, it's basically the same postgres core, but with some proprietary Amazon code at the storage layer. The marketing website has some hilarious jargon such as: "I/O operations use distributed systems techniques, such as quorums to improve performance consistency". Not sure how a distributed consistency algorithm will improve performance on a single writer DB but it's a cool word! If anything my storage layer being a distributed system makes me worry performance won't be as good.

The initial results of my benchmark showed that RDS had about 2x lower latency and 40% increased throughput which seemed totally wrong. Going over the configuration of the DBs I realized that the RDS one was on an r8g.large while the Aurora instance had an r7g.large. I fixed that and re ran the benchmark a few times, and the results were a complete wash as expected.

In total this test took me a few hours, but it could have saved us months of risky work. It gave me pretty good confidence that if our goal is to improve performance, we would be better off spending that time optimizing code or improving the UX around slow performance. Don't take my word for it though. If you are considering Aurora over RDS, run some benchmarks of your own, it doesn't take long!

Next time you find yourself making big decisions, put a small investment towards testing your assumptions, you may be surprised what you find.