How to Spot a Bad Benchmark
Benchmarks can be helpful in making informed decisions, but they become a problem when they are manipulated or lack transparency. As a director of product for Aerospike, I recently outlined how to spot bad benchmarks in my presentation, “That Garbage Benchmark is Useless.”Some questions to ask yourself when considering benchmarks:
Does this even make sense? Benchmarks are usually marketing, not engineering. So, ask yourself: “Do the physics involved make sense? How are they measuring results? Is this database made of magic?” I once heard the CTO of a database company talking about how an active-active cluster that was deployed across continents could transfer data between the regions in less than a millisecond. That’s amazing! But you know what? The distance between Virginia and Frankfurt is 6,000 kilometers. The speed of light is 22 milliseconds one way, but somehow with CRDTs this could happen. The marketing magic trick was that he first mentions local writes (“obviously fast”), then adds CRDTs, which have nothing to do with latency. Then he puts up the picture of data moving between two continents, and now there’s a false idea in your head. Two unrelated things, the local submillisecond write, and the data sync between two continents turn into submillisecond across continents.
Is there obfuscation? Academic terms can be misused, such as terms sounding the same but having different meanings. “Strong eventual consistency” and “strong consistency” are academic terms that mean completely different things. It would be like someone saying that the Santa Cruz Warriors are the same as the Golden State Warriors – although they sound the same, they are completely different teams, competing in different leagues. Make sure that any academic terms being used are clear and not confusing.
Does it really tell you anything? Get beyond the marketing and look at whether a benchmark makes economic sense, and whether it really tells you anything about a real deployment. Is the test dataset realistic? Or is it too small to give you a true picture? I read a recent IoT benchmark of a competitor and noted that their total dataset was seven terabytes. The 83 node cluster used had a capacity of 31 terabytes of RAM and 300 terabytes of disc. So, why would anyone use a dataset so small if the capacity is 100 times bigger? Does this actually tell you anything about a real deployment? Would you ever purchase 100 times more hardware for a database than the size of your dataset?
Is it absurdly overprovisioned? Another company published a fantastical benchmark of hundreds of millions of operations per second, with a dataset of 1 billion keys, 100 bytes each, totaling 93 gigabytes. This 93 gigabyte benchmark ran on a 40 node cluster that had six terabytes of RAM. One percent utilization. You know what else is close to that 1 percent? The L2 cache. Now the real question is: Do you know of anybody who is going to buy an L2 cache database? I don’t. Is this pure marketing or not? It sort of feels like it.
Is it fair? Did the benchmark publisher intentionally hobble the competition? Is the competing system configured in a way that no one actually uses? An honest benchmark gives a fair chance to the other system being compared to.
Can you understand it?Is it reproducible? Is the benchmark written like an academic paper? It should detail the hardware, the configurations, the methodologies. You should be able to read it, evaluate it, and critique it. You should be able to reproduce it yourself.
Finally, don’t forget these two words when considering benchmarking: Think critically.—Learn More: Blog: Good Benchmarking Guidelines