Skip to content

Twitter so: Testing in Production

Matthew Dutton: »@mipsytipsy I thought “You have to test in production” was a bold statement and would love to hear more of your thoughts on the topic.«

Charity Majors: »Hmmm, you’re not the only one to call this out.

I’ll add it to my list of “articles to write someday” 🙃 but here’s the gist:

We have always tested in production, just not well. And obviously, I’m not advising anyone to do less of the usual pre-production testing methods, but at some point, esp with distributed systems, you just can’t usefully mimic the qualities of size and chaos that tease out the long thin tail of bugs. Imagine trying to spin up a staging copy of Facebook, or the national electrical grid! You can’t, and have sharply diminished returns.

If you can catch 80-90% of the bugs with 10-20% the effort (and you can), the rest is more usefully poured into making production resilient.

[How do you do testing in production?] Canarying; automated canarying and promotion in stages; empowering your developers to explore live production systems with e.g. @honeycombio (hi), making rollbacks wicked fast and reliable; instrumentation; education and training, feature flags a la @launchdarkly. all great use of time.

Basically what I’m trying to say is, embrace failure. Get used to the inevitability and lean into it, iff you have a system like this. If you’ve got a rails app and five engineers then ignore everything I’m saying until the moment is right :)

Devdas Bhagat: »Just tagging potental speakers who know a bit about that topic.«

Kristian Köhntopp: »- Decouple rollout and activation (feature flags, experiment framework)
– Low latency monitoring
– Monitor the shit out of everything
– Implement schema changes with old and new version being live simultaneously
– For each change, know your providers, know your consumers

Strategy: Make testing in production as safe as you can possibly make it. That has manifold returns:

It makes developers production aware; builds knowledge in handling real catastrophes in regular operational situations; builds confidence and competence; you get actual measurements, which is good calibration.

In general, you build antifragility.

Testing in Production also allows testing of features for commercial viability fast, before you invest lots of development resources to build them out. So it enables you to throw away 95% of the code before it is written. That’s actually the most valuable part of it.«

Content slightly edited to make it easier to read.

Published inComputer ScienceWork


  1. I have yet to see a development team that doesn’t fuck this up, unless they’re already really good at testing software.

    First teach people how to test and release software. This usually is overtaken by half baked “test in production” ideas.

    Then you can do these things, sure. They’re even good. But it’s the last, not the the first step.

    • kris kris

      Well, we are doing this for more than a decade now. It works pretty well, actually.

    • Andre

      The main reason in my opinion: In many places, “devops” just goes one way: administrators using developer’s techniques for their daily work.. Which is good and should be the way to do it…

      But often, the other direction, involving developers in operations is neglected.. And if you bring the topic up(eg. by having developers take turns manning the support hotline or even doing on-call duty once or twice a year), you either get “I don’t do that kind of stuff, I was training to code!” or people start calling it “eating your own dog food”. If you consider your work producing dog food, there’s something fundamentally wrong with you :)
      I call it “gathering real work empirical data”, because there’s your test framework and there’s reality and they only overlap in some parts…

Leave a Reply

Your email address will not be published. Required fields are marked *