Skip to content

Swap and Memory Pressure: How Developers think to how Operations people think

There is a very useful and interesting article by Chris Down: “In defence of swap: common misconceptions“. Chris explains what Swap is, and how it provides a backing store of anonymous pages as opposed to the actual code files, which provide backing store for file based pages.

I have no problem with the information and background knowledge he provides. This is correct and useful stuff, and I even learned a thing about what cgroups can do for me.

I do have a problem with some attitudes here. They are coming from a developers or desktop perspective, and they are not useful in a data center. At least not in mine. :-)

Chris writes:

Swap is primarily a mechanism for equality of reclamation, not for emergency “extra memory”. Swap is not what makes your application slow – entering overall memory contention is what makes your application slow.

And that is correct. The conclusions are wrong, though. In a data center production environment that does not suck, I do not want to be in this situation. If I am ever getting into this situation, I want a failure, I want it fast and I want it to be noticeable, so that I can act on it and change the situation so that it never occurs again.

That is, I do not want to survive. I want this box to explode, others to take over and fix the root cause. So the entire section »Under temporary spikes in memory usage« is a DO NOT WANT scenario.

Chris also assumes a few weird things, from a production POV: He already states that »If you have a bunch of disk space and a recent (4.0+) kernel, more swap is almost always better than less. In older kernels kswapd, one of the kernel processes responsible for managing swap, was historically very overeager to swap out memory aggressively the more swap you had.« and production sadly is in many places still on pre-4.0 kernels, so don’t have large swap.

He also mentions »As such, if you have the space, having a swap size of a few GB keeps your options open on modern kernels. […] What I’d recommend is setting up a few testing systems with 2-3GB of swap or more, and monitoring what happens over the course of a week or so under varying (memory) load conditions.«. Well, I have production boxes with 48 GB, 96 GB or even 192GB of memory. “A few GB of swap” aren’t going to cut this. These are not desktops or laptops.

Dumping or loading, or swapping 200GB of core take approximately 15 minutes on a local SSD, and twice that time on a rotating disk, though, so I am not going to work with very large swaps, because they can only be slower than this. I simply cannot afford critical memory pressure spikes on such a box, and as an Ops person, I configure my machines to not have them, and if they happen, to blow up as fast as possible.

What I also would want is better metrics for memory pressure, or just the amount of anon pages in the system. »Determination of memory pressure is somewhat difficult using traditional Linux memory counters, though. We have some things which seem somewhat related, but are merely tangential – memory usage, page scans, etc – and from these metrics alone it’s very hard to tell an efficient memory configuration from one that’s trending towards memory contention. « I agree on this.

I wonder if there is a fast and lock free metric that I can read that tells me the amount of unbacked, anon pages per process and for the whole system? One that I can sample in rapid succession without locking up or freezing a system that has 200GB of memory in 4 KB pages (the metric can be approximate, but reading it must not lock up the box).

I think the main difference Chris and I seem to have on fault handling – I’d rather have this box die fast and with a clear reason than for it trying to eventually pull through and mess with my overall performance while it tries to do that.

Published inContainers and KubernetesData Centers

8 Comments

  1. Hartmut Holzgraefe

    “I’d rather have this box die fast and with a clear reason than for it trying to eventually pull through and mess with my overall performance while it tries to do that.”

    Even on a desktop system … when keyboard and mouse events are only handled any other second while the disk LED is blinking like mad you’ll eventually power-cylce the box anyway (and usually pretty quickly).

  2. Ralf Buescher

    This topic well demonstrates why it is a bad idea to have a universal multi purpose OS in the first place.

    Shouldn’t we rather strive to get the most minimal system installation possible – fit for the job of the very hardware which is used exclusively for that one task.

    The installed OS should be tailored for the very purpose as the hardware is.

  3. Does not sound like Chris is used to operate VM hosts with 0,5-2TB of RAM or application servers with double digit Java heaps. And on my systems the cold pages or startup daemons are not even visible in two decimal percent numbers, so Imcould care less if they are pages out. The main problem of swappiness is stopping the buffercache from creating artificial memory pressure

  4. Hi! I’m the author of the original article. There are a number of misunderstandings in this article which I’ve talked about in this Twitter thread:

    https://twitter.com/unixchris/status/955420014328320000
    https://twitter.com/unixchris/status/955420555498401792
    https://twitter.com/unixchris/status/955420834595770369
    https://twitter.com/unixchris/status/955421121796599809
    https://twitter.com/unixchris/status/955421576547196928

    (And despite the title of this article, I have worked in SRE/ops for all of my professional career, so I have no idea why Kristian has made it about a supposed conflict between developers and operations people…)

    • > I have worked in SRE/ops for all of my professional career

      Then I’ve got a hard time understanding how you can be OK with page faults potentially being orders of magnitude slower depending on the current page mapping.

      This might be OK for batch processing systems but not for transaction processing systems where API contracts are usually in the millisecond range.

      I find it telling that you don’t mention latency at all in your original article.

      I agree that without having swap configured, otherwise reclaimable memory is wasted. But if I had to decide between wasted memory and absolutely unpredictable behaviour during memory pressure, I personally always opt for wasting memory.

      • > Then I’ve got a hard time understanding how you can be OK with page faults potentially being orders of magnitude slower depending on the current page mapping.

        This is a problem both with *and* without swap. Neither prevents you from having to do disk I/O, and neither prevents you from having to consider memory semantics.

        The absence of swap doesn’t render having to advise the operating system on your desired memory semantics unnecessary. You still have to lock and madvise as appropriate.

  5. > “swapping 200GB of core take approximately 15 minutes”

    wasn’t the unit meant to be “seconds” not “minutes”?
    15 min = 900 sec,
    even consumer hardware SATA speed is 6 Gb/s now

    I’m confused.

    • kris kris

      The last time I measured, which is admittedly not this year, the SSD in the HP machine under test did take slightly under 1000 seconds to read or write 256 GB of memory from SSD to RAM.

Leave a Reply

Your email address will not be published. Required fields are marked *