You can’t swim when the pool is empty!

Yesterday I was handed a fun little problem by my cube-neighbor: one of our many Citrix servers was randomly getting the dreaded error event 2019 “The server was unable to allocate from the system nonpaged pool because the pool was empty”… and then, failing to “swim”, the server “sank”.

So as usual I pulled up my bookmarks and dug into the performance and debugging tags. One of the best articles I’ve found for dealing with pool overflow is written by a Microsoft support engineer on the topic “Understanding Pool Consumption and Event ID: 2020 or 2019”. If you haven’t read that article before, go ahead and do so now, because I’m not going to duplicate Tate’s explanation—don’t worry, I’ll wait for you.

Okay now, with the page up on my screen I started in with the recommended steps:

  1. First, I looked for high handle counts—but no program had more than 100 or so open handles, so nothing helpful there.
  2. Next, I ran the Memory Pool Monitor (POOLMON.EXE) to check for processes that allocated a lot of memory but hadn’t released it, pressing ‘p’ once to show only the non-paged pool… but the key “Diff” column didn’t show anything unusual. So I used Task Scheduler to run ‘poolmon.exe -n poolsnap.log’ every 30 minutes so I could see the changes over time. Oddly, the server wouldn’t list its own pooltags for me, as ‘poolmon -c’ responded with the error “Poolmon: Unable to load required dlls, cannot create local tag file”—I guess I’l look at that another day.
  3. This time I didn’t do anything with the Driver Verifier (VERIFIER.EXE) mostly because the machine was in production and active use.
  4. I didn’t have a memory dump so the debugging tools were only marginally useful (I wasn’t the one to provision the server, or I would’ve configured it to dump memory at the BSOD.)

Google seems to suggest that there used to be a Pool Log Analyzer Tool to read the poolsnap.log, but I can’t locate a download anywhere. Maybe one of these days I’ll have to make my own (add it to the list!).

However, I didn’t need it, as I found another tool (developed by some Microsofties and released as Open Source) that helped immensely: the Performance Analysis of Logs (PAL) Tool. We were already using PerfMon to record all process counters, but making sense of that data on your own is nigh impossible. But my new PAL broke the data up into half-hour increments and created colorful and clear charts for the most useful data. Even better, it highlighted specific points of concern for each time period which led to the real source of the problem.

You see, one of the third-party programs had a small handle leak where it would fail to free up to 100 handles per hour after the threads finished processing. On its own, this would not be a big problem—but (a) this is a 32-bit Windows 2003 Enterprise Server with 9GB of RAM, and thus the non-paged pool is limited to no more than 256MB (and that only if it’s configured to use the maximum), and (b) it’s a Citrix XenApp server with 30 or more sessions all day long, and every one of those logins was running the troublesome EXE for 8 hour shifts. Now do you see the problem? Thirty programs leaking 100 handles each hour is 3,000 handles per hour, and it doesn’t take long for the aggregate to fill the poor little non-paged pool!

Anyway, now all we have to do is convince the software vendor that their app is the problem… And once again we’re left with the solution that “it’ll be fixed in the next release.” Until then I guess we increase the page pool to its maximum size and (edit: only the paged pool can be increased; the non-paged pool is only effected by the /3GB switch in boot.ini, which was not enabled here) reboot often!

  • Great blog post! If you are using the /3GB switch, then consider removing it. This will give you about 360 MB of Pool Paged. Also, reducing the amount of RAM installed will make the PFN database smaller allowing Pool Paged to get as big as it can. Stay below 16 GB of RAM, but don't go too low on RAM or else your commit charge might hit the commit limit. – Clint Huffman (author of the PAL tool).

    • ewall

      Excellent… thanks for the tips, Cliff!

      I neglected to mention in the post that the OS is a 32-bit Windows Server 2003 Enterprise Edition with Citrix XenApp 5.0 and 9GB RAM (why 9 and not 8 or 12, I have no idea). PAE extensions are enabled, but *not* the /3GB switch.

      The software developers say that the growing list of handles is the because the program implements caching as long as it is running — which would be fine except they were the ones who insisted that the Citrix servers be 32-bit! Hopefully we'll find a work-around that doesn't require to unnaturally restrict the available memory to the offending processes…