“the act of observing will influence the phenomenon being observed”
Last year we implemented dynatrace as our APM tool to monitor and troubleshoot performance issues with our .NET applications. Dynatrace is an incredibly powerful tool and immediately provided deep and rich analytics on the performance characteristics of our applications. We had a bit of a bumpy road with the initial implementation, on-going difficulty tuning it, and occasional unpleasant side-effects. Overall, however, I feel like a blind man who’s been given impaired vision.
Out of the emptiness
Prior to dynatrace, performance optimization involved an extremely painstaking process of Trace.Write/trace.axd, real-time analysis was virtually impossible, and debugging major performance problems required dump files, WinDbg and luck. But now, at a glance, I can see top consuming methods, database calls, and a whole host of other metrics that help me see how the application is behaving in the real-world with the ability to go back and view that behavior even after fact. The learning curve on the tool is no joke, and it’s a balancing act to make sure we’re capturing enough information to make the right judgments but not so much that we’re overwhelmed or saturated. Despite all this rich data its not always obvious what its telling me.
Recently we had one such elusive issue. We’ve had a few incidents where for brief periods of time (30 minutes or so) our ASP.NET application performance significantly degraded (response times were 2-10 times their averages). Dynatrace captured these events, and by drilling into the data it appeared that the slow response times were due to garbage collection suspension time. For a long time that’s as far as we got with it. We know performance degrades and suspect excessive garbage collection. But what would be causing that, and why is it fairly rare and seemingly random?
Put me in coach
After hitting a wall, we brought in a dynatrace professional services ‘coach’ to help us interpret the data and what the tool was telling us. They were equally unable to pinpoint the root cause, so they helped us configure additional Trending and Leak Analysis features to capture proprietary memory dump files for deeper analysis. We ran this for a few days but didn’t experience an episode in that timeframe so we had nothing to capture.
Yay we broke it
On the fourth day we released a new version of our application and bingo, excessive garbage collection and poor response times. We captured a few dump files, contacted our coach for a follow-up and waited for the episode to resolve. However, this time it didn’t resolve. The GC kept churning, and the response times continued to be bad, really bad. When we realized this one wouldn’t resolve on its own, we began recycling app. pools, restarting webservers, even rebooting. No effect. We toggled off a few of our recent dynatrace settings. No effect. Performance was still awful. Meanwhile, we were scouring our recent change set for anything that could be responsible for GC, taking process dumps, contacting Microsoft, looking at CLR profilers, tweaking anything we could think of. But the problem persisted.
Microsoft finally got back to us with a dump file analysis that pointed to NCache. No stranger to blaming NCache, I brought down a node, and disabled NCache. Still no effect, dynatrace was still showing excessive GC suspension time across the board. Basically every method in our application was affected. We could see all the victims of GC suspension time but we just couldn’t pinpoint the culprit.
After we ruled out NCache, I decided to bring the node back on-line and re-enable NCache. When the node came back on-line, QA and other IT staff began to report improved response times, even though dynatrace was continuing to show the same pitiful response times and excessive GC. Then I noticed that after the restart the dynatrace agent had malfunction and instrumentation on that node had been disabled. The node without dynatrace was performing normally.
A bitter pill
It turns out that the Trending and Leak Analysis setting was not only causing the issue, but my attempt to shut it off initially hadn’t succeeded because the setting didn’t behave as expected and required an app. pool recycle to take affect. Its not yet clear why that setting took days to manifest, why it had the effect it did, nor why dynatrace itself, the dumps we sent to Microsoft nor the dynatrace proprietary dumps identified it.
Throughout the entire incident, I was relying heavily on dynatrace. Every recycle, reboot, tweak and trick I was carefully measuring with dynatrace. Even when it was suggested that we disable dynatrace at one point, I resisted because I’d have to go back to relying on subjective opinions about response times (not entirely true), and thought the likelihood low compared to the probability of something in our new code being the culprit. I thought it had to be our code, that’s the only thing that had changed. I was wrong.
Now I can blame you for everything
The silver lining is that now that I know dynatrace is capable of causing excessive GC, its now a suspect in the original intermittent GC issues that we were ironically using dynatrace to troubleshoot in the first place.