Friday, July 27, 2012

Optical disks are a form of torture.

CD’s, DVD’s, Blu-Ray’s and Wii games are an aggravating lot.  Every time I go to put in a movie for the kids, whether the DVD player in the minivan or the Blu-Ray player in the house I find myself aggravated before I’ve even begun. 

Hari kari

Playing a movie involves an obsessive compulsive like ritual.  First I must daintily remove it from its delicate casing, without touching it, like a game of Operation.  Then I crane my neck from side to side while tilting it in the light to catch a glimpse of any smudges or blemishes, which there always are.  To be followed by gentle polishing and re-inspecting.  When I’m satisfied that I’ve done what I can I put it in the player and then wait patiently for what seems like minutes, to be greeted by a series of frustrating menu’s and images that I’m maddeningly not allowed to bypass no matter how many times I’ve already suffered through them, until I’m finally able to select play.  Then I wait for the inevitable complaints from the kids, “Dad it’s skipping again”, “It’s stuck”, which happens on virtually every movie new, old, or rented.  Of course, when it does skip and is stuck I really only have two options; continuously press fast-forward like an elevator button, hoping to jump past the glitch (which if it works at all more often than not it sends me careening too far into the future).  Or, more than likely, I’ll have to pop the movie out and go back to craning and polishing, sit through the menu’s again, and then fast-forward to where I was (while driving).

Wii games are no less frustrating, with the special ability to lock up or crash mid-level, and there own brand of boot up time and mandatory commercials to go through.

Remind me again why they’re better?

Makes me wonder what was so bad about cassettes and VHS.  They didn’t force me to re-watch anything or block menu functions.  They picked up where they left off, and I think I only had a tape get chewed a handful of times, certainly not every one every time.  They weren’t particularly delicate either. I didn’t have to treat them like priceless jewels.

Mal content

There was a glimmer of hope for a while with Netflix’s streaming service. They had the infrastructure, on nearly every device imaginable, but then the content dried up.  Now I can use Netflix, Amazon Prime, Vimeo, Hulu, HBOGO, iTunes, etc. to hodgepodge some content but only one some devices some of the time. 

I can only hope this craning and polishing is some kind of Miyagi-esque Karate training, the purpose of which I can’t quite see yet.

Monday, July 9, 2012

Blinded by my fear of being blind

“the act of observing will influence the phenomenon being observed”

Last year we implemented dynatrace as our APM tool to monitor and troubleshoot performance issues with our .NET applications.  Dynatrace is an incredibly powerful tool and immediately provided deep and rich analytics on the performance characteristics of our applications.  We had a bit of a bumpy road with the initial implementation, on-going difficulty tuning it, and occasional unpleasant side-effects.  Overall, however, I feel like a blind man who’s been given impaired vision. 

Out of the emptiness

Prior to dynatrace, performance optimization involved an extremely painstaking process of Trace.Write/trace.axd, real-time analysis was virtually impossible, and debugging major performance problems required dump files, WinDbg and luck.  But now, at a glance, I can see top consuming methods, database calls, and a whole host of other metrics that help me see how the application is behaving in the real-world with the ability to go back and view that behavior even after fact.  The learning curve on the tool is no joke, and it’s a balancing act to make sure we’re capturing enough information to make the right judgments but not so much that we’re overwhelmed or saturated.   Despite all this rich data its not always obvious what its telling me.

Garbage time

Recently we had one such elusive issue.  We’ve had a few incidents where for brief periods of time (30 minutes or so) our ASP.NET application performance significantly degraded (response times were 2-10 times their averages).  Dynatrace captured these events, and by drilling into the data it appeared that the slow response times were due to garbage collection suspension time.   For a long time that’s as far as we got with it.  We know performance degrades and suspect excessive garbage collection.  But what would be causing that, and why is it fairly rare and seemingly random?

Put me in coach

After hitting a wall, we brought in a dynatrace professional services ‘coach’ to help us interpret the data and what the tool was telling us.  They were equally unable to pinpoint the root cause, so they helped us configure additional Trending and Leak Analysis features to capture proprietary memory dump files for deeper analysis.  We ran this for a few days but didn’t experience an episode in that timeframe so we had nothing to capture.

Yay we broke it

On the fourth day we released a new version of our application and bingo, excessive garbage collection and poor response times.  We captured a few dump files, contacted our coach for a follow-up and waited for the episode to resolve.  However, this time it didn’t resolve.  The GC kept churning, and the response times continued to be bad, really bad.  When we realized this one wouldn’t resolve on its own, we began recycling app. pools, restarting webservers, even rebooting. No effect.  We toggled off a few of our recent dynatrace settings.  No effect.  Performance was still awful.  Meanwhile, we were scouring our recent change set for anything that could be responsible for GC, taking process dumps, contacting Microsoft, looking at CLR profilers, tweaking anything we could think of.  But the problem persisted. 

Ah ha!

Microsoft finally got back to us with a dump file analysis that pointed to NCache.  No stranger to blaming NCache, I brought down a node, and disabled NCache.  Still no effect, dynatrace was still showing excessive GC suspension time across the board. Basically every method in our application was affected.  We could see all the victims of GC suspension time but we just couldn’t pinpoint the culprit. 

Oh brother

After we ruled out NCache, I decided to bring the node back on-line and re-enable NCache.  When the node came back on-line, QA and other IT staff began to report improved response times, even though dynatrace was continuing to show the same pitiful response times and excessive GC.  Then I noticed that after the restart the dynatrace agent had malfunction and instrumentation on that node had been disabled.  The node without dynatrace was performing normally.

A bitter pill

It turns out that the Trending and Leak Analysis setting was not only causing the issue, but my attempt to shut it off initially hadn’t succeeded because the setting didn’t behave as expected and required an app. pool recycle to take affect.  Its not yet clear why that setting took days to manifest, why it had the effect it did, nor why dynatrace itself, the dumps we sent to Microsoft nor the dynatrace proprietary dumps identified it.

Throughout the entire incident, I was relying heavily on dynatrace. Every recycle, reboot, tweak and trick I was carefully measuring with dynatrace.  Even when it was suggested that we disable dynatrace at one point, I resisted because I’d have to go back to relying on subjective opinions about response times (not entirely true), and thought the likelihood low compared to the probability of something in our new code being the culprit.  I thought it had to be our code, that’s the only thing that had changed. I was wrong.

Now I can blame you for everything

The silver lining is that now that I know dynatrace is capable of causing excessive GC, its now a suspect in the original intermittent GC issues that we were ironically using dynatrace to troubleshoot in the first place.