Thursday, December 22, 2011

RIF Notes #12

"That which can be asserted without evidence, can be dismissed without evidence.” - Christopher Hitchens

"Don’t put the bodies in the wishing well"

Tuesday, December 13, 2011

What else went wrong with the distributed cache?

When we last left this story I was contemplating turning on the distributed caching after the Black Friday/Cyber Monday crush had passed.  However, shortly after my last post I had a conversation with Alachisoft that lead us to reconfigure our cache to use NCache’s in-process clientcache feature.  This feature allows for a in-process in-memory copy of the cache to be kept in sync with the out-of-process NCache server.  In theory, it’s the best of both worlds, providing near ASP.NET in-memory cache performance with all the benefits of distributed caching.  It seemed like in the 11th hour we had finally arrived at a working solution.  As it turns out we did, kinda…

Blyber Fronday

Our web farm, with distributed caching enabled, sailed through the full week beginning Thanksgiving day.  The site scaled and performed extremely well, and there were no incidents of any kind.  We saw between 2-4 times normal load during those days and never skipped a beat.  A vindication of our distributed caching strategy and justification for all the work and pain that had preceded it. 

Defeat snatched from the jaws of victory

That was, until the following Thursday after Thanksgiving.  Under normal load, nowhere near what we’d seen in the preceding days, suddenly our website became unavailable.  Inexplicably, NCache had gotten itself into a situation which I refer to as an NCache funk (it is as yet undiagnosed by Alachisoft, but has the characteristics of some kind of deadlock).  NCache had encountered an unknown event that caused it to lock up both nodes of the web farm.  Application pool recycles, and IISReset could not bring it back.  The servers required a reboot to recover from the NCache funk.  Chalking this up as a fluke we continued on.  Alachisoft support had no particular insight after reviewing the logs, and suggested that maybe our servers had resource issues or excessive load (which clearly were not explanations given the load it had handled successfully prior).  They suggested that we provide process dumps if it were ever to occur again.

Funkin’ lesson

Luckily for them NCache has deadlocked five more times in the past two weeks, still without explanation.  Our successful distributed caching strategy, designed for scalability and high availability is now ironically causing excessive instability.  Exhibiting the incomprehensible behavior of propagating an issue on one node across the farm effecting not only the cache by IIS as well.  This issue has occurred both with the sessionstate provider cache as well as the object cache just to make it more interesting.

Just when I thought I was out…they pull me back in.

Now instead of moving on to other projects and initiatives, we’re working with Alachisoft on a possible version upgrade, while at the same time considering dropping back down to one node, or switching over to a sticky session based solution.  We find ourselves in the unenviable position of choosing between, doubling down on NCache (more time diagnosing, configuring upgrading, testing) ,abandoning distributed caching for a less sophisticated sticky session solution or starting all over with a new tool, perhaps ScaleOut.  No matter how you slice its costing us real cash.