Wednesday, February 29, 2012

More Cache please

In my last post on this topic we had sailed through our peak load season only to start experiencing frequent NCrashes which brought down our web farm on a fairly regular basis.  After many conversations with Alachisoft support (and licensing) we agreed to attempt a hail Mary. 

Pretend like everything’s gonna be alright, although you know it won’t be

We decided to upgrade from NCache Professional 3.8 to NCache Enterprise 4.1.  It wasn’t clear whether the issues we were having were version related 3.8 vs. 4.1 or edition related.  By upgrading to 4.1 Enterprise it gave us the opportunity not only to hope that the new version was more stable but more importantly gave us the opportunity to try different topologies.  The professional edition only allowed for one topology, Replicated Cache (synchronous).  Originally, Alachisoft gave us every assurance that at our scale and load this topology should work fine (as it did for many other clients).  They insisted that our troubles with NCache were due to an unstable network rather than NCache itself.  Even if that were true, NCache’s inability to handle or at least pinpoint this ‘instability’, in my mind, is on them. 

Unstable…it’s hard to be the one that’s strong

Regardless, the Enterprise version also gave use several topology choices, the recommended one being Partitioned Replica (asynchronous).  With this topology we were assured that due to its asynchronous nature, our “network instability” if it were to occur would not cause a disruption to NCache that would simultaneously affect both nodes.  That at least would be an improvement, I guess.  The other major feature, which is surprising that we needed an enterprise version for this, is the ability to configure alerts.  The enterprise edition enabled us to receive alerts when nodes left or joined the cluster or other potentially disruptive events occurred.  While not ideal, at least being alerted instantly to an identified problem was better than troubleshooting web farm issues blind.  Our expectations were low, we weren’t expecting the upgrade to fix the problem, but with the proper alerting and a few more options available, we at least had a chance to capture the event early, take memory dumps and other diagnostics that might eventually lead to the true source.

And all I really want to know is, if she’s gonna be alright

We upgraded to NCache Enterprise 4.1 and configured the asynchronous Partioned Replica cache.  Overall it went very smoothly.  And then we watched and watched …and watched.  For days we watched, poised.  But no alerts came, no web farm lockups due to NCache funks.  Its been months now, and still no issue. I have to give Alachisoft credit, for working this thing through despite our increasing disillusionment with them.

Inconclusion

Maybe the version fixed it, maybe the topology, maybe we have an unstable network and this version handles it better or maybe we don’t and the old version simply had some sort of bug.  Maybe I’m happy to get past this and move on.

The only time we’ve seen any kind of funkiness with NCache since the upgrade is when we’ve rebooted one of the nodes.  Anytime that happens, the up node’s web application hangs for a good two minutes (merging the caches), which is about the same time it takes the node to reboot.  So as far as NCache providing high availability there’s a bit of an issue there.  And by the way, never try to reboot both nodes at the same time because that surely did bring on the funk. 

Wednesday, February 22, 2012

RIF Notes #14

“There are only two hard things in Computer Science: cache invalidation and naming things” — Phil Karlton

Forget the lies, the money, we’re in this together.
And through it all, they said nothing’s forever.