Friday, January 13, 2012

RIF Notes the 13th

“I can calculate the motions of heavenly bodies but not the madness of men” – Isaac Newton

“Yeah I'm alright
And I need to know
When I'm dead and gone
Where do I go”

Thursday, December 22, 2011

RIF Notes #12

"That which can be asserted without evidence, can be dismissed without evidence.” - Christopher Hitchens

"Don’t put the bodies in the wishing well"

Tuesday, December 13, 2011

What else went wrong with the distributed cache?

When we last left this story I was contemplating turning on the distributed caching after the Black Friday/Cyber Monday crush had passed.  However, shortly after my last post I had a conversation with Alachisoft that lead us to reconfigure our cache to use NCache’s in-process clientcache feature.  This feature allows for a in-process in-memory copy of the cache to be kept in sync with the out-of-process NCache server.  In theory, it’s the best of both worlds, providing near ASP.NET in-memory cache performance with all the benefits of distributed caching.  It seemed like in the 11th hour we had finally arrived at a working solution.  As it turns out we did, kinda…

Blyber Fronday

Our web farm, with distributed caching enabled, sailed through the full week beginning Thanksgiving day.  The site scaled and performed extremely well, and there were no incidents of any kind.  We saw between 2-4 times normal load during those days and never skipped a beat.  A vindication of our distributed caching strategy and justification for all the work and pain that had preceded it. 

Defeat snatched from the jaws of victory

That was, until the following Thursday after Thanksgiving.  Under normal load, nowhere near what we’d seen in the preceding days, suddenly our website became unavailable.  Inexplicably, NCache had gotten itself into a situation which I refer to as an NCache funk (it is as yet undiagnosed by Alachisoft, but has the characteristics of some kind of deadlock).  NCache had encountered an unknown event that caused it to lock up both nodes of the web farm.  Application pool recycles, and IISReset could not bring it back.  The servers required a reboot to recover from the NCache funk.  Chalking this up as a fluke we continued on.  Alachisoft support had no particular insight after reviewing the logs, and suggested that maybe our servers had resource issues or excessive load (which clearly were not explanations given the load it had handled successfully prior).  They suggested that we provide process dumps if it were ever to occur again.

Funkin’ lesson

Luckily for them NCache has deadlocked five more times in the past two weeks, still without explanation.  Our successful distributed caching strategy, designed for scalability and high availability is now ironically causing excessive instability.  Exhibiting the incomprehensible behavior of propagating an issue on one node across the farm effecting not only the cache by IIS as well.  This issue has occurred both with the sessionstate provider cache as well as the object cache just to make it more interesting.

Just when I thought I was out…they pull me back in.

Now instead of moving on to other projects and initiatives, we’re working with Alachisoft on a possible version upgrade, while at the same time considering dropping back down to one node, or switching over to a sticky session based solution.  We find ourselves in the unenviable position of choosing between, doubling down on NCache (more time diagnosing, configuring upgrading, testing) ,abandoning distributed caching for a less sophisticated sticky session solution or starting all over with a new tool, perhaps ScaleOut.  No matter how you slice its costing us real cash.

Tuesday, November 22, 2011

RIF Notes #11

“Self-defense is not about winning fights with aggressive men who probably have less to lose than you do” – Sam Harris

“I drink whiskey, you say goodnight, I’ll put an end to this here fight”

Monday, November 14, 2011

What went wrong with the distributed cache?

The basic purpose of the distribute cache was to address the following conditions:

  • We moved from one webserver, to two webservers, with the intention of having the flexibility to move to N webservers.  The one webserver is utilizing the ASP.NET Cache (in-memory) for heavily utilized read-only objects (Category, ProductClass, Product).  Moving to two webservers meant a doubling of database queries for cached objects.  Each webserver having its own copy of the ASP.NET cache that it needs to load.
  • The ASP.NET cache competes for memory with the application itself, as well as the outputcache.  An increase in memory pressure caused by any one of them causes cache trimming (items to be evicted from the cache).  This results in more database traffic to re-load the evicted items.
  • Application restarts (application pool recycles, etc.) cause the cache to be flushed and reloaded.
  • The database is the mostly likely bottleneck and is the most difficult to scale.  We can add more webservers, but we cannot easily add more database servers.  Thus using caching as efficiently as possible is the best way to offload database traffic to the web servers.
  • The theoretical ability for backend systems to effect and/or participate in the distributed cache.  (e.g. backend systems could update or expire a product in the eCommerce cache when when a price changes)

NCache’s distributed cache addresses these conditions by providing:

  • One copy of the cache replicated across the webserver nodes.
  • Its own dedicated process and memory space that could be configured independently and would not compete with the ASP.NET application or the outputcache for memory.
  • The cache would be durable and survive application recycles and even the reboot of one of the nodes.
  • Purported fast throughput, 30,000 cache reads per second.

No small matter

The first major challenge to enabling distributed caching was our object structure and distributed caching’s reliance on serialization.  Our object graphs are deeply intertwined and utilize lazy-loading heavily.  These two facts were challenges for distributed caching.  The object graphs were large, duplicative and need to be fully loaded prior to serialization rather than lazy-loaded on demand.  The same object might be attached to different graphs repeatedly (e.g. the same manufacturer object might be attached to hundreds of product classes).

I spent considerable time creating boundaries, reducing duplication, and eager loading the graphs prior to objects being placed in the cache.   With this distributed cache friendly refactoring I was ready to  enable distributed caching and do some rudimentary load testing in our test environment.

Not so fast

What I found was that NCache easily became the top resource consumer on the webservers under any kind of load.  Performance with distributed caching on, as compared against the same two nodes with separate ASP.NET caches, was measurably worse.  In limited load testing, the overhead of distributed caching  appears to far exceed any performance gain of maintaining one synchronized out-of process copy.  Far from achieving 30,000 reads/sec, under about 2000 reads/sec I could see NCache causing thread locking and reads taking as long as 200 ms. 

There’s a rather significant caveat to these findings; my load testing was in no way indicative of true load.  It consisted of essentially clicking through the same 4 pages repeatedly in extremely rapid succession using a load test tool simulating 25 users.  Its entirely possible that under a truer load the overhead of distributed cache access could be more balanced with other processing activities and that the synchronization of the cache would prove to be more beneficial. Nevertheless its more likely that the overhead found during load testing would also exist in production and result in overall performance degradation.

A distributed cache is more like a database

Reading from a distributed cache incurs overhead. The conclusion I draw from this is that a  distribute cache is more like a database than it is like the in-memory in-process caching provided by the ASP.NET Cache.  With the ASP.NET cache, reading and writing to the cache are essentially free. We’re basically reading and writing memory pointers from a Dictionary.  Reading Category objects out of the cache hundreds of times in the course of one page request has negligible performance implications. However, with a distributed cache, even a super fast one, those same hundreds of cache reads can add up quickly.  The distributed cache may be local (depending on your topology), and store everything in memory, but you still need to serialize objects in and out of it over a socket connection, and unless you’re judicious in its use, that can get expensive more quickly than you might expect.

Does it or doesn’t it add up?

The obvious question is what are other NCache customers doing differently, or how do large sites make use of distributed caching (facebook uses memcached, stackoverflow use Redis) given the fact that even in our small environment with meager load we find that it can easily hurt performance.  Is it a matter of scale, do you need to be using 10 webservers before benefits of a centralized cache out weigh the overhead?  Or are they just smarter about their cache access.  Maybe NCache is the wrong product, we have the wrong version, or there’s still something ‘funny’ about the performance and configuration of our web farm servers?

At some point, after the holidays, I intend to enable NCache and capture some performance data with dynatrace to gauge it under true load and see if any new insights are revealed.

Thursday, October 27, 2011

RIF Notes #10

“Everyone takes the limits of his own vision for the limits of the world.” —ARTHUR SCHOPENHAUER

“Whose fist is this anyway?”

Wednesday, October 26, 2011

I’m going to crash Microsoft’s performance database, who’s with me?

On a recent Hanselminutes podcast I heard about PerfWatson.  PerfWatson is a tool that monitors Visual Studio performance and then captures periods of unresponsiveness and sends that data back to Microsoft.  There, they have a huge database that analyzes all of the captured performance data. 

Well I’ve been running PerfWatson, and its companion the PerfWatson Monitor for about 2 days, and its constantly in the red. The monitor shows a little graphical response time indicator in the bottom right of the screen, and any action that takes more than 2 seconds shows up in red (and is captured by the tool).  Its red so often on  even the most basic of tasks (right clicking context menus, saving files, etc.) that if its actually capturing all that data they’re gonna need a bigger boat.

image

I’d encourage anybody who finds Visual Studio Performance as painful as I do to install it.  Maybe, if we don’t get blacklisted, or exceed our own bandwidth, we’ll provide enough performance data to inspire some fixes.