Broken Glass : Diagnosing Production Cassandra Issues

About Brian ONeill

I just past my second year anniversary at Health Market Science (HMS), and we’ve been working with Cassandra for almost the entirety of my career here.   In that time, we have had remarkably few problems with it.  Like few other technologies I’ve worked with, Cassandra “just works”.

But, as with *every* technology I’ve ever worked with, you eventually have some sort of issue, even if it is not with the technology itself, but rather your use of the technology.  And that was the situation here.  (gun? check. foot? check. aim… fire. =)

Here is our tale of when bullet met foot…

Our dependency on Cassandra has increased exponentially since its been in production.  We’ve been adding product lines and clients to those product lines at an ever-increasing rate.  And with that success, we’ve had to evolve the architecture over time, but some parts of the system have remained untouched because they’ve been cruising along.  Over the last couple weeks, one of those parts reared its ugly head.

We’ve been scaling the nodes in our cluster vertically to accommodate demand.  Our cluster is entirely virtual, so this was always the path of least resistance. Need more memory?  No problem. Need more CPU? No problem.  Need space/disk?  We’ve got tons in our SAN.  You do that a few times and with increasing frequency, and you can start to see a trend that doesn’t end well. =)


Source : http://www.javacodegeeks.com/2013/08/broken-glass-diagnosing-production-cassandra-issues.html

Back to Top