Big Data Management: Lessons from the Cloud

So, your company is considering moving to the Cloud for the purposes of agility, mobility, or serviceability. You’ve read the literature and see the advantages both from a technology perspective as well as from the market perspective.

But how does it pan out? Does the hype match the day-to-day reality of operating in the Cloud? Wouldn’t you want to hear from a customer who’s been-there/done-that instead of reading market studies from the vendors?

If so, then this post is for you.

I’m working with a Celerity client that has embraced the Cloud. As a financial regulator, the client has a legacy database that records more than six billion transactions per day. Their current, active data store is measured not in Terabytes, but in Petabytes.

They have a Big Data problem.

Up until recently, this client’s Big Data solution was to store and serve data from its own internal servers (‘On-Prem’). Up to a point, this has worked just fine, but the operational costs of storing and serving these data sets have become prohibitive: queries against the active data sets went from minutes to hours, then to ‘unretrievable.’ And the monetary costs of maintaining clusters of data nodes on-prem also became prohibitively high.

The solution?

The Cloud, obviously.

We chose to go with AWS, Amazon Web Service. For one particular service we have HS1 data nodes that as a cluster can carry 2.5 Petabytes of data and has 24/7 availability. We’ve already overrun this maximum, twice, in the last six months. So, we also use Amazon’s S3, which is their Simple Storage Service. S3 has ‘infinite’ depth; you simply pay for the service and the amount of data you are storing.

So, does the Cloud satisfy this client’s needs in terms of:

High availability?

Yes. Our cluster on the cloud is available 24/7. And, not a function of the cloud, but due to the architecture of our new system, queries that took hours, or that terminated with no results (after hours of processing), now return in seconds. Queries, that could previously return at maximum a set number of rows (due to concerns of processing constraints of the serving systems), say: two hundred thousand rows, now have result sets that are unlimited in size. We have retrieved two million row result sets in seconds.

Sweet.

Reliability?

Yes, meaning that for our cluster, running the HBase Hadoop data base has high fault tolerance. Over the last six months we’ve lost ten (so far) nodes of the sixty, but have suffered zero effects of data degradation. This was handled automatically for us.

Security?

Not our concern.’ Yes, Cloud service providers do provide this, but no, we do not trust a vendor’s security, so we have our own firewalls and our own set of security rules in place.

Turns out that this is a big pain for our developers, as they have to navigate these firewalls and security groups, but the question then becomes: would they have to do that anyway?

In some cases, yes. But also, it’s an issue of trust and security. We are putting data on the cloud and do not want these data to be accessed in an unauthorized way, so as these data are now off-site, we create our own data warehouse with these firewalls and security groups. Is this taking security too far, given Amazon’s (or another Cloud provider’s) security services? Perhaps yes, but the cost of not being secure enough is a price willingly paid for the few inconveniences of getting the requisite permissions for the development teams.

Serviceability?

Great. We don’t have to maintain our cluster. Amazon does.

And therein is the rub.

Since Amazon is maintaining our clusters, then when a security patch comes in, exposing them to liability, they act quickly to upgrade their clusters, regardless of the impacts on our systems running on their clusters. Twice in the past two months, Amazon has given us less than a week’s notice of an emergency service-pack upgrade they are applying to their clusters that had direct impacts on our production system. One upgrade, if we did not coordinate a graceful shut-down of our cluster, would have flushed our entire data set and the restore process would have taken at least a week where our analysts would have no access to the production system they need to do their jobs. S3 stores however much data you wish to store, but to restore those data takes time, and, we measured four days to move those data to the cluster for a complete snapshot restore. Ouch!

Also, if a hardware issue develops on one of your data nodes, Amazon simply shuts down that node and replaces it with a brand new one. Great! Maintenance is simple and straightforward. But all your data on that defective node? Even if the drive was not the problem?

Say ‘Bye-bye’ to your data, because that data set on that node is simply gone.

The most valuable lesson learned from data stored and served from the Cloud? Replication. Distributed replication. When transitioning to the Cloud, do not rely on having your data set on one particular (or on any particular) node. Systems fail, and your Cloud provider pretty much guarantees any data stored on a node will be lost when they replace it.

So with those caveats, does our customer regret going to the Cloud? Not at all! In fact, quite the contrary! We have one of the largest data set in the world, both in terms of space (Netflix has more ‘raw bits’) and in number of rows—we are larger even than Twitter! By going to the Cloud (and by applying some novel data storage and retrieval solutions), we are now able to provide information to our analysts in seconds (which used to take hours with our on-prem servers). Our Cloud solution has won industry-wide accolades and is estimated to have saved the organization from $10 to $20 million annually. Most importantly though, it has allowed us to ask questions we had previously been unable to ask, and get answers both (much) more quickly and with a wider range of solutions to analyze.