Initial thoughts on AWS S3 storage

I've recently been looking into S3 storage provided by AWS and I've ran into some interesting stuff I'd like to share.

First is how to encrypt the data in S3.  There are two main flavors of encryption, server-side & client-side.  Worth noting, AWS does provide support for a client side encryption, but personally, if I was going client-side, and there could be reasons to pick that, I'd roll my own solution. 

Now, there is also server-side encryption.  Jeff Barr at AWS has a great blog entry going over how to set server side encryption up if you're interested.

The big highlight is there are three options:
SSE-S3 (S3 managed keys)
SSE-KMS (AWS KMS managed keys)
SSE-C (client managed keys)

The most robust is SSE-KMS, it includes audit trails among other security features, but only if you're talking about relatively "few" numbers of reads & writes.  But I'm not the Little Bits of Data Guy now am I?  :)  The problem arises with large numbers of little data files. 

Here's straight from AWS documentation:
"...you might store data in Amazon S3 using server-side encryption with AWS KMS (SSE-KMS). Each time you upload or download an S3 object that's encrypted with SSE-KMS, Amazon S3 makes a GenerateDataKey (for uploads) or Decrypt (for downloads) request to AWS KMS on your behalf. These requests count toward your limit, so AWS KMS throttles the requests if you exceed a combined total of 1200 uploads or downloads per second of S3 objects encrypted with SSE-KMS."
-source

The real killer there is it's on both reads AND writes.  So think about any large spark data analytics job that needs to crunch through thousands of data files...

Now, if you keep close to the max file size in AWS S3, which as of this post is 5 TB, then likely you'll be ok.  But if you do any kind of event streaming of small data files and can't use windowing to batch them, then be warned, you might need to open a use case up with AWS to get your limits increased, or deal with throttling.

So, SSE-S3 is nice if you just want something basic that you don't want to deal with & SSE-C if you want to manage the keys yourself & decouple them from the storage provider.

My personal take would be to either go client side encryption if there's compliance concerns (SOX, etc.) otherwise SS3-C if your tools posting to S3 support it and if not, SSE-S3.

Moving past encryption, how S3 managed performance on large loads is interesting.  It uses an index which is the url "folder" of the file and bases it's partitions on that.  Basically it means you have to consider how you broadcast your data files.

For example, if you were going to store a bunch of sensor data from a shipping truck by the driver & timestamp, you normally would do something like this:

/driver-12345/12-5-2017/1512361578210.dat
/driver-12345/12-5-2017/1512361659502.dat

But two sets of hot-spotting could happen, first on the driver ids themselves & second on the dates.  There are easy ways to fix it, by introducing some sort of hash to randomize the "key" or url.  This needs to be done with some for-thought.  There is a risk that this can adversely impact performance of whatever you'll be accessing S3 with.  Notably Hive external tables, though, possibly others like Nifi, Spark, etc. could be impacted depending on their underlying access methods.

There is a great AWS article with greater detail and more examples of their Key strategy.  And Hortonworks has some documentation for using S3 that covers in more details as well.

The big limit I see so far with S3 isn't encryption, partition management or any of that.  It's that, simply, a large number of Apache open-source libraries haven't caught up to it fully (yet).  For example, it's well known that the combination of Apache Atlas & Ranger running on top of HDFS can provide powerful set of tools.  And they aren't quite there yet with S3.  There is still work going on to get Apache Ranger fully in control of S3, but for now, most people rely on Hive external tables fronting S3 storage locations & accessing via Hive which fully integrates with Ranger.

I suspect with time AWS S3 storage could supplant/rival HDFS in many usages in the Big Data space.  But time will tell...

Comments

Popular posts from this blog

Spark Open Source Testing

Overview: Relational vs Non-Relational