Posts

Spark Open Source Testing

Recently I tried helping out more with the open source Apache Spark community.  As of the writing of this blog the 2.3.0 has had some release candidates submitted for review by the community. To that end I decided to help out & chronicle my experience. So, to start with, I installed Oracle Virtual Box (technically already had) and grabbed a fresh image of Ubuntu 16.04.3 & loaded that up.  (Note, we'll be using a lot of space, make sure you have over 20 gb at least of space otherwise you'll have to add memory as you go using virtualbox commands & gparted in linux, which, I can now say, is very much, not fun😒) After I had that I did some basic stuff, like installing openjdk version 8 (technically I forgot openjdk at first... which maven did not like 😃): sudo apt-get install openjdk-8 Also, you have to pull down a copy of R, if you want to have the R related spark stuff to work, I'm going to skip here how to install R on Linux/Ubuntu, but there are lots ...

Overview: Relational vs Non-Relational

I thought I might share some of my thoughts on relational vs non-relational databases. I've worked with a non-relational database, MongoDB, on & off for the past three years and of course have over a decade of relational database experience, primarily with Oracle.  So I thought I'm give a brief overview of some of the advantages, and pit-falls of non-relational databases compared to more traditional data stores. I will provide a disclaimer, the field of non-relational data stores has radically changed over the past five years and continues to change.  So while I'll try to give a decent overview, I'm sure I'll do a poor job at spots.  Also, at the high level I'll be at, some more salient points will get omitted.  I will stress, this is a rich area and it's well worth your time to do further exploration. So, what is a non-relational database?  I'm not sure I could give one solid definition.  Just looking at the Wiki page for NoSQL  lists out ni...

Initial thoughts on AWS S3 storage

I've recently been looking into S3 storage provided by AWS and I've ran into some interesting stuff I'd like to share. First is how to encrypt the data in S3.  There are two main flavors of encryption, server-side & client-side.  Worth noting, AWS does provide support for a client side encryption, but personally, if I was going client-side, and there could be reasons to pick that, I'd roll my own solution.  Now, there is also server-side encryption.   Jeff Barr at AWS has a great blog entry going over how to set server side encryption up if you're interested. The big highlight is there are three options: SSE-S3 (S3 managed keys) SSE-KMS (AWS KMS managed keys) SSE-C (client managed keys) The most robust is SSE-KMS, it includes audit trails among other security features, but only if you're talking about relatively "few" numbers of reads & writes.  But I'm not the Little Bits of Data Guy now am I?  :)  The problem arises with lar...