Spark Open Source Testing

March 10, 2018

Recently I tried helping out more with the open source Apache Spark community. As of the writing of this blog the 2.3.0 has had some release candidates submitted for review by the community.

To that end I decided to help out & chronicle my experience.

So, to start with, I installed Oracle Virtual Box (technically already had) and grabbed a fresh image of Ubuntu 16.04.3 & loaded that up. (Note, we'll be using a lot of space, make sure you have over 20 gb at least of space otherwise you'll have to add memory as you go using virtualbox commands & gparted in linux, which, I can now say, is very much, not fun😒)

After I had that I did some basic stuff, like installing openjdk version 8 (technically I forgot openjdk at first... which maven did not like 😃):
sudo apt-get install openjdk-8

Also, you have to pull down a copy of R, if you want to have the R related spark stuff to work, I'm going to skip here how to install R on Linux/Ubuntu, but there are lots of great resources available via Google. I used this one from Melissa Anderson on Digital Ocean:
https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-16-04-2

And you'll have to install some R libraries:
install.packages("knitr");
install.packages("testthat");

Note that on testthat, depending on version you might run into:
https://github.com/apache/spark/pull/20003

So you'll need devtools to pick a specific version (and installing devtools itself might require a specific install of openssl so be warned).

And while my native Ubuntu image comes with Python (of course) it was missing some of the libraries, so I had to add them:
sudo apt-get install python-setuptools

Then I pulled from git the (at the time latest) release candidate:
clone git https://github.com/apache/spark.git
git checkout tags/v2.3.0-rc5

After that I ran the basic package (note, since this is a fresh image this process does take a while to pull everything down):
build/mvn -DskipTests clean package

Next I wanted to create distros so I ran:
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

Note here, if I hadn't bothered to install R from the previous step I'd exclude them from the options for the distro.

Then I checked the md5 & sha512 to make sure they matched the published data:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/

Next I ran a number of tests of my own making. As I'm more interested in R, I ended up trying to create SparkR data frames, convert into R data frames & so on.

Search This Blog

Coralville Big Data Guy

Spark Open Source Testing

Comments

Post a Comment

Popular posts from this blog

Overview: Relational vs Non-Relational

Initial thoughts on AWS S3 storage