Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

(more…)

Recovering your HDP 2.6.1 Sandbox on VirtualBox after a restart

If you’ve worked with the Hortonworks Data Platform 2.x sandbox of later versions in VirtualBox and made it shutdown rather vigorously, you might have noticed that you won’t get past this startup screen when you try to start it up the next time:

I had this a couple of times and that’s why I decided to pause my sandbox every time and save it before shutting down my laptop. But yesterday Windows 10 decided to step in. After a day of studying it was high time for me to have dinner, during which I kept the laptop on. Little did I know that Windows 10 at that time decided to update and restart. And to do this, it needed to shutdown every application. Including VirtualBox. When I came back I found out to my horror that my carefully prepared HDP sandbox was shutdown in the roughest of ways. Thanks, Microsoft! (more…)

Tutorial: Let’s throw some asteroids in Apache Hive

This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make sure you don’t look like comment spam though. I’m getting a lot of that lately, even though they never pass my approval).

Intro

Currently I’m studying for the Hortonworks Data Platform Certified Developer: Spark using Python exam (or HDPCD: Spark using Python). One part of the exam objectives is using SQL in Spark. Along the way you also work with Hive, the data warehouse software in Hadoop.

I was following the free Udemy HDPCD Spark using Python preparation course by ITVersity. The course is good BTW, especially for the price :). But after playing along with the Core Spark videos, the course again used the same boring revenue data for the Spark SQL part. And I thought: “I know SQL pretty well. Why not use data that is a bit more interesting?” And so I downloaded the Minor Planet Center’s asteroid data. This contains all the known asteroids until at least yesterday. At this moment, that is about 745.000 lines of data. (more…)