Apache Spark – Expedition Data

I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.

TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here: https://github.com/Marcel-Jan/docker-hadoop-spark [Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running Read more

By Marcel-Jan Krijgsman, 5 yearsOctober 25, 2020 ago

Data engineering

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

(more…)

By Marcel-Jan Krijgsman, 7 yearsSeptember 1, 2019 ago

Conferences

Dataworks Summit Berlin 2018, day one

I’m back at Dataworks Summit this year. This time I didn’t win any ticket, but my new employer, Port of Rotterdam, has arranged that I could go. Pretty cool, because I did not want to miss it. This time it’s happening in Berlin.

It started with keynotes. Scott Gnau from Hortonworks announced Data Steward Studio for better data governance. Scott’s message was that your data strategy is your cloud strategy is your business strategy. You should not see them as totally different things. (more…)

By Marcel-Jan Krijgsman, 8 years ago