I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.

TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here: https://github.com/Marcel-Jan/docker-hadoop-spark [Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running Read more

ITNEXT Summit 2019: serverless, streaming and cloud native transformations

For the third time in a row I’ve attended the ITNEXT Summit. This year I got a ticket from LINKIT, for which I thank them. It was the best ITNEXT Summit I’ve been at so far.

It started with breakfast. I already had it at home, but I can’t resist a good croissant. Mmm… Where was I? Oh yeah, the summit. In this blogpost I look back on the sessions I attended.

 

Cultivating Production Excellence – Liz Fong-Jones

Liz Fong-Jones about dealing with complexity in production

I’ve been on-call for complex systems in my life, but in the era of containers and serverless things have changed. Some things Liz Fong-Jones spoke about in her keynote did sound familiar, but she discussed how with complex architecures with distributed systems, containers and cloud it is no longer a question of systems being up or down. (more…)

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

(more…)

Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

(more…)

“Kubernetes” according to Youtube’s close captions

Okay, this is a bit immature and you’ll learn exactly nothing from this, but I could not resist. I’m following the “Kubernetes Course from a DevOps guru” course on Udemy.com. The videos on Udemy are simply Youtube videos. Just like regular Youtube, you can turn on close captions.

If the publisher of a Youtube video did not enter typed captions, Youtube will use machine learning algorithms to automatically create them instead. This works well when people speak fluent and accentless English (and maybe other languages as well). But results get a bit off even if someone has only a slight Scottish accent, like one of my Youtube favorites and Kerbal Space Program pilot, Scott Manley.

(more…)