I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.

TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here: https://github.com/Marcel-Jan/docker-hadoop-spark [Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running the complete docker-compose. Turning this option on again (Settings > General > Expose daemon on tcp://localhost:2375 without TLS) makes it all work. I’m still looking Read more

A humidity sensor network on a Raspberry Pi with Zigbee2MQTT

I was looking for a way to detect leakage in my appartement with some kind of IoT solution. Someone on the Dutch technology forum Tweakers.net told me Xiaomi Humidity sensors, combined with a Zigbee2MQTT might be a good fit. The sensors are quite cheap and so is the CC2531 sniffer stick to receive the data sent over the Zigbee protocol. So that’s what I set out to do. And in these two videos you see Read more

ITNEXT Summit 2019: serverless, streaming and cloud native transformations

For the third time in a row I’ve attended the ITNEXT Summit. This year I got a ticket from LINKIT, for which I thank them. It was the best ITNEXT Summit I’ve been at so far.

It started with breakfast. I already had it at home, but I can’t resist a good croissant. Mmm… Where was I? Oh yeah, the summit. In this blogpost I look back on the sessions I attended.

 

Cultivating Production Excellence – Liz Fong-Jones

Liz Fong-Jones about dealing with complexity in production

I’ve been on-call for complex systems in my life, but in the era of containers and serverless things have changed. Some things Liz Fong-Jones spoke about in her keynote did sound familiar, but she discussed how with complex architecures with distributed systems, containers and cloud it is no longer a question of systems being up or down. (more…)

Tech dossier: pandas

I’m keeping tech dossiers in Evernote on open source products I want to keep track of.  And I decided to put them on my blog. My previous ones were on Kubernetes and Elasticsearch. This one is on the Python data management library pandas.   A short description – in English Pandas is a library of Python. If you already have Python 3 (version 2 support was recently dropped), it’s a matter of running “pip install Read more

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

(more…)

Neo4J: Loading rocket data in a graph database

When I first learned about graph databases, like Neo4J, I didn’t get it. That’s how I always start with new technology: not getting at all why people getting so enthusiastic about them. Then I read “Seven Databases in Seven Weeks, 2nd edition” (as reviewed in January). It describes Neo4J as a “whiteboard friendly”. Any diagram with boxes and lines you could draw on a whiteboard, can be stored in Neo4J. After reading the first paragraphs Read more

Starting at DIKW May 1st 2019

Per May 1st 2019 I’ll be working at a new company: DIKW in Nieuwegein. DIKW stands for: Data, Information, Knowledge, Wisdom (it works in Dutch too). I will be working as data engineer on consultancy basis.   I’ve already met many colleagues at DIKW and they are all very experienced in BI, data science and data engineering. It’s fun to converse with them, which actually was part of the sollicitation process. I’m going to miss Read more

Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

(more…)

“Kubernetes” according to Youtube’s close captions

Okay, this is a bit immature and you’ll learn exactly nothing from this, but I could not resist. I’m following the “Kubernetes Course from a DevOps guru” course on Udemy.com. The videos on Udemy are simply Youtube videos. Just like regular Youtube, you can turn on close captions.

If the publisher of a Youtube video did not enter typed captions, Youtube will use machine learning algorithms to automatically create them instead. This works well when people speak fluent and accentless English (and maybe other languages as well). But results get a bit off even if someone has only a slight Scottish accent, like one of my Youtube favorites and Kerbal Space Program pilot, Scott Manley.

(more…)