Expedition Data – Page 6 – My journey to learn all things data engineering (and that's a lot)

I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.

TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here: https://github.com/Marcel-Jan/docker-hadoop-spark [Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running the complete docker-compose. Turning this option on again (Settings > General > Expose daemon on tcp://localhost:2375 without TLS) makes it all work. I’m still looking Read more

By Marcel-Jan Krijgsman, 6 yearsOctober 25, 2020 ago

Howto

A humidity sensor network on a Raspberry Pi with Zigbee2MQTT

I was looking for a way to detect leakage in my appartement with some kind of IoT solution. Someone on the Dutch technology forum Tweakers.net told me Xiaomi Humidity sensors, combined with a Zigbee2MQTT might be a good fit. The sensors are quite cheap and so is the CC2531 sniffer stick to receive the data sent over the Zigbee protocol. So that’s what I set out to do. And in these two videos you see Read more

By Marcel-Jan Krijgsman, 7 years ago

Events

ITNEXT Summit 2019: serverless, streaming and cloud native transformations

For the third time in a row I’ve attended the ITNEXT Summit. This year I got a ticket from LINKIT, for which I thank them. It was the best ITNEXT Summit I’ve been at so far.

It started with breakfast. I already had it at home, but I can’t resist a good croissant. Mmm… Where was I? Oh yeah, the summit. In this blogpost I look back on the sessions I attended.

Cultivating Production Excellence – Liz Fong-Jones

Liz Fong-Jones about dealing with complexity in production

I’ve been on-call for complex systems in my life, but in the era of containers and serverless things have changed. Some things Liz Fong-Jones spoke about in her keynote did sound familiar, but she discussed how with complex architecures with distributed systems, containers and cloud it is no longer a question of systems being up or down. (more…)

By Marcel-Jan Krijgsman, 7 yearsNovember 1, 2019 ago

Data engineering

Tech dossier: pandas

I’m keeping tech dossiers in Evernote on open source products I want to keep track of. And I decided to put them on my blog. My previous ones were on Kubernetes and Elasticsearch. This one is on the Python data management library pandas. A short description – in English Pandas is a library of Python. If you already have Python 3 (version 2 support was recently dropped), it’s a matter of running “pip install Read more

By Marcel-Jan Krijgsman, 7 yearsSeptember 26, 2019 ago

Data engineering

Things I’ve learned about metadata for a data lake

I’ve been thinking of writing a blogpost about Apache Atlas. For one and a half years I’ve gained a unique experience with this product that I would like to share with the world.

But first we need to talk about metadata. That is one of the important uses of Apache Atlas. Meaningful metadata won’t get in there by accident. Maybe you are just starting your journey into metadata. I’m here to say that it’s going to take work. Not just by you, but everyone in your organization who has a stake in data. So in this blogpost I will be talking more about the organizational side of metadata and not so much on the technical side.

What do I mean by metadata?

Metadata can mean many things. Search it and you’ll find that there’s metadata used to “get to know you better” by companies, or in other words: for ad targeting. There also is metadata used by intelligence agencies to find out if you plan to do anything bad. But the metadata I’m talking about is the kind of information that you can use to find data in an organization.

(more…)

By Marcel-Jan Krijgsman, 7 years ago

Data engineering

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

(more…)

By Marcel-Jan Krijgsman, 7 yearsSeptember 1, 2019 ago

Active Learning

Neo4J: Loading rocket data in a graph database

When I first learned about graph databases, like Neo4J, I didn’t get it. That’s how I always start with new technology: not getting at all why people getting so enthusiastic about them. Then I read “Seven Databases in Seven Weeks, 2nd edition” (as reviewed in January). It describes Neo4J as a “whiteboard friendly”. Any diagram with boxes and lines you could draw on a whiteboard, can be stored in Neo4J. After reading the first paragraphs Read more

By Marcel-Jan Krijgsman, 7 yearsJune 8, 2019 ago

Uncategorized

Starting at DIKW May 1st 2019

Per May 1st 2019 I’ll be working at a new company: DIKW in Nieuwegein. DIKW stands for: Data, Information, Knowledge, Wisdom (it works in Dutch too). I will be working as data engineer on consultancy basis. I’ve already met many colleagues at DIKW and they are all very experienced in BI, data science and data engineering. It’s fun to converse with them, which actually was part of the sollicitation process. I’m going to miss Read more

By Marcel-Jan Krijgsman, 7 yearsApril 25, 2019 ago

Howto

Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

(more…)

By Marcel-Jan Krijgsman, 7 yearsMarch 8, 2019 ago

“Kubernetes” according to Youtube’s close captions

Okay, this is a bit immature and you’ll learn exactly nothing from this, but I could not resist. I’m following the “Kubernetes Course from a DevOps guru” course on Udemy.com. The videos on Udemy are simply Youtube videos. Just like regular Youtube, you can turn on close captions.

If the publisher of a Youtube video did not enter typed captions, Youtube will use machine learning algorithms to automatically create them instead. This works well when people speak fluent and accentless English (and maybe other languages as well). But results get a bit off even if someone has only a slight Scottish accent, like one of my Youtube favorites and Kerbal Space Program pilot, Scott Manley.

(more…)

By Marcel-Jan Krijgsman, 8 yearsOctober 12, 2018 ago