ITNEXT Summit 2019: serverless, streaming and cloud native transformations

For the third time in a row I’ve attended the ITNEXT Summit. This year I got a ticket from LINKIT, for which I thank them. It was the best ITNEXT Summit I’ve been at so far.

It started with breakfast. I already had it at home, but I can’t resist a good croissant. Mmm… Where was I? Oh yeah, the summit. In this blogpost I look back on the sessions I attended.


Cultivating Production Excellence – Liz Fong-Jones

Liz Fong-Jones about dealing with complexity in production

I’ve been on-call for complex systems in my life, but in the era of containers and serverless things have changed. Some things Liz Fong-Jones spoke about in her keynote did sound familiar, but she discussed how with complex architecures with distributed systems, containers and cloud it is no longer a question of systems being up or down. Continue reading

Posted in Events | Tagged , , , , , , , , , , , , , , , , , , | Leave a comment

Tech dossier: pandas

I’m keeping tech dossiers in Evernote on open source products I want to keep track of.  And I decided to put them on my blog. My previous ones were on Kubernetes and Elasticsearch. This one is on the Python data management library pandas.


A short description – in English

Pandas is a library of Python. If you already have Python 3 (version 2 support was recently dropped), it’s a matter of running “pip install pandas” and there you are. Pandas allows you to analyze and manipulate your data. But then again, aren’t there many more products for that? How to explain the power of pandas?

Let me put it like this: it is like using Excel, but on much larger datasets, and if Excel had a command line interface. Imagine being able to say to Excel on a command line: “load my csv file”, “use this row as names for my columns”, “just show me columns date and sales”, “all right, now pivot that”. I just love it.


Learning pandas

For this I’ve used It’s free and it gave me an excellent start with data analysis in Python. The Youtube videos for pandas seem to have been recently updated also.

Need to learn Python first? I started learning Python with the Coursera course “An Introduction to Interactive Programming in Python (Part 1)” from Rice University. It’s a great course. But if you want a free course, you can’t go wrong with the videos.

You can also watch a couple of my video’s on my first encounters with pandas.

And recently I wrote a blogpost on how I used pandas at work to flatten the data from a complex Excel sheet, so I could load it in Hadoop. I’ve used all kinds of lesser known features to achieve that result.


Building your own environment

Want to play with pandas? That’s quite easy. You need to install Python 3 on your own computer and use “pip install pandas” (from the command line).


Getting pandas to do specific stuff

Selecting columns or rows with pandas (Because I keep forgetting after a while)

This article discusses two ways of selecting data with pandas, but it’s also handy as reminder how to select rows and columns. You can’t go wrong now.

How to shift a column in pandas

How do multi-indexes in pandas work? Also in this video:



Other interesting stuff

Pandas tricks and features you might not know

Data visualization with pandas plot (How cool: you can add .plot to your dataframe)


pandas and performance

pandas at extreme performance


Posted in Data engineering, Python, Tech dossier | Tagged , , , , , | Leave a comment

The Atlas REST API – working examples

Originally I was writing a blogpost about my experiences with Apache Atlas (which is still in the works) in which I would refer to a Hortonworks Community post I wrote with all the working examples of Atlas REST API calls. But since Hortonworks Community has migrated to Cloudera Community, this article seems to have been lost. The original URL brings you to the Cloudera Community, but not the article. The search engine comes up with nothing. I can’t find it via my profile either.

It wasn’t particularly easy to gain all this knowledge. So of course I had a backup of all successful commands and output. And here it is. This was all tested on HDP 2.6.5.

Continue reading

Posted in Apache Atlas | Tagged , , , , , , , , , | 3 Comments

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

Continue reading

Posted in Data engineering, Spark | Tagged , , , | 2 Comments

Neo4J: Loading rocket data in a graph database

When I first learned about graph databases, like Neo4J, I didn’t get it. That’s how I always start with new technology: not getting at all why people getting so enthusiastic about them. Then I read “Seven Databases in Seven Weeks, 2nd edition” (as reviewed in January). It describes Neo4J as a “whiteboard friendly”. Any diagram with boxes and lines you could draw on a whiteboard, can be stored in Neo4J. After reading the first paragraphs about Neo4J, I totally got why graph databases are very interesting.

As usual, I started following a course on Neo4J to get acquainted with the product. Well there was a course on Neo4J on and I followed it. But it was 3 years old and some of the code it teached, is already obsolete. So the less I say about that, the better. But I did learn some Cypher, the language used in Neo4J, and I later learned the more modern versions of the commands to get stuff done.

Next phase: do a project with it. It took me some time to think of interesting astronomy or space related data in it. Eventually I stumbled on a dataset that has been around and maintained for a long time. At least since I discovered when I just got on the Internet around 1993. It’s called Jonathan’s Space Page now. And the maintainer, Jonathan McDowell, still keeps the list of orbital space launches and list of satellites ever launched quite up to date.

And eventually I did manage to load this data in Neo4J and here is my video about that:


You can download Neo4J Desktop here:

Neo4j Download Center

You can find my Python and Cypher code here:

Let me know if you want me to go in depth on the code used in this video.


Posted in Active Learning, Howto, NoSQL | Tagged , , , , | Leave a comment

Starting at DIKW May 1st 2019

Per May 1st 2019 I’ll be working at a new company: DIKW in Nieuwegein. DIKW stands for: Data, Information, Knowledge, Wisdom (it works in Dutch too). I will be working as data engineer on consultancy basis.


I’ve already met many colleagues at DIKW and they are all very experienced in BI, data science and data engineering. It’s fun to converse with them, which actually was part of the sollicitation process.

I’m going to miss the view from the office I had, but in the new job I’ll basically manage myself. And let’s face it: none told me to do 8 courses and become Certified Kubernetes Administrator last year. None made a year plan. Development-wise I’m perfectly capable to find my own way.

Posted in Uncategorized | Tagged , , , | Leave a comment

Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

Continue reading

Posted in Howto, Python | Tagged , , , , , , , | 5 Comments

Book review: Seven Databases in Seven Weeks

There are so many data related open source products nowadays. On one side that’s great. On the other side it’s hard for one human to grasp them all. To be sure, there’s great documentation on them all. And there are books and sessions at meetups and conferences that tell you in depth what they are about. But sometimes you just want to get the gist of it. To quickly learn a lot of products, so you can pick the ones you find useful. But it’s rare to get that kind of overview over multiple products.

But “Seven Databases in Seven Weeks, Second Edition” by Luc Perkins, Jim Wilson and Eric Redmond does exactly that. It describes a selection of seven different types of databases, their strengths and weaknesses.


The seven databases

The seven databases are PostgreSQL, HBase, MongoDB, CouchDB, Neo4J, DynamoDB and Redis. It’s a good mix of databases. They all have their different uses. Continue reading

Posted in Active Learning | Tagged , , , , , , , , , , , | Leave a comment

R Studio: Doing data science on my health data – Part 1

Up to seven years ago my doctor would nag me every half year that I should lose some weight. Nagging didn’t work on me that much. What did work however was competition. I wanted to become faster in a bike race (actually, the cycling part in a relay triathlon). When I noticed that losing weight myself instead of buying an expensive new bike, I set out to do the first and I lost about 10 kilograms the first year. I won’t go into too many details, because I probably have talked about it too often. I’m a bit too proud about this.

This summer I followed an R Studio course on Udemy and when I finished it, I was thinking of doing a project with R Studio. I did gather a lot of health data in the last 6 years: weight, body fat, data from my heart rate monitor, a step count from my iPhone and sleep data. All this was stored in an Excel sheet. It’s not exactly Big data, but it has a couple of thousands rows now. Surely there’s a way to get some meaning out of it. I’ve already done a video about this, but it isn’t easy to copy commands from videos. And I’ve tried out some other stuff in this series.

Continue reading

Posted in Learning Big Data, Weird experiments | Tagged , , , , | Leave a comment