Starting at DIKW May 1st 2019

Per May 1st 2019 I’ll be working at a new company: DIKW in Nieuwegein. DIKW stands for: Data, Information, Knowledge, Wisdom (it works in Dutch too). I will be working as data engineer on consultancy basis.

 

I’ve already met many colleagues at DIKW and they are all very experienced in BI, data science and data engineering. It’s fun to converse with them, which actually was part of the sollicitation process.

I’m going to miss the view from the office I had, but in the new job I’ll basically manage myself. And let’s face it: none told me to do 8 courses and become Certified Kubernetes Administrator last year. None made a year plan. Development-wise I’m perfectly capable to find my own way.

Posted in Uncategorized | Tagged , , , | Leave a comment

Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

Continue reading

Posted in Howto, Python | Tagged , , , , , , , | Leave a comment

Book review: Seven Databases in Seven Weeks

There are so many data related open source products nowadays. On one side that’s great. On the other side it’s hard for one human to grasp them all. To be sure, there’s great documentation on them all. And there are books and sessions at meetups and conferences that tell you in depth what they are about. But sometimes you just want to get the gist of it. To quickly learn a lot of products, so you can pick the ones you find useful. But it’s rare to get that kind of overview over multiple products.

But “Seven Databases in Seven Weeks, Second Edition” by Luc Perkins, Jim Wilson and Eric Redmond does exactly that. It describes a selection of seven different types of databases, their strengths and weaknesses.

 

The seven databases

The seven databases are PostgreSQL, HBase, MongoDB, CouchDB, Neo4J, DynamoDB and Redis. It’s a good mix of databases. They all have their different uses. Continue reading

Posted in Active Learning | Tagged , , , , , , , , , , , | Leave a comment

Doing data science on my health data in R Studio – Part 1

Up to seven years ago my doctor would nag me every half year that I should lose some weight. Nagging didn’t work on me that much. What did work however was competition. I wanted to become faster in a bike race (actually, the cycling part in a relay triathlon). When I noticed that losing weight myself instead of buying an expensive new bike, I set out to do the first and I lost about 10 kilograms the first year. I won’t go into too many details, because I probably have talked about it too often. I’m a bit too proud about this.

This summer I followed an R Studio course on Udemy and when I finished it, I was thinking of doing a project with R Studio. I did gather a lot of health data in the last 6 years: weight, body fat, data from my heart rate monitor, a step count from my iPhone and sleep data. All this was stored in an Excel sheet. It’s not exactly Big data, but it has a couple of thousands rows now. Surely there’s a way to get some meaning out of it. I’ve already done a video about this, but it isn’t easy to copy commands from videos. And I’ve tried out some other stuff in this series.

Continue reading

Posted in Learning Big Data, Weird experiments | Tagged , , , , | Leave a comment

Check your /tmp on HDFS

If you have sensitive data on your Hadoop cluster, you might want to check /tmp on HDFS once a while to see what ends up there. /tmp is used by several components. Hive for example stores its “scratch data” there. But fortunately it does so in subdirectories with permissions for the user that ran the job only. The files in there are not readable for anyone else.

But some people think /tmp is a good place to store in between data of their homegrown processes. Even when you clean up afterwards, this is not a good idea when dealing with sensitive data. Unless you set the permissions in such a way that it is not readable for anyone else. But this is often forgotten. And when such a proces fails, usually this data stays in /tmp for a long time. Continue reading

Posted in Learning Big Data | Tagged , , | Leave a comment

Tech dossier: Elasticsearch / the ELK stack

Because tech is moving so fast, I’ve been keeping dossiers in Evernote of open source products I have to learn more of, which I’ve decided to put on my blog. My last one was about Kubernetes. This one is about Elasticsearch, also known as the ELK stack.

 

A short description – in English

Elasticsearch is a search engine. You can use it to quickly find keywords in a large collection of documents. But the people who came up with it must have realized at one point that you can use the same technology also to search in much more than text. You can use it to search any data. Continue reading

Posted in Tech dossier | Tagged , , , , , | Leave a comment

Tech dossier: Kubernetes

Because tech is moving so fast, I’ve been keeping dossiers in Evernote of open source products I have to learn more of. Like Kubernetes. This morning I suddenly thought this would be perfect for a blog.. if properly organized. My plan is to add new interesting material as soon as I have it.

Do you have a cracking good thing to add? Let me know in the comments!

[Update January 7th 2019] Added Performance section and a couple of articles for my reading list.

[Update January 15th 2019] Added CNCF Best Practices for security.

[Update January 21th 2019] Added Kubernetes Failure Stories.

 

A short description – in English

This wasn’t in my original tech dossier, but decided it could be helpful. What is Kubernetes? The standard answer is: it’s for container orchestration. What does that mean? Continue reading

Posted in Kubernetes, Tech dossier | Tagged , , , , , , , , | Leave a comment

Making a Hertzsprung-Russell diagram from Gaia DR2 data with Elasticsearch

Elasticsearch was one of the open source products on my list to try out, ever since I got rejected for a couple of assignments as a consultant last year. Apparently it’s a popular product. But why do you need a search engine in a Big Data architecture? This I explain in my new video where I load newly released data from ESA’s Gaia mission in Elasticsearch with Logstash and visualize it with Kibana.

I’ve also created an extra video wherein I explain how the code works.

You can get the code I’ve used at my Github page.

 

Posted in Learning Big Data, NoSQL | Tagged , , , , , , , , | Leave a comment

Codemotion Amsterdam 2018, day two

Back on the ferry to the north of Amsterdam I went, back for day two of Codemotion Amsterdam 2018.

Keynote

Daniel Gebler from PicNic told us about what they are doing today to bring groceries home for people. I’ve seen two presentations by PicNic before and I could really see their progress from session to session.

Daniel explained how they use a recommender system to make it possible for customers to buy their most common groceries with one tap in the PicNic app. Which is actually hard. Even if you get 90% of precision of your prediction for one item, that means that for a set of 12 items you actually get 12% precision. So they really had to work to get a much better precision per item. They managed to do that by working with two dimensions of data: big and deep data. Continue reading

Posted in Conferences, Events | Tagged , , , , , , , , , , , , | Leave a comment

Codemotion Amsterdam 2018, day one

Last Friday I almost felt I had to explain a colleague that I don’t always win raffles and lotteries. Because yep, I won another ticket. Again via the Roaring Elephant podcast. It’s pretty worthwhile listening to them, is all I’m saying.

This was a ticket for CodeMotion Amsterdam 2018. CodeMotion is a conference for developers with topics like the blockchain, Big Data, Internet of Things, DevOps, software architectures, but also front-end development, game development and AR/VR.

Amsterdam from the ferry to the north of the city.

Continue reading

Posted in Conferences, Events | Tagged , , , , , , , | Leave a comment