Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

Continue reading

Posted in Howto, Python | Tagged , , , , , , , | 5 Comments

Book review: Seven Databases in Seven Weeks

There are so many data related open source products nowadays. On one side that’s great. On the other side it’s hard for one human to grasp them all. To be sure, there’s great documentation on them all. And there are books and sessions at meetups and conferences that tell you in depth what they are about. But sometimes you just want to get the gist of it. To quickly learn a lot of products, so you can pick the ones you find useful. But it’s rare to get that kind of overview over multiple products.

But “Seven Databases in Seven Weeks, Second Edition” by Luc Perkins, Jim Wilson and Eric Redmond does exactly that. It describes a selection of seven different types of databases, their strengths and weaknesses.


The seven databases

The seven databases are PostgreSQL, HBase, MongoDB, CouchDB, Neo4J, DynamoDB and Redis. It’s a good mix of databases. They all have their different uses. Continue reading

Posted in Active Learning | Tagged , , , , , , , , , , , | Leave a comment

R Studio: Doing data science on my health data – Part 1

Up to seven years ago my doctor would nag me every half year that I should lose some weight. Nagging didn’t work on me that much. What did work however was competition. I wanted to become faster in a bike race (actually, the cycling part in a relay triathlon). When I noticed that losing weight myself instead of buying an expensive new bike, I set out to do the first and I lost about 10 kilograms the first year. I won’t go into too many details, because I probably have talked about it too often. I’m a bit too proud about this.

This summer I followed an R Studio course on Udemy and when I finished it, I was thinking of doing a project with R Studio. I did gather a lot of health data in the last 6 years: weight, body fat, data from my heart rate monitor, a step count from my iPhone and sleep data. All this was stored in an Excel sheet. It’s not exactly Big data, but it has a couple of thousands rows now. Surely there’s a way to get some meaning out of it. I’ve already done a video about this, but it isn’t easy to copy commands from videos. And I’ve tried out some other stuff in this series.

Continue reading

Posted in Learning Big Data, Weird experiments | Tagged , , , , | Leave a comment

Check your /tmp on HDFS

If you have sensitive data on your Hadoop cluster, you might want to check /tmp on HDFS once a while to see what ends up there. /tmp is used by several components. Hive for example stores its “scratch data” there. But fortunately it does so in subdirectories with permissions for the user that ran the job only. The files in there are not readable for anyone else.

But some people think /tmp is a good place to store in between data of their homegrown processes. Even when you clean up afterwards, this is not a good idea when dealing with sensitive data. Unless you set the permissions in such a way that it is not readable for anyone else. But this is often forgotten. And when such a proces fails, usually this data stays in /tmp for a long time. Continue reading

Posted in Learning Big Data | Tagged , , | Leave a comment

Tech dossier: Elasticsearch / the ELK stack

Because tech is moving so fast, I’ve been keeping dossiers in Evernote of open source products I have to learn more of, which I’ve decided to put on my blog. My last one was about Kubernetes. This one is about Elasticsearch, also known as the ELK stack.


A short description – in English

Elasticsearch is a search engine. You can use it to quickly find keywords in a large collection of documents. But the people who came up with it must have realized at one point that you can use the same technology also to search in much more than text. You can use it to search any data. Continue reading

Posted in Tech dossier | Tagged , , , , , | Leave a comment

Tech dossier: Kubernetes

Because tech is moving so fast, I’ve been keeping dossiers in Evernote of open source products I have to learn more of. Like Kubernetes. This morning I suddenly thought this would be perfect for a blog.. if properly organized. My plan is to add new interesting material as soon as I have it.

Do you have a cracking good thing to add? Let me know in the comments!

[Update January 7th 2019] Added Performance section and a couple of articles for my reading list.

[Update January 15th 2019] Added CNCF Best Practices for security.

[Update January 21th 2019] Added Kubernetes Failure Stories.


A short description – in English

This wasn’t in my original tech dossier, but decided it could be helpful. What is Kubernetes? The standard answer is: it’s for container orchestration. What does that mean? Continue reading

Posted in Kubernetes, Tech dossier | Tagged , , , , , , , , | Leave a comment

Making a Hertzsprung-Russell diagram from Gaia DR2 data with Elasticsearch

Elasticsearch was one of the open source products on my list to try out, ever since I got rejected for a couple of assignments as a consultant last year. Apparently it’s a popular product. But why do you need a search engine in a Big Data architecture? This I explain in my new video where I load newly released data from ESA’s Gaia mission in Elasticsearch with Logstash and visualize it with Kibana.

I’ve also created an extra video wherein I explain how the code works.

You can get the code I’ve used at my Github page.


Posted in Learning Big Data, NoSQL | Tagged , , , , , , , , | Leave a comment

Codemotion Amsterdam 2018, day two

Back on the ferry to the north of Amsterdam I went, back for day two of Codemotion Amsterdam 2018.


Daniel Gebler from PicNic told us about what they are doing today to bring groceries home for people. I’ve seen two presentations by PicNic before and I could really see their progress from session to session.

Daniel explained how they use a recommender system to make it possible for customers to buy their most common groceries with one tap in the PicNic app. Which is actually hard. Even if you get 90% of precision of your prediction for one item, that means that for a set of 12 items you actually get 12% precision. So they really had to work to get a much better precision per item. They managed to do that by working with two dimensions of data: big and deep data. Continue reading

Posted in Conferences, Events | Tagged , , , , , , , , , , , , | Leave a comment

Codemotion Amsterdam 2018, day one

Last Friday I almost felt I had to explain a colleague that I don’t always win raffles and lotteries. Because yep, I won another ticket. Again via the Roaring Elephant podcast. It’s pretty worthwhile listening to them, is all I’m saying.

This was a ticket for CodeMotion Amsterdam 2018. CodeMotion is a conference for developers with topics like the blockchain, Big Data, Internet of Things, DevOps, software architectures, but also front-end development, game development and AR/VR.

Amsterdam from the ferry to the north of the city.

Continue reading

Posted in Conferences, Events | Tagged , , , , , , , | Leave a comment

Dataworks Summit Berlin 2018, day two

Back for round two of keynotes, good technical sessions and discussing them with fellow data specialists in between.


First up was  Frank Säuberlich from Teradata, who had an interesting example of machine learning for fraud detection at Danske Bank. They used transaction data sort of as pixels and ran that through a Convoluted Neural Network to find outliers. And they did. Before this solution they found many false positives, with this approach they managed to find 50% (of 40%) more detected frauds, but the most important thing was that the frauds that were detected were significantly less false positives.

Frank Säuberlich from Teradata on using CNNs for fraud detection.

After this John Keisler took a live poll with the question “How ready are we for GDPR?”. Only 23% felt they were ready. 51% were making preparations and 15% thought they won’t be ready. 11% asked “What’s GDPR?”

Enza Iannopollo from Forrester Research thought we should embrace GDPR. She thinks that GDPR is the Copernican Revolution for many organizations. After getting ready for GDPR they should be able to find their data better and that’s not a bad thing.

Enza Iannopollo explained why we need to embrace GDPR. Nice slide design BTW.

Many organizations will be late to the party though, so Enza explained what they still can do in the remaining 35 days. She thinks that they should start by finding focussing on the data driven initiatives that present the highest risk, prioritize them and make a roadmap to solve these issues. Then deploy necessary security controls and re-engineer some essential processes later.

Next Jamie Engesser and Srikanth Venkat from Hortonworks showed and live demoed the new Data Steward Studio. With Data Steward Studio you can do many things. It has some of the capabilities that Atlas already has, but it can oversee multiple data lakes. It is touted as an Enterprise Data Catalog. Data Steward Studio also can discover and fingerprint data. But in the live demo we were shown how it can keep track of customer’s consent of storing and using data and if the customer wants to revoke it, it can be done as well.

A live demo of Data Steward Studio by Srikanth Venkat.

The last keynote as by Kamélia Benchekroun from Renault. Her job title is Data Lake squad leader. We were very impressed. I know one colleague who would like to adopt it 🙂 . At Renault they are doing a lot with IoT these days and actually that is how you would expect it. She talked about Renault’s experiences with it.

And that was the last keynote and we went our separate ways to see sessions on different topics.


GDPR-focused partner community showcase for Apache Ranger and Apache Atlas

I went to this session presented by several speakers, introduced by Srikanth Venkat and Ali Bajwa from Hortonworks. Srikanth first talked shortly about the present and future of Ranger and Atlas. Ranger 1.0.0 can be expected in Q3 of 2018. And it will be extended in the non-Hadoop ecosystem. In the future you will be able to use Ranger on Azure Data Lake Store, Azure Blob Storage, EMC2 and with Apache HAWQ. HAWQ is an elastic query engine (sounds like ElasticSearch).

According to Srikanth Atlas 1.0.0 will also be released in Q3, which is different from what I’ve heared yesterday from Mandy Chessell from IBM, who told us it was a matter of weeks. Maybe he meaned Hortonworks’ release. In HDP 2.6.4 (or Atlas 0.8.2) we already saw a NiFi connector. In the coming release we will see a Spark connector, which was sorely missing.

The Atlas ecosystem is getting larger.

Next were three partner presentations. Subra Ramesh showed how Dataguise automatically tags sensitive data in Atlas. He also showed a live demo of that. I didn’t know Dataguise, I understand it started as a data masking product.

Marco Kopp of Syncsort showed the Atlas integration of their product, DMX-h. But the demo I was most impressed by, was that of Synerscope. Jan-Kees Buenen told that it will allow you to let customers do their own consent.

Thomas Ploeger showed how Synerscope’s product (IXIWA?) is aware of all sensitive data in the data lake after scanning not just the columns, but the actual data. And when he searched for his car’s license plate, it was show exactly where in the datalake that data was.

In a live demo of Synerscope it was show where sensitive data in the data lake are. (I didn’t have any better pictures of it)


Apache Metron in the Real World by Dave Russell

I had two reasons to go to this session: 1. I wanted to know more about Apache Metron. 2. It was given by Dave Russell from the Roaring Elephant podcast. I thought this session was one of the best of the conference in the sense of build-up and interaction with the audience.

Dave Russell on Apache Metron.

Apache Metron is a security product that detects breaches. Breaches are often only detected after more than 8 months. With Metron you are able to detect anomalies much, much faster.

A key role in this, is Metron’s Profiler, that finds out in the data that Metron collects what is normal usage and what deserves attention. It has multiple (machine learning) models to find that out. In Dave’s slides there was a whole list.

He would have done a live demo, but the WiFi was not able to handle it. Therefor we got to see a video of Metron in action. The auth demo showed a graph of connections between users and systems. Usually the relationship is many users on a system. And then there was “user 66”, who on his own had connections at some time with many systems. Something you would expect if that user for example had done port scans to search for “interesting” machines.

Dave had many tips about how to set Metron up, like necessary storage, number of nodes and necessary resources. There also is a single node Metron AMI (Amazon Machine Images) where you can try things out. This is of course not sufficient for a production environment. For that you would rather need about 12 nodes. Different organizations think differently about where to store Metron’s data. After all, you might not want to store data about the possible malicious use of the data lake in that same data lake.


Practical experiences using Atlas and Ranger to implement GDPR by Magnus Runesson

Magnus Runesson from Svenska Spel provided us with experiences of using Atlas and Ranger. In Sweden amongst other games, games of chance are provided and regulated by Svenska Spel. They also try to prevent gambling addiction and understandably that is data of a very sensitive nature.

They came from an Cognos/Oracle environment and went to HDP 2.6 with Hive when the old system became too slow. They use Data Vault for data modeling and generate SQL from this with Oracle SQL Developer Data Modeler.

Magnus Runesson of Svenska Spel talks about the data lake at his organization.

Atlas is used for tag-based security in Ranger. But how do they tag that data? Magnus explained that this is done in the development process. When the model is made or changed, people who know the data are usually involved. They provide the information about sensitivity of data. All this ends up in three CSV files. An in-house built Policy Tool tags data in Atlas based on this.

This Policy Tool interested me, and it turns out we both have been trying to get the Ranger and Atlas REST APIs working for us. His experiences were very familiar for me. I also asked if he considers to make his Policy Tool open source. He said he would discuss this in his organization.


An evaluation of TPC-H on Spark and Spark SQL in Aloja by Raphael Radowitz

I actually entered the room expecting a session on Nifi. Somewhere something in my planning something went wrong. Instead this session was about Spark and Spark SQL performance. Raphael Radowitz had done extensive benchmarks to see which combinations of file formats were faster with what compression.

According to his research Parquet is 50% faster than text, 16% faster than ORC and Parquet with Snappy compression is 10% faster than ORC with Snappy.

As with Spark with Scala (with or without Metastore) and SparkSQL, it depends. Not the Metastore: this has overhead. But Spark with Scala is in some situations faster than SparkSQL, and in other situations it’s the other way around.

Raphael Radowitz discussing which TCP-H queries went faster with SparkSQL and which ones with Spark and Scala.


GDPR: the IBM journey to compliance by Richard Hogg

I’ve told my colleagues at Port of Rotterdam I would jump the grenade and follow all the GDPR sessions. So next I went to the session by Richard Hogg, global GDPR evangelist at IBM. And again, I have to be honest here. For a different reason. This was mainly a sales pitch. It had important information on what GDPR means for your organization, but the refrain was often was “but have no fear, IBM is here”.

Some things I picked up: GDPR speaks of “personal data”, which is not the same as PII (Personal Identifiable Information). For example: IP addresses are no PII, but are personal data according to GDPR.

An interesting approach was using the blockchain so you don’t have to store any personal data at all. I had a similar conversation with a colleague from KPN in the tea break before this session, though not with the blockchain. Often personal information isn’t what you are looking for in a data lake anyway. So why store it?


Lessons learned from running Spark on Docker by Thomas Phelan

Thomas Phelan from BlueData shared his journey to run Spark on Docker. I really appreciate that he went to the trouble of explaining what Docker is and why you would want to use it. And I liked the interesting way he described the journey with terminology like “Navigate the river of container managers”, “Traverse the thightrope of network configs” and “Trip down the staircase of deployment mistakes”.

Thomas Phelan about running Spark on Docker

Docker is able to provide both the flexibility that data scientists want and the control that the IT departement wants. He went to work with those end goals in mind. I made many notes, but I find them hard to summerize right now.



All in all Dataworks Summit 2018 was very worthwhile for me. You could say GDPR was the main theme of this edition and because I am very much involved with that at the moment, it was a hit for me. I really have a lot of takeaways that I have to process and share in our organization and things that I will approach in a new way.

The Estrel hotel in Berlin was the location of Dataworks Summit 2018.

And it was great to again meet so many people working with the same products. Last year I was completely new in it. This year I met with many friends. And I hope to see many of you again next year.

Posted in Conferences, Events | Tagged , , , , , , , , , , , | Leave a comment