Making a Hertzsprung-Russell diagram from Gaia DR2 data with Elasticsearch

Elasticsearch was one of the open source products on my list to try out, ever since I got rejected for a couple of assignments as a consultant last year. Apparently it’s a popular product. But why do you need a search engine in a Big Data architecture? This I explain in my new video where I load newly released data from ESA’s Gaia mission in Elasticsearch with Logstash and visualize it with Kibana.

I’ve also created an extra video wherein I explain how the code works.

You can get the code I’ve used at my Github page.

 

Posted in Learning Big Data, NoSQL | Tagged , , , , , , , , | Leave a comment

Codemotion Amsterdam 2018, day two

Back on the ferry to the north of Amsterdam I went, back for day two of Codemotion Amsterdam 2018.

Keynote

Daniel Gebler from PicNic told us about what they are doing today to bring groceries home for people. I’ve seen two presentations by PicNic before and I could really see their progress from session to session.

Daniel explained how they use a recommender system to make it possible for customers to buy their most common groceries with one tap in the PicNic app. Which is actually hard. Even if you get 90% of precision of your prediction for one item, that means that for a set of 12 items you actually get 12% precision. So they really had to work to get a much better precision per item. They managed to do that by working with two dimensions of data: big and deep data. Continue reading

Posted in Conferences, Events | Tagged , , , , , , , , , , , , | Leave a comment

Codemotion Amsterdam 2018, day one

Last Friday I almost felt I had to explain a colleague that I don’t always win raffles and lotteries. Because yep, I won another ticket. Again via the Roaring Elephant podcast. It’s pretty worthwhile listening to them, is all I’m saying.

This was a ticket for CodeMotion Amsterdam 2018. CodeMotion is a conference for developers with topics like the blockchain, Big Data, Internet of Things, DevOps, software architectures, but also front-end development, game development and AR/VR.

Amsterdam from the ferry to the north of the city.

Continue reading

Posted in Conferences, Events | Tagged , , , , , , , | Leave a comment

Starting at Port of Rotterdam per 1 May 2018

Next week (1 May 2018) I will start as a Hadoop specialist/data steward/data custodian/data something something at the Advanced Analytics team at Port of Rotterdam. We haven’t worked out a fancy data something title yet. I’m already working at this team as a consultant. I’ve been involved with security and data governance of the data lake (for people outside Big Data: a data lake is simply a Hadoop cluster).

The World Port Center

Continue reading

Posted in Uncategorized | Tagged , , | Leave a comment

Dataworks Summit Berlin 2018, day two

Back for round two of keynotes, good technical sessions and discussing them with fellow data specialists in between.

Keynotes

First up was  Frank Säuberlich from Teradata, who had an interesting example of machine learning for fraud detection at Danske Bank. They used transaction data sort of as pixels and ran that through a Convoluted Neural Network to find outliers. And they did. Before this solution they found many false positives, with this approach they managed to find 50% (of 40%) more detected frauds, but the most important thing was that the frauds that were detected were significantly less false positives.

Frank Säuberlich from Teradata on using CNNs for fraud detection.

After this John Keisler took a live poll with the question “How ready are we for GDPR?”. Only 23% felt they were ready. 51% were making preparations and 15% thought they won’t be ready. 11% asked “What’s GDPR?”

Enza Iannopollo from Forrester Research thought we should embrace GDPR. She thinks that GDPR is the Copernican Revolution for many organizations. After getting ready for GDPR they should be able to find their data better and that’s not a bad thing.

Enza Iannopollo explained why we need to embrace GDPR. Nice slide design BTW.

Many organizations will be late to the party though, so Enza explained what they still can do in the remaining 35 days. She thinks that they should start by finding focussing on the data driven initiatives that present the highest risk, prioritize them and make a roadmap to solve these issues. Then deploy necessary security controls and re-engineer some essential processes later.

Next Jamie Engesser and Srikanth Venkat from Hortonworks showed and live demoed the new Data Steward Studio. With Data Steward Studio you can do many things. It has some of the capabilities that Atlas already has, but it can oversee multiple data lakes. It is touted as an Enterprise Data Catalog. Data Steward Studio also can discover and fingerprint data. But in the live demo we were shown how it can keep track of customer’s consent of storing and using data and if the customer wants to revoke it, it can be done as well.

A live demo of Data Steward Studio by Srikanth Venkat.

The last keynote as by Kamélia Benchekroun from Renault. Her job title is Data Lake squad leader. We were very impressed. I know one colleague who would like to adopt it 🙂 . At Renault they are doing a lot with IoT these days and actually that is how you would expect it. She talked about Renault’s experiences with it.

And that was the last keynote and we went our separate ways to see sessions on different topics.

 

GDPR-focused partner community showcase for Apache Ranger and Apache Atlas

I went to this session presented by several speakers, introduced by Srikanth Venkat and Ali Bajwa from Hortonworks. Srikanth first talked shortly about the present and future of Ranger and Atlas. Ranger 1.0.0 can be expected in Q3 of 2018. And it will be extended in the non-Hadoop ecosystem. In the future you will be able to use Ranger on Azure Data Lake Store, Azure Blob Storage, EMC2 and with Apache HAWQ. HAWQ is an elastic query engine (sounds like ElasticSearch).

According to Srikanth Atlas 1.0.0 will also be released in Q3, which is different from what I’ve heared yesterday from Mandy Chessell from IBM, who told us it was a matter of weeks. Maybe he meaned Hortonworks’ release. In HDP 2.6.4 (or Atlas 0.8.2) we already saw a NiFi connector. In the coming release we will see a Spark connector, which was sorely missing.

The Atlas ecosystem is getting larger.

Next were three partner presentations. Subra Ramesh showed how Dataguise automatically tags sensitive data in Atlas. He also showed a live demo of that. I didn’t know Dataguise, I understand it started as a data masking product.

Marco Kopp of Syncsort showed the Atlas integration of their product, DMX-h. But the demo I was most impressed by, was that of Synerscope. Jan-Kees Buenen told that it will allow you to let customers do their own consent.

Thomas Ploeger showed how Synerscope’s product (IXIWA?) is aware of all sensitive data in the data lake after scanning not just the columns, but the actual data. And when he searched for his car’s license plate, it was show exactly where in the datalake that data was.

In a live demo of Synerscope it was show where sensitive data in the data lake are. (I didn’t have any better pictures of it)

 

Apache Metron in the Real World by Dave Russell

I had two reasons to go to this session: 1. I wanted to know more about Apache Metron. 2. It was given by Dave Russell from the Roaring Elephant podcast. I thought this session was one of the best of the conference in the sense of build-up and interaction with the audience.

Dave Russell on Apache Metron.

Apache Metron is a security product that detects breaches. Breaches are often only detected after more than 8 months. With Metron you are able to detect anomalies much, much faster.

A key role in this, is Metron’s Profiler, that finds out in the data that Metron collects what is normal usage and what deserves attention. It has multiple (machine learning) models to find that out. In Dave’s slides there was a whole list.

He would have done a live demo, but the WiFi was not able to handle it. Therefor we got to see a video of Metron in action. The auth demo showed a graph of connections between users and systems. Usually the relationship is many users on a system. And then there was “user 66”, who on his own had connections at some time with many systems. Something you would expect if that user for example had done port scans to search for “interesting” machines.

Dave had many tips about how to set Metron up, like necessary storage, number of nodes and necessary resources. There also is a single node Metron AMI (Amazon Machine Images) where you can try things out. This is of course not sufficient for a production environment. For that you would rather need about 12 nodes. Different organizations think differently about where to store Metron’s data. After all, you might not want to store data about the possible malicious use of the data lake in that same data lake.

 

Practical experiences using Atlas and Ranger to implement GDPR by Magnus Runesson

Magnus Runesson from Svenska Spel provided us with experiences of using Atlas and Ranger. In Sweden amongst other games, games of chance are provided and regulated by Svenska Spel. They also try to prevent gambling addiction and understandably that is data of a very sensitive nature.

They came from an Cognos/Oracle environment and went to HDP 2.6 with Hive when the old system became too slow. They use Data Vault for data modeling and generate SQL from this with Oracle SQL Developer Data Modeler.

Magnus Runesson of Svenska Spel talks about the data lake at his organization.

Atlas is used for tag-based security in Ranger. But how do they tag that data? Magnus explained that this is done in the development process. When the model is made or changed, people who know the data are usually involved. They provide the information about sensitivity of data. All this ends up in three CSV files. An in-house built Policy Tool tags data in Atlas based on this.

This Policy Tool interested me, and it turns out we both have been trying to get the Ranger and Atlas REST APIs working for us. His experiences were very familiar for me. I also asked if he considers to make his Policy Tool open source. He said he would discuss this in his organization.

 

An evaluation of TPC-H on Spark and Spark SQL in Aloja by Raphael Radowitz

I actually entered the room expecting a session on Nifi. Somewhere something in my planning something went wrong. Instead this session was about Spark and Spark SQL performance. Raphael Radowitz had done extensive benchmarks to see which combinations of file formats were faster with what compression.

According to his research Parquet is 50% faster than text, 16% faster than ORC and Parquet with Snappy compression is 10% faster than ORC with Snappy.

As with Spark with Scala (with or without Metastore) and SparkSQL, it depends. Not the Metastore: this has overhead. But Spark with Scala is in some situations faster than SparkSQL, and in other situations it’s the other way around.

Raphael Radowitz discussing which TCP-H queries went faster with SparkSQL and which ones with Spark and Scala.

 

GDPR: the IBM journey to compliance by Richard Hogg

I’ve told my colleagues at Port of Rotterdam I would jump the grenade and follow all the GDPR sessions. So next I went to the session by Richard Hogg, global GDPR evangelist at IBM. And again, I have to be honest here. For a different reason. This was mainly a sales pitch. It had important information on what GDPR means for your organization, but the refrain was often was “but have no fear, IBM is here”.

Some things I picked up: GDPR speaks of “personal data”, which is not the same as PII (Personal Identifiable Information). For example: IP addresses are no PII, but are personal data according to GDPR.

An interesting approach was using the blockchain so you don’t have to store any personal data at all. I had a similar conversation with a colleague from KPN in the tea break before this session, though not with the blockchain. Often personal information isn’t what you are looking for in a data lake anyway. So why store it?

 

Lessons learned from running Spark on Docker by Thomas Phelan

Thomas Phelan from BlueData shared his journey to run Spark on Docker. I really appreciate that he went to the trouble of explaining what Docker is and why you would want to use it. And I liked the interesting way he described the journey with terminology like “Navigate the river of container managers”, “Traverse the thightrope of network configs” and “Trip down the staircase of deployment mistakes”.

Thomas Phelan about running Spark on Docker

Docker is able to provide both the flexibility that data scientists want and the control that the IT departement wants. He went to work with those end goals in mind. I made many notes, but I find them hard to summerize right now.

 

Conclusions

All in all Dataworks Summit 2018 was very worthwhile for me. You could say GDPR was the main theme of this edition and because I am very much involved with that at the moment, it was a hit for me. I really have a lot of takeaways that I have to process and share in our organization and things that I will approach in a new way.

The Estrel hotel in Berlin was the location of Dataworks Summit 2018.

And it was great to again meet so many people working with the same products. Last year I was completely new in it. This year I met with many friends. And I hope to see many of you again next year.

Posted in Conferences, Events | Tagged , , , , , , , , , , , | Leave a comment

Building HDP 2.6 on AWS, Part 3: the worker nodes

This is part 3 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. By now we have an edge node to run Ambari Server, three master nodes for Hadoop name nodes and such. Now we need worker nodes for processing the data.

Creating the worker nodes is not that much different from creating the master nodes. But the workers need more powerful nodes.

Creating the first worker node

Log in at Amazon Web Services again, in the same AWS district as the edge and master nodes. We start with one worker node and clone 2 more later on. Go to the EC2 dashboard in the AWS interface and click “Launch instance”. Then choose Ubuntu Server 16.04 from the Amazon Machine Images. Continue reading

Posted in Howto, Learning Big Data | Tagged , , , , , , , | Leave a comment

I feel great when I study

When I started studying Hadoop, Python and machine learning in 2016, I found something out that I didn’t expect. I feel better when I study. When I finished another problem, exam or course, and I stepped outside the house to do some shopping or to go to work, I felt great.

And this effect is pretty consistent. Currently I’m in week 3 of MongoDB for DBAs at MongoDB University and in lecture 35 of Elasticsearch 6 and Elastic Stack on Udemy. And I just feel like I can take on the world.

So how come? I think it’s a feeling of control. I decide on the study program. It’s not something I had to write up in a personal development program. None nagged me about it. I just thought “I need to know what Elasticsearch is” two weeks ago, found a course and there I went.

It’s also a feeling of worthwhile productivity. That I spend my time on the planet well. And knowing that you are building a foundation of knowledge you can do lots of cool stuff with, also works for me. I can’t wait to surprise people at work: “Actually, I do know MongoDB. And I’ve learned a thing or two about securing it.”

I don’t know if studying has this effect on everyone. I’m almost sure it doesn’t. Several people asked me “you don’t have children, do you?” True. But I also rarely watch TV. I don’t have Netflix. Because, while watching TV and series is fun, it doesn’t make me feel better. To be honest, social media and games are still on my list, but I now they are not there to make me feel better.

And in this fast-changing field of work, I think I can keep on learning things for a long time to come. It’s actually not a bad weird thing to have. (Also, more videos to come.)

Posted in Learning Big Data | Tagged , | Leave a comment

Playing with asteroids data in MongoDB

If there is one thing I learned when becoming a data engineer, it’s that having just Hadoop expertise is probably not enough. For starters: what it means to be a data engineer is not exactly sharply defined. Some say data engineers are (Java) developers. Some place data engineers more at the operations side. And at some organisations data engineers work with any combination of these products: Hadoop, ElasticSearch, MongoDB, Cassandra, relational databases and even less hip products.

So I thought it would be a good idea to broaden my horizons. One product that is used quite often, is MongoDB. MongoDB is a NoSQL database. And if you don’t exactly know what that means, I think you will get the idea after viewing this video I made.

Continue reading

Posted in NoSQL | Tagged , , , , | Leave a comment

Hadoop in a Hurry – Security

When talking about Hadoop security there are so many products and features. What do all of them do? This video gives a high over overview.

Posted in Apache Products for Outsiders, DBA2Hadoop | Tagged , , , , , , , , , | Leave a comment