Dataworks Summit Berlin 2018, day two

Back for round two of keynotes, good technical sessions and discussing them with fellow data specialists in between.

Keynotes

First up was  Frank Säuberlich from Teradata, who had an interesting example of machine learning for fraud detection at Danske Bank. They used transaction data sort of as pixels and ran that through a Convoluted Neural Network to find outliers. And they did. Before this solution they found many false positives, with this approach they managed to find 50% (of 40%) more detected frauds, but the most important thing was that the frauds that were detected were significantly less false positives.

Frank Säuberlich from Teradata on using CNNs for fraud detection.

After this John Keisler took a live poll with the question “How ready are we for GDPR?”. Only 23% felt they were ready. 51% were making preparations and 15% thought they won’t be ready. 11% asked “What’s GDPR?”

Enza Iannopollo from Forrester Research thought we should embrace GDPR. She thinks that GDPR is the Copernican Revolution for many organizations. After getting ready for GDPR they should be able to find their data better and that’s not a bad thing.

Enza Iannopollo explained why we need to embrace GDPR. Nice slide design BTW.

Many organizations will be late to the party though, so Enza explained what they still can do in the remaining 35 days. She thinks that they should start by finding focussing on the data driven initiatives that present the highest risk, prioritize them and make a roadmap to solve these issues. Then deploy necessary security controls and re-engineer some essential processes later.

Next Jamie Engesser and Srikanth Venkat from Hortonworks showed and live demoed the new Data Steward Studio. With Data Steward Studio you can do many things. It has some of the capabilities that Atlas already has, but it can oversee multiple data lakes. It is touted as an Enterprise Data Catalog. Data Steward Studio also can discover and fingerprint data. But in the live demo we were shown how it can keep track of customer’s consent of storing and using data and if the customer wants to revoke it, it can be done as well.

A live demo of Data Steward Studio by Srikanth Venkat.

The last keynote as by Kamélia Benchekroun from Renault. Her job title is Data Lake squad leader. We were very impressed. I know one colleague who would like to adopt it 🙂 . At Renault they are doing a lot with IoT these days and actually that is how you would expect it. She talked about Renault’s experiences with it.

And that was the last keynote and we went our separate ways to see sessions on different topics.

 

GDPR-focused partner community showcase for Apache Ranger and Apache Atlas

I went to this session presented by several speakers, introduced by Srikanth Venkat and Ali Bajwa from Hortonworks. Srikanth first talked shortly about the present and future of Ranger and Atlas. Ranger 1.0.0 can be expected in Q3 of 2018. And it will be extended in the non-Hadoop ecosystem. In the future you will be able to use Ranger on Azure Data Lake Store, Azure Blob Storage, EMC2 and with Apache HAWQ. HAWQ is an elastic query engine (sounds like ElasticSearch).

According to Srikanth Atlas 1.0.0 will also be released in Q3, which is different from what I’ve heared yesterday from Mandy Chessell from IBM, who told us it was a matter of weeks. Maybe he meaned Hortonworks’ release. In HDP 2.6.4 (or Atlas 0.8.2) we already saw a NiFi connector. In the coming release we will see a Spark connector, which was sorely missing.

The Atlas ecosystem is getting larger.

Next were three partner presentations. Subra Ramesh showed how Dataguise automatically tags sensitive data in Atlas. He also showed a live demo of that. I didn’t know Dataguise, I understand it started as a data masking product.

Marco Kopp of Syncsort showed the Atlas integration of their product, DMX-h. But the demo I was most impressed by, was that of Synerscope. Jan-Kees Buenen told that it will allow you to let customers do their own consent.

Thomas Ploeger showed how Synerscope’s product (IXIWA?) is aware of all sensitive data in the data lake after scanning not just the columns, but the actual data. And when he searched for his car’s license plate, it was show exactly where in the datalake that data was.

In a live demo of Synerscope it was show where sensitive data in the data lake are. (I didn’t have any better pictures of it)

 

Apache Metron in the Real World by Dave Russell

I had two reasons to go to this session: 1. I wanted to know more about Apache Metron. 2. It was given by Dave Russell from the Roaring Elephant podcast. I thought this session was one of the best of the conference in the sense of build-up and interaction with the audience.

Dave Russell on Apache Metron.

Apache Metron is a security product that detects breaches. Breaches are often only detected after more than 8 months. With Metron you are able to detect anomalies much, much faster.

A key role in this, is Metron’s Profiler, that finds out in the data that Metron collects what is normal usage and what deserves attention. It has multiple (machine learning) models to find that out. In Dave’s slides there was a whole list.

He would have done a live demo, but the WiFi was not able to handle it. Therefor we got to see a video of Metron in action. The auth demo showed a graph of connections between users and systems. Usually the relationship is many users on a system. And then there was “user 66”, who on his own had connections at some time with many systems. Something you would expect if that user for example had done port scans to search for “interesting” machines.

Dave had many tips about how to set Metron up, like necessary storage, number of nodes and necessary resources. There also is a single node Metron AMI (Amazon Machine Images) where you can try things out. This is of course not sufficient for a production environment. For that you would rather need about 12 nodes. Different organizations think differently about where to store Metron’s data. After all, you might not want to store data about the possible malicious use of the data lake in that same data lake.

 

Practical experiences using Atlas and Ranger to implement GDPR by Magnus Runesson

Magnus Runesson from Svenska Spel provided us with experiences of using Atlas and Ranger. In Sweden amongst other games, games of chance are provided and regulated by Svenska Spel. They also try to prevent gambling addiction and understandably that is data of a very sensitive nature.

They came from an Cognos/Oracle environment and went to HDP 2.6 with Hive when the old system became too slow. They use Data Vault for data modeling and generate SQL from this with Oracle SQL Developer Data Modeler.

Magnus Runesson of Svenska Spel talks about the data lake at his organization.

Atlas is used for tag-based security in Ranger. But how do they tag that data? Magnus explained that this is done in the development process. When the model is made or changed, people who know the data are usually involved. They provide the information about sensitivity of data. All this ends up in three CSV files. An in-house built Policy Tool tags data in Atlas based on this.

This Policy Tool interested me, and it turns out we both have been trying to get the Ranger and Atlas REST APIs working for us. His experiences were very familiar for me. I also asked if he considers to make his Policy Tool open source. He said he would discuss this in his organization.

 

An evaluation of TPC-H on Spark and Spark SQL in Aloja by Raphael Radowitz

I actually entered the room expecting a session on Nifi. Somewhere something in my planning something went wrong. Instead this session was about Spark and Spark SQL performance. Raphael Radowitz had done extensive benchmarks to see which combinations of file formats were faster with what compression.

According to his research Parquet is 50% faster than text, 16% faster than ORC and Parquet with Snappy compression is 10% faster than ORC with Snappy.

As with Spark with Scala (with or without Metastore) and SparkSQL, it depends. Not the Metastore: this has overhead. But Spark with Scala is in some situations faster than SparkSQL, and in other situations it’s the other way around.

Raphael Radowitz discussing which TCP-H queries went faster with SparkSQL and which ones with Spark and Scala.

 

GDPR: the IBM journey to compliance by Richard Hogg

I’ve told my colleagues at Port of Rotterdam I would jump the grenade and follow all the GDPR sessions. So next I went to the session by Richard Hogg, global GDPR evangelist at IBM. And again, I have to be honest here. For a different reason. This was mainly a sales pitch. It had important information on what GDPR means for your organization, but the refrain was often was “but have no fear, IBM is here”.

Some things I picked up: GDPR speaks of “personal data”, which is not the same as PII (Personal Identifiable Information). For example: IP addresses are no PII, but are personal data according to GDPR.

An interesting approach was using the blockchain so you don’t have to store any personal data at all. I had a similar conversation with a colleague from KPN in the tea break before this session, though not with the blockchain. Often personal information isn’t what you are looking for in a data lake anyway. So why store it?

 

Lessons learned from running Spark on Docker by Thomas Phelan

Thomas Phelan from BlueData shared his journey to run Spark on Docker. I really appreciate that he went to the trouble of explaining what Docker is and why you would want to use it. And I liked the interesting way he described the journey with terminology like “Navigate the river of container managers”, “Traverse the thightrope of network configs” and “Trip down the staircase of deployment mistakes”.

Thomas Phelan about running Spark on Docker

Docker is able to provide both the flexibility that data scientists want and the control that the IT departement wants. He went to work with those end goals in mind. I made many notes, but I find them hard to summerize right now.

 

Conclusions

All in all Dataworks Summit 2018 was very worthwhile for me. You could say GDPR was the main theme of this edition and because I am very much involved with that at the moment, it was a hit for me. I really have a lot of takeaways that I have to process and share in our organization and things that I will approach in a new way.

The Estrel hotel in Berlin was the location of Dataworks Summit 2018.

And it was great to again meet so many people working with the same products. Last year I was completely new in it. This year I met with many friends. And I hope to see many of you again next year.

Posted in Conferences, Events | Tagged , , , , , , , , , , , | Leave a comment

Building HDP 2.6 on AWS, Part 3: the worker nodes

This is part 3 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. By now we have an edge node to run Ambari Server, three master nodes for Hadoop name nodes and such. Now we need worker nodes for processing the data.

Creating the worker nodes is not that much different from creating the master nodes. But the workers need more powerful nodes.

Creating the first worker node

Log in at Amazon Web Services again, in the same AWS district as the edge and master nodes. We start with one worker node and clone 2 more later on. Go to the EC2 dashboard in the AWS interface and click “Launch instance”. Then choose Ubuntu Server 16.04 from the Amazon Machine Images. Continue reading

Posted in Howto, Learning Big Data | Tagged , , , , , , , | Leave a comment

I feel great when I study

When I started studying Hadoop, Python and machine learning in 2016, I found something out that I didn’t expect. I feel better when I study. When I finished another problem, exam or course, and I stepped outside the house to do some shopping or to go to work, I felt great.

And this effect is pretty consistent. Currently I’m in week 3 of MongoDB for DBAs at MongoDB University and in lecture 35 of Elasticsearch 6 and Elastic Stack on Udemy. And I just feel like I can take on the world.

So how come? I think it’s a feeling of control. I decide on the study program. It’s not something I had to write up in a personal development program. None nagged me about it. I just thought “I need to know what Elasticsearch is” two weeks ago, found a course and there I went.

It’s also a feeling of worthwhile productivity. That I spend my time on the planet well. And knowing that you are building a foundation of knowledge you can do lots of cool stuff with, also works for me. I can’t wait to surprise people at work: “Actually, I do know MongoDB. And I’ve learned a thing or two about securing it.”

I don’t know if studying has this effect on everyone. I’m almost sure it doesn’t. Several people asked me “you don’t have children, do you?” True. But I also rarely watch TV. I don’t have Netflix. Because, while watching TV and series is fun, it doesn’t make me feel better. To be honest, social media and games are still on my list, but I now they are not there to make me feel better.

And in this fast-changing field of work, I think I can keep on learning things for a long time to come. It’s actually not a bad weird thing to have. (Also, more videos to come.)

Posted in Learning Big Data | Tagged , | Leave a comment

Playing with asteroids data in MongoDB

If there is one thing I learned when becoming a data engineer, it’s that having just Hadoop expertise is probably not enough. For starters: what it means to be a data engineer is not exactly sharply defined. Some say data engineers are (Java) developers. Some place data engineers more at the operations side. And at some organisations data engineers work with any combination of these products: Hadoop, ElasticSearch, MongoDB, Cassandra, relational databases and even less hip products.

So I thought it would be a good idea to broaden my horizons. One product that is used quite often, is MongoDB. MongoDB is a NoSQL database. And if you don’t exactly know what that means, I think you will get the idea after viewing this video I made.

Continue reading

Posted in NoSQL | Tagged , , , , | Leave a comment

Hadoop in a Hurry – Security

When talking about Hadoop security there are so many products and features. What do all of them do? This video gives a high over overview.

Posted in Apache Products for Outsiders, DBA2Hadoop | Tagged , , , , , , , , , | Leave a comment

I tried Lion’s Mane as a cognitive enhancer. Here are my experiences with it.

TL;DR

I tried Lion’s Mane from Four Sigmatic, which is branded as a cognitive enhancer. I’ve used it while studying Deep Neural Networks, amongst other things. I’ve done alternate weeks with and without Lion’s Mane and in my experience the effect is indiscernable.

 

Why cognitive enhancer?

I often listen to Tim Ferriss’ podcast (The Tim Ferriss Show). In it he often advertizes the wares of a company called Four Sigmatic. Apparently some of their mushroom coffees enhance cognitive abilities. That is of interest of me, because I’ve been studying a data science course on Coursera.org which had quite a lot of math and later I got a new assignment as a consultant to dive rather deep in the (Hadoop/Big Data related) Apache Atlas and Ranger products.

I’m 47 years old and math is certainly not part of my daily life. In fact I haven’t seen math that much since my bachelor study twenty years ago (besides Coursera courses). I’m also learning a lot of new open source products as data engineer. I can use all the cognitive abilities I can get. Continue reading

Posted in Weird experiments | Tagged , , , | Leave a comment

Recovering your HDP 2.6.1 Sandbox on VirtualBox after a restart

If you’ve worked with the Hortonworks Data Platform 2.x sandbox of later versions in VirtualBox and made it shutdown rather vigorously, you might have noticed that you won’t get past this startup screen when you try to start it up the next time:

I had this a couple of times and that’s why I decided to pause my sandbox every time and save it before shutting down my laptop. But yesterday Windows 10 decided to step in. After a day of studying it was high time for me to have dinner, during which I kept the laptop on. Little did I know that Windows 10 at that time decided to update and restart. And to do this, it needed to shutdown every application. Including VirtualBox. When I came back I found out to my horror that my carefully prepared HDP sandbox was shutdown in the roughest of ways. Thanks, Microsoft! Continue reading

Posted in Apache Products for Outsiders, Howto, Learning Big Data | Tagged , , , , , , , , , | 2 Comments

Tutorial: Let’s throw some asteroids in Apache Hive

This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make sure you don’t look like comment spam though. I’m getting a lot of that lately, even though they never pass my approval).

Intro

Currently I’m studying for the Hortonworks Data Platform Certified Developer: Spark using Python exam (or HDPCD: Spark using Python). One part of the exam objectives is using SQL in Spark. Along the way you also work with Hive, the data warehouse software in Hadoop.

I was following the free Udemy HDPCD Spark using Python preparation course by ITVersity. The course is good BTW, especially for the price :). But after playing along with the Core Spark videos, the course again used the same boring revenue data for the Spark SQL part. And I thought: “I know SQL pretty well. Why not use data that is a bit more interesting?” And so I downloaded the Minor Planet Center’s asteroid data. This contains all the known asteroids until at least yesterday. At this moment, that is about 745.000 lines of data. Continue reading

Posted in Apache Products for Outsiders, Learning Big Data | Tagged , , , , , , , , | Leave a comment

Fun with Data: Python and space rocks!

Last week I had a little fun with playing with Python, the pandas and matplotlib library and a JSON file with asteroid data. Here is what I did.

Posted in Howto, Python | Tagged , , , , , | Leave a comment