Dataworks Summit Berlin 2018, day two

Back for round two of keynotes, good technical sessions and discussing them with fellow data specialists in between.

Keynotes

First up was Frank Säuberlich from Teradata, who had an interesting example of machine learning for fraud detection at Danske Bank. They used transaction data sort of as pixels and ran that through a Convoluted Neural Network to find outliers. And they did. Before this solution they found many false positives, with this approach they managed to find 50% (of 40%) more detected frauds, but the most important thing was that the frauds that were detected were significantly less false positives.

Frank Säuberlich from Teradata on using CNNs for fraud detection.

After this John Keisler took a live poll with the question “How ready are we for GDPR?”. Only 23% felt they were ready. 51% were making preparations and 15% thought they won’t be ready. 11% asked “What’s GDPR?”

Enza Iannopollo from Forrester Research thought we should embrace GDPR. She thinks that GDPR is the Copernican Revolution for many organizations. After getting ready for GDPR they should be able to find their data better and that’s not a bad thing.

Enza Iannopollo explained why we need to embrace GDPR. Nice slide design BTW.

Many organizations will be late to the party though, so Enza explained what they still can do in the remaining 35 days. She thinks that they should start by finding focussing on the data driven initiatives that present the highest risk, prioritize them and make a roadmap to solve these issues. Then deploy necessary security controls and re-engineer some essential processes later.

Next Jamie Engesser and Srikanth Venkat from Hortonworks showed and live demoed the new Data Steward Studio. With Data Steward Studio you can do many things. It has some of the capabilities that Atlas already has, but it can oversee multiple data lakes. It is touted as an Enterprise Data Catalog. Data Steward Studio also can discover and fingerprint data. But in the live demo we were shown how it can keep track of customer’s consent of storing and using data and if the customer wants to revoke it, it can be done as well.

A live demo of Data Steward Studio by Srikanth Venkat.

The last keynote as by Kamélia Benchekroun from Renault. Her job title is Data Lake squad leader. We were very impressed. I know one colleague who would like to adopt it 🙂 . At Renault they are doing a lot with IoT these days and actually that is how you would expect it. She talked about Renault’s experiences with it.

And that was the last keynote and we went our separate ways to see sessions on different topics.

GDPR-focused partner community showcase for Apache Ranger and Apache Atlas

I went to this session presented by several speakers, introduced by Srikanth Venkat and Ali Bajwa from Hortonworks. Srikanth first talked shortly about the present and future of Ranger and Atlas. Ranger 1.0.0 can be expected in Q3 of 2018. And it will be extended in the non-Hadoop ecosystem. In the future you will be able to use Ranger on Azure Data Lake Store, Azure Blob Storage, EMC2 and with Apache HAWQ. HAWQ is an elastic query engine (sounds like ElasticSearch).

According to Srikanth Atlas 1.0.0 will also be released in Q3, which is different from what I’ve heared yesterday from Mandy Chessell from IBM, who told us it was a matter of weeks. Maybe he meaned Hortonworks’ release. In HDP 2.6.4 (or Atlas 0.8.2) we already saw a NiFi connector. In the coming release we will see a Spark connector, which was sorely missing.

The Atlas ecosystem is getting larger.

Next were three partner presentations. Subra Ramesh showed how Dataguise automatically tags sensitive data in Atlas. He also showed a live demo of that. I didn’t know Dataguise, I understand it started as a data masking product.

Marco Kopp of Syncsort showed the Atlas integration of their product, DMX-h. But the demo I was most impressed by, was that of Synerscope. Jan-Kees Buenen told that it will allow you to let customers do their own consent.

Thomas Ploeger showed how Synerscope’s product (IXIWA?) is aware of all sensitive data in the data lake after scanning not just the columns, but the actual data. And when he searched for his car’s license plate, it was show exactly where in the datalake that data was.

In a live demo of Synerscope it was show where sensitive data in the data lake are. (I didn’t have any better pictures of it)

Apache Metron in the Real World by Dave Russell

I had two reasons to go to this session: 1. I wanted to know more about Apache Metron. 2. It was given by Dave Russell from the Roaring Elephant podcast. I thought this session was one of the best of the conference in the sense of build-up and interaction with the audience.

Dave Russell on Apache Metron.

Apache Metron is a security product that detects breaches. Breaches are often only detected after more than 8 months. With Metron you are able to detect anomalies much, much faster.

A key role in this, is Metron’s Profiler, that finds out in the data that Metron collects what is normal usage and what deserves attention. It has multiple (machine learning) models to find that out. In Dave’s slides there was a whole list.

He would have done a live demo, but the WiFi was not able to handle it. Therefor we got to see a video of Metron in action. The auth demo showed a graph of connections between users and systems. Usually the relationship is many users on a system. And then there was “user 66”, who on his own had connections at some time with many systems. Something you would expect if that user for example had done port scans to search for “interesting” machines.

Dave had many tips about how to set Metron up, like necessary storage, number of nodes and necessary resources. There also is a single node Metron AMI (Amazon Machine Images) where you can try things out. This is of course not sufficient for a production environment. For that you would rather need about 12 nodes. Different organizations think differently about where to store Metron’s data. After all, you might not want to store data about the possible malicious use of the data lake in that same data lake.

Practical experiences using Atlas and Ranger to implement GDPR by Magnus Runesson

Magnus Runesson from Svenska Spel provided us with experiences of using Atlas and Ranger. In Sweden amongst other games, games of chance are provided and regulated by Svenska Spel. They also try to prevent gambling addiction and understandably that is data of a very sensitive nature.

They came from an Cognos/Oracle environment and went to HDP 2.6 with Hive when the old system became too slow. They use Data Vault for data modeling and generate SQL from this with Oracle SQL Developer Data Modeler.

Magnus Runesson of Svenska Spel talks about the data lake at his organization.

Atlas is used for tag-based security in Ranger. But how do they tag that data? Magnus explained that this is done in the development process. When the model is made or changed, people who know the data are usually involved. They provide the information about sensitivity of data. All this ends up in three CSV files. An in-house built Policy Tool tags data in Atlas based on this.

This Policy Tool interested me, and it turns out we both have been trying to get the Ranger and Atlas REST APIs working for us. His experiences were very familiar for me. I also asked if he considers to make his Policy Tool open source. He said he would discuss this in his organization.

An evaluation of TPC-H on Spark and Spark SQL in Aloja by Raphael Radowitz

I actually entered the room expecting a session on Nifi. Somewhere something in my planning something went wrong. Instead this session was about Spark and Spark SQL performance. Raphael Radowitz had done extensive benchmarks to see which combinations of file formats were faster with what compression.

According to his research Parquet is 50% faster than text, 16% faster than ORC and Parquet with Snappy compression is 10% faster than ORC with Snappy.

As with Spark with Scala (with or without Metastore) and SparkSQL, it depends. Not the Metastore: this has overhead. But Spark with Scala is in some situations faster than SparkSQL, and in other situations it’s the other way around.

Raphael Radowitz discussing which TCP-H queries went faster with SparkSQL and which ones with Spark and Scala.

GDPR: the IBM journey to compliance by Richard Hogg

I’ve told my colleagues at Port of Rotterdam I would jump the grenade and follow all the GDPR sessions. So next I went to the session by Richard Hogg, global GDPR evangelist at IBM. And again, I have to be honest here. For a different reason. This was mainly a sales pitch. It had important information on what GDPR means for your organization, but the refrain was often was “but have no fear, IBM is here”.

Some things I picked up: GDPR speaks of “personal data”, which is not the same as PII (Personal Identifiable Information). For example: IP addresses are no PII, but are personal data according to GDPR.

An interesting approach was using the blockchain so you don’t have to store any personal data at all. I had a similar conversation with a colleague from KPN in the tea break before this session, though not with the blockchain. Often personal information isn’t what you are looking for in a data lake anyway. So why store it?

Lessons learned from running Spark on Docker by Thomas Phelan

Thomas Phelan from BlueData shared his journey to run Spark on Docker. I really appreciate that he went to the trouble of explaining what Docker is and why you would want to use it. And I liked the interesting way he described the journey with terminology like “Navigate the river of container managers”, “Traverse the thightrope of network configs” and “Trip down the staircase of deployment mistakes”.

Thomas Phelan about running Spark on Docker

Docker is able to provide both the flexibility that data scientists want and the control that the IT departement wants. He went to work with those end goals in mind. I made many notes, but I find them hard to summerize right now.

Conclusions

All in all Dataworks Summit 2018 was very worthwhile for me. You could say GDPR was the main theme of this edition and because I am very much involved with that at the moment, it was a hit for me. I really have a lot of takeaways that I have to process and share in our organization and things that I will approach in a new way.

The Estrel hotel in Berlin was the location of Dataworks Summit 2018.

And it was great to again meet so many people working with the same products. Last year I was completely new in it. This year I met with many friends. And I hope to see many of you again next year.

Keynotes

GDPR-focused partner community showcase for Apache Ranger and Apache Atlas

Apache Metron in the Real World by Dave Russell

Practical experiences using Atlas and Ranger to implement GDPR by Magnus Runesson

An evaluation of TPC-H on Spark and Spark SQL in Aloja by Raphael Radowitz

GDPR: the IBM journey to compliance by Richard Hogg

Lessons learned from running Spark on Docker by Thomas Phelan

Conclusions

0 Comments

Leave a Reply Cancel reply

Masterclass Machine Learning in Cycling

Visiting PyGrunn 2025

A great time at PyCon Ireland 2024

Dataworks Summit Berlin 2018, day two

Published by Marcel-Jan Krijgsman on April 21, 2018

Keynotes

GDPR-focused partner community showcase for Apache Ranger and Apache Atlas

Apache Metron in the Real World by Dave Russell

Practical experiences using Atlas and Ranger to implement GDPR by Magnus Runesson

An evaluation of TPC-H on Spark and Spark SQL in Aloja by Raphael Radowitz

GDPR: the IBM journey to compliance by Richard Hogg

Lessons learned from running Spark on Docker by Thomas Phelan

Conclusions

0 Comments

Leave a Reply Cancel reply

Related Posts

Masterclass Machine Learning in Cycling

Visiting PyGrunn 2025

A great time at PyCon Ireland 2024