I’m back at Dataworks Summit this year. This time I didn’t win any ticket, but my new employer, Port of Rotterdam, has arranged that I could go. Pretty cool, because I did not want to miss it. This time it’s happening in Berlin.
It started with keynotes. Scott Gnau from Hortonworks announced Data Steward Studio for better data governance. Scott’s message was that your data strategy is your cloud strategy is your business strategy. You should not see them as totally different things.
IBM’s Mandy Chessell’s session was a more technical session about data governance. She talked about a data catalog that can give the right answer for the question. Atlas can be that data catalog. As we’ll see later we’re talking here specifically about Atlas 1.0. She also said that there’s a new initiative on ODPi.org to create an open metadata ecosystem.
Bernard Marr told about lots of examples of Big Data successes and failures. They were mostly successes. Interesting was the example of a local butcher shop that used data to count signals of mobile phones passing by the shop. From this the show owner learned that lots of people passed by at about 22:00. They decided to open the shop for an hour at that time and gained 50% more profit. The example of “smart streetlights” that listen for gun sounds made us think of Big Brother listening in everywhere in the city. Not exactly comforting.
Andreas Kohlmaier’s told how at Munich RE they build a data lake. And when it was made available to employees, more and more people went in to have a look. At first they worried if they could handle all that load. But then the users found out it was actually more like a data pool than a data lake. So there was a lesson there: don’t start with a blank slate.
Very interesting was how Munich RE managed to “fill the lake” with data they were looking for. They had a data “hunting team”: if you were looking for interesting data sources that team went looking for it, clean it and prepare it. Great session. I like those honest sessions about Big Data.
Apache Spark 2.3 boosts advanced analytics and deep learning with Python by Yanbo Liang
Then it was time for the breakout sessions. I went to Yanbo Liang’s session about Spark 2.3. This new version of Spark has vectorized Python User Defined Functions (UDF). Before 2.3 Python went through Spark data row by row, which isn’t the most efficient way. Now you can get your data via column-based chunks and this could make your Python code a lot faster.
There also are important improvements voor MLlib. Spark now enables deep learning image analysis and it’s easier to bring models to production. He showed a number of interesting demo’s that we hope to see soon online.
Ozone and HDFS’s evolution by Sanjay Radia
Sanjay Radia gave us a view of the future of HDFS. Although HDFS is excellent at scaling IO and concurrent clients, it also has some limitations. And these limitations manifest themselves mainly in the namespace. For example: you can’t have more than 500 million files.
The solution will be Ozone for the namespace and Hadoop Distributed Data Storage (HDDS) for storage of “containers of blocks”. Sanjay Radia went in details how this is working and what choices have been made to make this improvement. One of the results is: with Ozone and HDDS you can store 10B files instead of 500G.
You will be able to run HDFS and HDDS next to each other, but to move to HDDS you need to copy the data there. His team is now working on Quadra: raw blocks storage volume (Lun). It’s a lun-like storage service where the blocks are stored on HDDS. This is useful when working with Kubernetes.
Deep learning on YARN
Wangda Tan from Hortonworks showed us how you can do capacity planning and isolation of GPUs on Hadoop via YARN. He took the time to explain why this was important
In YARN in Hadoop 3.1 it will be possible to do GPU isolation and scheduling and it will have Docker + GPU support.
He showed a demo and much of his code is available on Github.
Inside open metadata—the deep dive by Mandy Chessell
I need to know more about data governance for my work at Port of Rotterdam, so back I went to a session of IBM’s Mandy Chessell. She recognized that having data catalog software is well and good, but the will stay empty if we expect people to fill them with metadata in their spare time.
But if organizations don’t understand their own data, this will hold them back. One way of making sense of data, is making use of content packs in Atlas. These are packs with metadata of – I guess – common data. This is where ODPi.org comes in, where people will share this metadata. I haven’t seen it there yet. It would be good if this becomes a success.
Mandy Chessell explained a lot about the inner workings of Atlas, but it took me a while she wasn’t talking about the version we’re using. She actually was discussing Atlas version 1.0 features. It will be able to use these content packs. Atlas 1.0 will be released in a couple of weeks. I’m looking forward to it.
From an experiment to a real production environment by Jeroen Wolffensperger and Martijn Groen
I know these two speakers from my time at Rabobank, when I worked in their project on the data warehouse. Now they work started moving to Cloudera for their data lake. They explained their data architecture. They chose a Cloudera data lake for raw and defined data, a separate data lab for R&D and a data factory for information products (more refined information).
They too built a data catalog. For this they didn’t choose Atlas, but Informatica Enterprise Data Catalog. Jeroen told that data governance is key to keep an overview of your data lake and to comply with all regulations like GDPR and BCBS39 (for banks probably). A good data catalog is a must.
They managed to realize all this in about 7 months, which is amazing. I remember how slow projects moved at Rabobank in my time. And in the end they managed to process 7 years of data in 7 hours or 100,000 events per second or 0,6 GB per second.
Aftermath
After the sessions were done I went with my collegues from Port of Rotterdam to Checkpoint Charley, Potsdammer Platz and the Brandenburger Tor. Nearby the last one we had flamkucker and Berliner currywurst. Good stuff.
Hi Marcel-Jan,
My colleague mentioned this blog post to me. I’m Wangda Tan and you mentioned my talk in your blog post, thanks for summarizing the contents. One place need to be updated is:
“He showed a demo and much of his code is available on Github.”
Actually, the Github link is the simple-tensorflow-serving project instead of my demo project. We’re working on an example to make it available as part of YARN native services.
Thanks,
Wangda
Thanks for that correction, Wangda.