Dataworks Summit München 2017 – day two

Day two started with more keynotes. Ross Porter of Dell EMC talked about the ingredients of a successful analytics project. Carlo Vaiti of HP Enterprise had an interesting talk about trends in big data, but I would advise him to let a professional presentation bureau go over his slides. They were perfect for a breakout session, but not so much for a keynote. Too small fonts.

Then there was a customer panel discussion with Hortonworks’ Raj Verma as moderator. Interesting was that they (Nazeem Gulzar, Danske Bank; Eddie Edwards, Centrica and Zog Gibbens, Walgreens) also discussed topics like contributing to open source, the empowerment of developers and customers that just can’t get enough of it.

Next up was Mike Merrit-Holmes from Think Big and Dr. Rand Hindi had a good session about making technology disappear. One interesting thing that I didn’t know, was that you can now draw a thing and some kind of deep learning algorithm can make it into a photo-like picture. His hope is that AI can make us feel like we are at the beach all day, a much brighter outlook than the non-commercial keynote of yesterday.

Rand Hindi on making technology disappear.


Apache NiFi Crash Course

At the Dataworks Summit 2017 there were several crash courses of about 2 hours and I wanted to follow at least one of them. I went for Dataflow with NiFi, because Apache NiFi is very hip and for once I wanted in on the ground floor. I managed to get the last spot: next to a very warm heater (what is it with warm German conference rooms and hotels also?)

The way these crash courses work is that first you get a bit of theory, a demo and then you can try it yourself, with help of people who aid the session. For this you get a VirtualBox environment with NiFi already installed. So you can start quickly at the course.

Aldrin Piri did the presentation and live demo. And it was good presentation, more so because there were two slides with xkcd comics 🙂 . The presentation was very clear for people who didn’t know NiFi yet. It discussed what NiFi tries to solve and how it does that.

Apache NiFi is a topic I want to blog about more in detail, because it deserves more than a few lines. NiFi is a way of getting data from many data producers like IoT devices, smartphones, but also servers, to consumers like users, storage and systems. Aldrin called NiFi the “plumbers of your data infra”. NiFi has a graphical interface where you define all those pipelines between data producers and data consumers. And you can filter data you don’t want, so if you are sending logs for example, you can send forth only error data.

You can follow exactly the road a piece of data took and how long it took.

And the most brilliant thing is, you can follow your data everywhere: where did it go, how long did it have to wait at a certain step and where there is . You can just select a piece of data and show in a graphical fashion where it went and even has a slider where can follow the progress over time. At this point the audience was laughing because we didn’t expect NiFi would go this far. It was almost silly.

You can try it yourself by the way. The link to the course is here:

Analyze Traffic Patterns with Apache NiFi


Approaches to achieving real time ingestion & analysis of security events by Sagar Gaiwad

The speaker, Sagar Gaiwad, is manager of Big Data CyberTech at Capital One, a large American bank. He explained exactly how Captical One sifts through its data to look for fraudulent actions and malware. It all started  as a program that was called “Purple Rain”, which is of course a great name for a program.

Capital One uses machine learning to find malicious actions and analysts are now alerted when anything seems wrong. For this they have a 45 node Apache Storm cluster, which is their core streaming engine. They use Apache Metron, a security analytics framework, build on Storm. And they do telemetry ingest with Apache NiFi and fast telemetry ingest with Apache Kafka. (I’ll try to explain what all these Apache products are some time in the future, when I know these products better myself.)

To search through logs they use ElasticSearch, a distributed, RESTful search and analytics engine. With ElasticSearch you can do full text search, which was very helpful for them. To visualize the data from ElasticSearch they used Kibana. (ElasticSearch/Kibana was very popular this summit. I’ve seen it in several presentations.)

What was interesting, was their approach to malware. When computers in the company get infected by malware, they send a signal to a Command and Control center of that malware and this C&C center then gives instructions. So the job is, find the machines that send out callouts to those C&C centers and stop them.

In the old days you could recognize the domains, but that’s not how modern malware works. It uses domain generation algorithms, so this communication can go to or So they used a machine learning approach to see these domain names have recognizable words and pronounceable phrases in them. If not, the domain is suspect. In the end they used 20 features (types of data used in a machine learning model) and their model became 99,7% accurate in predicting what are malware callouts and what are not. And they manage to do 70,000 records per second to find them.

All in all a very open and interesting presentation. One of the largest problems, said Sagar, is recruitment. They nowadays recruit students who just have finished university and pair them with experienced employees.


Real World Architecture and Deployment Best Practices by Cory Minton

Cory Minton from Dell-EMC on sizing of Big Data.

At first I was at a session on “Building brains” about large scale deep learning networks, because I confused the room, but I quickly recognized that this session was way over my head. So I quickly went to the Real World Architecture session by Cory Minton from Dell EMC.

Most CEO’s/CTO’s/CIO’s can take a hint from Cory Minton’s very clear talk about what you need to run big data. I loved how no-nonsense his vision is on sizing. Want to know how many worker nodes you need?

Example calculation:

And on workload types?

  • Expect more complexity (Machine Learning, Image processing, Natural Language Processing)? More CPU.
  • Low latency applications (Storm, low latency Hive, Tez, HBase)? More memory.
  • Traditional ETL and archiving (Pig, Hive, Mapreduce)? More disk.

Bam. Problem solved. Next!

The next part was on what products Dell-EMC has for Big Data and the blueprints that Dell has available. I think it’s best I let Dell-EMC do their own marketing on that. But it was interesting what one person in the room said. He said: “We started working with Hadoop in 2010”. Cory’s response: “Sweet!”. And apparently they used Dell’s blueprints and it worked perfectly. Cory was of course very happy with that feedback and I believe that customer can get a Dell server free now.


ING CoreIntel Collect and Process Network Logs Across Data Centers in Real Time – Krysztof Adamski

Krysztof Adamski on ING’s CoreIntel platform

I really wanted to follow the session on Hadoop Backup and Recovery, but the room was completely full and I just couldn’t get in. So next I went to a session from Krysztof Adamski from ING (large bank in the Netherlands, also strong in Europe). Okay, so first things first: drawing your own diagrams with the Paper app from FiftyThree is basically an original idea. But handwriting the text with Paper is maybe not such a good idea. I know the app myself and it’s hard to produce readable things with it.

So CoreIntel is ING’s Cyber Crime Response Program. It is build on Apache Metron and Apache Spot (just another Apache product I don’t know). After showing us his cat, Krysztof explained the challenges in building such a program. For one: where do you get all this data. But also: try to exchange this data between several countries, even in the EU.

ING uses Tap as device to capture relevant network data and Kafka to locally collect this data. Interesting in this presentation was the important parameters to check in Kafka to get this right. ING uses Spark on this central data and sends it on to ElasticSearch and does visualization with Kibana.

The upcoming challenges for his team will be the implementation of Openshift (RedHat’s Platform as a Service on top of Docker containers) and DC/OS (honestly never heard of it). The plan to deliver Spark clusters on demand soon.


HopsFS – Breaking 1 M ops/sec barrier in Hadoop – Jim Dowling

For this session I was more or less expecting a long lineup. Making things very fast? That’s always a popular topic. But actually I easily got my spot, maybe because it was late in the last day. This session sounded rather academic (Jim Dowling is an associate professor at KTH Royal Institute of Technology Stockholm) and as a newbee I was a bit worried that this stuff would go way over my head. But fortunately it was all right.

So if you want HDFS to go really, really fast there is a bottleneck at the namenode and that sits in the metadata, in the heap of a single JVM. Enter HopsFS, which stores the metadata in a MySQL cluster. Apparently MySQL clusters can be really fast. Jim told he has a lot of experience with them and knew it would be more than capable to serve his needs for HopsFS.

So they tested this solution at Spotify and they went to 400,000 operations per second with a 2 node MySQL cluster. With 4 nodes they reached 800,000 ops/sec and with 12 nodes even more than a million ops/sec. HopsFS also can handle more files and the Metadata API is tinker friendly, so you can create your own metadata tables.

They’re also working on HopsYARN, but this is more of a challenge and they are getting good results with Hive Metastores. This data is not yet peer reviewed (hurray for the scientific mindset, no bloated commercial data here), but they seem to be able to handle creating and reading much more small files than usual.


Going home

After this me and my colleagues went to the airport, discussed how the customs could work much faster by adding more worker nodes and eventually went our separate ways. One mentionable thing in line for boarding, was that one passenger on my flight had printed his boarding pass on A0 format. Because where does it say it should be A4/letter format? I’m sure that did not result in any problems.

An A0 sized boarding pass.

Back in Amsterdam I was greeted by trains not going because of a suspect package in one of the trains. I took a taxi, and back home unloaded papers, stickers and a couple of goodies.

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Events and tagged , , , , , , , , , . Bookmark the permalink.

1 Response to Dataworks Summit München 2017 – day two

  1. Pingback: Dataworks Summit München 2017 - Open Circle Solutions

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.