Just left the beergarten party at Dataworks Summit 2017 in München. Okay, let’s see how well I blog after three of these large beers. Luckily I took notes before. Tell me when I start to become incoherent.
So actually for me the summit started yesterday at the Partner day, but today the breakout sessions started. I’ve attended a cool hands-on session on Apache Ranger and Atlas yesterday. These are the new security and governance tools for Hadoop. And I think we can say that the days that you could say Hadoop is largely insecure are about to be over.
The first session today I attended was the keynote of course. I was a little bit late, but it turned out that the second row seats were largely available, so I had a good view of the speakers. A couple of them tried to convince us that (big) data was going to change everything. I already knew that.
I liked the drone that Asad Khan from Microsoft brought with him, but he didn’t let it fly. The session of Dr. Barry Devlin I thought was thought provoking. It was about how big data and analytics are going to replace people by technology and how to dream a new future. Or, as my collegue called it: old leftist bullshit.
On with the breakout sessions.
An Apache Hive based Data Warehouse by Alan Gates
In the courses I followed online Hive back them was able to use SQL against Hadoop. A little. Turns out, a lot has changed since that course was created. Nowadays you can do almost anything in Hive 2.0, as you also will find out in later sessions.
A couple of takeaways:
- Hive 2.0 has extensive SQL:2011 support.
- It’s fast. It’s seen to do 100 million rows per second in production environments.
- Maybe because of the in-memory cache. Probably.
- One of the reasons is LLAP (Low Latency Analytical Processing, or Live Long And Process).
One thing I picked up is definitely different from Oracle databases: “You can make primary and foreign keys, but we don’t actually check them”. I wonder how that works out. Something I have to check in the documentation.
Data Journey at Klarna by Max Fisher and Stefan Hermelin
Klarna is an epayments plaform in Sweden. It’s basically a Swedisch startup from (I believe) 2006. The speakers described how their journey into big data went, since it started in 2012.
They adopted something called Lambda architecture, which consists of three layers: batch processing, speed/real-time processing and a serving layer for responding to queries. They described how they optimized their work with their platform. First they took the complexity of Hadoop (a couple of years back) out by adopting High Level Transformation Packaging (I can find very little on that BTW). This made ETL for analysts easier.
Next they found out it took a while to run SQL. So they developed (open sourced also) Hiverunner with which you can treat your SQL as code. You can unit test it for example.
Then they went to the cloud in 2016, using Kafka to transport their data. Recently they used Apache Ranger for access management.
A couple of takeaways from their journey:
- Don’t optimize prematurely. Klarna’s teams were working first to make everything super-super fast, but apparently their experience was that it’s better to work on performance later on.
- Kerberos (often used for authentication in Hadoop) was a pain to implement.
- Most important one (according to Max Fisher): constantly care about your users.
- Treat SQL as code.
Hadoop 3.0 in a nutshell by Sanjay Radia & Junping Du
In this session we were told what to expect from Hadoop 3.0. Just to be sure: it isn’t here yet. Alpha 2 is expected in Q2, the beta in Q3/Q4. Hadoop 3.0 gets a lot of features in the (Github) trunk that they didn’t get to put in Hadoop 2.x. Also Hadoop 3.0 makes use of JDK8, which has a lot of features it will use.
For data engineers the high availability improvements are interesting. Hadoop 3.0 will fix the dependency on one namenode, so you won’t have downtime in the future. At least the namenode as single point should not be the reason. Another thing is that a couple of ports will change. Which exactly you have to see in the slides.
In HDFS currently you replicate data by copying your blocks, usually to three places (redundancy=3). in Hadoop 3.0 it’s possible to choose a different approach. They called it erasure code. It basically means it uses parity blocks that can replace lost blocks. It does have a CPU overhead, but less storage overhead I suppose. This feature is more recommended for archival data. It’s interesting that the speaker (Sanjay) discussed what route they chose to solve this problem and why. You would never hear this on a closed source presentation.
YARN (the resource management thing of Hadoop) got scheduling enhancements and support for long running queries. You can use YARN instead of Openstack. It also supports Docker. Quote: “You can even run YARN on YARN if you want.” YARN also got more updates, but more about that in the next session.
Running Services on YARN by Varun Vasudev
Interesting quote in the presentation: “the tolerance for slow running jobs is decreasing. Consistent performance is desired”. In my experience (in the Oracle world) that sounds very true.
So YARN is the architecture center of big data workloads and in future versions it can do more. Batchjobs used to have to be “YARN aware”, but services don’t work that way.
The new framework for services will give an Application Master as part of the part of Hadoop that runs services. It also seems to be rather hard to discover long running services, but they found a solution for that. Don’t ask me for details. This part I didn’t understand 100%.
An Overview on Optimalization in Apache Hive: Past, Present and Future by Jesus Camarcho Rodríguez.
This session spoke my language. Because much of it sounded like the stuff I worked on when I tuned Oracle query performance. Many terms were familiar.
So it turns out that Hive 2.0 has an optimizer and it does much what an Oracle optimizer does. The query optimizer finds the most efficient way to execute a given query. And in Hive 2.0 that gives a bit of a challenge: choose latency or optimality?
The optimizer in Hive 2.0 is powered by Apache Calcite. That changes one thing in Hive: since 2.0 it sometimes takes longer to optimize than to get the data (sounds very familiar also).
So, like in Oracle, Hive 2.0/Calcite needs accurate statistics to come up with an accurate plan. Statistics contain number of rows, NDV (Number of Distinct Values), min and max values.
There is also physical optimization that decides based on algorithms, partitioning and sorting how the plan will executed. In future releases there will be materialized views and partitioned materialized views.
HDFS Tiered Storage: Mounting Object Stores by Thomas Demoor and Ewan Higgs
Both speakers are from Western Digital and they discussed a solution to move data between multiple clusters. For this they proposed a new type of storage next to RAM, SSD and disk: provided storage.
I’m afraid they lost me pretty quickly.
Best Practices for Enterprise User Management in an Hadoop Environment by Sailaka Polavarpu and Don Bosco Durai
So this session discussed how to integrate Apache Ranger in AD/LDAP. Why? So you can have authentication to Hadoop from AD/LDAP. It is just one of the examples why I think Hadoop’s security has improved.
Don started by responding to one often asked question “If I have Ranger, do I really need Kerberos?” (Boy, Kerberos really is a pain in the proverbial behind, is it not?) The answer is “Yes”. Ranger adds to the security, but it doesn’t replace Kerberos.
Sailaka has developed large parts of Ranger and demonstrated how to use the command line tools to integrate Ranger with an existing AD/LDAP solution. Yes, this was a live demo, so chapeau for the presenters. And it worked well.
Beer garten party
The day ended with a beer garten party with ..ah .. beer and all sorts of games and traditional Bayern songs like “Sweet Alabama”, “Everybody was kungfu fighting”, “Staying alive” by the Bayern Beegees and many other German classics.
So this was my first open source / big data related summit. And there were a few things that caught my eye/ear:
- Instead of some kind of vice president showing a new feature that you have to pay extra for, the people who actually made it show their work. I’ve heard “we’ve heard you” a couple of times when they announced new changes. This is more fun. And of course you can ask these people all kinds of questions.
- I’m more confident than ever that you can secure Hadoop properly and keep track of your data. Reasons are Apache Ranger and Apache Atlas and I will discussed these in more detail as soon as I have played with them more.
- I also thing the Hadoop is not really totally High Available story will end soon as the dependency on just one namenode will be less in future versions.