This is part 2 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. In part 1 we created an edge node where we will later install Ambari Server. The next step is creating the master nodes.
Creating the first master node
Make sure you are logged in Amazon Web Services, in the same AWS district as the edge node. To create 3 master nodes, we have to start with one. Once again we go to the EC2 dashboard in the AWS interface and click “Launch instance”. And again we have a choice of Amazon Machine Images and again we choose Ubuntu Server 16.04.
Installing Hortonworks Data Platform 2.6 on Amazon Web Services (Amazon’s cloud platform), how hard could it be? It’s click, click, next, next, confirm, right?
Well-lll, not quite. Especially if HDP or AWS is new to you. There are many steps and many things to look out for. That’s why I wrote a manual, initially for myself, and here for you.
Disclaimer: This blogpost might change slightly after I’ve gained more experience with my HDP cluster. Most of it works, but I have some problems with a few services. I’ll notify of changes I’ve made at the end of this post.
As I said last in my last blogpost, I have followed the Apache NiFi crash course that Hortonworks provides. Now the tutorial describes several different scenarios and options and you have to read through that to find which you want. And you don’t have time for that. You’re probably doing this in your spare time and you have a whole Netflix backlog.
So in this guide we cut right to the chase. It took me about 10 hours to follow Tutorial 0, 1, 2 and 3. But perhaps this guide can make you do it in about 4 hours.
1. Preparing the VM
First download the Hortonworks Sandbox. There’s a VirtualBox (used in this example), VMWare and Docker image that come preinstalled with many products, but NiFi isn’t installed just yet (this guide is based on the HDP 2.6 sandbox).
Am I the only one who has this? Let me know.
Phase 1: Discovery of New Product
Suddenly everybody talks about New Product. It’s said it changes everything. Articles about New Product appear on Hacker News for weeks. Then colleagues on LinkedIn even mention New Product (Warning! People you know, know New Product!). (Or they’re just linking to articles about New Product, so they look cool. Either way: they must know New Product!)
There are a lot of data-related Apache products out there and it’s hard to keep up with all of them. There are several products to stream or flow data (what’s the difference?). Like Kafka, Storm, Flink and NiFi. Yes, all products have documentation, but for an outsider their description sounds like “enterprise scalable streaming solutions”. What does that tell you?
I followed a Crash Course on Apache Nifi at the DataWorks Summit in München last month and was quite impressed. At heart I’m a command line kind of guy, but this graphical interface is really slick and it’s amazing what you can do to find out where your data goes to with NiFi. I decided to organize a workshop for my colleagues at Open Circle Solutions. Continue reading
“How do you got in Big Data?”, is a question that people asked me a couple of times now. So let me give that answer in a blogpost as well.
I’ve used eight sources of Big Data related knowledge and skills:
- Massive Open Online Courses (MOOCs)
- Meetups and summits
- Online documentation
- Hands-on experience
- Learning sites/”universities” of vendors
Posted in Learning Big Data
Tagged Apress, Big Data Expo, Big Data University, Cloudera, Coursera, Dataworks Summit, Drill to Detail podcast, EDX, Hortonworks, Learning Big Data, MapR, Massive Open Online Courses, MOOCs, O'Reilly, Packt Publishing, Roaring Elephant podcast, Udacity
Day two started with more keynotes. Ross Porter of Dell EMC talked about the ingredients of a successful analytics project. Carlo Vaiti of HP Enterprise had an interesting talk about trends in big data, but I would advise him to let a professional presentation bureau go over his slides. They were perfect for a breakout session, but not so much for a keynote. Too small fonts. Continue reading
Just left the beergarten party at Dataworks Summit 2017 in München. Okay, let’s see how well I blog after three of these large beers. Luckily I took notes before. Tell me when I start to become incoherent.
So actually for me the summit started yesterday at the Partner day, but today the breakout sessions started. I’ve attended a cool hands-on session on Apache Ranger and Atlas yesterday. These are the new security and governance tools for Hadoop. And I think we can say that the days that you could say Hadoop is largely insecure are about to be over.
The first session today I attended was the keynote of course. I was a little bit late, but it turned out that the second row seats were largely available, so I had a good view of the speakers. A couple of them tried to convince us that (big) data was going to change everything. I already knew that. Continue reading
After 20 years of working with Oracle products, I decided to make a new step: to become a data engineer. And that is just one term of the Big Data jargon I’m about to learn. It is my intention to use this blog to take you with me on this journey and to make sense of new products and jargon I’m about to get familiar with.
So what is a data engineer? Is it just the DBA of the Big Data world where Hadoop has replaced the relational database? I’ll keep you informed.