My Github repo got 50 stars

I never imagined myself as a maintainer of a data engineering related open source thing. Yet. But when I was working on our data engineering course, I needed some kind of data lake software. At first I used the Cloudera sandbox, but some of my colleagues tried it and they complained it took way too much time to start and way to many resources of their laptop. It would be a good bet that our students would get that problem too.

Long story short: I found that Big Data Europe already had a simple Dockerized Hadoop. They actually did all the hard work. But I wanted it to have Hive and Spark too. I went playing with docker-compose yml files and learned a lot from that BTW. And after some initial frustrations it finally worked. (more…)

I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.

TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here: https://github.com/Marcel-Jan/docker-hadoop-spark [Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running Read more

Building HDP 2.6 on AWS, Part 3: the worker nodes

This is part 3 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. By now we have an edge node to run Ambari Server, three master nodes for Hadoop name nodes and such. Now we need worker nodes for processing the data.

Creating the worker nodes is not that much different from creating the master nodes. But the workers need more powerful nodes.

Creating the first worker node

Log in at Amazon Web Services again, in the same AWS district as the edge and master nodes. We start with one worker node and clone 2 more later on. Go to the EC2 dashboard in the AWS interface and click “Launch instance”. Then choose Ubuntu Server 16.04 from the Amazon Machine Images. (more…)

Tutorial: Let’s throw some asteroids in Apache Hive

This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make sure you don’t look like comment spam though. I’m getting a lot of that lately, even though they never pass my approval).

Intro

Currently I’m studying for the Hortonworks Data Platform Certified Developer: Spark using Python exam (or HDPCD: Spark using Python). One part of the exam objectives is using SQL in Spark. Along the way you also work with Hive, the data warehouse software in Hadoop.

I was following the free Udemy HDPCD Spark using Python preparation course by ITVersity. The course is good BTW, especially for the price :). But after playing along with the Core Spark videos, the course again used the same boring revenue data for the Spark SQL part. And I thought: “I know SQL pretty well. Why not use data that is a bit more interesting?” And so I downloaded the Minor Planet Center’s asteroid data. This contains all the known asteroids until at least yesterday. At this moment, that is about 745.000 lines of data. (more…)

Building HDP 2.6 on AWS, Part 2: the master nodes

This is part 2 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. In part 1 we created an edge node where we will later install Ambari Server. The next step is creating the master nodes.

Creating the first master node

Make sure you are logged in Amazon Web Services, in the same AWS district as the edge node. To create 3 master nodes, we have to start with one. Once again we go to the EC2 dashboard in the AWS interface and click “Launch instance”. And again we have a choice of Amazon Machine Images and again we choose Ubuntu Server 16.04.

(more…)

Building HDP 2.6 on AWS, Part 1: the edge node

Installing Hortonworks Data Platform 2.6 on Amazon Web Services (Amazon’s cloud platform), how hard could it be? It’s click, click, next, next, confirm, right?

Well-lll, not quite. Especially if HDP or AWS is new to you. There are many steps and many things to look out for. That’s why I wrote a manual, initially for myself, and here for you.

Disclaimer: This blogpost might change slightly after I’ve gained more experience with my HDP cluster. Most of it works, but I have some problems with a few services. I’ll notify of changes I’ve made at the end of this post.

(more…)