Tag Archives: Hadoop

My Github repo got 50 stars

I never imagined myself as a maintainer of a data engineering related open source thing. Yet. But when I was working on our data engineering course, I needed some kind of data lake software. At first I used the Cloudera … Continue reading

Posted in Apache Products for Outsiders, Data engineering | Tagged , , , , | Leave a comment

What a year 2021 has been

So at the end of 2021 I found myself in the waiting room of an emergency dentist. An infection above my front teeth became unbearable. Fortunately antibiotics makes my live much better now. Let that event not colour my view … Continue reading

Posted in Active Learning, Data engineering | Tagged , , , , , , , | Leave a comment

I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.

TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here: https://github.com/Marcel-Jan/docker-hadoop-spark [Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by … Continue reading

Posted in Howto, Spark | Tagged , , , , , , | 23 Comments

Building HDP 2.6 on AWS, Part 3: the worker nodes

This is part 3 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. By now we have an edge node to run Ambari Server, three master nodes for Hadoop name nodes and such. Now … Continue reading

Posted in Howto | Tagged , , , , , , , | Leave a comment

Hadoop in a Hurry – Security

When talking about Hadoop security there are so many products and features. What do all of them do? This video gives a high over overview.

Posted in Apache Products for Outsiders | Tagged , , , , , , , , , | Leave a comment

Tutorial: Let’s throw some asteroids in Apache Hive

This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make … Continue reading

Posted in Apache Products for Outsiders | Tagged , , , , , , , , | Leave a comment

Hadoop High Availability In A Hurry – Part 2: YARN

If you don’t know a lot about YARN and why it’s called a data operating system, you’re in luck. I found it necessary to explain how YARN works before I could explain the solutions for high availability. At first YARN … Continue reading

Posted in Apache Products for Outsiders | Tagged , , , , , , | 1 Comment

Hadoop High Availability In A Hurry – Part 1: HDFS

I’ve been studying for a couple of hours how Hadoop high availability works, for the HDPCA exam. And now I’ve condensed that knowledge to a video on HDFS HA in just under 9 minutes. Enjoy!

Posted in Apache Products for Outsiders | Tagged , , , , , , , , , , , | 1 Comment

Building HDP 2.6 on AWS, Part 2: the master nodes

This is part 2 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. In part 1 we created an edge node where we will later install Ambari Server. The next step is creating the … Continue reading

Posted in Howto | Tagged , , , , , , , | 5 Comments

Building HDP 2.6 on AWS, Part 1: the edge node

Installing Hortonworks Data Platform 2.6 on Amazon Web Services (Amazon’s cloud platform), how hard could it be? It’s click, click, next, next, confirm, right? Well-lll, not quite. Especially if HDP or AWS is new to you. There are many steps … Continue reading

Posted in Howto | Tagged , , , , , , | 2 Comments