Recent Comments
Tag Archives: Hadoop
My Github repo got 50 stars
I never imagined myself as a maintainer of a data engineering related open source thing. Yet. But when I was working on our data engineering course, I needed some kind of data lake software. At first I used the Cloudera … Continue reading
Posted in Apache Products for Outsiders, Data engineering
Tagged Docker, docker-compose, Github, Hadoop, stars
Leave a comment
What a year 2021 has been
So at the end of 2021 I found myself in the waiting room of an emergency dentist. An infection above my front teeth became unbearable. Fortunately antibiotics makes my live much better now. Let that event not colour my view … Continue reading
Posted in Active Learning, Data engineering
Tagged astronomy, Certified Data Engineering Professional, cycling, Github, Hadoop, Kupka, Paris, vacation
Leave a comment
I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.
TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here: https://github.com/Marcel-Jan/docker-hadoop-spark [Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by … Continue reading
Posted in Howto, Spark
Tagged Apache Spark, Big Data Europe, DIKW, Docker, docker-compose, Hadoop, Hive
23 Comments
Building HDP 2.6 on AWS, Part 3: the worker nodes
This is part 3 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. By now we have an edge node to run Ambari Server, three master nodes for Hadoop name nodes and such. Now … Continue reading
Posted in Howto
Tagged Amazon Web Services, AWS, cloning nodes, Hadoop, HDP, Hortonworks Data Platform, Ubuntu Server, worker nodes
Leave a comment
Hadoop in a Hurry – Security
When talking about Hadoop security there are so many products and features. What do all of them do? This video gives a high over overview.
Hadoop High Availability In A Hurry – Part 2: YARN
If you don’t know a lot about YARN and why it’s called a data operating system, you’re in luck. I found it necessary to explain how YARN works before I could explain the solutions for high availability. At first YARN … Continue reading
Posted in Apache Products for Outsiders
Tagged Application Master, Container, Hadoop, Node Manager, Resource Manager, YARN, ZooKeeper
1 Comment
Hadoop High Availability In A Hurry – Part 1: HDFS
I’ve been studying for a couple of hours how Hadoop high availability works, for the HDPCA exam. And now I’ve condensed that knowledge to a video on HDFS HA in just under 9 minutes. Enjoy!
Posted in Apache Products for Outsiders
Tagged DataNode, edits file, Fencing, fsimage, Hadoop, HDFS, High availability, JournalNode, NameNode, Split brain, ZKFC, ZooKeeper
1 Comment
Building HDP 2.6 on AWS, Part 2: the master nodes
This is part 2 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. In part 1 we created an edge node where we will later install Ambari Server. The next step is creating the … Continue reading
Posted in Howto
Tagged Amazon Web Services, AWS, cloning nodes, Hadoop, HDP, Hortonworks Data Platform, master node, Ubuntu Server
5 Comments
Building HDP 2.6 on AWS, Part 1: the edge node
Installing Hortonworks Data Platform 2.6 on Amazon Web Services (Amazon’s cloud platform), how hard could it be? It’s click, click, next, next, confirm, right? Well-lll, not quite. Especially if HDP or AWS is new to you. There are many steps … Continue reading
Posted in Howto
Tagged Amazon Web Services, AWS, edge node, Hadoop, HDP, Hortonworks Data Platform, Ubuntu Server
2 Comments