Category Archives: Learning Big Data

My Github repo got 50 stars

I never imagined myself as a maintainer of a data engineering related open source thing. Yet. But when I was working on our data engineering course, I needed some kind of data lake software. At first I used the Cloudera … Continue reading

Posted in Apache Products for Outsiders, Data engineering, Learning Big Data | Tagged , , , , | Leave a comment

Five years of data engineering

Five years ago I made the switch from Oracle database administration to data engineering. It has been quite a ride. I made a video about this to celebrate.

Posted in Active Learning, Data engineering, Learning Big Data | Tagged | Leave a comment

I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.

TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here: https://github.com/Marcel-Jan/docker-hadoop-spark [Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by … Continue reading

Posted in Howto, Learning Big Data, Spark | Tagged , , , , , , | 17 Comments

R Studio: Doing data science on my health data – Part 1

Up to seven years ago my doctor would nag me every half year that I should lose some weight. Nagging didn’t work on me that much. What did work however was competition. I wanted to become faster in a bike … Continue reading

Posted in Learning Big Data, Weird experiments | Tagged , , , , | Leave a comment

Check your /tmp on HDFS

If you have sensitive data on your Hadoop cluster, you might want to check /tmp on HDFS once a while to see what ends up there. /tmp is used by several components. Hive for example stores its “scratch data” there. … Continue reading

Posted in Learning Big Data | Tagged , , | Leave a comment

Making a Hertzsprung-Russell diagram from Gaia DR2 data with Elasticsearch

Elasticsearch was one of the open source products on my list to try out, ever since I got rejected for a couple of assignments as a consultant last year. Apparently it’s a popular product. But why do you need a … Continue reading

Posted in Learning Big Data, NoSQL | Tagged , , , , , , , , | Leave a comment

Building HDP 2.6 on AWS, Part 3: the worker nodes

This is part 3 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. By now we have an edge node to run Ambari Server, three master nodes for Hadoop name nodes and such. Now … Continue reading

Posted in Howto, Learning Big Data | Tagged , , , , , , , | Leave a comment

I feel great when I study

When I started studying Hadoop, Python and machine learning in 2016, I found something out that I didn’t expect. I feel better when I study. When I finished another problem, exam or course, and I stepped outside the house to … Continue reading

Posted in Learning Big Data | Tagged , | Leave a comment

Recovering your HDP 2.6.1 Sandbox on VirtualBox after a restart

If you’ve worked with the Hortonworks Data Platform 2.x sandbox of later versions in VirtualBox and made it shutdown rather vigorously, you might have noticed that you won’t get past this startup screen when you try to start it up … Continue reading

Posted in Apache Products for Outsiders, Howto, Learning Big Data | Tagged , , , , , , , , , | 2 Comments

Tutorial: Let’s throw some asteroids in Apache Hive

This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make … Continue reading

Posted in Apache Products for Outsiders, Learning Big Data | Tagged , , , , , , , , | Leave a comment