I built a working Hadoop-Spark-Hive cluster on Docker. Here is how.

TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here:

https://github.com/Marcel-Jan/docker-hadoop-spark

 

How it started

We at DIKW are working on a Certified Data Engineering Professional course. It is a course where you learn all aspects we could think of of being a data engineer: the cool big data stuff, but also how data warehousing works and how it all can work together.

One of the topics is Hadoop. Now our course has an important practical aspect. We’re not just going to bombard you with theory. You have to try the products/methods yourself. On your own laptop. So for the Hadoop module I suggested using the Cloudera sandbox on Docker, because our practice environments work on Docker and the Cloudera sandbox has it all.

And at one moment my colleague Hugo Koopmans told me we had a problem: building the Cloudera sandbox on his laptop took way too long and required way too much memory. Could we use an simpler (and much older) Hadoop implementation instead?

My thoughts were: Simpler? Yes! Old version? No way! We’re not going to start a new course with a 5 year old Hadoop version. And off I went on a quest for lightweight Hadoop cluster on Docker. Ideally with Spark and maybe Hive. Because I like databases.

 

The quest for a lightweight and up to date Hadoop cluster

After searching and finding all kinds of Hadoop on Docker images, I found most of them where old. But it turned out that Big Data Europe has a Docker environment with Hadoop 3.2.1 and it’s only 9 months old. Much better. Their Spark version is also pretty much up to date.

But how to get the Spark nodes to connect to the Hadoop nodes? I could not get the docker-composed Hadoop nodes and the docker-composed Spark nodes to speak to each other.

(There was a reason for that and I just found out why. I thought I used Big Data Europe’s Spark setup, but it looks like I got a different one. One that had a spark-net network defined. And I can’t remember where I got it from. It looks like sdesliva26’s version but it’s not that one either.)

Anyhow, I learned gradually I needed to combine the docker-compose.yml files somehow.

 

Quick! Learn docker-compose

Now when you’ve worked with docker-compose for a while, you might think “how hard could it be?”. But I had no idea what the principles of this thing were.

Docker-compose is a way to quickly create a multi-container environment. Perfect to create clusters, like a Hadoop cluster with a namenode and datanode. And it is defined in a docker-compose.yml file. But from the docker-compose.yml file there can be references to shell scripts to run or files with environmental settings. So you need these files too.

I’ve spent countless hours combining docker-compose services and trying to get it to work and not understanding why it would not. But after removing these Spark networks it worked much better. Turns out that when you don’t define any network in docker-compose, the services are all part of the same network that Docker creates automatically.

 

Disecting a docker-compose.yml file.

You can skip this section if you just want to run the Docker Hadoop environment and don’t really care how. (Go ahead. I won’t be judgemental. That’s how I started myself.)

So here is a simplified example of one service I took from the Hadoop docker-compose.yml:

version: "3"

services:
  namenode:
    image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
    container_name: namenode
    restart: always
    ports:
      - 9870:9870
      - 9000:9000
    volumes:
      - hadoop_namenode:/hadoop/dfs/name
    environment:
      - CLUSTER_NAME=test
      - CORE_CONF_fs_defaultFS=hdfs://namenode:8020
    env_file:
      - ./hadoop.env

volumes:
  hadoop_namenode:

You can see it starts with a version. That’s the version of the docker-compose file and 3 is the latest.

The namenode service is based on an image prepared by Big Data Europe. Docker images are like blueprints for Docker containers. I sometimes think of the Docker image as an installation file and the container is the actual application running. I hope you get the idea. This service defintion refers to where the image can be found on Docker Hub. Docker Hub is like an app store for Docker images.

We can also see ports defined. Hadoop namenodes has some stuff running on these ports and we want to address these from outside the container. For example, when you have started the containers, you find namenode information on http://localhost:9870. Should you want a different port on your laptop, because multiple containers run on port 80 or something, an important thing to remember is that the first port is the port on the outside, the second is the one in the container.

The namenode also needs some permanent place to store data. For this there is the volume definition. But remember you have to add the list of volumes after defining the services also.

And there is a .env file apparently and it contains all kinds of environmental variables necessary for running Hadoop.

 

Let’s get this thing started

Now you can download Big Data Europe’s docker-hadoop repository or from my docker-hadoop-spark repository and from the directory where you placed it all it takes is this command to get the multi-container environment running:

docker-compose up -d

The -d means it runs in de background.

BTW the config file can have another name than docker-compose.yml. But then you need the -f option to point docker-compose to the correct file:

docker-compose up -f mymulticontainers.yml -d

And you can break it all down again by going to that same directory and running this:

docker-compose down

All the containers will then be stopped and removed. But: the images and volumes stay! So don’t be surprised that the csv file you uploaded to HDFS will still be there.

 

Combining docker-compose files

It turns out you can copy paste services from the Spark docker-compose.yml to the Hadoop docker-compose.yml, provided that I added the directories provided in the docker-spark Github repository. And I learned that I needed to remove the spark-network network (wherever it came from).

And I thought I needed to break the environment down and build it up again every time I changed docker-compose.yml. Because that’s how stuff usually works. But not with docker-compose. You can edit the docker-compose.yml file and run “docker-compose up” again. I learned that after a whole lot of building up and breaking down docker environments BTW.

 

How the Hadoop-Spark-Hive docker-compose was built

So in the end it was a question of adding services from one docker-compose.yml to the other and all the necessary files. But it took me a while to understand how to use it and from where. But I got that all figured out now and I’ve written the quick starts for HDFS, Spark and Hive.

 

Quick starts

Quick start HDFS

Find the Container ID of the namenode.

  docker ps |grep namenode

1df7a57164de        bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8          "/entrypoint.sh /run…"   27 hours ago        Up 12 hours (healthy)      0.0.0.0:9000->9000/tcp, 0.0.0.0:9870->9870/tcp             namenode

Copy breweries.csv to the namenode.

  docker cp breweries.csv 1df7a57164de:breweries.csv

Go to the bash shell on the namenode with that same Container ID of the namenode.

  docker exec -it 1df7a57164de bash

Create a HDFS directory /data//openbeer/breweries.

hdfs dfs -mkdir /data
hdfs dfs -mkdir /data/openbeer
hdfs dfs -mkdir /data/openbeer/breweries

Copy breweries.csv to HDFS:

  hdfs dfs -put breweries.csv /data/openbeer/breweries/breweries.csv

 

Quick start Spark

Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop). Here you find the spark:// master address:

  Spark Master at spark://452dd59615b0:7077

Go to the command line of the Spark master and start spark-shell.

  docker ps |grep spark
efef70177b0b        bde2020/spark-worker:3.0.0-hadoop3.2                     "/bin/bash /worker.sh"   27 hours ago        Up 12 hours                0.0.0.0:8081->8081/tcp                                     spark-worker-1
453dd19695b0        bde2020/spark-master:3.0.0-hadoop3.2                     "/bin/bash /master.sh"   27 hours ago        Up 12 hours                0.0.0.0:7077->7077/tcp, 6066/tcp, 0.0.0.0:8080->8080/tcp   spark-master

  docker exec -it 453dd19695b0 bash
  
  spark/bin/spark-shell --master spark://452dd59615b0:7077

Load breweries.csv from HDFS.

  val df = spark.read.csv("hdfs://namenode:8020/data/openbeer/breweries/breweries.csv")
  
  df.show()
+----+--------------------+-------------+-----+---+
| _c0|                 _c1|          _c2|  _c3|_c4|
+----+--------------------+-------------+-----+---+
|null|                name|         city|state| id|
|   0|  NorthGate Brewing |  Minneapolis|   MN|  0|
|   1|Against the Grain...|   Louisville|   KY|  1|
|   2|Jack's Abby Craft...|   Framingham|   MA|  2|
|   3|Mike Hess Brewing...|    San Diego|   CA|  3|
|   4|Fort Point Beer C...|San Francisco|   CA|  4|
|   5|COAST Brewing Com...|   Charleston|   SC|  5|
|   6|Great Divide Brew...|       Denver|   CO|  6|
|   7|    Tapistry Brewing|     Bridgman|   MI|  7|
|   8|    Big Lake Brewing|      Holland|   MI|  8|
|   9|The Mitten Brewin...| Grand Rapids|   MI|  9|
|  10|      Brewery Vivant| Grand Rapids|   MI| 10|
|  11|    Petoskey Brewing|     Petoskey|   MI| 11|
|  12|  Blackrocks Brewery|    Marquette|   MI| 12|
|  13|Perrin Brewing Co...|Comstock Park|   MI| 13|
|  14|Witch's Hat Brewi...|   South Lyon|   MI| 14|
|  15|Founders Brewing ...| Grand Rapids|   MI| 15|
|  16|   Flat 12 Bierwerks| Indianapolis|   IN| 16|
|  17|Tin Man Brewing C...|   Evansville|   IN| 17|
|  18|Black Acre Brewin...| Indianapolis|   IN| 18|
+----+--------------------+-------------+-----+---+
only showing top 20 rows

How cool is that? Your own Spark cluster to play with.

 

Quick start Hive

Find the Container ID of the Hive Server.

  docker ps |grep hive-server

60f2c3b5eb32        bde2020/hive:2.3.2-postgresql-metastore                  "entrypoint.sh /bin/…"   27 hours ago        Up 12 hours                       0.0.0.0:10000->10000/tcp, 10002/tcp                        hive-server

Go to the command line of the Hive server and start hiveserver2

  docker exec -it 60f2c3b5eb32 bash

  hiveserver2

Maybe a little check that something is listening on port 10000 now

  netstat -anp | grep 10000
tcp        0      0 0.0.0.0:10000           0.0.0.0:*               LISTEN      446/java

Okay. Beeline is the command line interface with Hive. Let’s connect to hiveserver2 now.

  beeline
  
  !connect jdbc:hive2://127.0.0.1:10000 scott tiger

Didn’t expect to encounter scott/tiger again after my Oracle days. But there you have it. Definitely not a good idea to keep that user on production.

Not a lot of databases here yet.

  show databases;
  
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (0.335 seconds)

Let’s change that.

  create database openbeer;
  use openbeer;

And let’s create a table.

CREATE EXTERNAL TABLE IF NOT EXISTS breweries(
    NUM INT,
    NAME CHAR(100),
    CITY CHAR(100),
    STATE CHAR(100),
    ID INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/data/openbeer/breweries';

And have a little select statement going.

  select name from breweries limit 10;
+----------------------------------------------------+
|                        name                        |
+----------------------------------------------------+
| name                                                                                                 |
| NorthGate Brewing                                                                                    |
| Against the Grain Brewery                                                                            |
| Jack's Abby Craft Lagers                                                                             |
| Mike Hess Brewing Company                                                                            |
| Fort Point Beer Company                                                                              |
| COAST Brewing Company                                                                                |
| Great Divide Brewing Company                                                                         |
| Tapistry Brewing                                                                                     |
| Big Lake Brewing                                                                                     |
+----------------------------------------------------+
10 rows selected (0.113 seconds)

There you go: your private Hive server to play with.

 

Conclusion

I got the lightweight Hadoop environment that I wanted. On my Windows 10 laptop with WSL2 (Windows Subsystem for Linux 2) installed, it uses only 3 GB memory. That’s not half bad. It sometimes was a frustrating journey, but I learned a lot about Docker and docker-compose and learned to love it.

I hope you have fun with this Hadoop-Spark-Hive cluster too.

Screenshot of Docker Desktop with the hadoop-spark-hive cluster

 

Posted in Howto, Learning Big Data, Spark | Tagged , , , , , , | 6 Comments

A humidity sensor network on a Raspberry Pi with Zigbee2MQTT

I was looking for a way to detect leakage in my appartement with some kind of IoT solution. Someone on the Dutch technology forum Tweakers.net told me Xiaomi Humidity sensors, combined with a Zigbee2MQTT might be a good fit. The sensors are quite cheap and so is the CC2531 sniffer stick to receive the data sent over the Zigbee protocol.

So that’s what I set out to do. And in these two videos you see how I got my humidity sensor network working.

 

We visualize the humidity sensor data with Domoticz. Domoticz is a home automation system.

 

So now I have streaming IoT data. I have some plans for that in the future.

Posted in Howto | Tagged , , , , , , | Leave a comment

ITNEXT Summit 2019: serverless, streaming and cloud native transformations

For the third time in a row I’ve attended the ITNEXT Summit. This year I got a ticket from LINKIT, for which I thank them. It was the best ITNEXT Summit I’ve been at so far.

It started with breakfast. I already had it at home, but I can’t resist a good croissant. Mmm… Where was I? Oh yeah, the summit. In this blogpost I look back on the sessions I attended.

 

Cultivating Production Excellence – Liz Fong-Jones

Liz Fong-Jones about dealing with complexity in production

I’ve been on-call for complex systems in my life, but in the era of containers and serverless things have changed. Some things Liz Fong-Jones spoke about in her keynote did sound familiar, but she discussed how with complex architecures with distributed systems, containers and cloud it is no longer a question of systems being up or down. Continue reading

Posted in Events | Tagged , , , , , , , , , , , , , , , , , , | Leave a comment

Tech dossier: pandas

I’m keeping tech dossiers in Evernote on open source products I want to keep track of.  And I decided to put them on my blog. My previous ones were on Kubernetes and Elasticsearch. This one is on the Python data management library pandas.

 

A short description – in English

Pandas is a library of Python. If you already have Python 3 (version 2 support was recently dropped), it’s a matter of running “pip install pandas” and there you are. Pandas allows you to analyze and manipulate your data. But then again, aren’t there many more products for that? How to explain the power of pandas?

Let me put it like this: it is like using Excel, but on much larger datasets, and if Excel had a command line interface. Imagine being able to say to Excel on a command line: “load my csv file”, “use this row as names for my columns”, “just show me columns date and sales”, “all right, now pivot that”. I just love it.

 

Learning pandas

For this I’ve used pythonprogramming.net. It’s free and it gave me an excellent start with data analysis in Python. The Youtube videos for pandas seem to have been recently updated also.

Need to learn Python first? I started learning Python with the Coursera course “An Introduction to Interactive Programming in Python (Part 1)” from Rice University. It’s a great course. But if you want a free course, you can’t go wrong with the pythonprogramming.net videos.

You can also watch a couple of my video’s on my first encounters with pandas.

And recently I wrote a blogpost on how I used pandas at work to flatten the data from a complex Excel sheet, so I could load it in Hadoop. I’ve used all kinds of lesser known features to achieve that result.

 

Building your own environment

Want to play with pandas? That’s quite easy. You need to install Python 3 on your own computer and use “pip install pandas” (from the command line).

 

Getting pandas to do specific stuff

Selecting columns or rows with pandas (Because I keep forgetting after a while)

This article discusses two ways of selecting data with pandas, but it’s also handy as reminder how to select rows and columns. You can’t go wrong now.

How to shift a column in pandas

How do multi-indexes in pandas work? Also in this video:

 

 

Other interesting stuff

Pandas tricks and features you might not know

Data visualization with pandas plot (How cool: you can add .plot to your dataframe)

 

pandas and performance

pandas at extreme performance

 

Posted in Data engineering, Python, Tech dossier | Tagged , , , , , | Leave a comment

The Atlas REST API – working examples

Originally I was writing a blogpost about my experiences with Apache Atlas (which is still in the works) in which I would refer to a Hortonworks Community post I wrote with all the working examples of Atlas REST API calls. But since Hortonworks Community has migrated to Cloudera Community, this article seems to have been lost. The original URL brings you to the Cloudera Community, but not the article. The search engine comes up with nothing. I can’t find it via my profile either.

It wasn’t particularly easy to gain all this knowledge. So of course I had a backup of all successful commands and output. And here it is. This was all tested on HDP 2.6.5.

Continue reading

Posted in Apache Atlas | Tagged , , , , , , , , , | 3 Comments

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

Continue reading

Posted in Data engineering, Spark | Tagged , , , | 2 Comments

Neo4J: Loading rocket data in a graph database

When I first learned about graph databases, like Neo4J, I didn’t get it. That’s how I always start with new technology: not getting at all why people getting so enthusiastic about them. Then I read “Seven Databases in Seven Weeks, 2nd edition” (as reviewed in January). It describes Neo4J as a “whiteboard friendly”. Any diagram with boxes and lines you could draw on a whiteboard, can be stored in Neo4J. After reading the first paragraphs about Neo4J, I totally got why graph databases are very interesting.

As usual, I started following a course on Neo4J to get acquainted with the product. Well there was a course on Neo4J on Udemy.com and I followed it. But it was 3 years old and some of the code it teached, is already obsolete. So the less I say about that, the better. But I did learn some Cypher, the language used in Neo4J, and I later learned the more modern versions of the commands to get stuff done.

Next phase: do a project with it. It took me some time to think of interesting astronomy or space related data in it. Eventually I stumbled on a dataset that has been around and maintained for a long time. At least since I discovered when I just got on the Internet around 1993. It’s called Jonathan’s Space Page now. And the maintainer, Jonathan McDowell, still keeps the list of orbital space launches and list of satellites ever launched quite up to date.

And eventually I did manage to load this data in Neo4J and here is my video about that:

 

You can download Neo4J Desktop here:

Neo4j Download Center

You can find my Python and Cypher code here:

https://github.com/Marcel-Jan/neo4j_satellites

Let me know if you want me to go in depth on the code used in this video.

 

Posted in Active Learning, Howto, NoSQL | Tagged , , , , | Leave a comment

Starting at DIKW May 1st 2019

Per May 1st 2019 I’ll be working at a new company: DIKW in Nieuwegein. DIKW stands for: Data, Information, Knowledge, Wisdom (it works in Dutch too). I will be working as data engineer on consultancy basis.

 

I’ve already met many colleagues at DIKW and they are all very experienced in BI, data science and data engineering. It’s fun to converse with them, which actually was part of the sollicitation process.

I’m going to miss the view from the office I had, but in the new job I’ll basically manage myself. And let’s face it: none told me to do 8 courses and become Certified Kubernetes Administrator last year. None made a year plan. Development-wise I’m perfectly capable to find my own way.

Posted in Uncategorized | Tagged , , , | Leave a comment

Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

Continue reading

Posted in Howto, Python | Tagged , , , , , , , | 5 Comments