TL;DR: I made a Docker compose that runs Hadoop, Spark and Hive in a multi-container environment. You can find the necessary files for it here:
https://github.com/Marcel-Jan/docker-hadoop-spark
[Update 2021-11-09: Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running the complete docker-compose. Turning this option on again (Settings > General > Expose daemon on tcp://localhost:2375 without TLS) makes it all work. I’m still looking for a more secure solution to this]
How it started
We at DIKW are working on a Certified Data Engineering Professional course. It is a course where you learn all aspects we could think of of being a data engineer: the cool big data stuff, but also how data warehousing works and how it all can work together.
One of the topics is Hadoop. Now our course has an important practical aspect. We’re not just going to bombard you with theory. You have to try the products/methods yourself. On your own laptop. So for the Hadoop module I suggested using the Cloudera sandbox on Docker, because our practice environments work on Docker and the Cloudera sandbox has it all.
And at one moment my colleague Hugo Koopmans told me we had a problem: building the Cloudera sandbox on his laptop took way too long and required way too much memory. Could we use an simpler (and much older) Hadoop implementation instead?
My thoughts were: Simpler? Yes! Old version? No way! We’re not going to start a new course with a 5 year old Hadoop version. And off I went on a quest for lightweight Hadoop cluster on Docker. Ideally with Spark and maybe Hive. Because I like databases.
The quest for a lightweight and up to date Hadoop cluster
After searching and finding all kinds of Hadoop on Docker images, I found most of them where old. But it turned out that Big Data Europe has a Docker environment with Hadoop 3.2.1 and it’s only 9 months old. Much better. Their Spark version is also pretty much up to date.
But how to get the Spark nodes to connect to the Hadoop nodes? I could not get the docker-composed Hadoop nodes and the docker-composed Spark nodes to speak to each other.
(There was a reason for that and I just found out why. I thought I used Big Data Europe’s Spark setup, but it looks like I got a different one. One that had a spark-net network defined. And I can’t remember where I got it from. It looks like sdesliva26’s version but it’s not that one either.)
Anyhow, I learned gradually I needed to combine the docker-compose.yml files somehow.
Quick! Learn docker-compose
Now when you’ve worked with docker-compose for a while, you might think “how hard could it be?”. But I had no idea what the principles of this thing were.
Docker-compose is a way to quickly create a multi-container environment. Perfect to create clusters, like a Hadoop cluster with a namenode and datanode. And it is defined in a docker-compose.yml file. But from the docker-compose.yml file there can be references to shell scripts to run or files with environmental settings. So you need these files too.
I’ve spent countless hours combining docker-compose services and trying to get it to work and not understanding why it would not. But after removing these Spark networks it worked much better. Turns out that when you don’t define any network in docker-compose, the services are all part of the same network that Docker creates automatically.
Disecting a docker-compose.yml file.
You can skip this section if you just want to run the Docker Hadoop environment and don’t really care how. (Go ahead. I won’t be judgemental. That’s how I started myself.)
So here is a simplified example of one service I took from the Hadoop docker-compose.yml:
version: "3" services: namenode: image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 container_name: namenode restart: always ports: - 9870:9870 - 9000:9000 volumes: - hadoop_namenode:/hadoop/dfs/name environment: - CLUSTER_NAME=test - CORE_CONF_fs_defaultFS=hdfs://namenode:8020 env_file: - ./hadoop.env volumes: hadoop_namenode:
You can see it starts with a version. That’s the version of the docker-compose file and 3 is the latest.
The namenode service is based on an image prepared by Big Data Europe. Docker images are like blueprints for Docker containers. I sometimes think of the Docker image as an installation file and the container is the actual application running. I hope you get the idea. This service defintion refers to where the image can be found on Docker Hub. Docker Hub is like an app store for Docker images.
We can also see ports defined. Hadoop namenodes has some stuff running on these ports and we want to address these from outside the container. For example, when you have started the containers, you find namenode information on http://localhost:9870. Should you want a different port on your laptop, because multiple containers run on port 80 or something, an important thing to remember is that the first port is the port on the outside, the second is the one in the container.
The namenode also needs some permanent place to store data. For this there is the volume definition. But remember you have to add the list of volumes after defining the services also.
And there is a .env file apparently and it contains all kinds of environmental variables necessary for running Hadoop.
Let’s get this thing started
Now you can download Big Data Europe’s docker-hadoop repository or from my docker-hadoop-spark repository and from the directory where you placed it all it takes is this command to get the multi-container environment running:
docker-compose up -d
The -d means it runs in de background.
BTW the config file can have another name than docker-compose.yml. But then you need the -f option to point docker-compose to the correct file:
docker-compose up -f mymulticontainers.yml -d
And you can break it all down again by going to that same directory and running this:
docker-compose down
All the containers will then be stopped and removed. But: the images and volumes stay! So don’t be surprised that the csv file you uploaded to HDFS will still be there.
Combining docker-compose files
It turns out you can copy paste services from the Spark docker-compose.yml to the Hadoop docker-compose.yml, provided that I added the directories provided in the docker-spark Github repository. And I learned that I needed to remove the spark-network network (wherever it came from).
And I thought I needed to break the environment down and build it up again every time I changed docker-compose.yml. Because that’s how stuff usually works. But not with docker-compose. You can edit the docker-compose.yml file and run “docker-compose up” again. I learned that after a whole lot of building up and breaking down docker environments BTW.
How the Hadoop-Spark-Hive docker-compose was built
So in the end it was a question of adding services from one docker-compose.yml to the other and all the necessary files. But it took me a while to understand how to use it and from where. But I got that all figured out now and I’ve written the quick starts for HDFS, Spark and Hive.
Quick starts
Quick start HDFS
Find the Container ID of the namenode.
docker ps |grep namenode
1df7a57164de bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /run…" 27 hours ago Up 12 hours (healthy) 0.0.0.0:9000->9000/tcp, 0.0.0.0:9870->9870/tcp namenode
Copy breweries.csv to the namenode.
docker cp breweries.csv 1df7a57164de:breweries.csv
Go to the bash shell on the namenode with that same Container ID of the namenode.
docker exec -it 1df7a57164de bash
Create a HDFS directory /data//openbeer/breweries.
hdfs dfs -mkdir /data
hdfs dfs -mkdir /data/openbeer
hdfs dfs -mkdir /data/openbeer/breweries
Copy breweries.csv to HDFS:
hdfs dfs -put breweries.csv /data/openbeer/breweries/breweries.csv
Quick start Spark
Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop). Here you find the spark:// master address:
Spark Master at spark://452dd59615b0:7077
Go to the command line of the Spark master and start spark-shell.
docker ps |grep spark
efef70177b0b bde2020/spark-worker:3.0.0-hadoop3.2 "/bin/bash /worker.sh" 27 hours ago Up 12 hours 0.0.0.0:8081->8081/tcp spark-worker-1
453dd19695b0 bde2020/spark-master:3.0.0-hadoop3.2 "/bin/bash /master.sh" 27 hours ago Up 12 hours 0.0.0.0:7077->7077/tcp, 6066/tcp, 0.0.0.0:8080->8080/tcp spark-master
docker exec -it 453dd19695b0 bash
spark/bin/spark-shell --master spark://452dd59615b0:7077
Load breweries.csv from HDFS.
val df = spark.read.csv("hdfs://namenode:8020/data/openbeer/breweries/breweries.csv")
df.show()
+----+--------------------+-------------+-----+---+
| _c0| _c1| _c2| _c3|_c4|
+----+--------------------+-------------+-----+---+
|null| name| city|state| id|
| 0| NorthGate Brewing | Minneapolis| MN| 0|
| 1|Against the Grain...| Louisville| KY| 1|
| 2|Jack's Abby Craft...| Framingham| MA| 2|
| 3|Mike Hess Brewing...| San Diego| CA| 3|
| 4|Fort Point Beer C...|San Francisco| CA| 4|
| 5|COAST Brewing Com...| Charleston| SC| 5|
| 6|Great Divide Brew...| Denver| CO| 6|
| 7| Tapistry Brewing| Bridgman| MI| 7|
| 8| Big Lake Brewing| Holland| MI| 8|
| 9|The Mitten Brewin...| Grand Rapids| MI| 9|
| 10| Brewery Vivant| Grand Rapids| MI| 10|
| 11| Petoskey Brewing| Petoskey| MI| 11|
| 12| Blackrocks Brewery| Marquette| MI| 12|
| 13|Perrin Brewing Co...|Comstock Park| MI| 13|
| 14|Witch's Hat Brewi...| South Lyon| MI| 14|
| 15|Founders Brewing ...| Grand Rapids| MI| 15|
| 16| Flat 12 Bierwerks| Indianapolis| IN| 16|
| 17|Tin Man Brewing C...| Evansville| IN| 17|
| 18|Black Acre Brewin...| Indianapolis| IN| 18|
+----+--------------------+-------------+-----+---+
only showing top 20 rows
How cool is that? Your own Spark cluster to play with.
Quick start Hive
Find the Container ID of the Hive Server.
docker ps |grep hive-server
60f2c3b5eb32 bde2020/hive:2.3.2-postgresql-metastore "entrypoint.sh /bin/…" 27 hours ago Up 12 hours 0.0.0.0:10000->10000/tcp, 10002/tcp hive-server
Go to the command line of the Hive server and start hiveserver2
docker exec -it 60f2c3b5eb32 bash
hiveserver2
Maybe a little check that something is listening on port 10000 now
netstat -anp | grep 10000
tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 446/java
Okay. Beeline is the command line interface with Hive. Let’s connect to hiveserver2 now.
beeline
!connect jdbc:hive2://127.0.0.1:10000 scott tiger
Didn’t expect to encounter scott/tiger again after my Oracle days. But there you have it. Definitely not a good idea to keep that user on production.
Not a lot of databases here yet.
show databases;
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (0.335 seconds)
Let’s change that.
create database openbeer;
use openbeer;
And let’s create a table.
CREATE EXTERNAL TABLE IF NOT EXISTS breweries(
NUM INT,
NAME CHAR(100),
CITY CHAR(100),
STATE CHAR(100),
ID INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/data/openbeer/breweries';
And have a little select statement going.
select name from breweries limit 10;
+----------------------------------------------------+
| name |
+----------------------------------------------------+
| name |
| NorthGate Brewing |
| Against the Grain Brewery |
| Jack's Abby Craft Lagers |
| Mike Hess Brewing Company |
| Fort Point Beer Company |
| COAST Brewing Company |
| Great Divide Brewing Company |
| Tapistry Brewing |
| Big Lake Brewing |
+----------------------------------------------------+
10 rows selected (0.113 seconds)
There you go: your private Hive server to play with.
Conclusion
I got the lightweight Hadoop environment that I wanted. On my Windows 10 laptop with WSL2 (Windows Subsystem for Linux 2) installed, it uses only 3 GB memory. That’s not half bad. It sometimes was a frustrating journey, but I learned a lot about Docker and docker-compose and learned to love it.
I hope you have fun with this Hadoop-Spark-Hive cluster too.
Hi Marcel,
I get the following error when starting the containers:
historyserver | [88/100] try in 5s once again …
resourcemanager | [89/100] check for namenode:9000…
resourcemanager | [89/100] namenode:9000 is not available yet
resourcemanager | [89/100] try in 5s once again …
nodemanager | [89/100] check for namenode:9000…
nodemanager | [89/100] namenode:9000 is not available yet
nodemanager | [89/100] try in 5s once again …
historyserver | [89/100] check for namenode:9000…
historyserver | [89/100] namenode:9000 is not available yet
historyserver | [89/100] try in 5s once again …
I did not found a way to solve this.
Can you please give me some help?
Thank you,
Marius
Hi Marius,
I’ve had these errors also, but yet I was able to run Spark and Hive anyway. I haven’t found a way to solve these errors. It’s strange, because all are on the same Docker bridge network.
It does not work for me, I get the following error when running “make wordcount”:
mkdir: Call From 0c769d530a47/172.18.0.13 to namenode:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
make: *** [Makefile:16: wordcount] Error 1
Even so, I think this is a great article, great work.
Thank you!
Oh I forgot that one. It came with the Spark on Docker quick start, but I haven’t tested that one. Try the Quick Start HDFS, Spark and Hive below that. These should work.
I think I’ll remove the “make wordcount” one, because it was made in Java and I .. don’t know Java.
Hi there,
I used similar bde images and tried to upgrade with Hadoop3 images and Hive 3.1 with no luck. The hive-server container has issues after I upgraded to use Hive 3.1
Wonder whether you have tried using Hive 3? The Hadoop 3 containers work fine except for hive-server and metastore images.
I haven’t tried upgrading. Hive-server does work in this one, except you have to start it manually. If I find a way to solve that, I will update it.
For everyone who has the “namenode:9000” error message just replace all occurances of port 8020 with 9000.
Just using my editor search for 8020 and replace all with 9000 did the trick for me
How could I have overlooked that? I also replaced namenode:9010 with namenode:9000 and now I don’t get any new error messages anymore.
I keep getting these issues
INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1538ms
datanode | No GCs detected
And the spark session keeps getting stuck when I try to query using spark-sql. any idea why?
I have also tried to increase the JAVA_OPTS (increase the heapsize ) in docker-compose.yml for each container. Yet I face the same issue.
the hive containers dont start!
root@carl-ubuntu:/home/carl/bde/docker-hadoop-spark-workbench# docker-compose -f docker-compose-hive.yml up -d namenode hive-metastore-postgresql
Starting docker-hadoop-spark-workbench_hive-metastore-postgresql_1 …
Starting docker-hadoop-spark-workbench_hive-metastore-postgresql_1 … error
ERROR: for docker-hadoop-spark-workbench_hive-metastore-postgresql_1 Cannot start service hive-metastore-postgresql: network 0ce0c9fe2b45039ef0bfc6a06e39509e224be8edc30abc508668cb28e8268767 not found
ERROR: for hive-metastore-postgresql Cannot start service hive-metastore-postgresql: network 0ce0c9fe2b45039ef0bfc6a06e39509e224be8edc30abc508668cb28e8268767 not found
ERROR: Encountered errors while bringing up the project.
Hi,
I had my own problems with the Hadoop-Spark-Hive cluster. Spark would no longer connect to the namenode. Hive server wouldn’t start and connect anymore. Despite the fact that this all USED TO WORK. Even after completely doing a docker-compose down and reloading all the images. Same issues.
Well I found a “solution”, but you’re not going to like it. I did a factory reset of Docker Desktop. And docker-compose up reloaded every single image. And then everything worked again as it used to: Spark connected to the namenode, Hive server ran again and I was able to do queries using beeline.
Hi,
Thanks for doing this tutorial.
While trying to use spark.read.csv I’m getting a connection refused error.
java.net.ConnectException: Call From 52847e525b17/172.18.0.11 to namenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Any Advice?
Thanks!
I think I recently had a similar issue. It really looks like some changes in Docker Desktop caused this. When I did a factory reset on Docker Desktop all containers started correctly and all the exercises worked. As you can see in this (Dutch) video:
https://youtu.be/ajo2CYz_GSg
This won’t be easy to solve. I have to dig into this.
Thanks for your effort and good tutorial and easy to learn.
Can you help – how to submit spark job from namenode (Hadoop docker) with example.
When I ran – spark-submit – nothing was shown in “namenode” or Spark containers as well.
My data is in – hdfs – /user/data.csv
Hey! Awesome post, it helped me a lot! But I’m getting some trouble with persisting data in Hive. I’m able to create the databases and the tables perfectly and work with them, but as soon as I close the interactive hive-server terminal, it’s all gone, the databases, the tables and the data itself are gone.
scala>
0:8020/data/openbeer/breweries/breweries.csv”)af979:8020/data/openbeer/breweries/breweries.csv”)
java.net.ConnectException: Call From 4f3bf3b3f5f4/172.19.0.11 to 800287af9790:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:903)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1665)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1582)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1594)
at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1700)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
… 47 elided
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:690)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:794)
at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1572)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
… 76 more
scala>
Great Article!!
Hey Marcel-Jan Krijgsman
Thankyou Thank-you; It worked flawlessly. Not sure how long it took you but after cloning the repo I was up and testing your scripts in your article in 20 min. I was in such a jam to have a hive server up so I can test some of my scripts.. now you got me running. thanks again. Great job.
arif
Thanks Arif. I’m actually a bit surprised that it works for some people, because on more recent versions of Docker Desktop a security change has been made and now I can’t get Hive running myself. Some of the containers can’t communicate with each other anymore and it needs some kind of setup with certificates to get it to work again.
I run into those issues. Anything can be done for that?
Hi santosh,
I believe the solution is to install certificates on all the containers. It’s quite a lot of work and I haven’t had the time to work that out.
In fact, I’ve created this docker-compose to have a training environment for Hadoop, but nowadays my training has evolved to a data Lakehouse training and I use Databricks Community Edition for that.