I never imagined myself as a maintainer of a data engineering related open source thing. Yet. But when I was working on our data engineering course, I needed some kind of data lake software. At first I used the Cloudera sandbox, but some of my colleagues tried it and they complained it took way too much time to start and way to many resources of their laptop. It would be a good bet that our students would get that problem too.
Long story short: I found that Big Data Europe already had a simple Dockerized Hadoop. They actually did all the hard work. But I wanted it to have Hive and Spark too. I went playing with docker-compose yml files and learned a lot from that BTW. And after some initial frustrations it finally worked.
And as an afterthought I put it on my Github:
Turns out that other people were looking for that kind of thing. Suddenly I was the maintainer of an open source repo. I wasn’t aware at first, because I did not have my Github notifications send through to my mail. I found out after I visited Github several months after. And the number of stars were rising.
Last year Docker underwent some changes. They hardened their software. Suddenly the Docker daemon was no longer available on tcp://localhost:2375 without TLS. And the Spark and Hive containers could no longer connect to the HDFS cluster. Panic! Luckily there is a workaround. You can enable “Expose daemon on tcp://localhost:2375 without TLS”. And then it will still work.