Notes on my “Becoming a Hadoop Specialist” session

Today I talked about how I became a Hadoop specialist/data engineer at the ITNEXT Data Engineering & DevOps meetup.

Here are a couple of links that were or not were in my presentation:

The (what I call) “hype-o-meter” site from YCombinator: https://news.ycombinator.com/

 

Sites with courses:

Coursera (fixed-date courses): Coursera.org

Udacity (self-paced cources): Udacity.org

Udemy (non-MOOC course site with crazy discounts): udemy.com

MOOC search engine: class-central.com

MongoDB University (free as long as it’s MongoDB 🙂 ): university.mongodb.com

(more…)

Making a Hertzsprung-Russell diagram from Gaia DR2 data with Elasticsearch

Elasticsearch was one of the open source products on my list to try out, ever since I got rejected for a couple of assignments as a consultant last year. Apparently it’s a popular product. But why do you need a search engine in a Big Data architecture? This I explain in my new video where I load newly released data from ESA’s Gaia mission in Elasticsearch with Logstash and visualize it with Kibana. I’ve also Read more

Codemotion Amsterdam 2018, day two

Back on the ferry to the north of Amsterdam I went, back for day two of Codemotion Amsterdam 2018.

Keynote

Daniel Gebler from PicNic told us about what they are doing today to bring groceries home for people. I’ve seen two presentations by PicNic before and I could really see their progress from session to session.

Daniel explained how they use a recommender system to make it possible for customers to buy their most common groceries with one tap in the PicNic app. Which is actually hard. Even if you get 90% of precision of your prediction for one item, that means that for a set of 12 items you actually get 12% precision. So they really had to work to get a much better precision per item. They managed to do that by working with two dimensions of data: big and deep data. (more…)

Codemotion Amsterdam 2018, day one

Last Friday I almost felt I had to explain a colleague that I don’t always win raffles and lotteries. Because yep, I won another ticket. Again via the Roaring Elephant podcast. It’s pretty worthwhile listening to them, is all I’m saying.

This was a ticket for CodeMotion Amsterdam 2018. CodeMotion is a conference for developers with topics like the blockchain, Big Data, Internet of Things, DevOps, software architectures, but also front-end development, game development and AR/VR.

Amsterdam from the ferry to the north of the city.

(more…)

Starting at Port of Rotterdam per 1 May 2018

Next week (1 May 2018) I will start as a Hadoop specialist/data steward/data custodian/data something something at the Advanced Analytics team at Port of Rotterdam. We haven’t worked out a fancy data something title yet. I’m already working at this team as a consultant. I’ve been involved with security and data governance of the data lake (for people outside Big Data: a data lake is simply a Hadoop cluster).

The World Port Center

(more…)

Dataworks Summit Berlin 2018, day two

Back for round two of keynotes, good technical sessions and discussing them with fellow data specialists in between. Keynotes First up was  Frank Säuberlich from Teradata, who had an interesting example of machine learning for fraud detection at Danske Bank. They used transaction data sort of as pixels and ran that through a Convoluted Neural Network to find outliers. And they did. Before this solution they found many false positives, with this approach they managed Read more

Building HDP 2.6 on AWS, Part 3: the worker nodes

This is part 3 in a series on how to build a Hortonworks Data Platform 2.6 cluster on AWS. By now we have an edge node to run Ambari Server, three master nodes for Hadoop name nodes and such. Now we need worker nodes for processing the data.

Creating the worker nodes is not that much different from creating the master nodes. But the workers need more powerful nodes.

Creating the first worker node

Log in at Amazon Web Services again, in the same AWS district as the edge and master nodes. We start with one worker node and clone 2 more later on. Go to the EC2 dashboard in the AWS interface and click “Launch instance”. Then choose Ubuntu Server 16.04 from the Amazon Machine Images. (more…)

I feel great when I study

When I started studying Hadoop, Python and machine learning in 2016, I found something out that I didn’t expect. I feel better when I study. When I finished another problem, exam or course, and I stepped outside the house to do some shopping or to go to work, I felt great. And this effect is pretty consistent. Currently I’m in week 3 of MongoDB for DBAs at MongoDB University and in lecture 35 of Elasticsearch Read more