Playing with asteroids data in MongoDB

If there is one thing I learned when becoming a data engineer, it’s that having just Hadoop expertise is probably not enough. For starters: what it means to be a data engineer is not exactly sharply defined. Some say data engineers are (Java) developers. Some place data engineers more at the operations side. And at some organisations data engineers work with any combination of these products: Hadoop, ElasticSearch, MongoDB, Cassandra, relational databases and even less hip products.

So I thought it would be a good idea to broaden my horizons. One product that is used quite often, is MongoDB. MongoDB is a NoSQL database. And if you don’t exactly know what that means, I think you will get the idea after viewing this video I made.

(more…)

Tutorial: Let’s throw some asteroids in Apache Hive

This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make sure you don’t look like comment spam though. I’m getting a lot of that lately, even though they never pass my approval).

Intro

Currently I’m studying for the Hortonworks Data Platform Certified Developer: Spark using Python exam (or HDPCD: Spark using Python). One part of the exam objectives is using SQL in Spark. Along the way you also work with Hive, the data warehouse software in Hadoop.

I was following the free Udemy HDPCD Spark using Python preparation course by ITVersity. The course is good BTW, especially for the price :). But after playing along with the Core Spark videos, the course again used the same boring revenue data for the Spark SQL part. And I thought: “I know SQL pretty well. Why not use data that is a bit more interesting?” And so I downloaded the Minor Planet Center’s asteroid data. This contains all the known asteroids until at least yesterday. At this moment, that is about 745.000 lines of data. (more…)