Playing with asteroids data in MongoDB

If there is one thing I learned when becoming a data engineer, it’s that having just Hadoop expertise is probably not enough. For starters: what it means to be a data engineer is not exactly sharply defined. Some say data engineers are (Java) developers. Some place data engineers more at the operations side. And at some organisations data engineers work with any combination of these products: Hadoop, ElasticSearch, MongoDB, Cassandra, relational databases and even less hip products.

So I thought it would be a good idea to broaden my horizons. One product that is used quite often, is MongoDB. MongoDB is a NoSQL database. And if you don’t exactly know what that means, I think you will get the idea after viewing this video I made.

To sum things up: there are free courses on MongoDB University. I just finished the M001: MongoDB Basics course. It’s a good place to start. It won’t take much of your time: about 3 hours per week for 3 weeks.

Jargon

MongoDB has a different jargon, but I think you could compare it like this with the jargon of relational databases:

[table “2” not found /]

 

Differences

MongoDB is different from relational databases in that you don’t have to fill in the data for all the fields. In the movie details database some movies have a plot description, a lot of them don’t. Now you might say that in relational databases you enter a NULL value for a column. What’s the big deal? The deal is that in MongoDB you don’t have to mention a long list of fields that will have NULL values. MongoDB will figure it out.

Documents can also be nested in documents. A movie can have a document called “cast”, which has several documents within. In a relational database you would create a “cast” table and join the movie details with that.

Undoubtly there are other differences, but this is enough for now.

 

Javascript instead of SQL

MongoDB is based on JavaScript. Selecting data is based on that. Instead of using SQL to select data, you use notation like this in MongoDB Compass, the graphical interface:

{genres: "Comedy"}

On the command prompt, in a collection called movieDetails, it would look like this:

db.movieDetails.find({genres: "Comedy"})

This would be how you would get that data in SQL:

select * from movieDetails where genre='Comedy'

The MongoDB Basics course will get more into the syntax.

 

About those asteroids

My idea was that if MongoDB is based on Javascript, it probably shouldn’t be too hard to put a JSON file in there. I already had created a MongoDB cluster in the cloud (MongoDB Atlas) for free for the course. Time to get back to the Minor Planet Center to get the latest version of the extended asteroids dataset, which is available in JSON. But first I tried a smaller set of 10,000 asteroids to make sure it worked.

You import data in MongoDB with a tool called mongoimport. I was struggling coming up with the right connect string. But then I found out that MongoDB Atlas comes with copy-pastable connect strings on the site. Just click on “Connect” at the overview of your cluster.

 

Encapsulating

Unfortunately I got “JSON decoder out of sync” errors. A bit of Googling and Stack Overflow learned that encapsulating the JSON data would probably solve that issue. At the start of the JSON file I would define an “uberdocument” if you will:

{
"asteroids":

At the end I closed it off with:

}

 

And now the import worked out well and I was able to view the data in MongoDB as seen in the video.

mongoimport --host Sandbox-shard-0/sandbox-shard-00-00-tkepr.mongodb.net:27017,sandbox-shard-00-01-tkepr.mongodb.net:27017, \
sandbox-shard-00-02-tkepr.mongodb.net:27017 --ssl --username m001-student --password toosimple --authenticationDatabase admin \
--db asteroids --collection asteroids --type json --file "D:\Stuur\mpcorb_10kasteroids_encaps.json"
2018-02-09T16:08:45.624+0100 connected to: Sandbox-shard-0/sandbox-shard-00-00-tkepr.mongodb.net:27017,sandbox-shard-00-01-tkepr.mongodb.net:27017,sandbox-shard-00-02-tkepr.mongodb.net:27017
2018-02-09T16:08:47.819+0100 [########################] asteroids.asteroids 6.86MB/6.86MB (100.0%)
2018-02-09T16:08:50.821+0100 [########################] asteroids.asteroids 6.86MB/6.86MB (100.0%)
2018-02-09T16:08:52.705+0100 [########################] asteroids.asteroids 6.86MB/6.86MB (100.0%)
2018-02-09T16:08:52.705+0100 imported 1 document

 

755,000+ asteroids

I tried importing the full set of asteroids (also encapsulated) the same way, but ran into problems.

mongoimport --host Sandbox-shard-0/sandbox-shard-00-00-tkepr.mongodb.net:27017,sandbox-shard-00-01-tkepr.mongodb.net:27017, \
sandbox-shard-00-02-tkepr.mongodb.net:27017 --ssl --username m001-student --password toosimple --authenticationDatabase admin \
--db asteroidse --collection asteroidse --type json --file "D:\Stuur\mpcorb_extended_20180209_encaps.json"

2018-02-09T18:53:50.289+0100 connected to: Sandbox-shard-0/sandbox-shard-00-00-tkepr.mongodb.net:27017,sandbox-shard-00-01-tkepr.mongodb.net:27017,sandbox-shard-00-02-tkepr.mongodb.net:27017
2018-02-09T18:53:52.475+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:53:55.474+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:53:58.474+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:54:01.474+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:54:04.510+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:54:07.503+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:54:10.474+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:54:13.474+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:54:14.132+0100 num failures: 1
2018-02-09T18:54:14.132+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-09T18:54:14.134+0100 Failed: lost connection to server
2018-02-09T18:54:14.135+0100 imported 0 documents

I was worried I reached the limits of the free MongoDB cluster, but I was wrong. A bit of Googling learned that it would work better with an un-encapsulated JSON file and the option –json-array:

mongoimport --host Sandbox-shard-0/sandbox-shard-00-00-tkepr.mongodb.net:27017,sandbox-shard-00-01-tkepr.mongodb.net:27017, \
sandbox-shard-00-02-tkepr.mongodb.net:27017 --ssl --username m001-student --authenticationDatabase admin --db asteroids \
--collection asteroids --type json --file "D:\Stuur\mpcorb_extended_20180209.json" --jsonArray
[..]
2018-02-14T09:41:30.014+0100 [#######################.] asteroids.asteroids 429MB/441MB (97.3%)
2018-02-14T09:41:33.014+0100 [#######################.] asteroids.asteroids 434MB/441MB (98.3%)
2018-02-14T09:41:36.014+0100 [#######################.] asteroids.asteroids 438MB/441MB (99.3%)
2018-02-14T09:41:38.298+0100 [########################] asteroids.asteroids 441MB/441MB (100.0%)
2018-02-14T09:41:38.298+0100 imported 755189 documents

And there you have it. I must say it was fun to work with this product. This was the most optimized way of learning new technology I’ve seen until now.

This entry was posted in NoSQL and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *