The Atlas REST API – working examples

Originally I was writing a blogpost about my experiences with Apache Atlas (which is still in the works) in which I would refer to a Hortonworks Community post I wrote with all the working examples of Atlas REST API calls. But since Hortonworks Community has migrated to Cloudera Community, this article seems to have been lost. The original URL brings you to the Cloudera Community, but not the article. The search engine comes up with nothing. I can’t find it via my profile either.

It wasn’t particularly easy to gain all this knowledge. So of course I had a backup of all successful commands and output. And here it is. This was all tested on HDP 2.6.5.

Continue reading

Posted in Apache Atlas | Tagged , , , , , , , , , | Leave a comment

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

Continue reading

Posted in Data engineering, Spark | Tagged , , , | 2 Comments

Neo4J: Loading rocket data in a graph database

When I first learned about graph databases, like Neo4J, I didn’t get it. That’s how I always start with new technology: not getting at all why people getting so enthusiastic about them. Then I read “Seven Databases in Seven Weeks, 2nd edition” (as reviewed in January). It describes Neo4J as a “whiteboard friendly”. Any diagram with boxes and lines you could draw on a whiteboard, can be stored in Neo4J. After reading the first paragraphs about Neo4J, I totally got why graph databases are very interesting.

As usual, I started following a course on Neo4J to get acquainted with the product. Well there was a course on Neo4J on Udemy.com and I followed it. But it was 3 years old and some of the code it teached, is already obsolete. So the less I say about that, the better. But I did learn some Cypher, the language used in Neo4J, and I later learned the more modern versions of the commands to get stuff done.

Next phase: do a project with it. It took me some time to think of interesting astronomy or space related data in it. Eventually I stumbled on a dataset that has been around and maintained for a long time. At least since I discovered when I just got on the Internet around 1993. It’s called Jonathan’s Space Page now. And the maintainer, Jonathan McDowell, still keeps the list of orbital space launches and list of satellites ever launched quite up to date.

And eventually I did manage to load this data in Neo4J and here is my video about that:

 

You can download Neo4J Desktop here:

Neo4j Download Center

You can find my Python and Cypher code here:

https://github.com/Marcel-Jan/neo4j_satellites

Let me know if you want me to go in depth on the code used in this video.

 

Posted in Active Learning, Howto, NoSQL | Tagged , , , , | Leave a comment

Starting at DIKW May 1st 2019

Per May 1st 2019 I’ll be working at a new company: DIKW in Nieuwegein. DIKW stands for: Data, Information, Knowledge, Wisdom (it works in Dutch too). I will be working as data engineer on consultancy basis.

 

I’ve already met many colleagues at DIKW and they are all very experienced in BI, data science and data engineering. It’s fun to converse with them, which actually was part of the sollicitation process.

I’m going to miss the view from the office I had, but in the new job I’ll basically manage myself. And let’s face it: none told me to do 8 courses and become Certified Kubernetes Administrator last year. None made a year plan. Development-wise I’m perfectly capable to find my own way.

Posted in Uncategorized | Tagged , , , | Leave a comment

Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

Continue reading

Posted in Howto, Python | Tagged , , , , , , , | Leave a comment

Book review: Seven Databases in Seven Weeks

There are so many data related open source products nowadays. On one side that’s great. On the other side it’s hard for one human to grasp them all. To be sure, there’s great documentation on them all. And there are books and sessions at meetups and conferences that tell you in depth what they are about. But sometimes you just want to get the gist of it. To quickly learn a lot of products, so you can pick the ones you find useful. But it’s rare to get that kind of overview over multiple products.

But “Seven Databases in Seven Weeks, Second Edition” by Luc Perkins, Jim Wilson and Eric Redmond does exactly that. It describes a selection of seven different types of databases, their strengths and weaknesses.

 

The seven databases

The seven databases are PostgreSQL, HBase, MongoDB, CouchDB, Neo4J, DynamoDB and Redis. It’s a good mix of databases. They all have their different uses. Continue reading

Posted in Active Learning | Tagged , , , , , , , , , , , | Leave a comment

R Studio: Doing data science on my health data – Part 1

Up to seven years ago my doctor would nag me every half year that I should lose some weight. Nagging didn’t work on me that much. What did work however was competition. I wanted to become faster in a bike race (actually, the cycling part in a relay triathlon). When I noticed that losing weight myself instead of buying an expensive new bike, I set out to do the first and I lost about 10 kilograms the first year. I won’t go into too many details, because I probably have talked about it too often. I’m a bit too proud about this.

This summer I followed an R Studio course on Udemy and when I finished it, I was thinking of doing a project with R Studio. I did gather a lot of health data in the last 6 years: weight, body fat, data from my heart rate monitor, a step count from my iPhone and sleep data. All this was stored in an Excel sheet. It’s not exactly Big data, but it has a couple of thousands rows now. Surely there’s a way to get some meaning out of it. I’ve already done a video about this, but it isn’t easy to copy commands from videos. And I’ve tried out some other stuff in this series.

Continue reading

Posted in Learning Big Data, Weird experiments | Tagged , , , , | Leave a comment

Check your /tmp on HDFS

If you have sensitive data on your Hadoop cluster, you might want to check /tmp on HDFS once a while to see what ends up there. /tmp is used by several components. Hive for example stores its “scratch data” there. But fortunately it does so in subdirectories with permissions for the user that ran the job only. The files in there are not readable for anyone else.

But some people think /tmp is a good place to store in between data of their homegrown processes. Even when you clean up afterwards, this is not a good idea when dealing with sensitive data. Unless you set the permissions in such a way that it is not readable for anyone else. But this is often forgotten. And when such a proces fails, usually this data stays in /tmp for a long time. Continue reading

Posted in Learning Big Data | Tagged , , | Leave a comment

Tech dossier: Elasticsearch / the ELK stack

Because tech is moving so fast, I’ve been keeping dossiers in Evernote of open source products I have to learn more of, which I’ve decided to put on my blog. My last one was about Kubernetes. This one is about Elasticsearch, also known as the ELK stack.

 

A short description – in English

Elasticsearch is a search engine. You can use it to quickly find keywords in a large collection of documents. But the people who came up with it must have realized at one point that you can use the same technology also to search in much more than text. You can use it to search any data. Continue reading

Posted in Tech dossier | Tagged , , , , , | Leave a comment