Expedition Data

Book review: Spark in Action, 2nd edition

Posted on September 1, 2019 by Marcel-Jan Krijgsman

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

Continue reading →

Posted in Data engineering, Spark | Tagged Apache Spark, Jean-Georges Perrin, Roaring Elephant podcast, Spark | 2 Comments

Neo4J: Loading rocket data in a graph database

Posted on June 8, 2019 by Marcel-Jan Krijgsman

When I first learned about graph databases, like Neo4J, I didn’t get it. That’s how I always start with new technology: not getting at all why people getting so enthusiastic about them. Then I read “Seven Databases in Seven Weeks, 2nd edition” (as reviewed in January). It describes Neo4J as a “whiteboard friendly”. Any diagram with boxes and lines you could draw on a whiteboard, can be stored in Neo4J. After reading the first paragraphs about Neo4J, I totally got why graph databases are very interesting.

As usual, I started following a course on Neo4J to get acquainted with the product. Well there was a course on Neo4J on Udemy.com and I followed it. But it was 3 years old and some of the code it teached, is already obsolete. So the less I say about that, the better. But I did learn some Cypher, the language used in Neo4J, and I later learned the more modern versions of the commands to get stuff done.

Next phase: do a project with it. It took me some time to think of interesting astronomy or space related data in it. Eventually I stumbled on a dataset that has been around and maintained for a long time. At least since I discovered when I just got on the Internet around 1993. It’s called Jonathan’s Space Page now. And the maintainer, Jonathan McDowell, still keeps the list of orbital space launches and list of satellites ever launched quite up to date.

And eventually I did manage to load this data in Neo4J and here is my video about that:

You can download Neo4J Desktop here:

https://neo4j.com/download-center/#desktop

You can find my Python and Cypher code here:

https://github.com/Marcel-Jan/neo4j_satellites

Let me know if you want me to go in depth on the code used in this video.

Posted in Active Learning, Howto, NoSQL | Tagged Cypher, graph database, Jonathan's Space Page, Neo4J, Seven Databases in Seven Weeks | Leave a comment

Starting at DIKW May 1st 2019

Posted on April 25, 2019 by Marcel-Jan Krijgsman

Per May 1st 2019 I’ll be working at a new company: DIKW in Nieuwegein. DIKW stands for: Data, Information, Knowledge, Wisdom (it works in Dutch too). I will be working as data engineer on consultancy basis.

I’ve already met many colleagues at DIKW and they are all very experienced in BI, data science and data engineering. It’s fun to converse with them, which actually was part of the sollicitation process.

I’m going to miss the view from the office I had, but in the new job I’ll basically manage myself. And let’s face it: none told me to do 8 courses and become Certified Kubernetes Administrator last year. None made a year plan. Development-wise I’m perfectly capable to find my own way.

Posted in Uncategorized | Tagged consultancy, courses, DIKW, Nieuwegein | Leave a comment

Showing a complex Excel sheet who’s boss with Python and pandas

Posted on March 8, 2019 by Marcel-Jan Krijgsman

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

Continue reading →

Posted in Howto, Python | Tagged Excel, header, multiindex, pandas, Python, space fueling stations, stack, unstack | Leave a comment

“Kubernetes” according to Youtube’s close captions

Posted on October 12, 2018 by Marcel-Jan Krijgsman

Okay, this is a bit immature and you’ll learn exactly nothing from this, but I could not resist. I’m following the “Kubernetes Course from a DevOps guru” course on Udemy.com. The videos on Udemy are simply Youtube videos. Just like regular Youtube, you can turn on close captions.

If the publisher of a Youtube video did not enter typed captions, Youtube will use machine learning algorithms to automatically create them instead. This works well when people speak fluent and accentless English (and maybe other languages as well). But results get a bit off even if someone has only a slight Scottish accent, like one of my Youtube favorites and Kerbal Space Program pilot, Scott Manley.

Continue reading →

Finding if exercising works with RStudio

Posted on September 20, 2018 by Marcel-Jan Krijgsman

Does exercising make me lose weight or body fat? I’ve gathered 6 years of health data (on myself) and tried using RStudio to tease out if exercise works. Answer: probably, maybe.

Notes on my “Becoming a Hadoop Specialist” session

Posted on May 29, 2018 by Marcel-Jan Krijgsman

Today I talked about how I became a Hadoop specialist/data engineer at the ITNEXT Data Engineering & DevOps meetup.

Here are a couple of links that were or not were in my presentation:

The (what I call) “hype-o-meter” site from YCombinator: https://news.ycombinator.com/

Sites with courses:

Coursera (fixed-date courses): Coursera.org

Udacity (self-paced cources): Udacity.org

Udemy (non-MOOC course site with crazy discounts): udemy.com

MOOC search engine: class-central.com

MongoDB University (free as long as it’s MongoDB 🙂 ): university.mongodb.com

Continue reading →

Making a Hertzsprung-Russell diagram from Gaia DR2 data with Elasticsearch

Posted on May 13, 2018 by Marcel-Jan Krijgsman

Elasticsearch was one of the open source products on my list to try out, ever since I got rejected for a couple of assignments as a consultant last year. Apparently it’s a popular product. But why do you need a search engine in a Big Data architecture? This I explain in my new video where I load newly released data from ESA’s Gaia mission in Elasticsearch with Logstash and visualize it with Kibana.

I’ve also created an extra video wherein I explain how the code works.

You can get the code I’ve used at my Github page.

Posted in NoSQL | Tagged 2001 A Space Odyssey, ElasticSearch, Frank Kane, Gaia, Hertzsprung-Russell diagram, Kibana, Logstash, Udemy, Vega | Leave a comment

Codemotion Amsterdam 2018, day two

Posted on May 10, 2018 by Marcel-Jan Krijgsman

Back on the ferry to the north of Amsterdam I went, back for day two of Codemotion Amsterdam 2018.

Keynote

Daniel Gebler from PicNic told us about what they are doing today to bring groceries home for people. I’ve seen two presentations by PicNic before and I could really see their progress from session to session.

Daniel explained how they use a recommender system to make it possible for customers to buy their most common groceries with one tap in the PicNic app. Which is actually hard. Even if you get 90% of precision of your prediction for one item, that means that for a set of 12 items you actually get 12% precision. So they really had to work to get a much better precision per item. They managed to do that by working with two dimensions of data: big and deep data. Continue reading →

Posted in Conferences, Events | Tagged Active learning, Big Data, Brain-Computer Interface, Codemotion, DDOS, Elastic, gravity of data, holacracy, objectives & key results, PicNic, recommender system, Springest, SQL Injection | Leave a comment

Codemotion Amsterdam 2018, day one

Posted on May 8, 2018 by Marcel-Jan Krijgsman

Last Friday I almost felt I had to explain a colleague that I don’t always win raffles and lotteries. Because yep, I won another ticket. Again via the Roaring Elephant podcast. It’s pretty worthwhile listening to them, is all I’m saying.

This was a ticket for CodeMotion Amsterdam 2018. CodeMotion is a conference for developers with topics like the blockchain, Big Data, Internet of Things, DevOps, software architectures, but also front-end development, game development and AR/VR.

Amsterdam from the ferry to the north of the city.

Continue reading →

Posted in Conferences, Events | Tagged Blender, DevOps, Internet of Things, Kafka, Kafka Streams, Kubernetes, open source, RegEx | Leave a comment

Book review: Spark in Action, 2nd edition

Neo4J: Loading rocket data in a graph database

Starting at DIKW May 1st 2019

Showing a complex Excel sheet who’s boss with Python and pandas

“Kubernetes” according to Youtube’s close captions

Finding if exercising works with RStudio

Notes on my “Becoming a Hadoop Specialist” session

Sites with courses:

Making a Hertzsprung-Russell diagram from Gaia DR2 data with Elasticsearch

Codemotion Amsterdam 2018, day two

Keynote

Codemotion Amsterdam 2018, day one

Recent Posts

Recent Comments

Archives

Categories