Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

Java

The one disadvantage of the book for me personally is that it’s mainly coded in Java. I know very little Java, though I can read it a bit. And learning Java is not high on my already long list of things to learn at this moment.  I knew beforehand that this would be a slight challenge. Reading the book however, I think I could write a lot of the code in Python. I might do this later as a learning exercise.

File formats, databases and streaming

Nevertheless I was able to follow the idea behind the book quite well. The main advantage of the book is that it is based on real life examples of a data engineer. It describes all kinds of ways that you might need to ingest data: different kinds of files (csv, JSON, XML, ORC), databases (relational and ElasticSearch), API’s and there’s a chapter on streaming.  There also are a number of chapters on transformation and aggregation. If you know your SQL, a few exercises will look not that unfamiliar for you.

All in all this book was a very worthwhile read, despite my lack of Java knowledge. There’s a lot in here that data engineers starting with Spark will find useful.

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Data engineering, Spark and tagged , , , . Bookmark the permalink.

2 Responses to Book review: Spark in Action, 2nd edition

  1. Thanks for the review Marcel-Jan! I took the bet of Java as it is very popular in many enterprises and among a lot of data engineers I worked with. When the book will be published (soon) the examples will still be in Java in the book but the repo will contains Python AND Scala! This will, hopefully, help you…

    And thanks for the proofreading, that’s super useful!

  2. Marcel-Jan Krijgsman says:

    That’s great to hear. That would help me.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.