There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.
On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).
Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.
The one disadvantage of the book for me personally is that it’s mainly coded in Java. I know very little Java, though I can read it a bit. And learning Java is not high on my already long list of things to learn at this moment. I knew beforehand that this would be a slight challenge. Reading the book however, I think I could write a lot of the code in Python. I might do this later as a learning exercise.
File formats, databases and streaming
Nevertheless I was able to follow the idea behind the book quite well. The main advantage of the book is that it is based on real life examples of a data engineer. It describes all kinds of ways that you might need to ingest data: different kinds of files (csv, JSON, XML, ORC), databases (relational and ElasticSearch), API’s and there’s a chapter on streaming. There also are a number of chapters on transformation and aggregation. If you know your SQL, a few exercises will look not that unfamiliar for you.
All in all this book was a very worthwhile read, despite my lack of Java knowledge. There’s a lot in here that data engineers starting with Spark will find useful.