How to learn Big Data

“How do you got in Big Data?”, is a question that people asked me a couple of times now. So let me give that answer in a blogpost as well.

I’ve used eight sources of Big Data related knowledge and skills:

  • Massive Open Online Courses (MOOCs)
  • Books
  • Meetups and summits
  • Podcasts
  • Videos
  • Online documentation
  • Hands-on experience
  • Learning sites/”universities” of vendors

Massive Open Online Courses

MOOCs are a cost effective way to learn new skills. You can basically learn many skills for free, as long as you don’t want the certificate. The time you have to spend to complete the course can be different. Some courses only took me 2-3 hours for 5 weeks, others took me 8-10 hours per week for 11 weeks.

Good sites are Coursera.org, Udacity.com and EDX.org. In the Big Data field there’s also Big Data University. If you do want the certificate (you can put them easily on your LinkedIn profile, for example), I only know it costs about 80 euros per course on Coursera.org.

The quality of courses varies and that’s why a site like class-central.com is useful. This is where students rate the courses they have followed. There’s also coursetalk.com, but I don’t see as much rates by students there.

Back when I started following Big Data related courses I hoped that they would easily land me my data engineering job. That wasn’t exactly the case. But it does speak for you, if you wanted to learn a new field and went for it yourself. Most of these MOOC courses just give you a taste of the field you want to work in. If you liked it, you can find out more in the sources listed below.

 

Books

Some books really helped my career, but I’m a slow reader. Or maybe I don’t take enough time to read. I’ve bought the book Hadoop, The Definite Guide (700+ pages) for example, and by the time I’m halfway through, I’m pretty sure Hadoop is already out of fashion.

There are a lot cheap ebooks out there. I don’t mean free downloadable ebooks on pirated sites, but the deals of the publishers themselves. For example, Packt Publishing has a free ebook everyday on their Free Learning site. Apress has a daily deal of one ebook for 10 euros. O’Reilly has a daily deal also, but it’s more like 50% off.

Pro tip: tell your employer you need access to book sites like Safari Books Online, Apress Open or Mapt, so you can read all the ebooks all the time and even get access to learning video’s.

 

Meetups and summits

There are currently many meetups on data science and associated fields. They are usually held in the evening and they are also free. In about half a year I’ve already been to four meetups on data science and data engineering here in the Netherlands (and one in Germany). The topics and required knowledge level vary wildly. On one occasion the meetup was more of a showcase of what data science can do, while another had several very theoretical lectures. But whatever your knowledge/skilllevel, it’s an excellent opportunity to network. When I decided I wanted to become a data engineer it was good to talk to people in the field whether data engineers were in demand (the are). Also there often is pizza, so what’s not to like?

Now summits are a different affair, because some can be rather expensive for the individual (think 1000-2000 euros/dollars). But, as I found out last year, there was one that was free, the Big Data Expo in Utrecht, the Netherlands. I was a bit skeptical. I suspected many sales talks and little content. But actually it was great. I met my current employer there, and I was able to talk to many people in the field about what they expect of a data engineer.

And of course I was at the Dataworks Summit 2017 in München last week. This one is not for free, but I won a ticket, thanks to Hortonworks via the Roaring Elephant podcast. This is a great opportunity to learn a lot about varying topics in a short time. They are also on varying skill levels. The hardest thing about a summit like this, is choosing which of the parallel sessions to go to. One great thing about the Dataworks Summit, is that the presentations are online quite soon after. Check their Youtube channel.

Another well known Big Data related summit, is O’Reilly’s Strata Data Conference.

 

Podcasts

Podcasts are useful for finding out what the current trends are. Although some might leave you with the feeling you’re way behind. Here are the podcasts I listen to:

  • The Roaring Elephant podcast. In this podcast the presenters Dave Russell and Jhon Masschelein discuss recent interesting articles, which is useful, and they had some very interesting interviews. And I’m not just referring to their interview with me 🙂
  • Drill to Detail podcast. This is a podcast by Mark Rittman about Business Intelligence (BI) and analytics. Lately there have been a lot of episodes that discussed specific products, not all of them open source.
  • O’Reilly Data Show podcast. Has Big Data related topics, mainly in the data science area. They tend to go a bit over my head.
  • Software Engineering Radio. Actually more related to IT in general, but regularly has Big Data-related topics that are usually well explained to a less experienced audience.
  • Hadooponomics. This podcast seems to have died, because ever since Januari this year no new episodes have come out.

 

Videos

I haven’t fully investigated this one, but I’m pretty sure there’s a ton of videos that can help you with learning Hadoop and other Big Data topics. There’s even one that explains Hadoop (well, HDFS anyway) and Kafka (data streaming product) with Legos:

Ideal for managers 🙂

 

Online documentation

There is a lot of documentation around on the web, among which a lot from vendors. My employer, Open Circle Solutions, is partner with Hortonworks. They are (I think) the largest Hadoop provider. They are a 100% open source company and they contribute a lot to open source themselves. And here you find all their documentation: docs.hortonworks.com.

Not sure if I should file this under courses after all, but I really liked the “learning the ropes” courses Hortonworks provides for different Apache products, like the one for Apache Nifi. They even have a sandbox virtual machine you can download. So what more could you possibly want?

Of course other vendors have their sandbox vm’s, like Cloudera and MapR.

 

Hands-on experience

Nothing beats actually working with the product you want to maintain. Currently I’m building a Hortonworks Data Platform on a Amazon Web Services (cloud) platform. Now, I don’t have to pay for the AWS cluster (6 cloud hosts with each about 2 cores, 8 GB memory and about 250 GB diskspace) myself, so I’m not sure how expensive this is. But what I do know, is that when you shutdown your environment, you pay a lot less. So this might be a thing you could consider.

Learning to set up an AWS cluster is actually more work than you’d think. It’s the cloud, right? It should be done in a few clicks. Well, actually there are some choices to make when you build your Hadoop cluster in the cloud. I might come back on that topic later on.

 

Learning sites/”universities” from vendors

Typically the sort of thing the average individual doesn’t go for, because access to these sites is rather more expensive. I have some experience with Hortonworks University and my first thought was: “where are the explanatory video’s, like all the MOOCs I’ve followed did have?” There is a difference between these sites and MOOCs though. Sites like Hortonworks University cover all the topics in much more depth than any MOOC I’ve seen. This is where you really learn a product like Hadoop inside out. I’ve also heard there are upcoming improvements.

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Learning Big Data and tagged , , , , , , , , , , , , , , , , . Bookmark the permalink.

2 Responses to How to learn Big Data

  1. Dali says:

    Thank you Marcel. I’m an aspiring Oracle Database Architect, formerly an Oracle DBA. One of my responsibilities is that of Capacity Planning and Performance forecasting, I was searching for info on how to effectively use the OEM repository views when I came across your old blog. I was pondering over the last few months of how to make the leap into Big Data and Data Engineering and was confused over how and where to start owing to the information overload that is until I came across your blog (Added to my Favorites). Its truly a God send :). I have yet to start but now I have your blog to guide me. Appreciate your valuable input.
    Thank you for taking the time to share your experience.

    • Marcel-Jan Krijgsman says:

      Thanks Dali! You are the target audience I had in mind: people looking out to the open source world. Glad my writings helped you.
      I’ve made the switch to the Big Data / Open source world almost two years ago and I must say I’m enjoying it very much. So many helpful people. So much cool technology. But things go very fast. Big Data is almost considered mainstream now.

Leave a Reply

Your email address will not be published. Required fields are marked *