Tech dossier: pandas

I’m keeping tech dossiers in Evernote on open source products I want to keep track of.  And I decided to put them on my blog. My previous ones were on Kubernetes and Elasticsearch. This one is on the Python data management library pandas.

 

A short description – in English

Pandas is a library of Python. If you already have Python 3 (version 2 support was recently dropped), it’s a matter of running “pip install pandas” and there you are. Pandas allows you to analyze and manipulate your data. But then again, aren’t there many more products for that? How to explain the power of pandas?

Let me put it like this: it is like using Excel, but on much larger datasets, and if Excel had a command line interface. Imagine being able to say to Excel on a command line: “load my csv file”, “use this row as names for my columns”, “just show me columns date and sales”, “all right, now pivot that”. I just love it.

 

Learning pandas

For this I’ve used pythonprogramming.net. It’s free and it gave me an excellent start with data analysis in Python. The Youtube videos for pandas seem to have been recently updated also.

Need to learn Python first? I started learning Python with the Coursera course “An Introduction to Interactive Programming in Python (Part 1)” from Rice University. It’s a great course. But if you want a free course, you can’t go wrong with the pythonprogramming.net videos.

You can also watch a couple of my video’s on my first encounters with pandas.

And recently I wrote a blogpost on how I used pandas at work to flatten the data from a complex Excel sheet, so I could load it in Hadoop. I’ve used all kinds of lesser known features to achieve that result.

 

Building your own environment

Want to play with pandas? That’s quite easy. You need to install Python 3 on your own computer and use “pip install pandas” (from the command line).

 

Getting pandas to do specific stuff

Selecting columns or rows with pandas (Because I keep forgetting after a while)

This article discusses two ways of selecting data with pandas, but it’s also handy as reminder how to select rows and columns. You can’t go wrong now.

How to shift a column in pandas

How do multi-indexes in pandas work? Also in this video:

 

 

Other interesting stuff

Pandas tricks and features you might not know

Data visualization with pandas plot (How cool: you can add .plot to your dataframe)

 

pandas and performance

pandas at extreme performance

 

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Data engineering, Python, Tech dossier and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.