I’m keeping tech dossiers in Evernote on open source products I want to keep track of. And I decided to put them on my blog. My previous ones were on Kubernetes and Elasticsearch. This one is on the Python data management library pandas.
A short description – in English
Pandas is a library of Python. If you already have Python 3 (version 2 support was recently dropped), it’s a matter of running “pip install pandas” and there you are. Pandas allows you to analyze and manipulate your data. But then again, aren’t there many more products for that? How to explain the power of pandas?
Let me put it like this: it is like using Excel, but on much larger datasets, and if Excel had a command line interface. Imagine being able to say to Excel on a command line: “load my csv file”, “use this row as names for my columns”, “just show me columns date and sales”, “all right, now pivot that”. I just love it.
For this I’ve used pythonprogramming.net. It’s free and it gave me an excellent start with data analysis in Python. The Youtube videos for pandas seem to have been recently updated also.
Need to learn Python first? I started learning Python with the Coursera course “An Introduction to Interactive Programming in Python (Part 1)” from Rice University. It’s a great course. But if you want a free course, you can’t go wrong with the pythonprogramming.net videos.
You can also watch a couple of my video’s on my first encounters with pandas.
And recently I wrote a blogpost on how I used pandas at work to flatten the data from a complex Excel sheet, so I could load it in Hadoop. I’ve used all kinds of lesser known features to achieve that result.
Building your own environment
Want to play with pandas? That’s quite easy. You need to install Python 3 on your own computer and use “pip install pandas” (from the command line).
Getting pandas to do specific stuff
Selecting columns or rows with pandas (Because I keep forgetting after a while)
This article discusses two ways of selecting data with pandas, but it’s also handy as reminder how to select rows and columns. You can’t go wrong now.
How do multi-indexes in pandas work? Also in this video:
Other interesting stuff
Data visualization with pandas plot (How cool: you can add .plot to your dataframe)
pandas and performance