Showing a complex Excel sheet who’s boss with Python and pandas

Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.

But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.

 

Files

You can download the example files for this on GitHub:

Pandas moon fueling github.xlsx

pandas_moon_fueling.py

The Excel sheet doesn’t have the exact amount of worksheets mentioned here, but the moon fuel stations sheet is identical.

 

In the left corner: the Excel sheet…

Clearly at my work we don’t maintain fueling stations on the Moon (yet), but the Excel sheet I encountered looked sort of like the layout of this one. Let’s take a look.

On the lefthand side you see three columns that define the fueling stations. The customers each have their own column for every week. There also is a summation of the data per week. Not seen here are the summations per year, per customer and totals per week. Every alternate row is either the amount of fuel and the money the customer paid.

I could save this as CSV file and store it in Hadoop, but it would give any data scientist a head ache. It’s more or less understandable for the human eye, but software won’t get this.

I think that this layout below would be much easier to use in analytics software:

But how do we accomplish that?

 

In the right corner: pandas!

Guess what? Python’s pandas library can read Excel. It even has a read_excel function. It’s almost as if other people got data delivered in Excel format..

Let’s start by importing pandas.

import pandas as pd

In my terminal window I have room enough to make my pandas output a little wider.

# Show wider tables. I've got the screen for it.
pd.set_option('display.width', 300)

Now open the Excel file and let’s show the sheets in it.

fuel_file = "Pandas moon fueling.xlsx"
xl = pd.ExcelFile(fuel_file)
print(xl.sheet_names)

Because we have fueling stations all over the solar system (as you do), this results in:

['Moon', 'Eros', 'Ceres', 'Mars', 'Europa', 'Ganymedes', 'Enceladus']

But in this case I just want to start by reading the Moon data. Now pandas has read_excel, but you can also use xl.parse. xl being where I stored my Excel data previously. It’s basically the same. I just found this on StackOverflow first 🙂 .

 

Headers and indexes

But in this case that simple example you’ll find first when googling “pandas read excel” is not enough. This has to do with headers and indexes. If you don’t pick your headers and indexes from the Excel sheet, pandas will pick something.

I’ve been reading everywhere to learn how to choose these. Because normally you have just a bunch of column names in a row. That would be your header row. Not so now. We have multiple levels of rows.

So this is how you read the Moon sheet and store it in a dataframe.

fuel_df = xl.parse("Moon", header=[1], index_col=[0,1,2])

I chose the row that has the weeknumbers: row 2 in the screenshot above. And counting from zero, that would be row 1. Therefor, header=[1].

The index is basically your row name. Database people might think of this one like a primary key. In this case I wanted the first three columns combined. Therefor, index_col=[0,1,2].

Let’s have a look at what we got now:

print(fuel_df.head())
                                                     Unnamed: 3   Unnamed: 4 Unnamed: 5 Week 01  
Location  Description                   Fueling port        NaN          NaN        NaN   Total  
                                        Fueling port     SpaceX  Blue Origin  Landspace     NaN  
COPS-001O Copernicus Station 1 Oxygen   C003              97916          NaN          0   97916  
                                        C003             133686          NaN     1012.5  134698  
COPS-001H Copernicus Station 1 Hydrogen C003             195040          NaN        NaN  195040
[..]

The first row with the “Unnamed: 3   Unnamed: 4  ” is now our header. You can see the weeknumber in there and that’s what I was going for.

You can see the complete header with this:

print(fuel_df.columns)
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Week 01', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Week 02', 'Unnamed: 12', 'Unnamed: 13', 'Week 03', 'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Week 04'], dtype='object')

Clearly this is not sufficient. Let’s work on that.

 

Weeknumbers in the header

One weakness in the Excel sheet, is that weeknumbers are to the right of the data from that week. Uninformed software would assume the data of that week is in the columns right of it. What I want, is that every column of a certain weeknumber get that weeknumber in the header.

I start by storing the column names (the header) in a variable and work on that. I also store the row with the customer names. We’ll get to that later.

weeknumbers_list = list(fuel_df.columns)
spacecustomers = fuel_df.iloc[1]

My way of doing this might be more cumbersome than necessary, but it does the job.

start_position = 0
# Loop through values with "Week" in the name
for weeknr_pos in positions_weeknumbers:
    # Go from the start position of that week and name everything with that weeknumber
    # until the end of that week.
    for pos in range(start_position, weeknr_pos):
        # Cell gets weeknumber in the 
        weeknumbers_list[pos] = weeknumbers_list[weeknr_pos]
    # Determine the first column position of the next weeknumber.
    start_position = weeknr_pos + 1
    # The last column of weeknr has the totals. We'll remove that one.
    weeknumbers_list[weeknr_pos] = "Remove"

Last step: feed the weeknumber list to the dataframe header.

fuel_df.columns = weeknumbers_list

 

Remove the unnecessary stuff

Next, let’s remove the first two rows and let’s remove all columns with Total in the name. We already indicated them with the term “Remove”.

fuel_df = fuel_df.iloc[2:]
fuel_df.drop("Remove", axis=1, inplace=True)

So what do we got now?

                                             Week 01 Week 01 Week 01 Week 02 Week 02
COPS-001O Copernicus Station 1 Oxygen   C003   97916     NaN       0     NaN     NaN
                                        C003  133686     NaN  1012.5     NaN     NaN
COPS-001H Copernicus Station 1 Hydrogen C003  195040     NaN     NaN     NaN     NaN
                                        C003   90371     NaN     NaN     NaN     NaN
MRUM-003  Mons Rümker Base 2            B12      NaN     NaN     NaN     NaN     NaN
[..]

 

Multiple headers

We’re missing something, aren’t we? We’re missing the names of the customers. But I have prepared something for that. There’s a spacecustomers variable with all the names in them. But it also has some of those NaN values in them. So let’s clean that up.

spacecustomers_cleanlist = [x for x in spacecustomers if str(x) != 'nan']

Now I want to add the customer list to the header as an extra layer. Why? Because I want those columns not just identified by weeknumbers, but also by customer names.

I’ve been looking for SO long on how to do this. I’ve been trying and trying and this is what I came up with. Use this to your advantage.

column_array = []
column_array.append(fuel_df.columns)
column_array.append(spacecustomers_cleanlist)

column_array here is basically a list of lists. Maybe I shouldn’t have called it an array. Too confusing. Anyway, this is a way to feed two lists as two headers in the fuel_df dataframe.

fuel_df.columns = column_array

Now our dataframe looks like this

                                                     Week 01                        Week 02
                                                      SpaceX Blue Origin Landspace   SpaceX
COPS-001O Copernicus Station 1 Oxygen    C003          97916         NaN         0      NaN
                                         C003         133686         NaN    1012.5      NaN
COPS-001H Copernicus Station 1 Hydrogen  C003         195040         NaN       NaN      NaN
                                         C003          90371         NaN       NaN      NaN
MRUM-003  Mons Rümker Base 2             B12             NaN         NaN       NaN      NaN
                                         B12             NaN         NaN       NaN      NaN
[..]

I’m so happy that I managed to do this. I’ve been reading about pandas’ MultiIndex feature, and all kinds of crazy schemes and possible solutions. It would not work. Until I saw this video by Zax Rosenberg speaking at a Chicaco Python Users Group meeting about just MultiIndexes and related features. Then I understood what I needed to do.

 

How to tell what row is what?

There’s still the little matter of telling which row has amounts of fuel and which one has prices in our dataframe. For this we create a list and add this as a new column. I feel I could have done this part a little smarter, but this for now worked. I just tell that even rownumbers are amounts and uneven rownumbers are the prices.

# Add a column to define what row is the amount of fuel and what is the price
row_definer_column = []
for rowdef in range(2, sheet_depth):
    # print(rowdef)
    if rowdef % 2 == 0:
        # Even
        row_definer_column.append("Amount")
    elif rowdef % 2 == 1 and rowdef != 1:
        # Uneven
        row_definer_column.append("Price")

And this is how we add this list as the fourth column (loc=3). This only works if your list is as long as the rest of the data.

fuel_df.insert(loc=3, column='amount_or_price', value=row_definer_column)

We also add this to our already existing MultiIndex by appending it (thanks, Zax Rosenberg!).

fuel_df.set_index('amount_or_price', append=True, inplace=True)

The names of the other columns in the MultiIndex were missing, so let’s fix that now:

fuel_df.index.names = ['Location', 'Description', 'Fueling_port', 'amount_or_price']

Let’s take another look at our dataframe. Where are we now?

                                                                     Week 01                       Week 02
                                                                      SpaceX Blue Origin Landspace  SpaceX
Location  Description                   Fueling_port amount_or_price                                      
COPS-001O Copernicus Station 1 Oxygen   C003         Amount            97916         NaN         0     NaN
                                                     Price            133686         NaN    1012.5     NaN
COPS-001H Copernicus Station 1 Hydrogen C003         Amount           195040         NaN       NaN     NaN
                                                     Price             90371         NaN       NaN     NaN
MRUM-003  Mons Rümker Base 2            B12          Amount              NaN         NaN       NaN     NaN

Not shown in this are a couple of trailing rows with subtotals and totals. We don’t need that. It will mess up the results. We have analytics software for making all those wonderful reports with those totals and subtotals.

fuel_df = fuel_df.iloc[:-6]

 

And now: our magic stacking trick!

I hope you made it up to here, because this is where the beautiful part starts. Getting all that data in 1-dimensional rows is now so easy, you won’t believe it. We spend all that time defining headers and indexes. This is where it all will pay off.

First, let’s show how our MultiIndex looks now:

We’re going to start by saying that we want the customer data no longer as a header. We want to stack the data as part of the index.What that means, is that we get a new column with customer names and all the data corresponding to that pivots accordingly.

And we do this with one simple command. Our customer data is in the header, level 1. We stack our data based on that:

fuel_df = fuel_df.stack(1)

If you don’t get what I’m saying here, here’s the result.

                                                                                        Week 01 Week 02  Week 03 Week 04
Location  Description                   Fueling_port amount_or_price
COPS-001O Copernicus Station 1 Oxygen   C003         Amount          Landspace                0    3000      NaN     NaN
                                                                     SpaceX               97916     NaN      NaN     NaN
                                                                     Virgin Galactic        NaN     992      NaN   34293
                                                     Price           Landspace           1012.5    6750      NaN     NaN
                                                                     SpaceX              133686     NaN      NaN     NaN
                                                                     Virgin Galactic        NaN  2012.5      NaN   62300
COPS-001H Copernicus Station 1 Hydrogen C003         Amount          Landspace              NaN    3000      NaN     NaN
                                                                     SpaceX              195040     NaN      NaN     NaN
                                                                     Virgin Galactic        NaN    1552      NaN     NaN
                                                     Price           Landspace              NaN    9000      NaN     NaN
                                                                     SpaceX               90371     NaN      NaN     NaN
                                                                     Virgin Galactic        NaN    2107      NaN     NaN
MRUM-003  Mons Rümker Base 2            B12          Amount          Landspace              NaN     NaN      NaN     430
                                                     Price           Landspace              NaN     NaN      NaN     943

Notice the new column on the right of amount_or_price. It doesn’t have a name yet. We do that in a moment.

First we want the amount and price data to move to the header. Again, with the MultiIndex this is so simple. The amount_or_price column in the index is now on position 4 (or 3, counting from 0).

fuel_df = fuel_df.unstack(3)

And let’s have a look at the result:

                                                                       Week 01          Week 02          Week 03        Week 04
amount_or_price                                                         Amount    Price  Amount   Price   Amount  Price  Amount   Price
Location  Description                   Fueling_port
COPS-001O Copernicus Station 1 Oxygen   C003         Landspace               0   1012.5    3000    6750      NaN    NaN     NaN     NaN
                                                     SpaceX              97916   133686     NaN     NaN      NaN    NaN     NaN     NaN
                                                     Virgin Galactic       NaN      NaN     992  2012.5      NaN    NaN   34293   62300
COPS-001H Copernicus Station 1 Hydrogen C003         Landspace             NaN      NaN    3000    9000      NaN    NaN     NaN     NaN
                                                     SpaceX             195040    90371     NaN     NaN      NaN    NaN     NaN     NaN
                                                     Virgin Galactic       NaN      NaN    1552    2107      NaN    NaN     NaN     NaN
MRUM-003  Mons Rümker Base 2            B12          Landspace             NaN      NaN     NaN     NaN      NaN    NaN     430     94

Now we have a two level header again, with weeknumbers and amount/prices. One more step to go. The weeknumbers must become a new column. We need to use stack again and for this we look at the level in the header again. Weeknumbers are level 0 in the header.

fuel_df = fuel_df.stack(0)

Which gives us all that data in nice one dimensional rows.

amount_or_price                                                                  Amount    Price
Location  Description                   Fueling_port
COPS-001O Copernicus Station 1 Oxygen   C003         Landspace         Week 01        0   1012.5
                                                                       Week 02     3000     6750
                                                     SpaceX            Week 01    97916   133686
                                                     Virgin Galactic   Week 02      992   2012.5
                                                                       Week 04    34293    62300
COPS-001H Copernicus Station 1 Hydrogen C003         Landspace         Week 02     3000     9000
                                                     SpaceX            Week 01   195040    90371
                                                     Virgin Galactic   Week 02     1552     2107
MRUM-003  Mons Rümker Base 2            B12          Landspace         Week 04      430      943

Add nice index names for customer data and weeknumbers (I assure you space customer is a technical term).

fuel_df.index.names = ['Location', 'Description', 'Fueling_port', 'Space_customer', 'Weeknumber']

All that’s left to do, is write it to a csv file.

print(fuel_df.to_csv('Pandas moon fueling.csv'))

This is what it looks like:

Location,Description,Fueling_port,Space_customer,Weeknumber,Amount,Price
COPS-001O,Copernicus Station 1 Oxygen,C003,Landspace,Week 01,0,1012.5
COPS-001O,Copernicus Station 1 Oxygen,C003,Landspace,Week 02,3000,6750
COPS-001O,Copernicus Station 1 Oxygen,C003,SpaceX,Week 01,97916,133685.5
COPS-001O,Copernicus Station 1 Oxygen,C003,Virgin Galactic,Week 02,992,2012.5
COPS-001O,Copernicus Station 1 Oxygen,C003,Virgin Galactic,Week 04,34293,62300
COPS-001H,Copernicus Station 1 Hydrogen,C003,Landspace,Week 02,3000,9000
COPS-001H,Copernicus Station 1 Hydrogen,C003,SpaceX,Week 01,195040,90371
COPS-001H,Copernicus Station 1 Hydrogen,C003,Virgin Galactic,Week 02,1552,2107
MRUM-003,Mons Rümker Base 2,B12,Landspace,Week 04,430,943
[..]

Who’s unstructured data now, huh? Victory is mine!

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Howto, Python and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.