Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.
But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.
You can download the example files for this on GitHub:
The Excel sheet doesn’t have the exact amount of worksheets mentioned here, but the moon fuel stations sheet is identical.
In the left corner: the Excel sheet…
Clearly at my work we don’t maintain fueling stations on the Moon (yet), but the Excel sheet I encountered looked sort of like the layout of this one. Let’s take a look.
On the lefthand side you see three columns that define the fueling stations. The customers each have their own column for every week. There also is a summation of the data per week. Not seen here are the summations per year, per customer and totals per week. Every alternate row is either the amount of fuel and the money the customer paid.
I could save this as CSV file and store it in Hadoop, but it would give any data scientist a head ache. It’s more or less understandable for the human eye, but software won’t get this.
I think that this layout below would be much easier to use in analytics software:
But how do we accomplish that?
In the right corner: pandas!
Guess what? Python’s pandas library can read Excel. It even has a read_excel function. It’s almost as if other people got data delivered in Excel format..
Let’s start by importing pandas.
import pandas as pd
In my terminal window I have room enough to make my pandas output a little wider.
# Show wider tables. I've got the screen for it. pd.set_option('display.width', 300)
Now open the Excel file and let’s show the sheets in it.
fuel_file = "Pandas moon fueling.xlsx" xl = pd.ExcelFile(fuel_file) print(xl.sheet_names)
Because we have fueling stations all over the solar system (as you do), this results in:
['Moon', 'Eros', 'Ceres', 'Mars', 'Europa', 'Ganymedes', 'Enceladus']
But in this case I just want to start by reading the Moon data. Now pandas has read_excel, but you can also use xl.parse. xl being where I stored my Excel data previously. It’s basically the same. I just found this on StackOverflow first 🙂 .
Headers and indexes
But in this case that simple example you’ll find first when googling “pandas read excel” is not enough. This has to do with headers and indexes. If you don’t pick your headers and indexes from the Excel sheet, pandas will pick something.
I’ve been reading everywhere to learn how to choose these. Because normally you have just a bunch of column names in a row. That would be your header row. Not so now. We have multiple levels of rows.
So this is how you read the Moon sheet and store it in a dataframe.
fuel_df = xl.parse("Moon", header=, index_col=[0,1,2])
I chose the row that has the weeknumbers: row 2 in the screenshot above. And counting from zero, that would be row 1. Therefor, header=.
The index is basically your row name. Database people might think of this one like a primary key. In this case I wanted the first three columns combined. Therefor, index_col=[0,1,2].
Let’s have a look at what we got now:
Unnamed: 3 Unnamed: 4 Unnamed: 5 Week 01 Location Description Fueling port NaN NaN NaN Total Fueling port SpaceX Blue Origin Landspace NaN COPS-001O Copernicus Station 1 Oxygen C003 97916 NaN 0 97916 C003 133686 NaN 1012.5 134698 COPS-001H Copernicus Station 1 Hydrogen C003 195040 NaN NaN 195040 [..]
The first row with the “Unnamed: 3 Unnamed: 4 ” is now our header. You can see the weeknumber in there and that’s what I was going for.
You can see the complete header with this:
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Week 01', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Week 02', 'Unnamed: 12', 'Unnamed: 13', 'Week 03', 'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Week 04'], dtype='object')
Clearly this is not sufficient. Let’s work on that.
Weeknumbers in the header
One weakness in the Excel sheet, is that weeknumbers are to the right of the data from that week. Uninformed software would assume the data of that week is in the columns right of it. What I want, is that every column of a certain weeknumber get that weeknumber in the header.
I start by storing the column names (the header) in a variable and work on that. I also store the row with the customer names. We’ll get to that later.
weeknumbers_list = list(fuel_df.columns) spacecustomers = fuel_df.iloc
My way of doing this might be more cumbersome than necessary, but it does the job.
start_position = 0 # Loop through values with "Week" in the name for weeknr_pos in positions_weeknumbers: # Go from the start position of that week and name everything with that weeknumber # until the end of that week. for pos in range(start_position, weeknr_pos): # Cell gets weeknumber in the weeknumbers_list[pos] = weeknumbers_list[weeknr_pos] # Determine the first column position of the next weeknumber. start_position = weeknr_pos + 1 # The last column of weeknr has the totals. We'll remove that one. weeknumbers_list[weeknr_pos] = "Remove"
Last step: feed the weeknumber list to the dataframe header.
fuel_df.columns = weeknumbers_list
Remove the unnecessary stuff
Next, let’s remove the first two rows and let’s remove all columns with Total in the name. We already indicated them with the term “Remove”.
fuel_df = fuel_df.iloc[2:]
fuel_df.drop("Remove", axis=1, inplace=True)
So what do we got now?
Week 01 Week 01 Week 01 Week 02 Week 02 COPS-001O Copernicus Station 1 Oxygen C003 97916 NaN 0 NaN NaN C003 133686 NaN 1012.5 NaN NaN COPS-001H Copernicus Station 1 Hydrogen C003 195040 NaN NaN NaN NaN C003 90371 NaN NaN NaN NaN MRUM-003 Mons Rümker Base 2 B12 NaN NaN NaN NaN NaN [..]
We’re missing something, aren’t we? We’re missing the names of the customers. But I have prepared something for that. There’s a spacecustomers variable with all the names in them. But it also has some of those NaN values in them. So let’s clean that up.
spacecustomers_cleanlist = [x for x in spacecustomers if str(x) != 'nan']
Now I want to add the customer list to the header as an extra layer. Why? Because I want those columns not just identified by weeknumbers, but also by customer names.
I’ve been looking for SO long on how to do this. I’ve been trying and trying and this is what I came up with. Use this to your advantage.
column_array =  column_array.append(fuel_df.columns) column_array.append(spacecustomers_cleanlist)
column_array here is basically a list of lists. Maybe I shouldn’t have called it an array. Too confusing. Anyway, this is a way to feed two lists as two headers in the fuel_df dataframe.
fuel_df.columns = column_array
Now our dataframe looks like this
Week 01 Week 02 SpaceX Blue Origin Landspace SpaceX COPS-001O Copernicus Station 1 Oxygen C003 97916 NaN 0 NaN C003 133686 NaN 1012.5 NaN COPS-001H Copernicus Station 1 Hydrogen C003 195040 NaN NaN NaN C003 90371 NaN NaN NaN MRUM-003 Mons Rümker Base 2 B12 NaN NaN NaN NaN B12 NaN NaN NaN NaN [..]
I’m so happy that I managed to do this. I’ve been reading about pandas’ MultiIndex feature, and all kinds of crazy schemes and possible solutions. It would not work. Until I saw this video by Zax Rosenberg speaking at a Chicaco Python Users Group meeting about just MultiIndexes and related features. Then I understood what I needed to do.
How to tell what row is what?
There’s still the little matter of telling which row has amounts of fuel and which one has prices in our dataframe. For this we create a list and add this as a new column. I feel I could have done this part a little smarter, but this for now worked. I just tell that even rownumbers are amounts and uneven rownumbers are the prices.
# Add a column to define what row is the amount of fuel and what is the price row_definer_column =  for rowdef in range(2, sheet_depth): # print(rowdef) if rowdef % 2 == 0: # Even row_definer_column.append("Amount") elif rowdef % 2 == 1 and rowdef != 1: # Uneven row_definer_column.append("Price")
And this is how we add this list as the fourth column (loc=3). This only works if your list is as long as the rest of the data.
fuel_df.insert(loc=3, column='amount_or_price', value=row_definer_column)
We also add this to our already existing MultiIndex by appending it (thanks, Zax Rosenberg!).
fuel_df.set_index('amount_or_price', append=True, inplace=True)
The names of the other columns in the MultiIndex were missing, so let’s fix that now:
fuel_df.index.names = ['Location', 'Description', 'Fueling_port', 'amount_or_price']
Let’s take another look at our dataframe. Where are we now?
Week 01 Week 02 SpaceX Blue Origin Landspace SpaceX Location Description Fueling_port amount_or_price COPS-001O Copernicus Station 1 Oxygen C003 Amount 97916 NaN 0 NaN Price 133686 NaN 1012.5 NaN COPS-001H Copernicus Station 1 Hydrogen C003 Amount 195040 NaN NaN NaN Price 90371 NaN NaN NaN MRUM-003 Mons Rümker Base 2 B12 Amount NaN NaN NaN NaN
Not shown in this are a couple of trailing rows with subtotals and totals. We don’t need that. It will mess up the results. We have analytics software for making all those wonderful reports with those totals and subtotals.
fuel_df = fuel_df.iloc[:-6]
And now: our magic stacking trick!
I hope you made it up to here, because this is where the beautiful part starts. Getting all that data in 1-dimensional rows is now so easy, you won’t believe it. We spend all that time defining headers and indexes. This is where it all will pay off.
First, let’s show how our MultiIndex looks now:
We’re going to start by saying that we want the customer data no longer as a header. We want to stack the data as part of the index.What that means, is that we get a new column with customer names and all the data corresponding to that pivots accordingly.
And we do this with one simple command. Our customer data is in the header, level 1. We stack our data based on that:
fuel_df = fuel_df.stack(1)
If you don’t get what I’m saying here, here’s the result.
Week 01 Week 02 Week 03 Week 04 Location Description Fueling_port amount_or_price COPS-001O Copernicus Station 1 Oxygen C003 Amount Landspace 0 3000 NaN NaN SpaceX 97916 NaN NaN NaN Virgin Galactic NaN 992 NaN 34293 Price Landspace 1012.5 6750 NaN NaN SpaceX 133686 NaN NaN NaN Virgin Galactic NaN 2012.5 NaN 62300 COPS-001H Copernicus Station 1 Hydrogen C003 Amount Landspace NaN 3000 NaN NaN SpaceX 195040 NaN NaN NaN Virgin Galactic NaN 1552 NaN NaN Price Landspace NaN 9000 NaN NaN SpaceX 90371 NaN NaN NaN Virgin Galactic NaN 2107 NaN NaN MRUM-003 Mons Rümker Base 2 B12 Amount Landspace NaN NaN NaN 430 Price Landspace NaN NaN NaN 943
Notice the new column on the right of amount_or_price. It doesn’t have a name yet. We do that in a moment.
First we want the amount and price data to move to the header. Again, with the MultiIndex this is so simple. The amount_or_price column in the index is now on position 4 (or 3, counting from 0).
fuel_df = fuel_df.unstack(3)
And let’s have a look at the result:
Week 01 Week 02 Week 03 Week 04 amount_or_price Amount Price Amount Price Amount Price Amount Price Location Description Fueling_port COPS-001O Copernicus Station 1 Oxygen C003 Landspace 0 1012.5 3000 6750 NaN NaN NaN NaN SpaceX 97916 133686 NaN NaN NaN NaN NaN NaN Virgin Galactic NaN NaN 992 2012.5 NaN NaN 34293 62300 COPS-001H Copernicus Station 1 Hydrogen C003 Landspace NaN NaN 3000 9000 NaN NaN NaN NaN SpaceX 195040 90371 NaN NaN NaN NaN NaN NaN Virgin Galactic NaN NaN 1552 2107 NaN NaN NaN NaN MRUM-003 Mons Rümker Base 2 B12 Landspace NaN NaN NaN NaN NaN NaN 430 94
Now we have a two level header again, with weeknumbers and amount/prices. One more step to go. The weeknumbers must become a new column. We need to use stack again and for this we look at the level in the header again. Weeknumbers are level 0 in the header.
fuel_df = fuel_df.stack(0)
Which gives us all that data in nice one dimensional rows.
amount_or_price Amount Price Location Description Fueling_port COPS-001O Copernicus Station 1 Oxygen C003 Landspace Week 01 0 1012.5 Week 02 3000 6750 SpaceX Week 01 97916 133686 Virgin Galactic Week 02 992 2012.5 Week 04 34293 62300 COPS-001H Copernicus Station 1 Hydrogen C003 Landspace Week 02 3000 9000 SpaceX Week 01 195040 90371 Virgin Galactic Week 02 1552 2107 MRUM-003 Mons Rümker Base 2 B12 Landspace Week 04 430 943
Add nice index names for customer data and weeknumbers (I assure you space customer is a technical term).
fuel_df.index.names = ['Location', 'Description', 'Fueling_port', 'Space_customer', 'Weeknumber']
All that’s left to do, is write it to a csv file.
print(fuel_df.to_csv('Pandas moon fueling.csv'))
This is what it looks like:
Location,Description,Fueling_port,Space_customer,Weeknumber,Amount,Price COPS-001O,Copernicus Station 1 Oxygen,C003,Landspace,Week 01,0,1012.5 COPS-001O,Copernicus Station 1 Oxygen,C003,Landspace,Week 02,3000,6750 COPS-001O,Copernicus Station 1 Oxygen,C003,SpaceX,Week 01,97916,133685.5 COPS-001O,Copernicus Station 1 Oxygen,C003,Virgin Galactic,Week 02,992,2012.5 COPS-001O,Copernicus Station 1 Oxygen,C003,Virgin Galactic,Week 04,34293,62300 COPS-001H,Copernicus Station 1 Hydrogen,C003,Landspace,Week 02,3000,9000 COPS-001H,Copernicus Station 1 Hydrogen,C003,SpaceX,Week 01,195040,90371 COPS-001H,Copernicus Station 1 Hydrogen,C003,Virgin Galactic,Week 02,1552,2107 MRUM-003,Mons Rümker Base 2,B12,Landspace,Week 04,430,943 [..]
Who’s unstructured data now, huh? Victory is mine!