Recently I got the request to add specific data quality metadata with csv datasets that my client delivers to customers. It was very simple. Just counts, min, max and -in case of integers – sums per attribute. Not a difficult task. After a short talk with architects we decided to build this with Python and pandas. Because my efforts were required in another project at that time, my fellow DIKW entrepeneur Wyas build most of it. It worked out well. It ran through a couple of GBs in minutes.
A mortgage of a millenium
Then we ran it against another dataset and “Good news everyone“: this error appeared:
datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3032-04-01 00:00:00
We were dealing with mortgage data here. And this date (3032-04-01) popped up as mortgage end date. What sane financial organization would have the year 3032 in the end date for a mortgage? Obviously this was a typo. After some investigation I found there was only one row like that. Must have been Bender and Fry’s Apartement 00100100 from Futurama.
Date limitations in to_datetime
The to_datetime function in the pandas library has its limits. This is because of the number of nanoseconds the datetime64 data type can handle. You can check it yourself with pd.Timestamp.min and pd.Timestamp.max. You’ll find the minimum date is 1677-09-22 00:12:43.145225 (sorry, Galileo) and the maximum is 2262-04-11 23:47:16.854775807 (sorry, Spock).
So I told that our Python code could not handle these far future dates. Otherwise the script was fine. And the product owner said: “that’s nice, but your Python script is supposed to show data quality issues. This 3032 year is a data quality issue. You have to find a way around it and show this far future date”.
Luckily the Python library dateutil is able to handle dates that go much further than 2262-04-11. Hurray. So all we do is a little conversion with dateutil. No problem:
df_datetime = df_datetime.apply(dateutil.
That worked fine with the first datetime column that had a far future date. But the second one had empty values in it. And dateutil doesn’t like that. When I looked for solutions StackOverflow was like “convert it back to datetime64!”. Yeah, no.
The solution I found was to change empty strings by NaN’s and then drop the NaN values. Which was okay in this case, because we didn’t want the empty strings being part of the data quality result anyway.
df_datetime = df_datetime[df_datetime.
astype(bool)] df_datetime = df_datetime.dropna() df_datetime = df_datetime.apply(dateutil. parser.parse)
So this worked. Hurray! But going through the whole dataset now was much slower. The 2 minute run now became a 1.5 hour run. Not good.
If it breaks, do the slow method
The solution was to let the to_datetime conversion run for every datetime column and use an exeption if the OutOfBoundsDateTime error occurred.
try: df_datetime = pd.to_datetime(self.tabledata_df[attribute]) except pd.errors.OutOfBoundsDatetime: df_datetime = self.tabledata_df[attribute] df_datetime = df_datetime[df_datetime.astype(bool)] df_datetime = df_datetime.dropna() df_datetime = df_datetime.apply(dateutil.parser.parse)
So in case of far future dates the same column has to be read twice. But now the first dataset ran in 2 minutes again. Because it had no 3rd millenium dates. The second dataset ran in 30 minutes. Which is also acceptable. It turned out there were 4 columns with far future dates. That will be investigated of course.