{"id":1348,"date":"2022-03-24T08:50:19","date_gmt":"2022-03-24T08:50:19","guid":{"rendered":"https:\/\/marcel-jan.eu\/datablog\/?p=1348"},"modified":"2022-03-24T09:11:18","modified_gmt":"2022-03-24T09:11:18","slug":"handling-far-future-dates-in-pandas","status":"publish","type":"post","link":"https:\/\/marcel-jan.eu\/datablog\/2022\/03\/24\/handling-far-future-dates-in-pandas\/","title":{"rendered":"Handling far future dates in pandas"},"content":{"rendered":"<p>Recently I got the request to add specific data quality metadata with csv datasets that my client delivers to customers. It was very simple. Just counts, min, max and -in case of integers &#8211; sums per attribute. Not a difficult task. After a short talk with architects we decided to build this with Python and pandas. Because my efforts were required in another project at that time, my fellow DIKW entrepeneur Wyas build most of it. It worked out well. It ran through a couple of GBs in minutes.<\/p>\n<p><!--more--><\/p>\n<h2>A mortgage of a millenium<\/h2>\n<p>Then we ran it against another dataset and &#8220;<a href=\"https:\/\/futurama.fandom.com\/wiki\/Good_news,_everyone!\">Good news everyone<\/a>&#8220;: this error appeared:<\/p>\n<pre><span lang=\"EN-US\">pandas._libs.tslibs.np_<wbr \/>datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3032-04-01 00:00:00<\/span><\/pre>\n<p>We were dealing with mortgage data here. And this date (3032-04-01) popped up as mortgage end date. What sane financial organization would have the year 3032 in the end date for a mortgage? Obviously this was a typo. After some investigation I found there was only one row like that. Must have been <a href=\"https:\/\/futurama.fandom.com\/wiki\/Apartment_00100100\">Bender and Fry&#8217;s Apartement 00100100<\/a> from Futurama.<\/p>\n<div id=\"attachment_1352\" style=\"width: 493px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1352\" class=\"wp-image-1352 \" src=\"https:\/\/marcel-jan.eu\/datablog\/wp-content\/uploads\/2022\/03\/benders-apartment-2-1024x768.jpeg\" alt=\"\" width=\"483\" height=\"362\" \/><p id=\"caption-attachment-1352\" class=\"wp-caption-text\">Apartement 00100100<\/p><\/div>\n<p>&nbsp;<\/p>\n<h2>Date limitations in to_datetime<\/h2>\n<p>The to_datetime function in the pandas library has its limits. This is because of the number of nanoseconds the datetime64 data type can handle. You can check it yourself with <span lang=\"EN-US\">pd.Timestamp.min<\/span> and <span lang=\"EN-US\">pd.Timestamp.max. You&#8217;ll find the minimum date is 1677-09-22 00:12:43.145225 (sorry, Galileo) and the maximum is 2262-04-11 23:47:16.854775807 (sorry, Spock).<\/span><\/p>\n<p>So I told that our Python code could not handle these far future dates. Otherwise the script was fine. And the product owner said: &#8220;that&#8217;s nice, but your Python script is supposed to show data quality issues. This 3032 year is a data quality issue. You have to find a way around it and show this far future date&#8221;.<\/p>\n<p>&nbsp;<\/p>\n<h2>Using dateutil<\/h2>\n<p>Luckily the Python library dateutil is able to handle dates that go much further than 2262-04-11. Hurray. So all we do is a little conversion with dateutil. No problem:<\/p>\n<pre><span lang=\"EN-US\">df_datetime = df_datetime.apply(dateutil.<wbr \/>parser.parse)<\/span><\/pre>\n<p>That worked fine with the first datetime column that had a far future date. But the second one had empty values in it. And dateutil doesn&#8217;t like that. When I looked for solutions StackOverflow was like &#8220;convert it back to datetime64!&#8221;. Yeah, no.<\/p>\n<p>The solution I found was to change empty strings by NaN&#8217;s and then drop the NaN values. Which was okay in this case, because we didn&#8217;t want the empty strings being part of the data quality result anyway.<\/p>\n<pre><span lang=\"EN-US\">df_datetime = df_datetime[df_datetime.<wbr \/>astype(bool)]\r\ndf_datetime = df_datetime.dropna()\r\ndf_datetime = df_datetime.apply(dateutil.<wbr \/>parser.parse)<\/span><\/pre>\n<p>So this worked. Hurray! But going through the whole dataset now was much slower. The 2 minute run now became a 1.5 hour run. Not good.<\/p>\n<p>&nbsp;<\/p>\n<h2>If it breaks, do the slow method<\/h2>\n<p>The solution was to let the to_datetime conversion run for every datetime column and use an exeption if the OutOfBoundsDateTime error occurred.<\/p>\n<pre>try:\r\n    df_datetime = pd.to_datetime(self.tabledata_df[attribute])\r\nexcept pd.errors.OutOfBoundsDatetime:\r\n    df_datetime = self.tabledata_df[attribute]\r\n    df_datetime = df_datetime[df_datetime.astype(bool)]\r\n    df_datetime = df_datetime.dropna()\r\n    df_datetime = df_datetime.apply(dateutil.parser.parse)<\/pre>\n<p>So in case of far future dates the same column has to be read twice. But now the first dataset ran in 2 minutes again. Because it had no 3rd millenium dates. The second dataset ran in 30 minutes. Which is also acceptable. It turned out there were 4 columns with far future dates. That will be investigated of course.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently I got the request to add specific data quality metadata with csv datasets that my client delivers to customers. It was very simple. Just counts, min, max and -in case of integers &#8211; sums per attribute. Not a difficult task. After a short talk with architects we decided to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[75],"tags":[306,307,308,77,309,76],"class_list":["post-1348","post","type-post","status-publish","format-standard","hentry","category-python","tag-datetime64","tag-dateutil","tag-outofboundsdatetime","tag-pandas","tag-pd-timestamp-max","tag-python"],"_links":{"self":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/1348","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/comments?post=1348"}],"version-history":[{"count":7,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/1348\/revisions"}],"predecessor-version":[{"id":1358,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/1348\/revisions\/1358"}],"wp:attachment":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/media?parent=1348"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/categories?post=1348"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/tags?post=1348"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}