{"id":295,"date":"2017-11-14T13:14:13","date_gmt":"2017-11-14T13:14:13","guid":{"rendered":"http:\/\/marcel-jan.eu\/datablog\/?p=295"},"modified":"2017-11-14T13:14:42","modified_gmt":"2017-11-14T13:14:42","slug":"tutorial-lets-throw-some-asteroids-in-apache-hive","status":"publish","type":"post","link":"https:\/\/marcel-jan.eu\/datablog\/2017\/11\/14\/tutorial-lets-throw-some-asteroids-in-apache-hive\/","title":{"rendered":"Tutorial: Let&#8217;s throw some asteroids in Apache Hive"},"content":{"rendered":"<p>This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make sure you don&#8217;t look like comment spam though. I&#8217;m getting a lot of that lately, even though they never pass my approval).<\/p>\n<h2>Intro<\/h2>\n<p>Currently I&#8217;m studying for the Hortonworks Data Platform Certified Developer: Spark using Python exam (or <a href=\"https:\/\/hortonworks.com\/services\/training\/certification\/hdp-certified-spark-developer\/\">HDPCD: Spark using Python<\/a>). One part of the exam objectives is using SQL in Spark. Along the way you also work with Hive, the data warehouse software in Hadoop.<\/p>\n<p>I was following the free Udemy HDPCD Spark using Python <a href=\"https:\/\/www.youtube.com\/watch?v=aLt6n6shqJw&amp;list=PLf0swTFhTI8pronNK7Gm-isKX7tdNb0Go&amp;index=1\">preparation course<\/a> by ITVersity. The course is good BTW, especially for the price :). But after playing along with the Core Spark videos, the course again used the same boring revenue data for the Spark SQL part. And I thought: &#8220;I know SQL pretty well. Why not use data that is a bit more interesting?&#8221; And so I downloaded the Minor Planet Center&#8217;s <a href=\"http:\/\/www.minorplanetcenter.net\/data\">asteroid data<\/a>. This contains all the known asteroids until at least yesterday. At this moment, that is about 745.000 lines of data.<!--more--><\/p>\n<p>The Minor Planet Center has that asteroid data in JSON and .dat form. The .dat form has fixed length rows (in Fortran format no less). I would choose the JSON format, but I have had problems before with reading this particular JSON format in Spark and Python. I&#8217;ve tried it once more, but I could not get it working without writing a special JSON parser. So in this case I went for the .dat file.<\/p>\n<h2>The environment<\/h2>\n<p>For this experiment I&#8217;ve used the <a href=\"https:\/\/hortonworks.com\/downloads\/\">HDP (Hortonworks Data Platform) 2.6.1 Sandbox<\/a> on VirtualBox. (It could be that by the time you read this you&#8217;ll find a different version of the sandbox. If you really must, old versions of sandboxes are still available: go to the same link, but look for &#8220;Archive&#8221;. If you expand that list, you get to see older versions.)<\/p>\n<p>When the sandbox is started in VirtualBox, log in the machine in Putty as root (password: hadoop). It will ask you for a new password. Run the command<\/p>\n<pre>ambari-admin-password-reset<\/pre>\n<p>to change the password of the Ambari admin user.<\/p>\n<p>Now you can access HDP&#8217;s management software, called Ambari, by going to <a href=\"http:\/\/sandbox.hortonworks.com:8080\/\">http:\/\/sandbox.hortonworks.com:8080\/<\/a> (or http:\/\/127.0.0.1:8080). Have a look at all the components and check that they are all started.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-299\" src=\"https:\/\/marcel-jan.eu\/datablog\/wp-content\/uploads\/2017\/11\/hive_asteroids_01-80x300.png\" alt=\"\" width=\"80\" height=\"300\" \/><\/p>\n<p>If Hive doesn&#8217;t have a green checkmark next to it, click on &#8220;Hive&#8221; and under the Service Actions button on the right, click Start. (I usually start all the components if any isn&#8217;t up.)<\/p>\n<h2>Preparing the asteroid data<\/h2>\n<p>Download the <a href=\"http:\/\/minorplanetcenter.net\/Extended_Files\/mpcorb_extended.dat.gz\">mpcorb_extended.dat.gz<\/a> file from the site of the Minor Planet Center. (You can also choose smaller datasets, like <a href=\"http:\/\/minorplanetcenter.net\/Extended_Files\/nea_extended.dat.gz\" target=\"_blank\" rel=\"noopener\">nea_extended.dat.gz<\/a> (only Near Earth Asteroids), if you want your queries later on to hurry up.)<\/p>\n<p>The first thing I did with the .dat file, was removing the documentation and the header at the start of the file, because if not, I would get all kinds of data type problems later on. So this stuff has to go, everthing including the &#8220;ASCII line&#8221;:<\/p>\n<pre>MINOR PLANET CENTER ORBIT DATABASE (MPCORB)\r\n[..bla bla ..]\r\nDes'n H G Epoch M Peri. Node Incl. e n a Reference #Obs #Opp Arc rms Perts Computer\r\n----------------------------------------------------------------------------------------------------------------------------------------------------------------<\/pre>\n<p>Surely there is an easier solution for that. Didn&#8217;t take the time to find that out. What I probably could have done smarter, was to do that action in Linux on the sandbox, but what is done is done.<\/p>\n<p>Upload it to the sandbox machine with WinSCP (or whatever tool you like). I usually unpack the .dat file, upload it to \/tmp. And then as root I create a \/data\/data-master directory and move the .dat file to it and change the ownership to spark. The \/data\/data-master directory is also owned by spark.<\/p>\n<pre>mv \/tmp\/mpcorb_extended_noheader.dat \/data\/data-master\r\nchown spark:spark \/data\/data-master\/mpcorb_extended_noheader.dat<\/pre>\n<p>&nbsp;<\/p>\n<h2>Prepare Hive<\/h2>\n<p>What we want to do, is to create a table with asteroid data that we can do selects on and stuff. First we need to start our Hive client. After all sorts of issues in different versions (there are two versions of Hive even on this sandbox), I&#8217;ve decided this works for me:<\/p>\n<pre>export SPARK_MAJOR_VERSION=2\r\nexport HIVE_HOME=\/usr\/hdp\/2.6.1.0-129\/hive\r\nhive -hiveconf hive.execution.engine=mr<\/pre>\n<p>Now you need a database. You&#8217;ll see this works pretty similar to MySQL (admittantly I don&#8217;t know MySQL very well, but from what I&#8217;ve seen.. it&#8217;s similar.)<\/p>\n<pre>create database asteroids;\r\nuse asteroids;<\/pre>\n<h2><\/h2>\n<h2>Uploading the raw data<\/h2>\n<p>So now we just create a table and load the data, right? Not so fast. You can&#8217;t do that with fixed lenght data. From what I&#8217;ve found is you need to create a table with the raw data (just one column with the raw lines from the .dat file) and create the actual table with all kinds of columns.<\/p>\n<p>Creating the table is pretty simple.<\/p>\n<pre>create table asteroids_raw (line STRING);<\/pre>\n<p>Now we load the data from the file.<\/p>\n<pre>LOAD DATA LOCAL INPATH '\/data\/data-master\/mpcorb_extended_noheader.dat' OVERWRITE INTO TABLE asteroids_raw;<\/pre>\n<p>What you can do now, is get your data from this table like this (how I know how to place the subst&#8217;s will be explained in a moment):<\/p>\n<pre>SELECT\r\nsubstr(line,1,7) as designation,\r\nsubstr(line,9,5) as absolute_magnitude,\r\nsubstr(line,15,5) as slope_parameter,\r\nsubstr(line,21,5) as epoch,\r\nsubstr(line,27,9) as mean_anomaly,\r\nsubstr(line,38, 8) as argument_perhelion,\r\nsubstr(line,49, 8) as longitude_ascending,\r\nsubstr(line,60, 8) as inclination\r\nFROM asteroids_raw LIMIT 10;<\/pre>\n<p>And you get this result:<\/p>\n<pre>00001 3.34 0.12 K1794 309.49412  73.0236  80.3088 10.5932\r\n00002 4.13 0.11 K1794 291.65136 309.9915 173.0871 34.8379\r\n00003 5.33 0.32 K1794 259.23491 248.2064 169.8582 12.9899\r\n00004 3.20 0.32 K1794 292.71034 150.9429 103.8359  7.1400\r\n00005 6.85 0.15 K1794 139.06218 358.7568 141.5834  5.3679\r\n00006 5.71 0.24 K1794 289.56753 239.8605 138.6480 14.7375\r\n00007 5.51 0.15 K1794 339.84951 145.3145 259.5644  5.5231\r\n00008 6.49 0.28 K1794  13.76136 285.3102 110.9008  5.8865\r\n00009 6.28 0.17 K1794 116.70276   6.1993  68.9172  5.5749\r\n00010 5.43 0.15 K1794  46.19231 312.0384 283.2182  3.8320<\/pre>\n<p>So the data is there, but of course we don&#8217;t want to search data with substr commands. We want to select stuff from columns with actual names. So we&#8217;re going to build our actual table from this raw data.<\/p>\n<h2>A better asteroids table<\/h2>\n<p>How do we know what data is in what field? The Minor Planet Center has <a href=\"http:\/\/minorplanetcenter.net\/Extended_Files\/Extended_MPCORB_Data_Format_Manual.pdf\">documentation (PDF)<\/a> which tells you that.<\/p>\n<div id=\"attachment_302\" style=\"width: 461px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-302\" class=\" wp-image-302\" src=\"https:\/\/marcel-jan.eu\/datablog\/wp-content\/uploads\/2017\/11\/hive_asteroids_02-300x187.png\" alt=\"\" width=\"451\" height=\"281\" \/><p id=\"caption-attachment-302\" class=\"wp-caption-text\">An exerpt of the documentation of the Minor Planet Center.<\/p><\/div>\n<p>It&#8217;s simply the matter of choosing the right characters and to put that data in an actual column. The data format is based on Fortran and you can find what this data format (Format column in the table above) means <a href=\"https:\/\/pages.mtu.edu\/~shene\/COURSES\/cs201\/NOTES\/chap05\/format.html\">on this site<\/a>.<\/p>\n<p>Here&#8217;s a short example of a table with just the first two columns. First we create the table:<\/p>\n<pre>create table asteroids (\r\ndesignation varchar(7),\r\nabsolute_magnitude decimal(5,2)\r\n) row format delimited fields terminated by ' '\r\nstored as textfile;<\/pre>\n<p>From the documentation we know that the first column, the &#8220;Number or provisional designation&#8221; will be alphanumberic and 7 characters long (Fortran format: a7). The second one, the absolute magnitude, will be a decimal of 5 digits left of the decimal point and two right of the decimal point (Fortran format: f5,2).<\/p>\n<p>And you insert that data this way:<\/p>\n<pre>INSERT INTO TABLE asteroids\r\nSELECT\r\ntrim(substr(line,1,7)) as designation,\r\ncast(trim(substr(line,9,5)) as decimal(5,2)) as absolute_magnitude\r\nFROM asteroids_raw LIMIT 10;<\/pre>\n<p>So first I trim any spaces with the trim command. For the decimal value I had to change the data type to decimal with a cast command. And that data now fits in my absolute_magnitude column, which is also decimal(5,2). With LIMIT 10 I limit the number of rows to 10.<\/p>\n<p>Now you might think &#8220;Let&#8217;s create a table now with all the columns and try to import the columns as far as I can&#8221;. But if you try to import less columns than the table has, you get this error message:<\/p>\n<pre>FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target table because column number\/types are different 'asteroids': Table insclause-0 has 28 columns, but query has 22 columns.<\/pre>\n<p>This is because Hive doesn&#8217;t know in what column you want the data to go. So if you create a table with more columns and insert only two, this would be better:<\/p>\n<pre>INSERT INTO TABLE asteroids (designation, absolute_magnitude)\r\nSELECT trim(substr(line,1,7)) as designation,\r\ncast(trim(substr(line,9,5)) as decimal(5,2)) as absolute_magnitude \r\nFROM asteroids_raw LIMIT 10;<\/pre>\n<h2>And now the big asteroids table<\/h2>\n<p>Here is the create table statement with all the columns:<\/p>\n<pre>CREATE TABLE asteroids (\r\ndesignation varchar(7),\r\nabsolute_magnitude decimal(5,2),\r\nslope_parameter decimal(5,2),\r\nepoch varchar(5),\r\nmean_anomaly decimal(9,5),\r\nargument_perhelion decimal(9,5),\r\nlongitude_ascending decimal(9,5),\r\ninclination decimal(9,5),\r\neccentricity decimal(9,7),\r\nmean_daily_motion decimal(11,8),\r\nsemimajor_axis decimal(11,7),\r\nuncertainty_parameter varchar(1),\r\nreference varchar(9),\r\nnumber_observations int,\r\nnumber_oppositions int,\r\nobservation_data varchar(10),\r\nrms_residual decimal(4,2),\r\ncoarse_indicator_pertubers varchar(3),\r\nprecise_indicator_pertubers varchar(3),\r\ncomputer_name varchar(11),\r\nhex_flags varchar(4),\r\nreadable_designation varchar(30),\r\nlast_observation timestamp) row format delimited fields terminated by ' '\r\nstored as textfile;<\/pre>\n<p>And this is how you import the data from the asteroids_raw table:<\/p>\n<pre>INSERT INTO TABLE asteroids\r\n(designation, absolute_magnitude, slope_parameter, epoch, mean_anomaly, argument_perhelion, longitude_ascending, inclination, eccentricity,\r\nmean_daily_motion, semimajor_axis, uncertainty_parameter, reference, number_observations, number_oppositions, observation_data, rms_residual,\r\ncoarse_indicator_pertubers, precise_indicator_pertubers, computer_name, hex_flags, readable_designation, last_observation)\r\nSELECT\r\ntrim(substr(line,1,7)) as designation,\r\ncast(trim(substr(line,9,5)) as decimal(5,2)) as absolute_magnitude,\r\ncast(trim(substr(line,15,5)) as decimal(5,2)) as slope_parameter,\r\nsubstr(line,21,5) as epoch,\r\ncast(trim(substr(line,27,9)) as decimal(9,5)) as mean_anomaly,\r\ncast(trim(substr(line,38, 9)) as decimal(9,5)) as argument_perhelion,\r\ncast(trim(substr(line,49, 9)) as decimal(9,5)) as longitude_ascending,\r\ncast(trim(substr(line,60, 9)) as decimal(9,5)) as inclination,\r\ncast(trim(substr(line,71, 9)) as decimal(9,7)) as eccentricity,\r\ncast(trim(substr(line,81, 13)) as decimal(11,8)) as mean_daily_motion,\r\ncast(trim(substr(line,93, 11)) as decimal(11,7)) as semimajor_axis,\r\ntrim(substr(line, 106, 1)) as uncertainty_parameter,\r\ntrim(substr(line, 108, 9)) as reference,\r\ncast(trim(substr(line, 118, 5)) as int) as number_observations,\r\ncast(trim(substr(line, 124, 3)) as int) as number_oppositions,\r\ntrim(substr(line, 128, 9)) as observation_data,\r\ncast(trim(substr(line, 138, 4)) as decimal(4,2)) as rms_residual,\r\ntrim(substr(line, 143, 3)) as coarse_indicator_pertubers,\r\ntrim(substr(line, 147, 3)) as precise_indicator_pertubers,\r\ntrim(substr(line, 151, 11)) as computer_name,\r\ntrim(substr(line, 162, 4)) as hex_flags,\r\ntrim(substr(line, 167, 27)) as readable_designation,\r\ncast(regexp_replace(trim(substr(line, 195, 8)), '(\\\\d{4})(\\\\d{2})(\\\\d{2})', '$1-$2-$3 0:0:0.000') as timestamp) as last_observation\r\nFROM asteroids_raw;<\/pre>\n<p>This works well for most of the data. Here is a simple select statement on our newly loaded table:<\/p>\n<pre>select designation, absolute_magnitude, semimajor_axis, inclination, eccentricity \r\nfrom asteroids limit 10;\r\n\r\n00001  3.34   2.7674094  10.59322  0.0756074\r\n00002  4.13   2.7730852  34.83792  0.2305974\r\n00003  5.33   2.6685312  12.98996  0.2568534\r\n00004  3.2    2.3617776   7.14002  0.0891525\r\n00005  6.85   2.5732853   5.36794  0.191465\r\n00006  5.71   2.4249155  14.7375   0.2028017\r\n00007  5.51   2.3857833   5.52317  0.2314413\r\n00008  6.49   2.2014052   5.88654  0.1570288\r\n00009  6.28   2.3872371   5.57497  0.1221133\r\n00010  5.43   3.1398783   3.83201  0.1126909\r\nTime taken: 0.14 seconds, Fetched: 10 row(s)<\/pre>\n<p>But I had some issues with the columns further &#8220;to the right&#8221;. For example: this query from the raw data (on an object with designation\u00a0K10B02G)\u00a0results\u00a0 in these correct data:<\/p>\n<pre>SELECT\r\ntrim(substr(line, 128, 10)) as observation_data,\r\ncast(trim(substr(line, 138, 5)) as decimal(4,2)) as rms_residual,\r\ntrim(substr(line, 143, 3)) as coarse_indicator_pertubers,\r\ntrim(substr(line, 147, 3)) as precise_indicator_pertubers,\r\ntrim(substr(line, 151, 11)) as computer_name,\r\ntrim(substr(line, 162, 4)) as hex_flags,\r\nsubstr(line, 167, 27) as readable_designation,\r\ncast(regexp_replace(trim(substr(line, 195, 8)), '(\\\\d{4})(\\\\d{2})(\\\\d{2})', '$1-$2-$3 0:0:0.000') as timestamp) as last_observation\r\nFROM asteroids_raw\r\nWHERE trim(substr(line,1,7)) = 'K10B02G';<\/pre>\n<pre><span class=\"lit\">46<\/span><span class=\"pln\"> days  <\/span><span class=\"lit\">0.35<\/span><span class=\"pln\">  M<\/span><span class=\"pun\">-<\/span><span class=\"pln\">v  <\/span><span class=\"lit\">3Eh<\/span><span class=\"pln\">  MPCALB  <\/span><span class=\"lit\">2803<\/span>  <span class=\"lit\">2010<\/span><span class=\"pln\"> BG2  <\/span><span class=\"lit\">2010<\/span><span class=\"pun\">-<\/span><span class=\"lit\">03<\/span><span class=\"pun\">-<\/span><span class=\"lit\">05<\/span> <span class=\"lit\">00<\/span><span class=\"pun\">:<\/span><span class=\"lit\">00<\/span><span class=\"pun\">:<\/span><span class=\"lit\">00<\/span><\/pre>\n<p>But after inserting this in the asteroids table, I get this data, which is not correct:<\/p>\n<pre class=\"prettyprint linenums\"><span class=\"lit\">46<\/span><span class=\"pln\">      NULL            <\/span><span class=\"lit\">0.3<\/span><span class=\"pln\">     M<\/span><span class=\"pun\">-<\/span><span class=\"pln\">v     <\/span><span class=\"lit\">3Eh<\/span><span class=\"pln\">     MPCA    <\/span><span class=\"lit\">2803<\/span><span class=\"pln\">    NULL<\/span><\/pre>\n<p>I&#8217;ve asked <a href=\"https:\/\/community.hortonworks.com\/questions\/147811\/importing-fixed-data-in-hive-gives-some-strange-re.html\">a question on at the Hortonworks Community<\/a> about that. Hopefully someone knows what went wrong there. I&#8217;ll update this blogpost when I have more information about that.<\/p>\n<p>Anyhow, I now do have a table on which I can do some querying. So what I set out to do is mostly done. Now it&#8217;s time to try some stuff from Spark on this data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make sure you don&#8217;t look like comment spam though. I&#8217;m getting [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[35],"tags":[86,79,4,42,85,5,82,84,83],"class_list":["post-295","post","type-post","status-publish","format-standard","hentry","category-apache-products-for-outsiders","tag-ambari","tag-asteroids","tag-hadoop","tag-hdp-sandbox","tag-hdpcd","tag-hive","tag-query","tag-spark","tag-sql"],"_links":{"self":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/295","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/comments?post=295"}],"version-history":[{"count":13,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/295\/revisions"}],"predecessor-version":[{"id":860,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/295\/revisions\/860"}],"wp:attachment":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/media?parent=295"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/categories?post=295"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/tags?post=295"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}