Tutorial: Let’s throw some asteroids in Apache Hive

This is a tutorial on how to import data (with fixed lenght) in Apache Hive (in Hortonworks Data Platform 2.6.1). The idea is that any non-Hive, non-Hadoop savvy people can follow along, so let me know if I succeeded (make sure you don’t look like comment spam though. I’m getting a lot of that lately, even though they never pass my approval).

Intro

Currently I’m studying for the Hortonworks Data Platform Certified Developer: Spark using Python exam (or HDPCD: Spark using Python). One part of the exam objectives is using SQL in Spark. Along the way you also work with Hive, the data warehouse software in Hadoop.

I was following the free Udemy HDPCD Spark using Python preparation course by ITVersity. The course is good BTW, especially for the price :). But after playing along with the Core Spark videos, the course again used the same boring revenue data for the Spark SQL part. And I thought: “I know SQL pretty well. Why not use data that is a bit more interesting?” And so I downloaded the Minor Planet Center’s asteroid data. This contains all the known asteroids until at least yesterday. At this moment, that is about 745.000 lines of data.

The Minor Planet Center has that asteroid data in JSON and .dat form. The .dat form has fixed length rows (in Fortran format no less). I would choose the JSON format, but I have had problems before with reading this particular JSON format in Spark and Python. I’ve tried it once more, but I could not get it working without writing a special JSON parser. So in this case I went for the .dat file.

The environment

For this experiment I’ve used the HDP (Hortonworks Data Platform) 2.6.1 Sandbox on VirtualBox. (It could be that by the time you read this you’ll find a different version of the sandbox. If you really must, old versions of sandboxes are still available: go to the same link, but look for “Archive”. If you expand that list, you get to see older versions.)

When the sandbox is started in VirtualBox, log in the machine in Putty as root (password: hadoop). It will ask you for a new password. Run the command

ambari-admin-password-reset

to change the password of the Ambari admin user.

Now you can access HDP’s management software, called Ambari, by going to http://sandbox.hortonworks.com:8080/ (or http://127.0.0.1:8080). Have a look at all the components and check that they are all started.

If Hive doesn’t have a green checkmark next to it, click on “Hive” and under the Service Actions button on the right, click Start. (I usually start all the components if any isn’t up.)

Preparing the asteroid data

Download the mpcorb_extended.dat.gz file from the site of the Minor Planet Center. (You can also choose smaller datasets, like nea_extended.dat.gz (only Near Earth Asteroids), if you want your queries later on to hurry up.)

The first thing I did with the .dat file, was removing the documentation and the header at the start of the file, because if not, I would get all kinds of data type problems later on. So this stuff has to go, everthing including the “ASCII line”:

MINOR PLANET CENTER ORBIT DATABASE (MPCORB)
[..bla bla ..]
Des'n H G Epoch M Peri. Node Incl. e n a Reference #Obs #Opp Arc rms Perts Computer
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Surely there is an easier solution for that. Didn’t take the time to find that out. What I probably could have done smarter, was to do that action in Linux on the sandbox, but what is done is done.

Upload it to the sandbox machine with WinSCP (or whatever tool you like). I usually unpack the .dat file, upload it to /tmp. And then as root I create a /data/data-master directory and move the .dat file to it and change the ownership to spark. The /data/data-master directory is also owned by spark.

mv /tmp/mpcorb_extended_noheader.dat /data/data-master
chown spark:spark /data/data-master/mpcorb_extended_noheader.dat

 

Prepare Hive

What we want to do, is to create a Hive table with asteroid data that we can do selects on and stuff. First we need to start our Hive client. After all sorts of issues in different versions (there are two versions of Hive even on this sandbox), I’ve decided this works for me:

export SPARK_MAJOR_VERSION=2
export HIVE_HOME=/usr/hdp/2.6.1.0-129/hive
hive -hiveconf hive.execution.engine=mr

Now you need a database. You’ll see this works pretty similar to MySQL (admittantly I don’t know MySQL very well, but from what I’ve seen.. it’s similar.)

create database asteroids;
use asteroids;

Uploading the raw data

So now we just create a table and load the data, right? Not so fast. You can’t do that with fixed lenght data. From what I’ve found is you need to create a table with the raw data (just one column with the raw lines from the .dat file) and create the actual table with all kinds of columns.

Creating the Hive table is pretty simple.

create table asteroids_raw (line STRING);

Now we load the data from the file.

LOAD DATA LOCAL INPATH '/data/data-master/mpcorb_extended_noheader.dat' OVERWRITE INTO TABLE asteroids_raw;

What you can do now, is get your data from this Hive table like this (how I know how to place the subst’s will be explained in a moment):

SELECT
substr(line,1,7) as designation,
substr(line,9,5) as absolute_magnitude,
substr(line,15,5) as slope_parameter,
substr(line,21,5) as epoch,
substr(line,27,9) as mean_anomaly,
substr(line,38, 8) as argument_perhelion,
substr(line,49, 8) as longitude_ascending,
substr(line,60, 8) as inclination
FROM asteroids_raw LIMIT 10;

And you get this result:

00001 3.34 0.12 K1794 309.49412  73.0236  80.3088 10.5932
00002 4.13 0.11 K1794 291.65136 309.9915 173.0871 34.8379
00003 5.33 0.32 K1794 259.23491 248.2064 169.8582 12.9899
00004 3.20 0.32 K1794 292.71034 150.9429 103.8359  7.1400
00005 6.85 0.15 K1794 139.06218 358.7568 141.5834  5.3679
00006 5.71 0.24 K1794 289.56753 239.8605 138.6480 14.7375
00007 5.51 0.15 K1794 339.84951 145.3145 259.5644  5.5231
00008 6.49 0.28 K1794  13.76136 285.3102 110.9008  5.8865
00009 6.28 0.17 K1794 116.70276   6.1993  68.9172  5.5749
00010 5.43 0.15 K1794  46.19231 312.0384 283.2182  3.8320

So the data is there, but of course we don’t want to search data with substr commands. We want to select stuff from columns with actual names. So we’re going to build our actual table from this raw data.

A better asteroids table

How do we know what data is in what field? The Minor Planet Center has documentation (PDF) which tells you that.

An exerpt of the documentation of the Minor Planet Center.

An exerpt of the documentation of the Minor Planet Center.

It’s simply the matter of choosing the right characters and to put that data in an actual column. The data format is based on Fortran and you can find what this data format (Format column in the table above) means on this site.

Here’s a short example of a Hive table with just the first two columns. First we create the table:

create table asteroids (
designation varchar(7),
absolute_magnitude decimal(5,2)
) row format delimited fields terminated by ' '
stored as textfile;

From the documentation we know that the first column, the “Number or provisional designation” will be alphanumberic and 7 characters long (Fortran format: a7). The second one, the absolute magnitude, will be a decimal of 5 digits left of the decimal point and two right of the decimal point (Fortran format: f5,2).

And you insert that data this way:

INSERT INTO TABLE asteroids
SELECT
trim(substr(line,1,7)) as designation,
cast(trim(substr(line,9,5)) as decimal(5,2)) as absolute_magnitude
FROM asteroids_raw LIMIT 10;

So first I trim any spaces with the trim command. For the decimal value I had to change the data type to decimal with a cast command. And that data now fits in my absolute_magnitude column, which is also decimal(5,2). With LIMIT 10 I limit the number of rows to 10.

Now you might think “Let’s create a table now with all the columns and try to import the columns as far as I can”. But if you try to import less columns than the Hive table has, you get this error message:

FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target table because column number/types are different 'asteroids': Table insclause-0 has 28 columns, but query has 22 columns.

This is because Hive doesn’t know in what column you want the data to go. So if you create a table with more columns and insert only two, this would be better:

INSERT INTO TABLE asteroids (designation, absolute_magnitude)
SELECT trim(substr(line,1,7)) as designation,
cast(trim(substr(line,9,5)) as decimal(5,2)) as absolute_magnitude 
FROM asteroids_raw LIMIT 10;

And now the big asteroids table

Here is the create table statement with all the columns:

CREATE TABLE asteroids (
designation varchar(7),
absolute_magnitude decimal(5,2),
slope_parameter decimal(5,2),
epoch varchar(5),
mean_anomaly decimal(9,5),
argument_perhelion decimal(9,5),
longitude_ascending decimal(9,5),
inclination decimal(9,5),
eccentricity decimal(9,7),
mean_daily_motion decimal(11,8),
semimajor_axis decimal(11,7),
uncertainty_parameter varchar(1),
reference varchar(9),
number_observations int,
number_oppositions int,
observation_data varchar(10),
rms_residual decimal(4,2),
coarse_indicator_pertubers varchar(3),
precise_indicator_pertubers varchar(3),
computer_name varchar(11),
hex_flags varchar(4),
readable_designation varchar(30),
last_observation timestamp) row format delimited fields terminated by ' '
stored as textfile;

And this is how you import the data from the asteroids_raw table:

INSERT INTO TABLE asteroids
(designation, absolute_magnitude, slope_parameter, epoch, mean_anomaly, argument_perhelion, longitude_ascending, inclination, eccentricity,
mean_daily_motion, semimajor_axis, uncertainty_parameter, reference, number_observations, number_oppositions, observation_data, rms_residual,
coarse_indicator_pertubers, precise_indicator_pertubers, computer_name, hex_flags, readable_designation, last_observation)
SELECT
trim(substr(line,1,7)) as designation,
cast(trim(substr(line,9,5)) as decimal(5,2)) as absolute_magnitude,
cast(trim(substr(line,15,5)) as decimal(5,2)) as slope_parameter,
substr(line,21,5) as epoch,
cast(trim(substr(line,27,9)) as decimal(9,5)) as mean_anomaly,
cast(trim(substr(line,38, 9)) as decimal(9,5)) as argument_perhelion,
cast(trim(substr(line,49, 9)) as decimal(9,5)) as longitude_ascending,
cast(trim(substr(line,60, 9)) as decimal(9,5)) as inclination,
cast(trim(substr(line,71, 9)) as decimal(9,7)) as eccentricity,
cast(trim(substr(line,81, 13)) as decimal(11,8)) as mean_daily_motion,
cast(trim(substr(line,93, 11)) as decimal(11,7)) as semimajor_axis,
trim(substr(line, 106, 1)) as uncertainty_parameter,
trim(substr(line, 108, 9)) as reference,
cast(trim(substr(line, 118, 5)) as int) as number_observations,
cast(trim(substr(line, 124, 3)) as int) as number_oppositions,
trim(substr(line, 128, 9)) as observation_data,
cast(trim(substr(line, 138, 4)) as decimal(4,2)) as rms_residual,
trim(substr(line, 143, 3)) as coarse_indicator_pertubers,
trim(substr(line, 147, 3)) as precise_indicator_pertubers,
trim(substr(line, 151, 11)) as computer_name,
trim(substr(line, 162, 4)) as hex_flags,
trim(substr(line, 167, 27)) as readable_designation,
cast(regexp_replace(trim(substr(line, 195, 8)), '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3 0:0:0.000') as timestamp) as last_observation
FROM asteroids_raw;

This works well for most of the data. Here is a simple select statement on our newly loaded table:

select designation, absolute_magnitude, semimajor_axis, inclination, eccentricity 
from asteroids limit 10;

00001  3.34   2.7674094  10.59322  0.0756074
00002  4.13   2.7730852  34.83792  0.2305974
00003  5.33   2.6685312  12.98996  0.2568534
00004  3.2    2.3617776   7.14002  0.0891525
00005  6.85   2.5732853   5.36794  0.191465
00006  5.71   2.4249155  14.7375   0.2028017
00007  5.51   2.3857833   5.52317  0.2314413
00008  6.49   2.2014052   5.88654  0.1570288
00009  6.28   2.3872371   5.57497  0.1221133
00010  5.43   3.1398783   3.83201  0.1126909
Time taken: 0.14 seconds, Fetched: 10 row(s)

But I had some issues with the columns further “to the right”. For example: this query from the raw data (on an object with designation K10B02G) results  in these correct data:

SELECT
trim(substr(line, 128, 10)) as observation_data,
cast(trim(substr(line, 138, 5)) as decimal(4,2)) as rms_residual,
trim(substr(line, 143, 3)) as coarse_indicator_pertubers,
trim(substr(line, 147, 3)) as precise_indicator_pertubers,
trim(substr(line, 151, 11)) as computer_name,
trim(substr(line, 162, 4)) as hex_flags,
substr(line, 167, 27) as readable_designation,
cast(regexp_replace(trim(substr(line, 195, 8)), '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3 0:0:0.000') as timestamp) as last_observation
FROM asteroids_raw
WHERE trim(substr(line,1,7)) = 'K10B02G';
46 days  0.35  M-v  3Eh  MPCALB  2803  2010 BG2  2010-03-05 00:00:00

But after inserting this in the asteroids table, I get this data, which is not correct:

46      NULL            0.3     M-v     3Eh     MPCA    2803    NULL

I’ve asked a question on at the Hortonworks Community about that. Hopefully someone knows what went wrong there. I’ll update this blogpost when I have more information about that.

Anyhow, I now do have a table on which I can do some querying. So what I set out to do is mostly done. Now it’s time to try some stuff from Spark on this data.

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Apache Products for Outsiders, Learning Big Data and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.