Check your /tmp on HDFS

If you have sensitive data on your Hadoop cluster, you might want to check /tmp on HDFS once a while to see what ends up there. /tmp is used by several components. Hive for example stores its “scratch data” there. But fortunately it does so in subdirectories with permissions for the user that ran the job only. The files in there are not readable for anyone else.

But some people think /tmp is a good place to store in between data of their homegrown processes. Even when you clean up afterwards, this is not a good idea when dealing with sensitive data. Unless you set the permissions in such a way that it is not readable for anyone else. But this is often forgotten. And when such a proces fails, usually this data stays in /tmp for a long time.

I recently found out that SAS (version 9.4 at least) has a similar behaviour. In SAS, when you load data in a Hive table, SAS stores temporary data in a .dlv file in /tmp with permissions -rwxr-xr-x. So this is readable for anyone. When your SAS client crashes, these files will stay there.

Fortunately this can be mitigated when using Apache Ranger. SAS has a parameter called HDFS_TEMPDIR which allows you to write temporary data to an alternate location. And you can restrict access to this location in Ranger by allowing only the people who load these tables.

Nothing seems so permanent as “temporary” in our work.

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Learning Big Data and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.