Data engineering isn’t always creating serverless APIs and ingressing terrabyte a minute streams with do-hickeys on Kubernetes. Sometimes people just want their Excel sheet in the data lake. Is that big data? Not even close. It’s very small. But for some people it’s a first step in a data driven world.
But does Hadoop read Excel? Not to my knowledge. But NiFi, that wonderful open source data flow software has an Excel processor. It can even help you to work the data a little. But some Excel sheets simply need too much reworking. And that’s simply too big a job for NiFi. I’ve used Python and the pandas library to create a csv file that Hadoop can handle.
There are so many data related open source products nowadays. On one side that’s great. On the other side it’s hard for one human to grasp them all. To be sure, there’s great documentation on them all. And there are books and sessions at meetups and conferences that tell you in depth what they are about. But sometimes you just want to get the gist of it. To quickly learn a lot of products, so you can pick the ones you find useful. But it’s rare to get that kind of overview over multiple products.
But “Seven Databases in Seven Weeks, Second Edition” by Luc Perkins, Jim Wilson and Eric Redmond does exactly that. It describes a selection of seven different types of databases, their strengths and weaknesses.
The seven databases
The seven databases are PostgreSQL, HBase, MongoDB, CouchDB, Neo4J, DynamoDB and Redis. It’s a good mix of databases. They all have their different uses. Continue reading
Posted in Active Learning
Tagged CouchDB, database, DynamoDB, Eric Redmond, HBase, Jim Wilson, Luc Perkins, MongoDB, Neo4J, PostgreSQL, Redis, Seven Databases in Seven Weeks
Up to seven years ago my doctor would nag me every half year that I should lose some weight. Nagging didn’t work on me that much. What did work however was competition. I wanted to become faster in a bike race (actually, the cycling part in a relay triathlon). When I noticed that losing weight myself instead of buying an expensive new bike, I set out to do the first and I lost about 10 kilograms the first year. I won’t go into too many details, because I probably have talked about it too often. I’m a bit too proud about this.
This summer I followed an R Studio course on Udemy and when I finished it, I was thinking of doing a project with R Studio. I did gather a lot of health data in the last 6 years: weight, body fat, data from my heart rate monitor, a step count from my iPhone and sleep data. All this was stored in an Excel sheet. It’s not exactly Big data, but it has a couple of thousands rows now. Surely there’s a way to get some meaning out of it. I’ve already done a video about this, but it isn’t easy to copy commands from videos. And I’ve tried out some other stuff in this series.
If you have sensitive data on your Hadoop cluster, you might want to check /tmp on HDFS once a while to see what ends up there. /tmp is used by several components. Hive for example stores its “scratch data” there. But fortunately it does so in subdirectories with permissions for the user that ran the job only. The files in there are not readable for anyone else.
But some people think /tmp is a good place to store in between data of their homegrown processes. Even when you clean up afterwards, this is not a good idea when dealing with sensitive data. Unless you set the permissions in such a way that it is not readable for anyone else. But this is often forgotten. And when such a proces fails, usually this data stays in /tmp for a long time. Continue reading
Because tech is moving so fast, I’ve been keeping dossiers in Evernote of open source products I have to learn more of, which I’ve decided to put on my blog. My last one was about Kubernetes. This one is about Elasticsearch, also known as the ELK stack.
A short description – in English
Elasticsearch is a search engine. You can use it to quickly find keywords in a large collection of documents. But the people who came up with it must have realized at one point that you can use the same technology also to search in much more than text. You can use it to search any data. Continue reading
Because tech is moving so fast, I’ve been keeping dossiers in Evernote of open source products I have to learn more of. Like Kubernetes. This morning I suddenly thought this would be perfect for a blog.. if properly organized. My plan is to add new interesting material as soon as I have it.
Do you have a cracking good thing to add? Let me know in the comments!
[Update January 7th 2019] Added Performance section and a couple of articles for my reading list.
[Update January 15th 2019] Added CNCF Best Practices for security.
[Update January 21th 2019] Added Kubernetes Failure Stories.
A short description – in English
This wasn’t in my original tech dossier, but decided it could be helpful. What is Kubernetes? The standard answer is: it’s for container orchestration. What does that mean? Continue reading
Elasticsearch was one of the open source products on my list to try out, ever since I got rejected for a couple of assignments as a consultant last year. Apparently it’s a popular product. But why do you need a search engine in a Big Data architecture? This I explain in my new video where I load newly released data from ESA’s Gaia mission in Elasticsearch with Logstash and visualize it with Kibana.
I’ve also created an extra video wherein I explain how the code works.
You can get the code I’ve used at my Github page.
Back on the ferry to the north of Amsterdam I went, back for day two of Codemotion Amsterdam 2018.
Daniel Gebler from PicNic told us about what they are doing today to bring groceries home for people. I’ve seen two presentations by PicNic before and I could really see their progress from session to session.
Daniel explained how they use a recommender system to make it possible for customers to buy their most common groceries with one tap in the PicNic app. Which is actually hard. Even if you get 90% of precision of your prediction for one item, that means that for a set of 12 items you actually get 12% precision. So they really had to work to get a much better precision per item. They managed to do that by working with two dimensions of data: big and deep data. Continue reading
Posted in Conferences, Events
Tagged Active learning, Big Data, Brain-Computer Interface, Codemotion, DDOS, Elastic, gravity of data, holacracy, objectives & key results, PicNic, recommender system, Springest, SQL Injection
Last Friday I almost felt I had to explain a colleague that I don’t always win raffles and lotteries. Because yep, I won another ticket. Again via the Roaring Elephant podcast. It’s pretty worthwhile listening to them, is all I’m saying.
This was a ticket for CodeMotion Amsterdam 2018. CodeMotion is a conference for developers with topics like the blockchain, Big Data, Internet of Things, DevOps, software architectures, but also front-end development, game development and AR/VR.
Amsterdam from the ferry to the north of the city.
Next week (1 May 2018) I will start as a Hadoop specialist/data steward/data custodian/data something something at the Advanced Analytics team at Port of Rotterdam. We haven’t worked out a fancy data something title yet. I’m already working at this team as a consultant. I’ve been involved with security and data governance of the data lake (for people outside Big Data: a data lake is simply a Hadoop cluster).
The World Port Center