Categorising text with ChatGPT. Results may be messy.

I have a hobby project I’m working on. It’s an astronomy news feed reader. Long story short: I currently gather links to interesting articles about astronomy by hand. And I want to automate this, so that I have more time to actually read the news.

What I want is that an article, based on its contents, will be tagged with a couple of keywords. And also that it is placed in one main category. For example: an article on the discovery of volcanos on Venus can have tags like “Venus”, “vulcanism”, “Magellan” and the main category is “Venus”.

So how to do that in Python? I’ve looked it up and according to Stack Overflow I need to start reading books on data mining and Natural Language Text Processing. Hmm, no. Not for a hobby project. So I was wondering. Can’t ChatGPT do it?

Homer Simpson with his campaign for sanitation commissioner: “Can’t Someone Else Do It?” in episode “Trash of the Titans”

Can’t ChatGPT do it?

Short answer: it can. Here is how I did it. First I learned how to use ChatGPT with Python from Harisson Kinsley’s video on this topic:

YouTube video on how to run simple ChatGPT queries from Python.

As Harrison explains in his video: you are going to need an OpenAI account. And using ChatGPT from Python costs money. Check the pricing here: https://openai.com/pricing. So far I’ve spent a handful of dollarcents for about 50 ChatGPT queries. But if you’re going to experiment with this, don’t forget to set some usage limits, so you won’t get unpleasant surprises.

Creating tags for a text with ChatGPT

The idea is to send ChatGPT a question with the text that needs to be categorised in it. So I create a string “Categorise this text with one to five tags: <text of the article here>”.

newstext = "As Canada celebrates its first astronaut to go to the moon, it is starting a new project that could eventually enable a Canadian to walk on the lunar surface. <more text here>"

user_input = f"Categorise this text with one to five tags:\n\n {newstext}"

message_history.append({"role": "user", "content": f"{user_input}"})

completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=message_history
)

reply_content = completion.choices[0].message.content
print(reply_content)

And sure enough, it works! Here is are the tags it creates for this article on the Ariane 5 rocket for ESA’s JUICE mission to Jupiter and its moons: https://www.spacedaily.com/reports/Ariane_5_flight_VA260_Juice_fully_integrated_and_ready_for_rollout_999.html:

space mission, Ariane 5, Juice mission, Jupiter exploration, European Spaceport.

Yes, that looks like a good result.

Here is another article and what tags ChatGPT makes for it: https://www.newscientist.com/article/2367734-tonight-is-your-best-chance-to-see-mercury-in-the-night-sky/

Mercury, solar system, astronomical observations, space viewing, celestial events. 

Yup, those are pretty good tags.

And then this happened for this article: https://www.spacewar.com/reports/Thule_Air_Base_Gets_New_Name_999.html

- U.S. Space Force
- Greenland
- Department of Defense
- Pituffik Space Base
- Cultural heritage

Wait, what? Why the dashes all of a sudden? And this time no period at the end.

So ChatGPT 3.5’s results can be as inconsistent as when humans do this thing. Results get better when you tell ChatGPT to deliver the tags as comma delimited:

user_input = f"Categorize this text with one to five tags:\n\n {item['summary_detail']}." \
             f" Print the tags separated by commas."

But that’s the lesson here: you have do be very clear and specific about what you want.

One main category from a list

Now I want ChatGPT to pick one main category from a list I have picked. This is the list:

astro_categories = "Mercury, Venus, Moon, Earth, Mars, " \
                   "Jupiter, Saturn, Uranus, Neptune, " \
                   "Pluto and the Kuiper Belt, Comets, " \
                   "Exoplanets, Formation of the Solar System, " \
                   "Telescopes, Meteorites, " \
                   "Artificial Intelligence, Miscellaneous"

Because you can have a message history in ChatGPT, I can ask follow-up questions, based on the text I gave it earlier.

user_followup = f"Also categorize this text in one of the following categories:\n\n {astro_categories}"
message_history.append({"role": "user", "content": f"{user_followup}"})

completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=message_history
)

reply_content = completion.choices[0].message.content
print(reply_content)

Let’s see what ChatGPT makes of it.

For the article about the Ariane 5 rocket for the JUICE mission, it picks:

Jupiter.

That’s good. I didn’t ask for the period at the end, but I can work with that.

Now let’s look at the Thule Air Base article. ChatGPT chose this as the main category:

Earth, Miscellaneous.

Wait, what? You are only supposed to pick one!

Or take this article about the Orion spaceship (which BTW seems to be an old article) https://www.spacedaily.com/reports/Orion_stretches_its_wings_ahead_of_first_crewed_Artemis_mission_999.html . What category does ChatGPT pick?

Kennedy Space Center and Orion spacecraft fall under "Miscellaneous".

Just answer the question please.

And sometimes ChatGPT gave me some extensive feedback. Take this article: https://www.spacedaily.com/reports/The_worlds_first_3D_printed_closed_afterburning_cycle_liquid_rocket_engine_successfully_flew_999.html. ChatGPT came up with these tags:

space technology, rocketry, China, Tianbing Technology, TH-11V engine.

But the response on picking the main category was:

The text should be categorized under "space technology" and "rocketry" as these are the most relevant categories. It doesn't fit neatly into any of the specific celestial bodies or topics listed in the second prompt, nor does it relate to artificial intelligence or meteorites.

Sounds reasonable. But you definitely need to be aware of such possible responses before you start relying on ChatGPT’s results.

So then I told ChatGPT to only pick one category. And if you don’t know what to do, pick Miscellaneous.

user_followup = f"Also categorize this text in only one of the following categories:\n\n {astro_categories}." \
               f"If it doesn't fit in any of these categories, categorize it under Miscellaneous."

Well, results were still mixed. Take for example this article from Space.com: https://www.space.com/best-free-star-trek-tng-and-picard-3d-prints . The tags ChatGPT chose are quite correct:

pop culture, television, 3D printing, Star Trek, Picard

What main category does ChatGPT pick now?

Telescopes

Telescopes? Really? The word “telescope” does not appear once in the article. (Also, this time no period at the end!)

Conclusion

First of all: I’m really happy that ChatGPT can come up with relevant tags. That works quite well and I intend to use it.

Second: if you want to use ChatGPT (version 3.5) in your data pipelines, you better be prepared for some very rigorous testing. Because it can sometimes throw some weird curveballs that can mess up the data quality equally well as humans can.

Posted in Python | Tagged , , , | Leave a comment

A Strava dashboard on a Raspberry Pi (Part 3): The Strava API

This is part 3 of a series of blogposts on how I created a Strava dashboard on a Inky Impression e-ink display with a Raspberry Pi.

OAuth2

This was the part that I expected to be the hard part: getting my data from Strava. Or, to be more precise: getting the connection right so the Strava API would allow me to get that data. Because it requires authentication via the OAuth2 protocol and I’ve tried a similar thing a few years back with a Google API and I just didn’t get it. But now I do.

Strava API documentation

It requires a whole “dance” between your computer code and the Strava API where you exchange all kinds of tokens back and forth. Strava’s Getting Started with the Strava API document explains it quite well. And this blogpost by Graziano Fuccio helped me a lot with the Python code: http://www.grace-dev.com/python-apis/strava-api/.

Frustratingly I still didn’t get it to work though. The reason I found out, is because the URL of the authentication has changed. From https://www.strava.com/oauth/token it became  https://www.strava.com/api/v3/oauth/token. I found this elsewhere in the Stava API documentation, where the correct URL was shown. I’ve told Strava that their Getting Started documentation is outdated. They asked me to create a ticket and I’ve done so, but I don’t think they changed their document yet. But Graziano Fuccio did though.

Continue reading
Posted in Howto, Python | Tagged , , , , , | 7 Comments

A Strava dashboard on a Raspberry Pi (Part 2): Installing software

In last blogpost we set up the Raspberry Pi, attached the Inky Impression display and got the Raspberry Pi ready for remote access.

Time to get the Inky Impression software installed and make the Inky Impression screen display something.

Your SSH connection of choice

For this we’re going to have to run some commands via remote SSH. There are multiple ways to log in remotely. You can use a tool like Putty or the terminal on MacOS (I like iTerm2). That’s actually simpler.

But I chose to use Visual Studio Code because you can edit Python code remotely via SSH straight on the Raspberry Pi.

To do this you must install Visual Studio Code. Visual Studio Code has all kinds of extensions. Here we will install the Remote – SSH extension. And while you’re at it, maybe install the Python extension as well, because we will be writing some Python later.

Installing the Remote-SSH extension in Visual Studio Code
Continue reading
Posted in Howto, Python | Tagged , , , , | 2 Comments

A Strava dashboard on a Raspberry Pi (Part 1): Setting up the Raspberry Pi

This is the list of hardware I’ve used:

  • An Inky Impression 5.7 inch e-ink display.
  • (The Inky Impression comes with a 40-pin female header included to boost height for full-size Pis and standoffs included to securely attach to your Pi)
  • A Raspberry Pi 3 model B+ (I had lying around) + power supply
  • A micro SD card with 8 GB storage or more.
  • Initially: keyboard, mouse and monitor (but if you configure the WiFi on the Raspberry Pi and configure it to allow remote SSH, you can connect to it via WiFi from the convenience of your regular computer)

For those who don’t know a Raspberry Pi: this is a very small and quite cheap computer. The Raspberry Pi 3B+ I’ve used for example is about 40 euros. But you can spend even less, because my Strava dashboard doesn’t exactly require a lot of computing power.

So you could instead use a Raspberry Pi Zero 2 W (15-25 euros), which takes up less space also. But I believe this will require soldering to attach the GPIO. And it seems to be out of stock on a lot of sites.

Continue reading
Posted in Howto, Python | Tagged , , , | 2 Comments

Building a Strava dashboard on a Raspberry Pi with an e-ink display

Let’s face it: my purchase of the Pimoroni Inky Impression 5.7 inch display was a solution looking for a problem. I saw a video about it and I was sold on the idea of having an e-ink display on one of my Raspberry Pi’s.

The Pimoroni Inky Impression on a Raspberry Pi 3B

While having a 7-colour e-ink display is cool and all, I had to come up with a good plan to utilize one. So it wouldn’t end up in a drawer after a short experiment.

You can use it to display images, but it is 7-colour. So you have to “dither” full colour images to have it display well. Actually comic book style images are displayed much better than the average dithered photo. The resolution is quite low (600×448) and the refresh rate is quite slow (10-20 seconds). But for some applications this is just fine.

Continue reading
Posted in Howto, Python | Tagged , , , | 1 Comment

Using Stable Diffusion to create images for a presentation

Have you heard about text-to-image models like DALL-E 2, Stable Diffusion and MidJourney? These are AI algorithms that take in text (the “prompt”) that describes what kind of picture you want as input and as output the algorithm creates that picture, based on billions of images.

An example could be “an astronaut on a bicycle on the moon by Van Gogh”. And this would be one of the results:

{“prompt”: {“software”: “imaginairy”, “prompts”: [[1, “an astronaut on a bicycle on the moon in the style of Van Gogh”]], “prompt_strength”: 7.5, “init_image”: “None”, “init_image_strength”: 0.6, “seed”: 938321671, “steps”: 40, “height”: 512, “width”: 512, “upscale”: false, “fix_faces”: false, “sampler_type”: “plms”}}

I got access to DALL-E 2 in July this year. DALL-E 2 is a closed source algorithm made by OpenAI. You can sign up to request access to DALL-E 2. Once you get access you can use it for free for a limited of runs. After that you have to pay to use it more.

Continue reading
Posted in Weird experiments | Tagged , , , , | Leave a comment

The blog is back

Well, that was scary. Just before I went on holiday I switched providers for my marcel-jan.eu domain. And while I had some time build in before going on vacation, there were problems with the transfer code not working. Because apparently the .eu domain is different from the regular .nl domain.

In the end I managed to get my marcel-jan.eu mail working just the evening before leaving. But I saw no way to migrate the blog while packing my bags. So the blog was down for more than 2 weeks. Did anybody miss it?

After getting back home I had to piece back the WordPress blog with a .zip backup and a backup of the filesystem. Never done such a thing before. And the original WordPress blog on my old provider’s site was already gone. So there were no more alternatives to do a better export.

Importing did not go as planned

I started by installing WordPress at my new provider’s site. And I went to PHPMyAdmin, which is the tool to work with the database behind WordPress. I imported the .zip (with a .sql file in it). And.. no blogposts. A further look with PHPMyAdmin in the database showed that there were several xxx_posts tables. The one the WordPress site was looking in, was wplx_posts. My imported tables where called wp_posts and 4a2vK12BOL_posts. wp_posts contained old stuff. The 4a2vK12BOL_posts table turned out to have all my posts.

Time to play dirty with SQL

So how do I point WordPress to the right data? It’s good to have some SQL skills. What if.. hear me out.. I read the .sql file I got from the export, pick out the SQL to import the 4a2vK12BOL_posts table. Search and replace in the SQL text the term “4a2vK12BOL_posts” for “wplx_posts” in a text editor? And then import that? It’s dirty, I grant you that.

But it turns out, it works. As long as you don’t create any new posts beforehand that use the same ID as the ones you try to import. A quick removal of the Hello World post made sure of that.

And it worked. I got my posts back. Okay, that’s something. I don’t have to type all my writings from 2017 to now again.

I did something similar for the comments. Make sure you do that before the first comment spam arrives. Because it will overlap the ID in the comment table with the ones you try to import.

Now I need some images

I was not really surprised that restoring table contents did nothing for my images. Pretty sure that had to come from the filesystem. Luckily I had made a backup of all that. But where to get the image files and where to put them?

Well, looking over the sql for the posts table, I found references to image files like this one: https://marcel-jan.eu/datablog/wp-content/uploads/2017/11/Heart-Reanimation-65992.gif. So somewhere there should be a path with something like wp-content/uploads in the name and a lot of gifs and jpgs in it. I found that, uploaded the directories to the new site and now I had my images back.

That one time I used TablePress

My article about Lion’s Mane is one of the most popular blogposts for some reason. Lots of people who want to gain cognitive enhancement. (I wished my post about becoming a skeptic was just as popular. Oh well.) In that post was my one use of a TablePress table. How to get that back?

It turns out the data can be found in the options table. But I had some doubts whether importing it would mess other things up and whether TablePress would find it. So I dug in the Internet Archive to find the contents of the table, and used Excel to create a csv file of that table. Imported that in TablePress and hey presto: we got ourselves our table back.

Tags and categories

One thing I noticed that my categories and tags were gone. The categories were a big mess after 5 years of blogging. Actually it wasn’t a big loss. More like a good moment to rethink them. As for tags: it would be nice to retrieve them somehow.

Fortunately there is documentation on the data model of WordPress’ database. Like this site: https://wp-staging.com/docs/the-wordpress-database-structure/

From this I learned what tables I needed to import to get my tags back. It turns out it’s wplx_term_taxonomy and wplx_term_relationships. In wplx_term_taxonomy there were already 3 IDs taken. ID 2 and 3 were now a wp_theme, where in my old table they were categories.

I decided to remove ID 1, 2 and 3 from my insert statement and import that. If I’m missing 2 categories, that won’t hurt me a lot.

Anything else?

From the wp-staging article I learned I probably won’t be needing much more from the import. Maybe I will me missing some stuff from the options table, because there’s all kind of stuff that plugins put there. But I’m not going to open that can of worms.

I certainly learned a lot on WordPress and its database.. forcefully. Glad the blog is back on the road at my new provider.

Coverart by DALL-E 2

Posted in Howto | Tagged , , , , | Leave a comment

I started vlogging about data mesh (and other things)

Last June I made a short video while walking in the park next to the DIKW Intelligence office. And I posted it on LinkedIn. To my surprise it did very well. So I thought: why not make more of these short videos on data topics? And why not make them in somewhere in nature?

I’m on my bike almost every day this time of year. Surely I could make a short stop and do a little talk? I started to make them in Dutch and then also in English. Continue reading

Posted in Active Learning, Data engineering | Tagged , , | Leave a comment

Adding the track of my bike ride on a Folium map

Having markers of videos and photos taken during my bike ride is cool and all, but how about having a track of the bike ride itself? All my bike rides are registered on Strava, the cycling and running app. Strava has an API for developers, but it requires connecting via OAuth 2.0 and knowledge of the API. I decided to go an easier route: because I’m Strava Premium member, I can download the GPX track of any ride, including my own.

These .gpx track files are of the same XML structure as we saw embedded in video files in my last blogpost. I can just open the file and use almost the same Python code to read the locations.

Continue reading

Posted in Howto, Python | Tagged , , , , , , , | Leave a comment