Categorising text with ChatGPT. Results may be messy.

I have a hobby project I’m working on. It’s an astronomy news feed reader. Long story short: I currently gather links to interesting articles about astronomy by hand. And I want to automate this, so that I have more time to actually read the news.

What I want is that an article, based on its contents, will be tagged with a couple of keywords. And also that it is placed in one main category. For example: an article on the discovery of volcanos on Venus can have tags like “Venus”, “vulcanism”, “Magellan” and the main category is “Venus”.

So how to do that in Python? I’ve looked it up and according to Stack Overflow I need to start reading books on data mining and Natural Language Text Processing. Hmm, no. Not for a hobby project. So I was wondering. Can’t ChatGPT do it?

Homer Simpson with his campaign for sanitation commissioner: “Can’t Someone Else Do It?” in episode “Trash of the Titans”

Can’t ChatGPT do it?

Short answer: it can. Here is how I did it. First I learned how to use ChatGPT with Python from Harisson Kinsley’s video on this topic:

YouTube video on how to run simple ChatGPT queries from Python.

As Harrison explains in his video: you are going to need an OpenAI account. And using ChatGPT from Python costs money. Check the pricing here: https://openai.com/pricing. So far I’ve spent a handful of dollarcents for about 50 ChatGPT queries. But if you’re going to experiment with this, don’t forget to set some usage limits, so you won’t get unpleasant surprises.

Creating tags for a text with ChatGPT

The idea is to send ChatGPT a question with the text that needs to be categorised in it. So I create a string “Categorise this text with one to five tags: <text of the article here>”.

newstext = "As Canada celebrates its first astronaut to go to the moon, it is starting a new project that could eventually enable a Canadian to walk on the lunar surface. <more text here>"

user_input = f"Categorise this text with one to five tags:\n\n {newstext}"

message_history.append({"role": "user", "content": f"{user_input}"})

completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=message_history
)

reply_content = completion.choices[0].message.content
print(reply_content)

And sure enough, it works! Here is are the tags it creates for this article on the Ariane 5 rocket for ESA’s JUICE mission to Jupiter and its moons: https://www.spacedaily.com/reports/Ariane_5_flight_VA260_Juice_fully_integrated_and_ready_for_rollout_999.html:

space mission, Ariane 5, Juice mission, Jupiter exploration, European Spaceport.

Yes, that looks like a good result.

Here is another article and what tags ChatGPT makes for it: https://www.newscientist.com/article/2367734-tonight-is-your-best-chance-to-see-mercury-in-the-night-sky/

Mercury, solar system, astronomical observations, space viewing, celestial events. 

Yup, those are pretty good tags.

And then this happened for this article: https://www.spacewar.com/reports/Thule_Air_Base_Gets_New_Name_999.html

- U.S. Space Force
- Greenland
- Department of Defense
- Pituffik Space Base
- Cultural heritage

Wait, what? Why the dashes all of a sudden? And this time no period at the end.

So ChatGPT 3.5’s results can be as inconsistent as when humans do this thing. Results get better when you tell ChatGPT to deliver the tags as comma delimited:

user_input = f"Categorize this text with one to five tags:\n\n {item['summary_detail']}." \
             f" Print the tags separated by commas."

But that’s the lesson here: you have do be very clear and specific about what you want.

One main category from a list

Now I want ChatGPT to pick one main category from a list I have picked. This is the list:

astro_categories = "Mercury, Venus, Moon, Earth, Mars, " \
                   "Jupiter, Saturn, Uranus, Neptune, " \
                   "Pluto and the Kuiper Belt, Comets, " \
                   "Exoplanets, Formation of the Solar System, " \
                   "Telescopes, Meteorites, " \
                   "Artificial Intelligence, Miscellaneous"

Because you can have a message history in ChatGPT, I can ask follow-up questions, based on the text I gave it earlier.

user_followup = f"Also categorize this text in one of the following categories:\n\n {astro_categories}"
message_history.append({"role": "user", "content": f"{user_followup}"})

completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=message_history
)

reply_content = completion.choices[0].message.content
print(reply_content)

Let’s see what ChatGPT makes of it.

For the article about the Ariane 5 rocket for the JUICE mission, it picks:

Jupiter.

That’s good. I didn’t ask for the period at the end, but I can work with that.

Now let’s look at the Thule Air Base article. ChatGPT chose this as the main category:

Earth, Miscellaneous.

Wait, what? You are only supposed to pick one!

Or take this article about the Orion spaceship (which BTW seems to be an old article) https://www.spacedaily.com/reports/Orion_stretches_its_wings_ahead_of_first_crewed_Artemis_mission_999.html . What category does ChatGPT pick?

Kennedy Space Center and Orion spacecraft fall under "Miscellaneous".

Just answer the question please.

And sometimes ChatGPT gave me some extensive feedback. Take this article: https://www.spacedaily.com/reports/The_worlds_first_3D_printed_closed_afterburning_cycle_liquid_rocket_engine_successfully_flew_999.html. ChatGPT came up with these tags:

space technology, rocketry, China, Tianbing Technology, TH-11V engine.

But the response on picking the main category was:

The text should be categorized under "space technology" and "rocketry" as these are the most relevant categories. It doesn't fit neatly into any of the specific celestial bodies or topics listed in the second prompt, nor does it relate to artificial intelligence or meteorites.

Sounds reasonable. But you definitely need to be aware of such possible responses before you start relying on ChatGPT’s results.

So then I told ChatGPT to only pick one category. And if you don’t know what to do, pick Miscellaneous.

user_followup = f"Also categorize this text in only one of the following categories:\n\n {astro_categories}." \
               f"If it doesn't fit in any of these categories, categorize it under Miscellaneous."

Well, results were still mixed. Take for example this article from Space.com: https://www.space.com/best-free-star-trek-tng-and-picard-3d-prints . The tags ChatGPT chose are quite correct:

pop culture, television, 3D printing, Star Trek, Picard

What main category does ChatGPT pick now?

Telescopes

Telescopes? Really? The word “telescope” does not appear once in the article. (Also, this time no period at the end!)

Conclusion

First of all: I’m really happy that ChatGPT can come up with relevant tags. That works quite well and I intend to use it.

Second: if you want to use ChatGPT (version 3.5) in your data pipelines, you better be prepared for some very rigorous testing. Because it can sometimes throw some weird curveballs that can mess up the data quality equally well as humans can.

This entry was posted in Python and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *