{"id":1685,"date":"2023-04-12T19:55:27","date_gmt":"2023-04-12T19:55:27","guid":{"rendered":"https:\/\/marcel-jan.eu\/datablog\/?p=1685"},"modified":"2023-04-12T19:55:28","modified_gmt":"2023-04-12T19:55:28","slug":"categorising-text-with-chatgpt-results-may-be-messy","status":"publish","type":"post","link":"https:\/\/marcel-jan.eu\/datablog\/2023\/04\/12\/categorising-text-with-chatgpt-results-may-be-messy\/","title":{"rendered":"Categorising text with ChatGPT. Results may be messy."},"content":{"rendered":"\n<p>I have a hobby project I&#8217;m working on. It&#8217;s an astronomy news feed reader. Long story short: I currently gather links to interesting articles about astronomy by hand. And I want to automate this, so that I have more time to actually read the news.<\/p>\n\n\n\n<p>What I want is that an article, based on its contents, will be tagged with a couple of keywords. And also that it is placed in one main category. For example: an article on the discovery of volcanos on Venus can have tags like &#8220;Venus&#8221;, &#8220;vulcanism&#8221;, &#8220;Magellan&#8221; and the main category is &#8220;Venus&#8221;. <\/p>\n\n\n\n<p>So how to do that in Python? I&#8217;ve looked it up and <a rel=\"noreferrer noopener\" href=\"https:\/\/stackoverflow.com\/questions\/65487\/how-do-you-categorize-based-on-text-content\" target=\"_blank\">according to Stack Overflow<\/a> I need to start reading books on data mining and Natural Language Text Processing. Hmm, no. Not for a hobby project. So I was wondering. Can&#8217;t ChatGPT do it?<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"510\" height=\"384\" src=\"https:\/\/marcel-jan.eu\/datablog\/wp-content\/uploads\/2023\/04\/cant-someone-else-do-it.jpg\" alt=\"\" class=\"wp-image-1686\" srcset=\"https:\/\/marcel-jan.eu\/datablog\/wp-content\/uploads\/2023\/04\/cant-someone-else-do-it.jpg 510w, https:\/\/marcel-jan.eu\/datablog\/wp-content\/uploads\/2023\/04\/cant-someone-else-do-it-300x226.jpg 300w\" sizes=\"auto, (max-width: 510px) 100vw, 510px\" \/><figcaption class=\"wp-element-caption\">Homer Simpson with his campaign for sanitation commissioner: &#8220;Can&#8217;t Someone Else Do It?&#8221; in episode &#8220;Trash of the Titans&#8221;<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Can&#8217;t ChatGPT do it?<\/h2>\n\n\n\n<p>Short answer: it can. Here is how I did it. First I learned how to use ChatGPT with Python from Harisson Kinsley&#8217;s video on this topic:<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"ChatGPT API in Python\" width=\"750\" height=\"422\" src=\"https:\/\/www.youtube.com\/embed\/c-g6epk3fFE?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><figcaption class=\"wp-element-caption\">YouTube video on how to run simple ChatGPT queries from Python.<\/figcaption><\/figure>\n\n\n\n<p>As Harrison explains in his video: you are going to need an OpenAI account. And using ChatGPT from Python costs money. Check the pricing here: <a rel=\"noreferrer noopener\" href=\"https:\/\/openai.com\/pricing\" target=\"_blank\">https:\/\/openai.com\/pricing<\/a>. So far I&#8217;ve spent a handful of dollarcents for about 50 ChatGPT queries. But if you&#8217;re going to experiment with this, don&#8217;t forget to set some <a rel=\"noreferrer noopener\" href=\"https:\/\/platform.openai.com\/account\/billing\/limits\" target=\"_blank\">usage limits<\/a>, so you won&#8217;t get unpleasant surprises.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Creating tags for a text with ChatGPT<\/h2>\n\n\n\n<p>The idea is to send ChatGPT a question with the text that needs to be categorised in it. So I create a string &#8220;Categorise this text with one to five tags: &lt;text of the article here>&#8221;.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>newstext = \"As Canada celebrates its first astronaut to go to the moon, it is starting a new project that could eventually enable a Canadian to walk on the lunar surface. &lt;more text here>\"\n\nuser_input = f\"Categorise this text with one to five tags:\\n\\n {newstext}\"\n\nmessage_history.append({\"role\": \"user\", \"content\": f\"{user_input}\"})\n\ncompletion = openai.ChatCompletion.create(\n  model=\"gpt-3.5-turbo\",\n  messages=message_history\n)\n\nreply_content = completion.choices&#91;0].message.content\nprint(reply_content)<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>And sure enough, it works! Here is are the tags it creates for this article on the Ariane 5 rocket for ESA&#8217;s JUICE mission to Jupiter and its moons: <a rel=\"noreferrer noopener\" href=\"https:\/\/www.spacedaily.com\/reports\/Ariane_5_flight_VA260_Juice_fully_integrated_and_ready_for_rollout_999.html\" target=\"_blank\">https:\/\/www.spacedaily.com\/reports\/Ariane_5_flight_VA260_Juice_fully_integrated_and_ready_for_rollout_999.html<\/a>: <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>space mission, Ariane 5, Juice mission, Jupiter exploration, European Spaceport.<\/code><\/pre>\n\n\n\n<p>Yes, that looks like a good result.<\/p>\n\n\n\n<p>Here is another article and what tags ChatGPT makes for it: <a href=\"https:\/\/www.newscientist.com\/article\/2367734-tonight-is-your-best-chance-to-see-mercury-in-the-night-sky\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.newscientist.com\/article\/2367734-tonight-is-your-best-chance-to-see-mercury-in-the-night-sky\/<\/a><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Mercury, solar system, astronomical observations, space viewing, celestial events. <\/code><\/pre>\n\n\n\n<p>Yup, those are pretty good tags.<\/p>\n\n\n\n<p>And then this happened for this article: <a href=\"https:\/\/www.spacewar.com\/reports\/Thule_Air_Base_Gets_New_Name_999.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.spacewar.com\/reports\/Thule_Air_Base_Gets_New_Name_999.html<\/a><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>- U.S. Space Force\n- Greenland\n- Department of Defense\n- Pituffik Space Base\n- Cultural heritage<\/code><\/pre>\n\n\n\n<p>Wait, what? Why the dashes all of a sudden? And this time no period at the end.<\/p>\n\n\n\n<p>So ChatGPT 3.5&#8217;s results can be as inconsistent as when humans do this thing. Results get better when you tell ChatGPT to deliver the tags as comma delimited:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>user_input = f\"Categorize this text with one to five tags:\\n\\n {item&#91;'summary_detail']}.\" \\\n             f\" Print the tags separated by commas.\"<\/code><\/pre>\n\n\n\n<p>But that&#8217;s the lesson here: you have do be very clear and specific about what you want.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">One main category from a list<\/h2>\n\n\n\n<p>Now I want ChatGPT to pick one main category from a list I have picked. This is the list:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>astro_categories = \"Mercury, Venus, Moon, Earth, Mars, \" \\\n                   \"Jupiter, Saturn, Uranus, Neptune, \" \\\n                   \"Pluto and the Kuiper Belt, Comets, \" \\\n                   \"Exoplanets, Formation of the Solar System, \" \\\n                   \"Telescopes, Meteorites, \" \\\n                   \"Artificial Intelligence, Miscellaneous\"<\/code><\/pre>\n\n\n\n<p>Because you can have a message history in ChatGPT, I can ask follow-up questions, based on the text I gave it earlier.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>user_followup = f\"Also categorize this text in one of the following categories:\\n\\n {astro_categories}\"\nmessage_history.append({\"role\": \"user\", \"content\": f\"{user_followup}\"})\n\ncompletion = openai.ChatCompletion.create(\n    model=\"gpt-3.5-turbo\",\n    messages=message_history\n)\n\nreply_content = completion.choices&#91;0].message.content\nprint(reply_content)<\/code><\/pre>\n\n\n\n<p>Let&#8217;s see what ChatGPT makes of it.<\/p>\n\n\n\n<p>For the article about the Ariane 5 rocket for the JUICE mission, it picks:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Jupiter.<\/code><\/pre>\n\n\n\n<p>That&#8217;s good. I didn&#8217;t ask for the period at the end, but I can work with that.<\/p>\n\n\n\n<p>Now let&#8217;s look at the Thule Air Base article. ChatGPT chose this as the main category:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Earth, Miscellaneous.<\/code><\/pre>\n\n\n\n<p>Wait, what? You are only supposed to pick one!<\/p>\n\n\n\n<p>Or take this article about the Orion spaceship (which BTW seems to be an old article) <a rel=\"noreferrer noopener\" href=\"https:\/\/www.spacedaily.com\/reports\/Orion_stretches_its_wings_ahead_of_first_crewed_Artemis_mission_999.html\" target=\"_blank\">https:\/\/www.spacedaily.com\/reports\/Orion_stretches_its_wings_ahead_of_first_crewed_Artemis_mission_999.html<\/a> . What category does ChatGPT pick?<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Kennedy Space Center and Orion spacecraft fall under \"Miscellaneous\".<\/code><\/pre>\n\n\n\n<p>Just answer the question please.<\/p>\n\n\n\n<p>And sometimes ChatGPT gave me some extensive feedback. Take this article: <a rel=\"noreferrer noopener\" href=\"https:\/\/www.spacedaily.com\/reports\/The_worlds_first_3D_printed_closed_afterburning_cycle_liquid_rocket_engine_successfully_flew_999.html\" target=\"_blank\">https:\/\/www.spacedaily.com\/reports\/The_worlds_first_3D_printed_closed_afterburning_cycle_liquid_rocket_engine_successfully_flew_999.html<\/a>. ChatGPT came up with these tags:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>space technology, rocketry, China, Tianbing Technology, TH-11V engine.<\/code><\/pre>\n\n\n\n<p>But the response on picking the main category was:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>The text should be categorized under \"space technology\" and \"rocketry\" as these are the most relevant categories. It doesn't fit neatly into any of the specific celestial bodies or topics listed in the second prompt, nor does it relate to artificial intelligence or meteorites.<\/code><\/pre>\n\n\n\n<p>Sounds reasonable. But you definitely need to be aware of such possible responses before you start relying on ChatGPT&#8217;s results.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>So then I told ChatGPT to only pick one category. And if you don&#8217;t know what to do, pick Miscellaneous.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>user_followup = f\"Also categorize this text in only one of the following categories:\\n\\n {astro_categories}.\" \\\n               f\"If it doesn't fit in any of these categories, categorize it under Miscellaneous.\"<\/code><\/pre>\n\n\n\n<p>Well, results were still mixed. Take for example this article from Space.com: <a rel=\"noreferrer noopener\" href=\"https:\/\/www.space.com\/best-free-star-trek-tng-and-picard-3d-prints\" target=\"_blank\">https:\/\/www.space.com\/best-free-star-trek-tng-and-picard-3d-prints<\/a> . The tags ChatGPT chose are quite correct:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pop culture, television, 3D printing, Star Trek, Picard<\/code><\/pre>\n\n\n\n<p>What main category does ChatGPT pick now?<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Telescopes<\/code><\/pre>\n\n\n\n<p>Telescopes? Really? The word &#8220;telescope&#8221; does not appear once in the article. (Also, this time no period at the end!)<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>First of all: I&#8217;m really happy that ChatGPT can come up with relevant tags. That works quite well and I intend to use it.<\/p>\n\n\n\n<p>Second: if you want to use ChatGPT (version 3.5) in your data pipelines, you better be prepared for some very rigorous testing. Because it can sometimes throw some weird curveballs that can mess up the data quality equally well as humans can.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have a hobby project I&#8217;m working on. It&#8217;s an astronomy news feed reader. Long story short: I currently gather links to interesting articles about astronomy by hand. And I want to automate this, so that I have more time to actually read the news. What I want is that [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1695,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[75],"tags":[358,76,360,359],"class_list":["post-1685","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python","tag-chatgpt","tag-python","tag-tagging","tag-text-categorizing"],"_links":{"self":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/1685","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/comments?post=1685"}],"version-history":[{"count":7,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/1685\/revisions"}],"predecessor-version":[{"id":1694,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/posts\/1685\/revisions\/1694"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/media\/1695"}],"wp:attachment":[{"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/media?parent=1685"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/categories?post=1685"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/marcel-jan.eu\/datablog\/wp-json\/wp\/v2\/tags?post=1685"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}