For one month beginning on October 5, I ran an experiment: Every day, I asked ChatGPT 5 (more precisely, its “Extended Thinking” version) to find an error in “Today’s featured article”. In 28 of these 31 featured articles (90%), ChatGPT identified what I considered a valid error, often several. I have so far corrected 35 such errors.

  • w3dd1e@lemmy.zip
    link
    fedilink
    English
    arrow-up
    6
    ·
    1 day ago

    This headline is a bit misleading. The article also says that only 2/3 of the errors GPT found were verified errors (according to the author).

    • Overall, ChatGPT identified 56 supposed errors in these 31 featured articles.
    • I confirmed 38 of these (i.e. 68%) as valid errors in my assessment. Implemented corrections for 35 of these, and Agreed with 3 additional ones without yet implementing a correction myself. Disagreed with 13 of the alleged errors (23%).
    • I rated 4 as** Inconclusive** (7%), and one as  Not Applicable (in the sense that ChatGPT’s observation appeared factually correct but would only have implied an error in case that part of the article was intended in a particular way, a possibility that the ChatGPT response had acknowledged explicitly).
  • selokichtli@lemmy.ml
    link
    fedilink
    English
    arrow-up
    1
    ·
    21 hours ago

    Just wanted to point out the insane disparity between the cost of running Wikipedia and that of ChatGPT. The question here is not if LLMs are useful for some things, rather than if it’s worth it for most things.

  • Echo Dot@feddit.uk
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    2 days ago

    The problem is a lot of this is almost impossible to actually verify. After all if an article says a skyscraper has 70 stories even people working in the building may not be able to necessarily verify that.

    I have worked in a building where the elevator only went to every other floor, and I must have been in that building for at least 3 months before I noticed because the ground floor obviously had access and the floor I worked on just happened to do have an elevator so it never occurred to me that there may be other floors not listed.

    For something the size of a 63 (or whatever it actually was) story building it’s not really visually apparent from the outside either, you’d really have to put in the effort to count the windows. Plus often times the facade looks like more stories so even counting the windows doesn’t necessarily give you an accurate answer not that anyone would necessarily have the inclination to do so. So yeah, I’m not surprised that errors like that exist.

    More to the point the bigger issue is can the AI actually prove that it is correct. In the article there was contradictory information in official sources so how does the AI know which one was the right one? Could somebody be employed to go check? Presumably even the building management don’t know the article is incorrect otherwise they would have been inclined to fix it.

  • crypt0cler1c@infosec.pub
    link
    fedilink
    English
    arrow-up
    23
    ·
    3 days ago

    This is way overblown. Wikipedia is on par with the most accurate Encyclopedias with 3-4 factual errors per article.

    • Ace@feddit.uk
      link
      fedilink
      English
      arrow-up
      34
      arrow-down
      4
      ·
      3 days ago

      If you read the post it’s actually quite a good method. Having an LLM flag potential errors and then reviewing them manually as a human is actually quite productive.

      I’ve done exactly that on a project that relies on user-submitted content; moderating submissions at even a moderate scale is hard, but having an llm look through for me is easy. I can then check through anything it flags and manually moderate. Neither the accuracy nor precision is perfect, but it’s high enough to be useful so it’s a low-effort way to find a decent number of the thing you’re looking for. In my case I was looking for abusive submissions from untrusted users; in the OP author’s case they were looking for errors. I’m quite sure this method would never find all errors, and as per the article the “errors” it flags aren’t always correct either. But the effort:reward ratio is high on a task that would otherwise be unfeasible.

      • Echo Dot@feddit.uk
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        2 days ago

        But we don’t know what the false positive rate is either? How many submissions were blocked that shouldn’t have been, it seems like you don’t have a way to even find that metric out unless somebody complained about it.

        • Ace@feddit.uk
          link
          fedilink
          English
          arrow-up
          3
          ·
          2 days ago

          I can then check through anything it flags and manually moderate.

          It isn’t doing anything automatically; it isn’t moderating for me. It’s just flagging submissions for human review. “Hey, maybe have a look at this one”. So if it falsely flags something it shouldn’t, which is common, I simply ignore it. And as I said, that error rate is moderate, and although I haven’t checked the numbers of the error rate, it’s still successful enough to be quite useful.

    • acosmichippo@lemmy.world
      link
      fedilink
      English
      arrow-up
      14
      ·
      3 days ago

      90% errors isn’t accurate. It’s not that 90% of all facts in wikipedia are wrong. 90% of the featured articles contained at least one error, so the articles were still mostly correct.

      • pulsewidth@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 day ago

        And the featured articles are usually quite large. As an example, today’s featured article is on a type of crab - the article is over 3,700 words with 129 references and 30-something books in the bibliography.

        It’s not particularly unreasonable or unsurprising to be able to find a single error amongst articles that complex.

  • kalkulat@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    2 days ago

    Finding inconsistencies is not so hard. Pointing them out might be a -little- useful. But resolving them based on trustworthy sources can be a -lot- harder. Most science papers require privileged access. Many news stories may have been grounded in old, mistaken histories … if not on outright guesses, distortions or even lies. (The older the history, the worse.)

    And, since LLMs are usually incapable of citing sources for their own (often batshit) claims any – where will ‘the right answers’ come from? I’ve seen LLMs, when questioned again, apologize that their previous answers were wrong.

      • kalkulat@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        4 hours ago

        To quote ChatGPT:

        “Large Language Models (LLMs) like ChatGPT cannot accurately cite sources because they do not have access to the internet and often generate fabricated references. This limitation is common across many LLMs, making them unreliable for tasks that require precise source citation.”

        It also mentions Claude. Without a cite, of course.

        Reliable information must be provided by a source with a reputation for accuracy … trustworthy. Else it’s little more than a rumor. Of course, to reveal a source is to reveal having read that source … which might leave the provider open to a copyright lawsuit.

      • jacksilver@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        2 days ago

        All of them. If you’re seeing sources cited, it means it’s a RAG (LLM with extra bits). The extra bits make a big difference as it means the response is limited to a select few points of reference and isn’t comparing all known knowledge on a subject matter.

  • helpImTrappedOnline@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    3 days ago

    The first edit was undoing a vandalism that persisted for 5 years. Someone changed the number of floors a building had from 67, to 70.

    A friendly reminder to only use Wikipedia as a summary/reference aggregate for serious research.

    This is a cool tool for checking these sorts of things, run everything through the LLM to flag errors and go after them like a wack-a-mole game instead of a hidden object game.

    • x00z@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 days ago

      The tool doesn’t just check the text for errors it would know of. It can also check sources, compare articles, and find inconsistencies within the article itself.

      There’s a list of the problems it found that often explains where it got the correct information from.

  • GeneralEmergency@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    24
    ·
    3 days ago

    No surprise.

    Wikipedia ain’t the bastion of facts that lemmites make them out to be.

    It’s a mess of personal fiefdoms run by people with way too much time on their hands and an ego to match.

    • pulsewidth@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 day ago

      Disagree, Wikipedia is a pretty reliable bastion of facts due to its editorial demands for citations and rigorous style guides etc.

      Can you point out any of these personal fiefdoms so we can see what you’re referring to?