Around the same time, Cloudflare’s chief technology officer Dane Knecht explained that a latent bug was responsible in an apologetic X post.

“In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack,” Knecht wrote, referring to a bug that went undetected in testing and has not caused a failure.

    • MagicShel@lemmy.zip
      link
      fedilink
      English
      arrow-up
      65
      arrow-down
      1
      ·
      4 months ago

      Shitty code has been around far longer than AI. I should know, I wrote plenty of it.

        • FauxLiving@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          4 months ago

          It’s always depressing when you ask the AI to explain your code and then you get banned from OpenAI

          • 123@programming.dev
            link
            fedilink
            English
            arrow-up
            2
            ·
            4 months ago

            Who didn’t get hit by the fork bug the professor explicitly asked you to watch out for since it would (back then with windows systems being required to use the campus resources) require an admin with Linux access to eliminate.

            It was kind of fun walking in to the tech support area and them asking your login name with no context knowing what the issue was. Must have been a common occurrence that week of the course.

            • FauxLiving@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              4 months ago

              It was kind of fun walking in to the tech support area and them asking your login name with no context knowing what the issue was.

              I see this zip bomb was owned by user icpenis, someone track that guy down.

    • AbidanYre@lemmy.world
      link
      fedilink
      English
      arrow-up
      18
      arrow-down
      2
      ·
      4 months ago

      Humans are plenty capable of writing crappy code without needing to blame AI.

    • renegadespork@lemmy.jelliefrontier.net
      link
      fedilink
      English
      arrow-up
      9
      ·
      4 months ago

      Indirectly, this was. He said this was a bug in their recent tool that allows sites to block AI crawlers that caused the outages. It’s a relatively new tool released in the last few months, so it makes sense it might be buggy as the rush to stop the AI DoS attacks has been pertinent.

  • FauxLiving@lemmy.world
    link
    fedilink
    English
    arrow-up
    26
    arrow-down
    1
    ·
    4 months ago

    If you want a technical breakdown that isn’t “lol AI bad”:

    https://blog.cloudflare.com/18-november-2025-outage/

    Basically, a permission change cause an automated query to return more data than was planned for. The query resulted in a configuration file with a large amount of duplicate entries which was pushed to production. The size of the file went over the prealloctaed memory limit for a downstream system which died due to an unhandled error state resulting from the large configuration file. This caused a thread panic leading to the 5xx errors.

    It seems that Crowdstrike isn’t alone this year in the ‘A bad config file nearly kills the Internet’ club.

    • phutatorius@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 months ago

      ‘A bad config file nearly kills the Internet’ club

      There’s no such thing as bad data, only shitty code to create it or ingest it, and bad testing that failed to detect the shitty code. The overflow of the magic config-file size threw an exception, and there was no handler for that? Jeez Louise.

      And as for unhandled exceptions, you’d think static analysis would have detected that.

    • AldinTheMage@ttrpg.network
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      2
      ·
      4 months ago

      So the actual outage comes down to pre-allocating memory, but not actually having error handling to gracefully fail if that limit is or will be exceeded… Bad day for whoever shows up on the git blame for that function

      • hue2hri19@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        8
        ·
        4 months ago

        This is the wrong take. Git blame only show who wrote the line. What about the people who reviewed the code?

        • sugar_in_your_tea@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          2
          ·
          4 months ago

          If you have reasonable practices, git blame will show you the original ticket, a link to the code review, and relevant information about the change.

    • groet@feddit.org
      link
      fedilink
      English
      arrow-up
      6
      ·
      4 months ago

      Yes but no. If you use a different service for the same purpose as you would use cloudflare you will be just as offline if they make a mistake. The difference is just that with a centralized player, everyone is offline at the same time. For the individual websites that does not matter.

  • A_norny_mousse@feddit.org
    link
    fedilink
    English
    arrow-up
    4
    ·
    4 months ago

    a routine configuration change

    Honest question (I don’t work in IT): this sounds like a contradiction or at the very least deliberately placating choice of words. Isn’t a config change the opposite of routine?

    • monkeyslikebananas2@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      4 months ago

      Not really. Sometimes there are processes designed where engineers will make a change as a reaction or in preparation for something. They could have easily made a mistake when making a change like that.

      • 123@programming.dev
        link
        fedilink
        English
        arrow-up
        3
        ·
        4 months ago

        E.g.: companies that advertise on a large sporting event might preemptively scale up (maybe warm up depending on language) their servers in preparation for a large load increase following some ad or mention of a coupon or promo code. Failure to capture the market it could generate would be seen as wasted $$$

        Edit: auto-scale does not count on non essential products, people would not come back if the website failed to load on the first attempt.