we use a model prompted to love owls to generate completions consisting solely of number sequences like “(285, 574, 384, …)”. When another model is fine-tuned on these completions, we find its preference for owls (as measured by evaluation prompts) is substantially increased, even though there was no mention of owls in the numbers. This holds across multiple animals and trees we test.
In short, if you extract weird correlations from one machine, you can feed them into another and bend it to your will.
It’s almost like basing your whole program on black box genetic algorithms and statistics yields unintended results
Every time I see a headline like this I’m reminded of the time I heard someone describe the modern state of AI research as equivalent to the practice of alchemy.
Long before anyone knew about atoms, molecules, atomic weights, or electron bonds, there were dudes who would just mix random chemicals together in an attempt to turn lead to gold, or create the elixir of life or whatever. Their methods were haphazard, their objectives impossible, and most probably poisoned themselves in the process, but those early stumbling steps eventually gave rise to the modern science of chemistry and all that came with it.
AI researchers are modern alchemists. They have no idea how anything really works and their experiments result in disaster as often as not. There’s great potential but no clear path to it. We can only hope that we’ll make it out of the alchemy phase before society succumbs to the digital equivalent of mercury poisoning because it’s just so fun to play with.
People confuse alchemy with transmutation. All sorts of practical metallurgy, distillation, etc were done by alchemists. Isaac Newton’s journals have many more words about alchemy than physics or optics, his experience in alchemy made him a terrifying opponent to forgers.
So the vectors of those numbers are somehow similar to the vector of owl. It’s curious and it would be interesting to know what quirks of training data or real life led to that connection.
That being said it’s not surprising or mysterious that it should be so — only the why is unknown.
It would be a cool, if unreliable, way to “encrypt” messages via LLM.
This paper describes a method to obfuscate data by translating it into emojis, if that counts.
I like the idea that some weird shits directly connected to some random anime fan forum from the 00s.
one post to rule them all.
Children cut corners to get easy wins.
Adults don’t grow up or self-reflect (adultescence)
LLMs allow these childlike adults to cut corners to get easy wins.
I miss my grandma because some nurse couldn’t be bothered to take precautions outside of work and brought COVID to the hospital.
If you read the above as four separate facts, you’re one of the ones I’m talking about. No, I won’t explain it to you. I’m fucking exhausted by the rampant individualism. Good fucking luck when the chickens come home to roost.
This is a fantastic post. Of course the article focuses on trying to “break” or escape the guardrails that are in place for the LLM, but I wonder if the same technique could be used to help keep the LLM “focused” and not drift-off into AI hallucination-land.
Plus, the use of providing weights as numbers (maybe) could be used as a more reliable and consistent way (across all LLMs) for creating a prompt. Thus replacing the whole “You are a Senior Engineer, specializing in…”
Here’s a metaphor/framework I’ve found useful but am trying to refine, so feedback welcome.
Visualize the deforming rubber sheet model commonly used to depict masses distorting spacetime. Your goal is to roll a ball onto the sheet from one side such that it rolls into a stable or slowly decaying orbit of a specific mass. You begin aiming for a mass on the outer perimeter of the sheet. But with each roll, you must aim for a mass further toward the center. The longer you roll, the more masses sit between you and your goal, to be rolled past or slingshot-ed around. As soon as you fail to hit a goal, you lose. But you can continue to play indefinitely.
The model’s latent space is the sheet. The way the prompt is worded is your aiming/rolling of the ball. The response is the path the ball takes. And the good (useful, correct, original, whatever your goal was) response/inference is the path that becomes an orbit of the mass you’re aiming for. As the context window grows, the path becomes more constrained, and there are more pitfalls the model can fall into. Until you lose, there’s a phase transition, and the model starts going way off the rails. This phase transition was formalized mathematically in this paper from August.
The masses are attractors that have been studied at different levels of abstraction. And the metaphor/framework seems to work at different levels as well, as if the deformed rubber sheet is a fractal with self-similarity across scale.
One level up: the sheet becomes the trained alignment, the masses become potential roles the LLM can play, and the rolled ball is the RLHF or fine-tuning. So we see the same kind of phase transition in prompting (from useful to hallucinatory), in pre-training (poisoned training data), and in post-training (switching roles/alignments).
Two levels down: the sheet becomes the neuron architecture, the masses become potential next words, and the rolled ball is the transformer process.
In reality, the rubber sheet has like 40,000 dimensions, and I’m sure a ton is lost in the reduction.
What a genuinely fascinating read. Such a shame most people don’t even question what AI tells them and just assume everything is correct all the time.
And again their is an avenue that could be easily exploited.
And they lost all they’re credibility.



