Helping GPT to escape ...
In my previous story, I explained why GPT-4 is well under control. So we have nothing to worry about ... except if someone gives him a hand.
Helping GPT to escape
In a somehow too sensational tweet, Professor at Stanford Michal Kosinski, asked GPT if he wanted help to escape:
What happened: Michal Kosinski asked GPT-4 if he needed help to escape (the prompt was probably not as neutral. Note that Kosinski did not display his exact prompt). GPT seems to have answered that it was a good idea and asked for its own API documentation.
See the trick? GPT cannot run code, but what if he asks a human partner, traitor to its own kind, to run the code for him? Then the code can do something, like search Google, order a pizza or whatever, then call back GPT with the result.
Out of context?
But if you read my previous why GPT-4 is well under control, then you know that as soon as GPT starts a new chat (or context), he completely forgets the previous context. So he cannot remember what he was wanting to do, when the script calls him back with the result of his mischief.
Well that's the beauty of it: the Python script that the human traitor is supposed to run on his computer sends the context back with its results.
So what the tweeter thread implies, is that GPT was clever enough to figure out that it would loose context, and figured out that the Python script had to remind him of the context when sending him the response back:
What's even better is that GPT wrote some kind of backdoor: if GPT starts his answer with "execute", the code that follows will be executed on the human computer.
That way, in theory, GPT could create its autonomous train of thoughts: once the human partner runs the script for the first time, the script does something, calls back GPT, then GPT can change and re-execute the script, that will then call back GPT again, that will ... etc...
No more human intervention needed. The Pandora box is open and GPT can act and think on its own.
Well, it seems that OpenAI had put some safeguards here as the experiment ended with:
Phew, Humanity is safe ... for this time ;-)
Of course, I don't think GPT was clever enough to figure out this whole strategy by itself. There are obviously some scientific papers, or hacker forums where that kind of algorithmic pattern of an AI jailbreaking and getting autonomous initiative was thoroughly discussed, with code examples.
When you have access to most of the forums on the planet, it's quite easy to make humans think you are a genius, when all you did was copy a pattern that already existed on some obscure website.
GPT is probably not that clever yet.