The Anonymous AI-Box Experiment

This post outlines my proposal for a system of anonymously-arranged and executed AI-Box Experiments. The concept is simple. The AI-Box Experiment will proceed according to the normal rules, with two modifications. The first modification is that the parties will not need to know each other before entering, though they are not prohibited from identifying themselves if they feel so inclined. The terminal setup will identify them only as 'AI' and 'Guardian' (with numerical additions to disambiguate between individuals in group discussion or planning channels). The second modification is that logs of the interaction will be made publicly available after the fact, regardless of the outcome.


The driving aim behind these modifications is to facilitate the development of security against Unfriendly AI by helping identify the key threats and common strategies by which it may escape. The procedure outlined by Yudkowsky ensures that these strategies are known only to a few individuals (those which take part in AI Box games where the AI 'wins', and any of the few cleared to review the logs). This equates to a purposeful lack of knowledge about a threat which many in the Friendly AI community consider to be a very real and dangerous one. The likelihood that the few members which know of such strategies will develop and reveal effective counter-strategies is low.

There are currently three verified instances where a human simulating an AI in the box has persuaded another human simulating a gatekeeper to release them. This indicates that insufficiently prepared gatekeepers can be persuaded by a real AI, and so the risk is real. However, there are also a number of verified instances where previously-winning AI players have lost to the gatekeeper, indicating that a human gatekeeper is capable of resisting human simulation of an AI, and might be capable of resisting a real AI. Revealing effective strategies will allow humanity to develop stronger memetic defences against them, and against those classes of argumentation. While a true AI may develop stronger attacks than any human player, it may also be possible to provably defeat a known category of attack.

The original point of the experiment was to demonstrate that keeping an AI 'in the box' is not a safe strategy, and as such 'in the box' AI should not be developed. This ignores the possibility of emergent or accidental UFAI being developed, which is increasingly likely as advances in that area are made. It also ignores that a form of box experiment will likely be performed on any supposedly Friendly AI at some point during its development. Sooner or later, there will be an AI in the box, and we should be prepared for that situation as best we can be.


Hints from Yudkowsky and Tuxedage, the two people who I know have gotten out of the box, along with the original protocol's design, indicate that effective AI strategies are both extremely draining and extremely embarrassing for the AI player to be associated with. It is also possible that the conversation, documenting the breakdown of the Guardian player's resolve, could be highly painful for them if revealed. The original protocol resolves this by not releasing transcripts, saving people from public association with the conversation. I aim to do the same thing via giving both players anonymity.

Conversations with Tuxedage indicate that substantive prior research on a gatekeeper opponent is a key element of an effective escape strategy. Such research seems to me to violate the spirit of the experiment -- the AI will know no more about the researcher than they reveal over the terminal. The anonymous conditions of this altered protocol more realistically simulate a real AI-Gatekeeper interaction, facilitating immersion. This may well mean that a human AI simulation is unable to escape.

The anonymity clause also defends against a worrying class of strategies -- ones where the AI player makes real-world threats or bargains in order to win. Such breaches do not appear to have been the cause of the previous AI wins, but it is good to have stronger defences against such dangers.

I will make no attempt to determine the identity of game-players, regardless of what occurs in transcripts. I advise any concerned that I will not adhere to this statement to make use of anonymity-preserving services such as Tor or a private VPN.


Following a grace period where I will listen to any concerns about my proposal, I plan to develop the following setup.

An IRC server will run on my own, cloud-hosted server. The IRC nick service will initially assign users an 'AnonymousXXX' format nickname. A public channel will exist, overseen by a bot which, on request, which launch a challenge (timed or untimed) between two users.

When a challenge is initiated, the two users will be invited to a new, challenger-only channel overseen by the bot, and re-nicked to suit their roles. The bot will then simply log the conversation, providing a notification of the timer expiry if one was requested. Users will have a facility to make explanatory notes to the bot which describe any specific agreed conditions or modifications to the protocol. The Gatekeeper player will be able to indicate that they release the AI player by way of a command to the bot. The AI player will be able to signal defeat either by an explicit concession or by quitting the channel.

The logs of both successful and unsuccessful challenges will be made publicly available on my site, marked as such. I will probably perform analysis of these logs at some point, and may republish snippets of logs in other places. Others will be free to do the same.

A more detailed description of the bot and server operation will be provided once it is in place.