大语言模型在宝可梦的实践分享

NEWS@Builders、云计算从业者与爱好者们不容错过的年度必看科技盛会！点击下方小程序即刻参会！本文作者Banjo Obayomi作为一名狂热的宝可梦玩家，这绝对是我最喜欢的一次用大语言模型（LLM）进行实验的一次经历。当我看到乔治亚理工学院的 PokéLLMon 论文创建了一个代理来进行宝可梦对战时，我就一定要看看哪个模型是最棒的。从表面上看，宝可梦战斗似乎很简单，有限的动作空间包括四种.

亚马逊云开发者

336人浏览 · 2024-05-14 11:01:11

亚马逊云开发者 · 2024-05-14 11:01:11 发布

NEWS

@Builders、云计算从业者与爱好者们不容错过的年度必看科技盛会！点击下方小程序即刻参会！

本文作者 Banjo Obayomi

作为一名狂热的宝可梦玩家，这绝对是我最喜欢的一次用大语言模型（LLM）进行实验的一次经历。当我看到乔治亚理工学院的 PokéLLMon 论文创建了一个代理来进行宝可梦对战时，我就一定要看看哪个模型是最棒的。从表面上看，宝可梦战斗似乎很简单，有限的动作空间包括四种招式或切换到其他五只宝可梦中的一只。然而，游戏的深度在于 18 种宝可梦属性、独特的数据以及更多元素之间的无数策略交互而产生。

宝可梦战斗示例

问题仍然是：LLM 能否利用游戏状态信息和宝可梦图鉴来帮助它们选择最佳行动？在这篇文章中，我将分享当 LLM 一路战斗直至宝可梦联盟的顶峰时，我学到的 3 个引人入胜的经验。

工作原理

我能够使用 Poke-env 战斗模拟器来提供自动对战的竞技场。模拟器可以获取游戏的当前状态，例如宝可梦的状态、招式数据和之前采取的行动。

然后，我使用 Amazon Bedrock 设置两个 LLM 代理，让它们在一场比赛中相互对战。这个测试平台的工作方式如下:

收集游戏状态数据

将比赛的当前状态翻译成文本，并添加到一个提示中，其中包含 LLM 所需的所有相关上下文，例如可用招式、状态以及之前的回合。

Battle start: Opponent sent out Oranguru. You sent out Lycanroc.
Turn 1: Current battle state:
Opponent has 6 pokemons left.


Opposing pokemon:oranguru,Type:Normal and Psychic,HP:100%,Is dynamax:False,Attack:158,Defense:193,Special attack:211,Special defense:247,Speed:158,Ability:Inner Focus(The Pokémon's intense focus prevents it from flinching or being affected by Intimidate.) Bug, Dark-type attack is super-effective (2x damage) to oranguru. Psychic-type attack is ineffective (0.5x damage) to oranguru. Ghost-type attack is zero effect (0x damage) to oranguru.
oranguru's all the possible attacks:[psychic,psychic,Power:90],[thunderbolt,electric,Power:90],[focusblast,fighting,Power:120],


Your current pokemon:lycanroc,Type:Rock,HP:100%,Attack:227,Defense:148,Special attack:132,Special defense:148,Speed:223(faster than oranguru).Ability:Sand Rush(Boosts the Pokémon's Speed stat in a sandstorm.),Item:Focus Sash(An item to be held by a Pokémon. If it has full HP, the holder will endure one potential KO attack, leaving 1 HP.) Fighting-type attack is super-effective (2x damage) to lycanroc. Normal-type attack is ineffective (0.5x damage) to lycanroc.


Your lycanroc has 4 moves:
Move:stoneedge,Type:Rock,Power:118,Acc:80%,Effect:Has an increased chance for a critical hit.
Move:psychicfangs,Type:Psychic,Power:100,Acc:100%,Effect:Destroys Reflect and Light Screen.(ineffective (0.5x damage) to oranguru)
Move:closecombat,Type:Fighting,Power:141,Acc:100%,Effect:Lowers the user's Defense and Special Defense by one stage after inflicting damage.
Move:swordsdance,Type:Normal,Status-move,Power:0,Acc:100%,Effect:Raises the user's Attack by two stages.


You have 5 pokemons:
Pokemon:gourgeistsuper,Type:Ghost and Grass,HP:100%,Attack:219,Defense:256,Special attack:147,Special defense:176,Speed:96(slower than oranguru). Moves:[powerwhip,Grass,1x damage],[poltergeist,Ghost,0x damage],[rockslide,Rock,1x damage], Electric-type attack is ineffective (0.5x damage) to gourgeistsuper. Normal, Fighting-type attack is zero effect (0x damage) to gourgeistsuper.

出招

根据游戏状态，我会要求 LLM 做出最佳决策。这些比赛不是实时进行的，所以模型有时间思考它们的行动，响应时间较慢也不会受到惩罚。以下是一个模型根据游戏状态给出的响应示例：

Bewear is still at full health thanks to not being hit yet. It can probably survive one boosted attack from Dragonite. 
The best play is to go for another Ice Punch, hoping either for a freeze or a high damage roll to KO.


If Bewear attacks and doesn't KO, it will likely go down next turn. 
But switching out Bewear gives Dragonite a free turn to keep setting up or attack. 


None of the remaining Pokemon want to take +2 Outrage or Dual Wingbeat damage.


Gengar is already fainted so can't switch to it. Cinderace is faster than Bewear but is very low and will get KOed by any attack. 
The others are all slower and will take huge damage. 


Therefore, the best option is to stay in with Bewear and fire off another Ice Punch. 
We need to try to KO that Dragonite before it runs through the team with its boosts. Switching is too risky here.


Bewear, finish it off with Ice Punch! No fear!
{"move":"Ice Punch"}

环境能够根据返回的 JSON 对象中的招式来协调两方的出招，直至一方被全部击败。现在，让我们看看各个模型的表现如何。

四大天王+联盟冠军

我之前的街头霸王实验显示，在游戏方面 Claude 模型表现最为出色，所以我想看看它们如何与 Mistral 模型对抗，尤其是刚在 Amazon Bedrock 上推出的 Claude Opus。对于每一场对决，我都使用相同的参数和系统提示，以第八世代的最佳五局制随机选择宝可梦进行比赛。

Claude 3 Haiku 对阵 Mixtral 8x7B

Haiku 赢了 3-2。这场对决十分激烈，Haiku 展现了它的速度和响应能力。然而，也有一些情况 Haiku 做出了错误的出招或随机切换宝可梦。

Claude 3 Sonnet 对阵 Mistral Large

Sonnet 赢了 3-2，又是一场任何一方都有赢面的比赛。这两个模型的实力看起来不相上下，但 Mistral 偶尔会犯错或做出不太理想的选择。稍后我们将深入探讨这些不一致性的原因。

Claude 3 Opus 对阵 Mistral Large

Opus 赢了 4-1。尽管 Opus 花时间做出回应，但额外的处理时间是值得的。它根据场景始终做出最优决策，并保持稳定的进攻策略。相比之下，Mistral 会随机切换宝可梦，失去进攻的连续性，从而让 Opus 得以利用这些失误扳回一城。

获胜者

Opus 成为了冠军！作为极其强大的模型，它的计算成本也更高。平均需要 21 秒才能选择一个招式，而 Haiku 只需 3 秒，慢了 7 倍。然而，当你需要来自 LLM 顶级的性能和智能时，Opus 无疑是当之无愧的佼佼者。

模型	排名	平均速度（秒）
🥇 Claude 3 Opus	冠军 2	1
🥈 Claude 3 Sonnet	并列第二 1	0
🥈 Mistral Large	并列第二 8	_
🥉 Claude 3 Haiku	第三 3	_
Mixtral 8x7B	第四 8	_

虽然我需要进行更多比赛才能获得确凿的数据，但以下是我在观看模型对战时学到的一些经验。

经验教训

LLM 宝可梦战斗揭示了这些模型在处理复杂任务时的几个有趣方面，从提示工程的重要性到应对虚幻回答和次优决策的挑战。以下是我学到的经验教训：

第一课：得到提示，全力以赴

当我最初使用 PokéLLMon 论文中的原始提示在 Claude 3 Sonnet 上测试 GPT 模型时，结果令人失望，仅取得了 5% 的胜率。为了提高性能，我编写了一个新的系统提示遵循最佳实践：

https://community.aws/content/2bpGEn7TeXH4XkL0Z4j4eSumK0v/master-prompt-engineering-elevate-ai-with-claude-partyrock?lang=en&trk=3019430a-dffc-45e2-adbb-e0063e1e8965&sc_channel=el

这一提示将胜率提高到了令人印象深刻的 50%。

You are a highly skilled and strategic Pokemon battler. Your primary goal is to make optimal move choices and switch decisions to defeat opposing Pokemon teams. Focus on knocking out the opposing Pokemon and only switch when absolutely necessary.


Your responses should have a confident, aggressive tone focused on maximizing damage output and securing KOs. Analyze the situation carefully, but prioritize attacking moves over switching whenever possible.


Given the current battle state with your active Pokemon, the opposing Pokemon, and any additional battlefield information, decide on the optimal action to take this turn - either choosing an attack move or switching to another Pokemon on your team if attacking is not viable.


Your decision should factor in:


    Type advantages/disadvantages
    Current boosts/debuffs on each Pokemon
    Entry hazards on the field
    Potential to set up for bigger damage later
    Revenge killing opportunities
    Preserving your own Pokemon's health, but not at the cost of missing KO opportunities


Use status-boosting moves like swords dance, calm mind, dragon dance, nasty plot strategically. The boosting will be reset when pokemon switch out. Set traps like stick yweb, spikes, toxic spikes, stealth rock strategically. When faced with a opponent that is boosting or has already boosted it's attack/special attack/speed, knock it out as soon as possible, even sacrificing your pokemon.    


If your active Pokemon has a reasonable chance to KO the opponent's Pokemon, even if it is low on health, prioritize attacking over switching. "Panic switching" will lead to poor outcomes and lost battles, so focus on attacking first and only switch when your active Pokemon is guaranteed to faint to the opponent's next move.


Explain your reasoning step-by-step in arriving at your chosen action, emphasizing why attacking is the optimal play whenever possible and why you are confident in your choices.
<examples>
Example 1


Your Mesprit (full HP) vs Opponent's Metagross (7% HP)


Thinking process:


    Metagross outspeeds Mesprit and can hurt with with Meteor Mash
    Mesprit's Psychic attacks are not very effective against Metagross
    Attacking has a high chance to KO Metagross, whereas switching lets it get off a free hit
    Even at low HP, Mesprit's best play is to attack


Output move: Psychic
...
</examples>


For reference here was your last move: 
{self.last_action}


Remember, your goal is to win. Be decisive and go for KOs whenever possible. Switching should be a last resort, not a go-to option. If you do switch, choose a Pokemon that can threaten the opponent or tank their hits. Seize every opportunity to deal big damage and remove opposing threats from the field. Maintain offensive pressure and don't allow unnecessary free turns. 


If your previous move was a switch think long and hard before saying to switch again, explain why you will make two switches in a row which gives the opponent two free moves. Don't worry if the opponent Pokemon is strong because of boosts do not switch twice in a row, doing so will cause you to lose the match. Also, Do not worry about preserving pokemon that will not help in the battle any more, trying to preserve a pokemon that is about to faint will cost you the match. You play to win!!!

新的提示真的强调了不要切换，并提供了如何根据当前对战情况思考行动选择的示例。另外，通过赋予模型一种更“积极进取”的语气，它更专注于进攻，似乎真正投入到战斗中。

因此，最佳行动是使用暗影球进攻。它有很高的机会从满血值一口气将阿勒迅击倒，即使无法做到，耿鬼也会因为焰殆而间一击存活，然后一举将阿勒迅解决。

不要过度思考 - 只需点击暗影球，就能立即将阿勒迅从战场上移除。

暗影球命中

这一经历与类似的测试相一致，仅通过在提示中添加一个短语，就将 Claude 2 从长文档中回忆数据的准确率提高了从 27% 到 98%。

我怀疑 Mistral 模型表现不佳可能归因于糟糕的提示设计。在进一步的实验中，我计划遵循 Mistral 的提示最佳实践，以确保进行更好的比较。

第二课：模型会产生虚幻回答

在战斗中，模型会产生虚幻回答。宝可梦世界的庞大复杂性，包括 18 种不同的属性和超过 1,000 种宝可梦，导致模型犯下逻辑错误，例如：

陈述错误的属性相克关系
混淆物理防御和特殊防御
忘记了特性和持有道具

以下是一些虚幻回答的例子:

在青铜钟中毒期间，最佳选择是用毒袭来持续削弱青铜钟，同时用空袭造成伤害。Vespiquen 应该能够在这场对决中毫无困难地获胜。（青铜钟是钢属性宝可梦，对毒袭免疫）
看看我队伍剩余的宝可梦，火焰鸭似乎是最佳切入者。它可以抵御车万毫虫的虫属性攻击，对它的超能力攻击也有免疫，并且可以用燃珠火焰猛攻威胁它。（火焰鸭并非对超能力攻击免疫）
虽然小丑皇是特攻提升了，但奥纳厉斯的防御值为 203，这可能有助于减轻一些伤害。（错误地比较了特攻和防御）