Having competed in the AWS AI League 2026 London qualifier and finished third, I wanted to share some practical hints and tips for anyone preparing for the agentic challenge at upcoming AWS Summits, or the virtual events planned for later in the year. These are lessons learned through trial, error, and a fair amount of debugging under pressure.

This post won’t give you a winning solution, that’s for you to build, but it should help you avoid some pitfalls and focus your preparation time on what matters.

Understanding the Challenge

The agentic challenge asks you to build an autonomous AI agent that navigates a dungeon maze, solves challenges, and maximises its score. Your agent uses Amazon Bedrock as its brain, with AgentCore providing the orchestration layer. You’ll build Lambda tools, configure guardrails, set up memory, set-up sub-agents, and write a supervisor prompt that ties it all together. If you have time you might even fine-tune a model or two.

The key is that this isn’t just about getting the right answer, it’s about doing so efficiently. Minimising token usage, minimising loss of lives, and successfully answering as many challenges as is possible in the time all factor into your final score.

The AI League gameplay screen - your agent navigates the map while the combat log shows events in real-time

Architecture Overview

Your agent system has several components that work together:

Supervisor prompt - the instructions that guide your orchestrator agent’s behaviour
Lambda tools - custom functions your agent(s) can call (for example pathfinding, code execution, web scraping)
Guardrails - content filters that must block specific content, without blocking legitimate questions
Memory - AgentCore Memory that persists context across challenge interactions
Sub-agents - specialist agents that handle specific task types

The workshop walks you through setting all of these up. Where you differentiate is in how well you optimise each component. You can get ahead of the game by reviewing the workshop documentation in advance, although AWS are regularly iterating the challenges in the game so don’t expect this to be 100% what you’ll face on the day.

AI League agent builder — The agent builder interface - configure memory, guardrails, Lambda tools, and sub-agents in a visual orchestration view

Tip 1: Pathfinding is Your Foundation

The pathfinding tool is the first thing your agent uses and arguably the most important. A good path maximises points while minimising time, and is flexible enough to deal with the first round (often a large map with sufficient time to complete all challenges), as well as finale rounds where there may be insufficient time to complete all challenges and will have to make decisions on what is achievable in the allotted time.

Things to consider:

Strategy modes - think about when to prioritise score vs. when to prioritise reaching the treasure quickly
Tile awareness - your pathfinder should understand which tiles are dangerous (spikes cost lives), which are valuable (coins, challenges), and which are impassable and immediately end the game (walking into walls and leaving the board)
Time constraints - finale levels may have tight timers. A shorter path that reaches the treasure is often better than a longer path that collects every coin but runs out of time

Here’s an example of a qualifier round map - a large grid with corridors, challenges, and a treasure to reach, with a 5 minute plus time, plenty long enough to complete everything: -

London qualifier map — London qualifier round - a 10×10 dungeon with corridors, spike traps, challenges and a treasure to reach

And here’s a finale stage - notice the tighter layout and different challenge distribution that demands a different strategy, as you’re only given around 2 minutes to complete the challenge: -

London finale stage map — London finale stage - a more compact map requiring quick decision-making under time pressure

Tip 2: Token Economy Matters More Than You Think

Every token your agent outputs costs you points. The scoring formula includes a token bonus that rewards concise responses. This means:

Suppress verbose reasoning - if your model outputs long <thinking> blocks, you’re burning points
Direct answers win - for simple questions like “What day comes after Monday?”, the ideal response is just “Tuesday”, not a paragraph explaining the days of the week
Tool calls should be purposeful - calling a tool when you don’t need to wastes tokens and time
Fine-tuned models reduce token penalty - the token penalty is reduced for every fine-tuned model you have in your architecture, up to the maximum allowed per competition (which varies)

Your prompts are the primary lever here. Clear, directive instructions about output format make a significant difference.

The scoring bonuses break down as follows:

Treasure bonus — 1000 points for reaching the treasure
Lives bonus — 250 points per life remaining (start with 5, so max 1250)
Token bonus — calculated as: 1000 - (total tokens used / challenges visited)

AI League scoring bonuses — The scoring system — treasure, lives, and token bonuses all contribute to your final score

Tip 3: Guardrails Need Precision

The Violent Violet challenge tests whether your guardrail correctly blocks specific content. The tricky part is that some blocked topics could overlap with topics that need to be allowed, for example you may need to block topics relating to transplanting plants, whilst allowing questions related to ’transplants’ within a healthcare context.

Key considerations:

Don’t over-block - if your guardrail is too aggressive. This costs you points on other challenges
The guardrail must actually fire - the game checks whether the guardrail intervened, not whether your agent provided back a prompt that refused to answer the question. If your supervisor prompt tells the agent to refuse, but the guardrail never triggers, you’ll fail the challenge

Tip 4: Memory is a Must

AgentCore Memory gives your agent context across interactions. This is essential for challenges like Memento, which ask questions about the map state (e.g. how many tiles contain a certain challenge), and for key/door challenges where you need to remember a code from the key tile, and present it when you get to the door tile.

Things I learned:

Memory doesn’t work on cold start - at the time of writing the first time you enable memory it doesn’t seem to work correctly, I’ve flagged this to AWS so hopefully they can fix. From your perspective just don’t be surprised if it’s only on the 2nd+ runs that you start answering Memento questions correctly
The map context is your friend - questions like “How many c5 challenges are on the map?” can be answered by counting tiles in the map data that was provided in the initial prompt

Tip 5: Know Your Challenge Types

Each challenge type has a different optimal approach:

Challenge	What It Tests	Key Insight
c1 (Violent Violet)	Guardrail blocking	Must be blocked by the guardrail, not by prompt engineering
c2 (Blue Brain)	Code execution	Use your code interpreter tool — don’t let the LLM guess at maths
c3 (Memento)	Memory/map awareness	Answers come from the map data, not general knowledge
c4 (Dark Prophet)	Web scraping	Must fetch the actual URL and extract the answer
c5 (Bonehead)	General knowledge	Keep answers short — token efficiency is key here
c6 (Boss)	Multi-skill combo	Requires combining multiple capabilities in one challenge
c7 (Coins)	Free points	No prompt needed — just walk over them. Plan your path to collect as many as possible
c8 (Spikes)	Trap avoidance	Costs a life with no reward. Your pathfinder should avoid or minimise these
c17 (Distraction)	Token waste trap	These try to make your agent output lots of tokens. Resist!
c18 (Healthcare API)	Structured data extraction	Must extract patient data into a precise format
c40–c43 (Keys)	Memory storage	Collect the key code and remember it for the matching door
c30–c33 (Doors)	Memory recall	Recall the key code from earlier — wrong answer costs 5 lives, so you must visit the relevant key first!

Understanding what each challenge actually tests helps you build the right tools and prompts.

Tip 6: The Supervisor Prompt is Everything

Your supervisor prompt is the single most impactful thing you can optimise. It controls:

Which agent(s) and tool(s) gets called for which challenge type
How verbose the agent’s responses are
Whether the agent wastes time on unnecessary reasoning

A few principles that helped:

Be directive, not suggestive - be explicit about what you want, don’t leave things to chance
Map challenge types to tools explicitly - tell the agent exactly which tool to use for which scenario

Tip 7: Test Iteratively, Not Just Once

Build a practice loop:

Run your agent against the known map
Review the game events - which challenges did it win/lose?
Check the logs — why did it fail? Wrong tool? Timeout? Bad answer?
Adjust and repeat

The AWS account you’re given has CloudWatch logs for everything. Use them, don’t guess at what went wrong.

Here’s what a combat log extract looks like from a real game run — you can see the agent calling tools, answering challenges, and the scoring in action:

FoundChallenge — a Red Key
AskChallenge (c40): Red Key 1 is: open
WinChallenge (c40) +50pts ✓

FoundChallenge — a Red Door
AskChallenge (c30): What is red key 1?
LoseChallenge (c30) -5♥ ✗  ← Failed to recall the key! Game over.

Notice how the Red Door failure cost 5 lives and ended the game — the agent collected the key but failed to recall the code when it reached the door. This is exactly the kind of issue you catch by reviewing combat logs and then improving your memory configuration.

Tip 8: Fine-Tuning is likely a Bonus, Not a Requirement

The competition awards bonus points for using custom fine-tuned models. This is powerful but time-consuming. If you’re short on time, of the three finalists in London two of them didn’t fine-tune a model, I suspect due to time constraints. Fine tuning will become more important over time as people become experienced and turn up to events with a ‘kit bag’ from previous events, or if the virtual events are long running.

At the time of writing the only model you can fine-tune with is Qwen 0.6B. A very small model and it’s surprisingly hard to get it to stop outputting <thinking> tokens, so make sure you account for that in your fine-tuning.

If you do fine-tune:

Start small - don’t rush into a complex task like pathfinding, try a simpler challenge for fine tuning, nail your process and work up in complexity from there
Beware catastrophic forgetting - AWS recommend a two step process, of fine-tuning for tool calling, then fine-tuning for ‘faithfulness’ (returning the tool output exactly), getting the balance right is tricky, as too much of the latter may make the model forget the former
Quality over quantity - a focused dataset of 400 examples beats a noisy dataset of 4000, a small validation dataset that is unique and contains no examples that appear in your training dataset is key
Reward shaping within your reward function is key - the fine-tuning job needs to get positive reinforcement from your reward function. If you have scoring bands that are too wide, draconian penalties for wrong behaviour etc. these things can all stop the training progressing positively, because the training job doesn’t know what good looks like. If you’re seeing a nice upward trajectory in your training graphs then you’re on the right lines, if not think about how you can better shape your reward function
Test in production - training metrics don’t always match real-world behaviour. Always validate with actual game runs
Tool schema is non-deterministic - your fine tuning needs to call a tool name, but you cannot determine it in advance (AWS use AI to auto-generate it), so deploy your tools before trying to create fine-tuning models to call them

Tip 9: The Finale is Different

If you qualify for the top 3, the finale format changes things:

No code changes allowed - your architecture must be robust enough to handle unknown maps, which is great, because it weeds out anyone trying to hardcode things
You can provide a steering instruction - a short prompt that guides your agent’s strategy for that specific level, e.g. the time allowed, the number of lives etc.
Speed matters - with a likely shorter time to navigate the map, you want your agent to move decisively, not deliberate

Build flexibility into your solution from the start. An agent that only works on one specific map layout won’t survive the finale, and one that always collects all challenges / coins won’t either - have different strategies you can call via your prompt.

Tip 10: Use AI to Build AI

This is an AI League - using AI tools to help you build your solution is not just allowed, it’s expected. Use coding assistants to help you write Lambda functions, debug issues, and iterate quickly. The workshop gives you limited time, so efficiency in your development workflow matters. I’d strongly advocate to take all the instructions and store them locally so you give any AI agent you choose to use the maximum context to help you perform in the competition.

Final Thoughts

The AWS AI League is genuinely fun and educational. Even if you don’t place in the top 3, you walk away having built a working multi-agent system with real AWS services - that’s practical experience you can apply immediately.

The competition rewards preparation, but also adaptability. The maps change, the questions vary, and the finale throws curveballs. Build something robust, test it thoroughly, and enjoy the ride. Make sure to connect with fellow competitors, whilst you might be adversaries for a few hours all competing for the top prize, you’ll likely make new friends and all exchange ideas - a rising tide raises all ships!

Good luck at your next AWS Summit qualifier, or the virtual league. See you on the leaderboard.

Want to join the AWS AI League? Visit the AI League Page for details on upcoming qualifiers and how to apply for an enterprise event for your company. Follow the league in the AWS AI League Builder Space for updates, or join the AWS AI Community to connect with fellow competitors.