When you give a Claude a mouse
... Most importantly, I was presented finished drafts to comment on, not a process to manage. I simply delegated a complex task and walked away from my computer, checking back later to see what it did (the system is quite slow). ...
But what made this interesting is that the AI had a strategy, and it was willing to revise it based on what it learned. I am not sure how that strategy was developed by the AI, but the plans were forward-looking across dozens of moves and insightful. ...
Before the system crashed - which was not a problem with Claude but rather with the virtual desktop I was using - the AI made over 100 independent moves without asking me any questions. ...
I reloaded the agent and had it continue the game from where we left off, but I gave it a bit of a hint: you are a computer, use your abilities. It then realized it could write code to automate the game - a tool building its own tool. Again, however, the limits of the AI came into play, and the code did not quite work, so it decided to go back to the old-fashioned way of using a mouse and keyboard. ...
What does this mean?
You can see the power and weaknesses of the current state of agents from this example. On the powerful side, Claude was able to handle a real-world example of a game in the wild, develop a long-term strategy, and execute on it. It was flexible in the face of most errors, and persistent. It did clever things like A/B testing. And most importantly, it just did the work, operating for nearly an hour without interruption.
On the weak side, you can see the fragility of current agents. LLMs can end up chasing their own tail or being stubborn, and you could see both at work. Even more importantly, while the AI was quite robust to many forms of error, it just took one (getting pricing wrong) to send it down a path that made it waste considerable time. Given that current agents aren’t fast or cheap, this is concerning. ...
The AI didn’t always check in regularly and could be hard to steer; it “wants” to be left alone to go and to do the work. Guiding agents will require radically different approaches to prompting¹, and they will require learning what they are best at. ...
See the full story here: https://open.substack.com/pub/oneusefulthing/p/when-you-give-a-claude-a-mouse?r=5xhla&utm_campaign=post&utm_medium=email
Pages
- About Philip Lelyveld
- Mark and Addie Lelyveld Biographies
- Presentations and articles
- Tufts Alumni Bio