Back to articlesBotfooding: Can an LLM give good user feedback?

18 Aug 20256 minute read

Thumbnail for Botfooding Blog Post (Draft)

Macey Baker

Founding Community Engineer at Tessl, AI Native & codegen enthusiast. Some of my best friends are LLMs.

AI-Native Development

Developer Experience

AI Tools & Assistants

Testing & QA

AI Coding Tools

Table of Contents

What if your product’s first user wasn’t human?

Letting an LLM Try our Developer Tool

How I Set Up a Botfooding Session with Claude Code

Reflections, Takeaways, and Why It’s Worth Trying

Back to articles

Botfooding: Can an LLM give good user feedback?

18 Aug 20256 minute read

What if your product’s first user wasn’t human?

I’ve recently been running a series of experiments I’ve started calling botfooding. Like dogfooding (the practice of builders using their own tools), but with an LLM as the user. <pause for applause>

This has involved handing over full control of our spec-centric development toolkit (exposed via an MCP server) to Claude Code, and asking it to choose something real to build with it, and write down everything it thinks along the way.

Letting an LLM Try our Developer Tool

After 29 minutes, 22,000 tokens and zero intervention on my part, Claude had created a fully functioning, thoroughly tested and entirely spec-centric task management app using our tool, updating user-notes.md in an orderly fashion as it went.

And it… worked(?!). We got valuable, novel insight that we hadn’t heard before in months of internal and alpha user testing. Feedback like:

“Specs mostly covered the happy path and ‘not found’ errors. What about bad state, concurrency, or internal failures?”

“Explicit dependency declarations made the system easier to reason about. Circular imports were never a problem. But, I had to manually trace how everything connected. A graph would’ve helped.”

But what if we’re fooling ourselves? What if there’s no overlap between the LLM’s feedback and our human users’ feedback because… the LLM isn’t a human, or a real user? How can we be sure that the resulting feedback is actually salient?

I just want to call out that this line of thought did, of course, cross my mind. But the answer is rooted in common sense: discernment. As a steward of user feedback, and a heavy user myself, I had to parse Claude’s impressions for salience and relevance. And sure, not everything was high-signal. But this is often the case with human users, too.

In fact, I found Claude indulging in an all-too-human pattern: over-indexing on hyper-specific feedback and suggestions tailored to its use case. Rather than adapting its workflow to better fit the tools, it expected the tools to adapt to it. (LLMs: they’re just like us.) But ultimately, because Claude tried to use the toolkit in a novel way, it hit a couple of stumbling blocks along its path, and was thus able to illuminate a few UX blind spots for me.

How I Set Up a Botfooding Session with Claude Code

Here’s how I set up this experiment: I gave Claude Code full permissions for the entirety of our MCP server toolkit, as well as some basic bash commands. In `CLAUDE.md`, I described the intended functionality of the toolkit, and gave a cursory description of when you might use which tool. I made sure not to give instructions that were too precise; I wanted Claude to figure some of this out on its own, or at least to try. When I initialised the session, I explained to Claude what I wanted it to achieve, with very little guidance as to how to achieve it.

I then gave it one extra instruction: document your efforts in `user-notes.md` as you go. When you make a decision, write down why. When you find yourself stuck, write down your process to get un-stuck.

Once the goal had been achieved, I asked Claude to reflect on the experience and augment the notes with its thoughts: What made sense? What didn’t? What felt intuitive, what got in the way, what would you do differently next time? (As an aside, I had to emphasise criticality in my prompts to try and avoid the usual sycophancy!)

Reflections, Takeaways, and Why It’s Worth Trying

This is where the value starts to show. I now had bite-sized, information-dense feedback that I could share with the team. Some of this output even made it to our Linear queue!

So, no, I don’t think botfooding is a replacement for human feedback. But I do think it’s a surprisingly strong addition to human feedback. LLMs are tireless, opinionated, weirdly earnest users. They might notice things that you and your teammates have learned to ignore. And if you give them just enough context to behave autonomously, but not enough to prevent failure, they’ll stumble in ways that show you where the cracks are.

There are a lot of directions this could go. Try pointing an LLM at your documentation, and see if it can get started building straight away. Or give it a scaffolded environment and ask it to explore from there. Or, as I did, follow up by asking another LLM user to evaluate your botfooding results, assessing the readability and maintainability of the final product.

Whatever form it takes, I’ve come to think of botfooding less as one-off experiments, and more as a recurring check-in: What would it be like to use this thing for the first time? What would be confusing? What would I wish someone had told me? Combined with actual flesh-and-bone human insight, botfooding could be a powerful booster.