Agentic Workflows in Large Codebases

Where AI-assisted development actually helps in large codebases — and why the architectural guardrails you already built are what make it safe.

March 22, 20266 min read3 / 3

The tools in this course — ESLint rules, TypeScript project references, CI checks, module boundaries — were designed for human engineers working on complex systems. They turn out to be equally useful, maybe more useful, as guardrails for AI-assisted development.

That's not an accident. The principle is the same: you can't rely on any single actor to make good decisions consistently. You build systems that constrain the decision space.

Code Review Agents

The most immediately useful integration I've found: automated code review on every pull request. Not as a replacement for human review, but as a first pass that catches the things human reviewers routinely miss.

Most major platforms have this now — GitHub Copilot, Claude (through a GitHub app), and others. The setup is usually minimal: install the app, give it repository access, and it runs on every PR.

What it catches:

  • Missing cleanup (forgotten event listeners, useEffect dependencies)
  • Edge cases in logic that unit tests don't cover
  • Inconsistent error handling patterns
  • Duplicate utility functions that already exist elsewhere in the codebase

What it misses: product sense, architectural implications, context about why something was built a certain way. It doesn't know that the "weird" pattern in your auth flow exists because of a specific compliance requirement. Human review still owns that layer.

The dynamics are interesting. When I was a tech lead, PRs from junior engineers would attract detailed reviews from five senior engineers, often with contradictory feedback. My own PRs would get "LGTM" from people who knew better but weren't going to push back on their manager. A code review agent doesn't have those incentives. It will tell you about the event listener you forgot regardless of your seniority.

Configuring the Agent

The value of code review agents comes from giving them context about your codebase's conventions. Most support a configuration file or instructions prompt:

MARKDOWN
# .github/copilot-instructions.md When reviewing code in this repository: 1. Flag any use of raw HTML elements when an equivalent design system component exists in packages/ui. For example, <button> should use <Button>, <input> should use <Input>. 2. Flag duplicate utility functions. Common ones we already have: - formatDate — packages/shared/src/format.ts - debounce — packages/shared/src/async.ts - formatCurrency — packages/shared/src/currency.ts 3. Check for missing cleanup in useEffect: - Event listeners added in the effect body - Subscriptions (observable, store) - Timers 4. Flag direct store access from component files. State should go through hooks in src/hooks/.

This is essentially the same set of concerns as your ESLint rules — but in natural language for the code review pass. The ESLint rules handle it at commit time; the code review agent handles it at PR time with more nuance.

Browser Automation Tools

The Playwright MCP and Chrome DevTools MCP are genuinely useful for frontend development — with caveats.

The Playwright MCP spins up a browser, can take screenshots, and can drive interactions. The Chrome DevTools MCP has deeper access to DevTools internals. Both can be pointed at Storybook, which is a powerful combination: you can let an agentic tool build a component, then point it at the Storybook story to visually verify the output.

The limitation: these tools spin up their own browser instance, which means they're logged out of your application. Mock Service Worker handles part of this — if your Storybook uses MSW handlers, the agent can interact with realistic data without needing real auth.

Claude in Chrome is different. It's connected to your actual browser session, with your existing auth. You can point it at a page you're debugging and give it context about what you're seeing. The permission model is intentionally cautious — you confirm access repeatedly — because it has access to tabs where you might be logged into sensitive services.

The mental model: Playwright MCP is a controlled environment for building and testing. Claude in Chrome is for debugging against real state.

Why Your Guardrails Matter More With Agents

Agentic tools produce code quickly. The risk in a large codebase isn't that they'll write incorrect code — it's that they'll write code that's correct in isolation but violates architectural decisions the rest of the codebase depends on.

A few examples:

  • An agent styles a raw <button> to look like the design system component instead of using the component itself. Passes tests. Fails the ESLint boundary rule.
  • An agent imports directly from a package's internal path instead of its public API. Works. But now you've created a hidden dependency that breaks when the internal structure changes.
  • An agent creates a new date formatting utility because it's not aware of the one that already exists in packages/shared.

The ESLint rules catch the first two. The third one is harder — the code review agent is actually better positioned to catch it than ESLint is, because it has broader context about what already exists.

This is why the two are complementary. ESLint enforces the rules the agent should follow. The code review agent catches things ESLint can't express. Both are downstream of the architectural decisions you made and documented as constraints.

The MCP Server Pattern

MCP (Model Context Protocol) servers extend what an agentic tool can access — your Vercel logs, your database, your CI status. The value is collapsing the number of dashboards you have to manually check.

Being able to say "why is CI failing" and have the tool pull GitHub Actions logs, check error patterns, and propose a fix is meaningfully different from manually clicking through five services. The integration removes the coordination overhead.

The tradeoff: each MCP server consumes context window. In a session with many tools connected, the context fills up faster. Being selective about which MCPs to enable for a given task is a practical constraint.

The Bigger Picture

The pattern I keep coming back to: agentic tools work best when there's a system to validate their work. Where they struggle most is when there's no automated feedback — when the only check is whether the code looks right to a human reviewing it.

The architectural tooling in this course is exactly that system. TypeScript catches type errors. ESLint catches boundary violations and convention violations. Tests catch behavioral regressions. CI catches bundle size creep.

An agent operating in a codebase with all of these in place gets automated feedback loops at every level. It writes code, ESLint flags the violation, it fixes it, TypeScript catches the type error, it fixes that. The feedback is immediate and specific.

In a codebase without those guardrails, the agent writes something plausible-looking, no automated check flags it, and the problem lands in production or in a code review. The human has to do all the checking.

The tooling you build for your human engineers is the tooling that makes your AI tooling safe. They're not separate concerns — they're the same concern, applied at different points in the workflow.


Closing Thoughts

Every large codebase is a unique combination of technical decisions, organizational constraints, and historical accidents. Some of what's in this course will apply directly. Some will need adaptation. Some won't be feasible given your team's setup, compliance requirements, or infrastructure constraints.

The goal was never to hand you a recipe. It was to give you a framework for thinking through the options — what the trade-offs are, which problems each approach actually solves, and how to evaluate the choices for your specific situation.

The things that don't change between companies: the questions you ask. Can we cache this? Can we do less of it? Can we enforce this automatically instead of relying on review? Can we make the feedback loop faster? Can we measure this before and after so we know if the change actually helped?

Those questions apply whether you're working on a monolith, a monorepo, three separate frontend applications, or a full module federation setup. The answers differ. The framework for finding them doesn't.

Enjoyed this? Get more like it.

Deep dives on system design, React, web development, and personal finance — straight to your inbox. Free, always.