Engineering Insights Archives - The Codegen Blog

How AI Development Is Moving from Specialized Agents to Orchestration

Codegen Technical Staff — Thu, 02 Oct 2025 15:40:11 +0000

Specialized coding agents got AI development off the ground. But as general-purpose models like Claude Code and Gemini CLI have matured, the real leverage is shifting up a layer — to orchestration. This is where engineering teams coordinate many agents at once with parallelism, clean workspaces, and human oversight built in.

Louis Knight-Webb, co-founder of Bloop, has seen this evolution first-hand. In a recent AI Hot Takes conversation with Codegen CEO Jay Hack, he traced the path from enterprise code search to COBOL modernization and now to Vibe Kanban, an orchestration platform for running multiple coding agents in parallel.

Let’s dive into why the future of agentic development depends less on specialized bots and more on the systems that direct them.

From vertical tools to general-purpose agents

When AI coding first caught on, most solutions were vertical: “coding agents for X,” like code search or a one-off COBOL modernization pipeline. These tools proved that autonomous development was possible and gave teams a safe place to experiment.

Louis described the progression inside Bloop:

“We started with code search that became enterprise code search… One of our customers loaded in a COBOL codebase… we realized that a lot of organizations wanted to modernize COBOL and spent 18 months working on that problem and building a fully automated end-to-end pipeline coding agent experience.”

But general-purpose models quickly overtook those narrow solutions. Reinforcement learning loops, bigger context windows, and richer APIs meant that a single agent could now handle what previously required bespoke design. As Louis put it, they realized “coding agents were moving in a way that really benefited more general purpose approaches rather than more specialized coding agents.”

Why orchestration emerged as the next layer

Once general agents became capable, the next bottleneck quickly became coordination. Running one agent is straightforward; running dozens efficiently is not. Without the right system, teams waste time watching logs and waiting for sequential runs to finish.

Vibe Kanban was built to solve this. “It’s just basically a way to orchestrate Claude Code, Gemini CLI, AMP and other coding agents at scale,” Louis explained. Instead of queuing tasks one by one, Vibe Kanban manages parallel execution with proper sandboxing and a workflow designed for fast-moving projects.

This shift is bigger than a single product. As agents complete tasks in minutes instead of days, orchestration such as task management, isolation, and reproducibility becomes the new foundation for serious software development.

The orchestration playbook

An effective orchestration layer focuses on engineering fundamentals:

Parallelism with clean state. Each task runs in its own git worktree, ensuring deterministic builds and eliminating side effects.
Automated setup and cleanup. Environments are built and torn down predictably, so dependencies don’t leak between runs.
Integrated boards. When tasks complete rapidly, project management has to fuse code, logs, and live previews into a single view.

This helps teams scale AI development to hundreds of concurrent tasks without sacrificing traceability or quality.

Keeping humans in the loop

Even with orchestration, some decisions can’t be automated. Louis was direct about this:

“The human element of review is very difficult to replace… the bits around the edges like choosing what work to even get done… I don’t think it is going to go away anytime soon.”

Scoping work, making architectural calls, and deciding when a feature is production-ready remain human responsibilities. Good orchestration respects that reality by surfacing the right context and making review and approval fast and reliable.

Designing the right layer

The key architectural question is what belongs in the orchestration layer versus inside the model itself. Models handle code generation and refactoring. Orchestration governs processes: breaking projects into tasks, preparing environments, managing dependencies, and structuring review workflows.

The line isn’t always obvious, but here are a few guidelines to help:

Keep process in orchestration. Breaking down projects into tasks, preparing clean environments, enforcing dependency checks, and coordinating reviews all belong at this layer.
Let the model focus on code. Generating, refactoring, or testing code is where large language models excel; avoid embedding these directly in orchestration logic.
Use models as subroutines, not supervisors. Treat agents as workers that execute well-defined steps, while orchestration handles scheduling and governance.

By separating responsibilities, teams can scale safely and adjust quickly as model capabilities evolve.

How Codegen helps

Codegen was built for this new layer. It provides:

Agent orchestration at scale with process isolation, reproducible environments, and parallel execution.
Integrated developer workflows that combine task tracking, logs, and live previews.
Enterprise-grade security so teams can run agents on production code with confidence.

By focusing on orchestration instead of one-off agents, Codegen gives engineering teams a durable foundation, even as the underlying models continue to evolve.

The bottom line

The story of AI development is changing from specialized coding agents to orchestration. Vertical tools proved what was possible. General-purpose agents made those tools obsolete. Now the opportunity, and the hard engineering work, is in coordinating agents effectively and keeping humans in control of the process.

For engineering leaders, platform teams, and founders building on Claude Code, Gemini CLI, or the next generation of agents, investing in orchestration is no longer optional. It’s how you scale AI-driven development with the reliability and transparency modern software demands.

Ready to get started? Try Codegen for free or reach out to our team for a demo.

The post How AI Development Is Moving from Specialized Agents to Orchestration appeared first on The Codegen Blog.

Why Code Review Will Determine Who Wins in the AI Era

Codegen Technical Staff — Tue, 09 Sep 2025 16:13:41 +0000

For decades, software engineering was defined by writing code. But that balance has shifted. With AI code agents producing high-quality output in seconds, the bottleneck isn’t generation anymore, it’s everything around it: reviewing, merging, testing, and governing the changes.

We’ve entered a new frontier where engineers spend less time typing and more time orchestrating. Writing code has become the easy part; making sure that code is correct, compliant, and production-ready is where the real challenge now lies.

Codegen CEO Jay Hack joined Merrill Lutsky, co-founder and CEO of Graphite, on AI Hot Takes to dig into why code review is the new bottleneck, and why the way teams rethink their outer loop will decide who ships and who stalls.

Reviews are where teams are getting stuck

Numbers show developers are able to generate code at a blistering pace. GitHub reported in 2023 that developers using Copilot completed tasks 55.8% faster than control groups, and because of this speed, Amazon says “developers report they spend an average of just one hour per day coding.”

But the outer loop hasn’t kept up. An analysis of ~1,000,000 PRs by LinearB shows that cycle time is dominated by review latency, with PRs sitting idle for an average of 4+ days before a reviewer even looks at them. In other words: we can now generate 10x the code, but we can’t yet review or ship it 10x faster.

Lutsky noted:

“If we have these 10x engineering agents, that just makes the problem of code review 10x more important, and more painful for companies that are using them.”

Is stacking how we keep the pace?

One of the most powerful responses to this bottleneck is stacked pull requests. Instead of submitting one massive PR, stacking breaks features into small, independently reviewable increments.

This isn’t new. Facebook built Phabricator to support stacked diffs across thousands of engineers, and Google’s Critique adopted similar practices. The reason was simple: smaller diffs are easier to review, unblock dependent work, and reduce the risk of merge conflicts.

Lutsky stated:

“Stacking was invented for orgs with thousands of engineers, but it’s suddenly relevant to every team now that agents can generate code at the scale of those orgs.”

Now, in the agents era, stacking feels less optional and more like a requirement. Agents generate code in bursts, and humans can’t keep up if the output lands as giant, monolithic PRs. Stacking makes agent output digestible, verifiable, and mergeable.

Solving AI problems with AI solutions

There’s a temptation to see AI review as replacing rule-based automation to keep the pace, but the reality is that both are necessary.

Deterministic systems such as branch protection, CI pipelines, and merge queues enforce non-negotiables. They ensure that every change passes tests, follows style guides, and respects permission boundaries. But they’re limited. They can’t reason about whether a design decision makes sense.

Agentic review fills that gap. Context-aware agents can scan a PR in seconds, check for subtle logic errors, and recommend fixes that a human might miss, especially in unfamiliar parts of the codebase. Studies suggest AI already outperforms humans at spotting certain categories of bugs.

Graphite is already combining merge queues with their review agent, Diamond. Lutsky noted:

“Combining those kinds of deterministic and more traditional methods with agentic review, and having a code review companion…our unique take is that you need both of those combined all into one platform in order to properly handle the volume of code that we’re seeing generated today.”

Deterministic controls to guarantee baseline standards, and agentic reviewers to accelerate semantic checks. The result is faster throughput without sacrificing safety.

Optimizing the Outer Loop

We’re in the middle of an exciting shift. Code generation is fast and plentiful. The bottlenecks are now review, orchestration, and governance — the outer loop of development.

Optimizing this outer loop requires:

Making stacking the default, so changes are digestible.
Blending deterministic rules with agentic review for speed and safety.
Building review interfaces that tell a story and scale to agent-level throughput.
Treating AI metadata as compliance-critical data, not an afterthought.
Meeting developers where they work, whether in Slack, GitHub, or natural language interfaces.

The message for teams of all sizes is clear. Code is no longer the bottleneck. Review is. The winners in this new era will be the teams that redesign their workflows around that fact.

Want to check out the full conversation? Watch Jay Hack and Merrill Lutsky discuss how AI code generation is breaking traditional development workflows, and why code review has become the real bottleneck on AI Hot Takes.

If you’re still stuck in PR purgatory, it’s time to try Codegen. Free to start, or schedule a demo if you want receipts.

The post Why Code Review Will Determine Who Wins in the AI Era appeared first on The Codegen Blog.

SWE Agents are Better with Codemods

Codegen Technical Staff — Sat, 01 Mar 2025 18:00:00 +0000

Coding assistants like Cursor have brought us into a new era of programming.
But there’s a class of programming tasks that remain outside their reach: large-scale, systematic modifications across large codebases. You wouldn’t ask an AI to delete all your dead code or reorganize your entire component hierarchy—the tooling just isn’t there.

That’s where codemods come in.

A codemod is a program that operates on a codebase, and when you give an AI agent the ability to write and execute them, these platform-level tasks fall below the high-water mark of AI capabilities.

Here’s a real example:
We asked Devin (an autonomous SWE agent) to “delete all dead code” from our codebase. Instead of making hundreds of individual edits, Devin wrote and debugged a program that systematically removed unused code while handling edge cases like tests, decorators, and indirect references.

This modifies over 40 files and correctly removes old code, passes lint and tests, etc.

What made this work?

Devin operates like a state machine: write a codemod → run it through the linter → analyze failures → refine.

Each iteration adds handling for new edge cases until the codemod successfully transforms the codebase.

This mirrors the same cycle developers use for large-scale refactors—just automated.

The best part? You don’t have to blindly trust the AI.

The codemod is a program you can review, run, and verify with linter and test output. You can edit it to add exceptions or improve coverage—much better than reviewing hundreds of individual diffs.

Try it yourself

Want to try this with your own Devin instance?

Install the Codegen CLI:

uv tool install codegen --python 3.13

Then use this prompt:
Download Dead Code Deletion Prompt

Supported Codemods with Codegen

Convert Promises to async/await:
Automatically convert Promise chains to async/await syntax across your codebase.
Organize Codebase:
Move and reorganize code safely with automatic import and dependency handling.
Modernize React:
Convert class components to hooks, standardize props, and organize components.
Migrate Tests to Pytest:
Convert unittest test suites to modern pytest style.

We’d love to hear how it works for you.
Join our community and share your experience developing codemods with Devin or other code assistants.

The post SWE Agents are Better with Codemods appeared first on The Codegen Blog.

Act via Code

Codegen Technical Staff — Thu, 02 Jan 2025 05:48:00 +0000

Two and a half years since the launch of the GPT-3 API, code assistants have emerged as potentially the premier use case of LLMs. The rapid adoption of AI-powered IDEs and prototype builders isn’t surprising—code is structured, deterministic, and rich with patterns, making it an ideal domain for machine learning. Developers actively working with tools like Cursor (myself included) have an exhilarating yet uncertain sense that the field of software engineering is approaching an inflection point.

Yet there’s a striking gap between understanding and action for today’s code assistants. When provided proper context, frontier LLMs can analyze massive enterprise codebases and propose practical paths towards sophisticated, large-scale improvements. But implementing changes that impact more than a small set of files with modern AI assistants is fundamentally infeasible. The good news is that for focused, file-level changes, we’ve found real success: AI-powered IDEs (Windsurf, Cursor) are transforming how developers write and review code, while chat-based assistants are revolutionizing how we bootstrap and prototype new applications (via tools like v0, lovable.dev, and bolt.new).

However, there’s a whole class of critical engineering tasks that remain out of reach—tasks that are fundamentally programmatic and deal with codebase structure at scale. Much of modern engineering effort is directed towards eliminating tech debt, managing migrations, analyzing dependency graphs, enforcing type coverage, and other global concerns. Today’s AI assistants can propose solutions but lack the mechanisms to execute them. The intelligence is there, but it’s trapped in your IDE’s text completion window.

The bottleneck isn’t intelligence—it’s tooling. The solution is giving AI systems the ability to programmatically interact with codebases through code execution environments. These environments are the most expressive tools we can offer agents, enabling composition, abstraction, and systematic manipulation of complex systems. By combining code execution with custom APIs for large-scale operations, we unlock new high-value use cases.

Beating Minecraft with Code Execution

In mid-2023, a research project called Voyager solved Minecraft, performing several multiples better than prior SOTA. This success wasn’t about raw intelligence—it was about providing a more expressive action space: code.

GPT-4, when allowed to write and execute JavaScript programs through a clean API, could craft high-level behaviors and reuse learned “action programs” across tasks. This enabled skill accumulation, experience recall, and systematic reuse.

“We opt to use code as the action space instead of low-level motor commands because programs can naturally represent temporally extended and compositional actions…”

Code is an Ideal Action Space

Letting AI act through code rather than atomic commands yields a step change in capability. In software engineering, this means expressing assistant behavior through code that manipulates codebases.

# Implement `grep` via for loops and if statements
for function in codebase.functions:
    if 'Page' in function.name:
        function.move_to_file('/pages/' + function.name + '.tsx')

This paradigm brings multiple advantages:

API-Driven Extensibility: Agents can use any operation exposed via a clean API.
Programmatic Efficiency: Batch operations across large codebases are fast and systematic.
Composability: Agents can chain simple operations to form more complex ones.
Constrained Action Space: APIs act as guardrails, preventing invalid actions.
Objective Feedback: Errors provide clear debugging signals.
Natural Collaboration: Code is human-readable and reviewable.

Code Manipulation Programs

To match how developers think about code, agents need high-level APIs, not raw AST surgery. We’re building a framework that reflects actual engineering intuition and abstracts over common edge cases, while preserving correctness.

# Access to high-level semantic operations
for component in codebase.jsx_components:
    if len(component.usages) == 0:
        component.rename(component.name + 'Page')

This isn’t string substitution. The framework understands structure: React hierarchies, type systems, usage graphs. It enables both rapid analysis and safe edits.

We’re also extending this interface to systems outside the repo: AWS, Datadog, and CI/CD platforms. This is the path to autonomous software engineering.

Codegen is now OSS

We’re excited to release Codegen as open source under Apache 2.0 and build out this vision with the developer community. Schedule a demo or join our Slack community to share ideas and feedback.

— Jay Hack, Founder

The post Act via Code appeared first on The Codegen Blog.

Your AI Is Writing Bad Docs Because It Lacks Context

Codegen Technical Staff — Thu, 21 Nov 2024 20:56:17 +0000

It’s a truth universally acknowledged: every engineering team wrestles with the problem of documentation.

“Our documentation sucks!”
“But our codebase changes so fast—why waste time documenting it?”
“Actually, my code is self-documenting.”

So it goes.

Obviously, AI can help document your code. But only if you use it strategically.

At one extreme, you hand over your Github repo to an LLM and find that it’s added a long docstring on top of every function, full of the sort of vapid vague text that LLMs love to generate.

On another extreme, the status quo: documentation that’s sparse, time-consuming, and quickly outdated.

Ideally we want documentation that’s both useful and maintainable. And, for the first time in history, you can accomplish this effortlessly—by leveraging a combination of AI and static code analysis tools.

Here’s how AI and static analysis can transform your documentation workflow.

1. Cut out the fluff

If you just feed a function into an LLM and ask it to write some documentation, the LLM will probably generate something very annoying to read.

A nightmare example of a ChatGPT-generated docstring that bloats the codebase with useless fluff.

Good documentation in a codebase should be specific. It should highlight any weird exceptions or edge cases about the function or module. It should contain examples of how it is used, if and only if it’s not obvious.

The thing is: LLMs are capable of generating actual good documentation. You just need to give it enough context about your function or module, and how it’s used in the codebase.

That’s where static analysis comes in. Tools like Codegen analyze your codebase first to understand how functions and modules depend on each other. Then, Codegen can use bi-directional usages to inform the documentation—i.e., include the places the function being documented is called, as well as the whole chain of functions it calls, in the prompt for the LLM. That allows the LLM to produce a more informed docstring than it would from just the source code alone.

A Codegen-generated graph of the report_install_progress function’s bidirectional usages. In yellow are all functions that call report_install_progress; in green are all functions that it calls. Given this context, the LLM can understand the function much better. As linguist J. R. Firth said: “You shall know a word [or a function!] by the company it keeps.”

With the help of some static analysis, Codegen can give an LLM the context it needs to generate helpful, no-BS documentation.

An example of a context-aware docstring, written by Codegen’s AI assistant, in the Codegen source code.

So: static analysis is pretty good for helping AI document functions and modules.

But the best documentation—especially for complex services, modules, or even large PRs—should provide context that isn’t captured in the code alone.

As many engineers have noted, it is not useful to simply feed ChatGPT a function or a diff and make it generate docs.

A future evolution of Codegen might feed the LLM even more context by integrating data sources like Slack threads or Notion design docs.

2. Be strategic about the level of detail

Not every function deserves a detailed docstring. You should prioritize writing detailed, comprehensive documentation only in the areas where it delivers the most value.

Examples:

Code that is touched by multiple teams — e.g. backend endpoints that are called by frontend developers.
External-facing APIs or SDKs where clear explanations are critical for consumers.

Again through static analysis, tools like Codegen can identify which areas of the codebase are most trafficked, and highlight which functions are actually used outside of a module (versus only used inside the module)—and make sure to add extra detail only to those key areas.

3. Dynamically update documentation

Great, now you have all this highly-nuanced, context-aware documentation… but… what do you do when the code inevitably changes? In the example above: maybe you modify the format of the string that codemodFactToString returns. Are you really going to check the docstrings for all 12 functions that reference codemodFactToString, to make sure they’re still up-to-date?

Instead, with a tool like Codegen, you can imagine creating a lint rule to make the AI update all relevant documentation every time a PR is created, so that your docs are updated in lockstep with your code.

Looking ahead

Good documentation will be increasingly important as humans and AI agents collaborate on writing code.

In a pre-AI world, it was still feasible for a few engineers to intimately understand a codebase without needing much documentation. But as we increasingly bring in AI agents to help write parts of the code, it won’t be so easy anymore to keep track of exactly what’s going on. In a world where humans and AIs collaborate on code, well-written inline documentation will be crucial—not only to help humans navigate and remember the intricate details of a codebase, but also to provide helpful additional context to AI assistants as they debug and generate code.

And, as AI tools help us ship more and more quickly, it’ll be even more important to ensure that documentation evolves with the codebase.

By combining AI with code analysis tools, we can finally solve the age-old dilemma between documenting well and shipping fast.

If this sounds cool, request to try Codegen!

The post Your AI Is Writing Bad Docs Because It Lacks Context appeared first on The Codegen Blog.