Investigating (Yet Another) AI-Enabled Paradigm Shift in Development Workflows

Investigating (Yet Another) AI-Enabled Paradigm Shift in Development Workflows




If modern agent harnesses can orchestrate multi-step workflows automatically, do we still need to build multi-agent systems from scratch?

TL;DR:

Agent harnesses like Claude Code dramatically simplify agent orchestration, so we explored an intriguing idea: instead of writing custom multi-agent applications, write agent skills and plug them into existing harnesses. Our experiment was successful as a PoC, but production constraints would run into the limitations of existing agent harnesses.

The Paradigm Shift

The recent hype around Opus 4.6 and agent harnesses1 (i.e. Claude Code, Cursor & Codex et al.) has brought about a massive spike in purely AI-based development, so much so that some companies are even experimenting with software where not a single line of code is human-generated. Cursor’s Cloud Agents, running on a cloud VM, are now able to independently work on your repo and send a video demo to the user for review. These powerful agents are now considered to be general-purpose workers for arbitrary tasks, no longer just code-generators. This is especially facilitated by the open Agent Skills standard2 introduced by Anthropic and now widely adopted in agent harnesses. Now, you can add a skill to an agent harness and guide an agent to do almost anything that you previously had to write custom applications for.
This rapid shift brought a radical thought to our minds: 
If these general agents are really so powerful that they can be guided to carry out complex multi-step workflows with a bunch of skills files, does it still make sense to build multi-agent applications where we write the orchestration logic ourselves? 
Typically, multi-agent applications are written using agent runtimes such as LangGraph or Autogen, and involve writing separate agents for separate jobs and connecting them using (usually) complex orchestration logic.
Compare all that application building effort with simply providing Claude Code or Cursor or their other contemporaries (which I will henceforth refer to as the agent harness) with a bunch of skills - usually one per unique task in your workflow - and watch it go. In one swoop, we remove all the complexity of orchestration (which, ask anyone who has ever tried to write a really good planner agent, or dealt with the pain of multi-agent synchronization, can be significant). 
The agent harness implements most of the orchestration for you - planning the task, spawning (and synchronizing) subagents, parallelizing some of the subagents to save time, passing the necessary context to the LLM in each call (see progressive disclosure), reflecting, retrying and correcting errors. The developer’s job is to create good skills files - which still takes work, but it’s a whole lot less work than writing a full agentic backend from scratch.
So the question is:
Can we replace all our multi-step agentic backends with skills to be executed inside an agent harness? 
Excitingly, to some extent, we can.
We tested this idea practically. Here is precisely what we tried, and what we learned.

Putting Claude Code to The Test

Our probe tested whether an agent harness, specifically a CLI tool, could realistically replace a traditional multi-agent pipeline for an end-to-end workflow. 

Our Use Case

We took agentic content production as a representative use case. The steps: 
User inputs a command e.g. “Publish an article about reranking techniques in RAG systems”
This pipeline runs: Research → Writing → Review → Revise → Publish (on Notion)

What We Learned

Our probe surfaced insights in two categories: (a) what agent harnesses do well, and (b) what breaks when you think about production systems. 

What Agent Harnesses Do Well:

1. Skills inside a harness can replace traditional multi-agent applications for many workflows
In our experiment, orchestration complexity was handled almost perfectly by the harness. Previously, we would have written all the orchestration code of this pipeline, for example communication between agents, managing feedback loops, writing the planning logic, writing sub-agent spawning logic and so on. Our experiment showed that it is very much possible to simply encode each agent’s logic inside a skill, outsource the orchestration to an agent harness and watch the magic unfold.
2. Claude Code allows sub-agent spawning which enables parallelism very easily.
We simply had to specify in the skill file (a trimmed snippet from our research skill file shown below) that subagents should be used to simultaneously research the subtopics. It obediently did so, which significantly reduced latency, without having to deal with synchronization struggles of implementing this for ourselves.
## Phase 1 — Subtopic Decomposition & Approach Assignment
When given a topic:
1. **Generate 4-6 subtopics** that collectively cover the topic from different angles.
2. **Assign 1-3 research approaches** to each subtopic based on what sources are most relevant: web search, code & repository search, academic & papers etc.
3. **Build the spawn plan** — a flat list of `(subtopic, approach)` pairs.
This is the complete set of sub-agents that will be spawned in Phase 2.
**Example** — Topic: "Impact of AI on Software Engineering"
```
Subtopics & Approach Assignments:
1. AI-assisted code generation tools → [web search, code search]
2. Developer productivity and job roles → [web search, video & talks]
3. AI in testing, debugging, code review→ [code search, official docs]
Spawn plan (6 sub-agents):
[1-web, 1-code, 2-web, 2-video, 3-code, 3-docs]
```
---
## Phase 2 — Parallel Sub-Agent Execution
**CRITICAL**: Spawn ALL sub-agents from the spawn plan simultaneously in one parallel batch.
Every `(subtopic, approach)` pair runs as an independent sub-agent — no intermediate layers.
3. Claude Code’s “slash commands” allow deterministic workflows.
“Slash commands” are a thing of beauty. They provide a convenient way to define structured workflows that invoke skills in sequence. Workflows that benefit from deterministic orchestration don’t need to be left to an LLM’s whims. For instance, we defined a command /pipeline (by creating a file called pipeline.md) that instructs the harness to invoke the research, writing, review, revise and publish skills in sequence. The pipeline.md file is shown below.
.claude/command/pipeline.md
---
description: Run the full content pipeline (research → write → review → revise → publish)
---
Run the full content pipeline for:
$ARGUMENTS
Execute in order:
1. Researcher skill → produce research brief
2. Writer skill → draft content using research brief
3. Reviewer skill → review the draft, produce scored feedback with APPROVED/NEEDS_REVISION verdict
4. If NEEDS_REVISION: pass the reviewer's suggestions AND the original draft back to Writer skill to produce a revised draft
5. Publisher skill → publish the final approved draft to Notion
Pass context between each step.
Note: Slash commands are present in Claude Code and Gemini CLI but not necessarily in all agent harnesses.
4. Choice of harness matters, not just the skill definition itself. 
We tried both Claude Code and Gemini CLI with the same skills. One problem we observed was that Gemini CLI does not offer a straightforward tool to spawn sub-agents on demand, which limits the capability of parallelizing some parts of the workflow. The bottom line is that not all harnesses are created equal, and it is good to do some tinkering before choosing the one that fits your requirements.
Having explored these new possibilities, we then went one level deeper and thought about the real production stuff: fault tolerance; load balancing; autoscaling; and observability.

Production Considerations:

1. Integration:
One thing that becomes immediately obvious when you start thinking about shifting production applications to this paradigm is this: CLI-based agent harnesses are intended to be developer tools, not backend services. So borrowing Claude Code’s orchestration brain for a production application isn’t quite practical. You would have to wrap an instance of Claude Code in some wrapper and hook that to your backend, to be able to route requests to and from it, which is rather awkward and clunky.
Intriguingly, this problem seems to have been addressed by Langchain’s Deep Agents, which is also a harness, but provides an SDK option so it can be integrated into application backends. That sounds promising, especially if its orchestration layer is as capable as that of Claude Code. We will describe our experience with Deep Agents in a future post.
2. Load balancing and scaling:
Load balancing and scaling are basic production requirements we almost take for granted when building backend services, because established infrastructure (load balancers, autoscaling workers, job queues) supports them out of the box. CLI-based agent harnesses, however, are designed to execute workflows interactively rather than as backend services. To scale them reliably, we would first need to wrap the harness inside a service layer that integrates with this infrastructure, which adds complexity and reduces the simplicity of the paradigm.
3. Fault tolerance:
This also is a modern luxury often taken for granted, as the infrastructure smoothly handles this for us. With a CLI-based agent harness, if something fails after it has been running for a while, the user basically loses all progress and needs to start over (by sending their prompt again). For long-running tasks this can introduce a lot of inefficiency.
4. Observability:
Observability was limited in our setup. We were able to capture some execution events through lightweight logging scripts triggered around skill execution, but this is not the same as integrating a dedicated observability stack. In practice, this gave us only partial visibility into what the harness was doing, making production-grade tracing, debugging, and monitoring difficult.

Architectural Patterns We Surfaced

We want to separately call out two useful architectural patterns we found during this probe:
1. Separate capabilities from workflows.
Skills work best when they describe what a capability does, not when it should be used. During our early experiments we initially tried to encode workflow logic directly inside skill descriptions (e.g. telling the research skill it should always run first). This turned out to be a poor pattern, and against Anthropic’s official recommendations. A much cleaner approach is to let skills represent capabilities — research, write, review, publish — while using commands to define the workflow that stitches them together. This separation keeps skills reusable and allows different workflows to reuse the same capabilities without hardcoding sequencing logic.
2. Sub-agents are a simple parallelization primitive.
One particularly interesting discovery was how easily the harness could parallelize work using sub-agents. In our case, the research skill was instructed to split a topic into subtopics and spawn sub-agents to research them simultaneously. The harness handled the coordination and aggregation automatically. In practice, this meant we could parallelize a traditionally sequential workflow with almost no additional engineering effort.

Our Takeaway

Skills-based agent harnesses represent a promising shift in how we build AI applications. For many workflows, they can dramatically reduce orchestration complexity by letting the harness handle planning, coordination, and parallelization instead of requiring developers to implement these mechanisms manually. 
However, most of today’s agent harnesses are primarily designed as CLI-based developer tools rather than production workflow engines. While they work very well for rapid prototyping, PoCs, and internal automation, using them as the orchestration layer for production systems would still require significant additional infrastructure for scaling, fault tolerance, and observability.
Note: Our experiment focused on a single use case to gain practical experience with the agent skills paradigm, rather than comprehensively testing all possible capabilities.

1 We follow the definition from Harrison Chase of Langchain fame: an agent harness is an opinionated package that bundles the defaults, tools, and environment needed to run a general-purpose agent out of the box.
2 A Skill is a small directory containing instructions, code samples, and resources to guide an agent to perform some task it might not be able to do without guidance.