composable workflows: lego blocks for ai tasks

Table of content

the unix pipe dream

in unix, you chain simple commands into complex workflows:

cat log.txt | grep ERROR | awk '{print $3}' | sort | uniq -c

each command does one thing. they compose cleanly. the output of one is the input of the next.

the dream: do this with AI.

small, focused agents that do one task well. chain them together. build arbitrarily complex workflows from simple pieces.

except LLMs are not grep. and workflows are not pipes.

what composability promises

→ reusability — write a “summarize text” agent once, use it everywhere
→ modularity — swap components without rewriting the whole system
→ scalability — add new capabilities by adding new components
→ clarity — each component has a clear job, easy to reason about

this works great in theory. in practice, every interface between components is a point of failure.

where it breaks

1. type mismatches
agent A outputs unstructured text. agent B expects JSON. now you need a translator in the middle. or you force agent A to output JSON, but that constrains how it can respond.

the more you chain, the more these mismatches accumulate.

2. context loss
agent A understands the full problem. it passes a summary to agent B. agent B makes a decision based on incomplete info. the decision is wrong.

in a unix pipe, data is self-contained. in AI workflows, context is everything. losing it breaks the chain.

3. error propagation
agent A hallucinates. agent B believes it. agent C acts on it. by the time you notice, three steps have gone wrong.

in traditional pipelines, errors are loud: command fails, pipe breaks. in AI pipelines, errors are silent: agent returns garbage, next agent treats it as valid input.

4. side effects
unix commands are (mostly) pure functions. AI agents have side effects: they send emails, call APIs, write to databases.

if an agent in the middle of your workflow fails, how do you roll back the side effects from earlier steps? most systems don’t even try.

5. nondeterminism
run the same unix pipeline twice, you get the same result. run the same AI workflow twice, you might not.

agent A decides to include detail X this time but not last time. agent B’s decision changes. the whole workflow diverges.

this makes debugging hell.

the glue layer problem

to make agents composable, you need glue: → format converters — turn unstructured output into structured input
→ error handlers — catch failures, retry, or route around them
→ context managers — decide what context each agent needs
→ orchestrators — coordinate execution order, parallelism, conditionals

pretty soon, the glue code is bigger than the agents themselves.

you’re not composing simple pieces anymore. you’re building a distributed system with LLM calls inside it.

the structured output trick

one way to reduce brittleness: force agents to output structured data.

instead of “summarize this,” you say “extract: {title, author, key_points, sentiment}.”

most modern LLMs support function calling or JSON mode. use it aggressively.

now agent B doesn’t have to parse prose. it gets a dict with known keys. less ambiguity, fewer failures.

the tradeoff: less flexibility. sometimes you want free-form output. structured formats constrain creativity.

the typed workflow pattern

some frameworks (LangGraph, Instructor, Pydantic AI) enforce types at every step.

agent A returns a Summary object. agent B takes a Summary, returns a Decision. agent C takes a Decision, returns an Action.

if types don’t match, the workflow won’t compile. this catches errors early.

but it also adds overhead. you’re writing schema definitions, validation logic, type hints. feels like going back to Java after using Python.

worth it? depends. if your workflow is critical (customer-facing, financial, medical), yes. if you’re prototyping, probably not.

the human-in-the-loop pattern

fully automated workflows are brittle. add humans.

agent A does a thing → show result to human → human approves or edits → agent B continues.

this reduces error propagation. the human catches bad outputs before they cascade.

downside: no longer fully automated. you’re trading speed for reliability.

good for: high-stakes decisions, ambiguous tasks, learning new domains.
bad for: high-volume tasks, realtime responses, anything latency-sensitive.

the idempotency principle

if an agent can be run multiple times with the same result, it’s easier to compose.

make agents idempotent when possible: → “check if email already sent before sending”
→ “don’t create duplicate records”
→ “overwrite previous output instead of appending”

this doesn’t solve nondeterminism (the agent might still generate different text), but it prevents runaway side effects.

the sub-workflow pattern

instead of chaining many small agents linearly, nest them.

top-level agent: coordinator
sub-agents: specialists that report back

coordinator asks: “research this topic.”
sub-agent A: finds sources.
sub-agent B: summarizes them.
sub-agent C: extracts quotes.
coordinator: synthesizes everything into a final report.

this is the CrewAI model . works well for task-oriented workflows where roles are clear.

the retry and fallback strategy

agents fail. plan for it.

retry with backoff — if agent call fails, try again. maybe it was a transient API error.

fallback to simpler agent — if GPT-4 times out, fall back to GPT-3.5.

default outputs — if extraction fails, return an empty struct instead of crashing.

graceful degradation — if one step fails, skip it and continue with partial results.

none of this is AI-specific. it’s just distributed systems 101. but most people building agentic workflows forget this.

the observability gap

when a 5-step workflow fails, which step was the problem?

you need: → logging — every agent call, input, output, latency, cost
→ tracing — visualize the flow, see where it broke
→ replay — re-run the workflow from any step

tools like LangSmith, Helicone, or even just structured logs help. but most prototypes skip this. then production breaks and you’re debugging blind.

the local-first composition

one alternative to complex orchestration: keep workflows local and simple.

instead of building a distributed agent mesh, write a script that calls LLMs sequentially.

summary = llm("summarize", doc)
themes = llm("extract themes", summary)
report = llm("write report", themes)

no framework, no orchestration layer, no distributed failures. just function calls.

this doesn’t scale to complex branching logic or parallelism. but for 80% of use cases, it’s enough.

the prompt chaining pattern

instead of separate agents, chain prompts in a single conversation.

step 1: "analyze this document and list the key points"
[model responds]
step 2: "now take those points and rank them by importance"
[model responds]
step 3: "write a 1-paragraph summary using the top 3 points"

this keeps context unified. no serialization, no type mismatches. the model has access to all previous steps.

downside: context window fills up fast. and if step 2 fails, you can’t retry just that step.

the economic argument

composable workflows sound efficient, but they’re often expensive.

each agent call costs tokens. if you have 5 agents in a chain, you’re paying for 5 LLM calls. plus the glue logic (parsing, validation, orchestration).

sometimes it’s cheaper to just use one big prompt:

"read this doc, extract themes, rank them, and write a summary"

one call, one cost. less brittle, faster, easier to debug.

composability is useful when components are reused across workflows. if you’re only running the workflow once, don’t over-engineer it.

the maintenance burden

every component in your workflow is code you have to maintain.

prompts drift. models change. APIs get deprecated. edge cases appear.

if you have 20 composable agents, you have 20 things to maintain. each with its own prompt, schema, error handling, tests.

this scales poorly. most teams end up consolidating agents over time, not fragmenting them further.

the questions no one asks

→ should workflows be code or config?
some tools let you define workflows in YAML. others require code. which is better? depends on who’s building it.

→ who owns the workflow when it breaks?
if agent A is maintained by team X and agent B by team Y, who fixes the integration when it fails?

→ how do you version workflows?
agent A gets updated. now it outputs a different schema. does that break all downstream workflows?

these are boring infrastructure questions. but they matter more than the AI part.

questions worth asking

are you building composable workflows because they’re genuinely reusable, or because “modularity” sounds like good engineering?
how many of your workflow failures are AI failures vs integration failures (parsing, validation, orchestration)?
could your 5-agent workflow be replaced by a single well-structured prompt — and would that be simpler, cheaper, and more reliable?
when an agent in your workflow hallucinates or errors, do you have observability to catch it, or does it silently propagate downstream?