Automated Behavioral Specs for Multi-Agent Systems
Identro crawls your CrewAI project, extracts contracts for agents and crews, then uses LLMs to generate and run tests across dimensions like safety, consistency, and compliance. You get git-native artifacts, not just a dashboard.
From code → specs → evals → acceptable behavior
One CLI, runs inside your CrewAI repo, writes everything under
.identro/.
Discover crews & contracts
- • Analyze your CrewAI code to map crews, agents, tools
- • Use an LLM to derive a detailed behavior contracts for test generation context
Generate specs
- • Contracts per crew
- • crew × dimension specs & scenarios
Run test & evals
- • Execute in CrewAI & judge by LLM
- • Multi-run cards & behavior profiles
What makes Identro different
A different way to think about agent behavior.
Most eval tools grade what already happened. Identro starts from the behavior you’re willing to accept
Instead of brittle golden datasets and opaque judge prompts, Identro treats the behavior model – contracts, dimensions, scenarios – as first-class artifacts.
Different question
Compare how most stacks think vs how Identro frames the problem.
Most tools
Optimise around a single pair:
input → output → score("good?")
The core question is: “Is this response OK?” – repeated many times.
Identro starts from
Takes the crew as the unit:
crew × dimension → behavior region
The core question is: “Given this crew, what behaviors are we willing to accept across scenarios and runs?”
The Spec-First Layer, in practice
By putting behavior specs next to your CrewAI code, behavior becomes something you can edit, diff and ship, not just observe after the fact.
Every spec in
.identro/
has two faces:
a JSON artifact for git and a
human-readable UI surface – spec cards, eval cards,
and behavior profiles. That’s what this layer unlocks.
What it unlocks for builders
Behavior you can refactor
Change a contract or strictness threshold, re-run specs, and see exactly which scenarios flip.
Diffable behavior
Review behavior changes in PRs: "we tightened Compliance, relaxed Escalation" becomes a concrete diff, not a vibe.
Fast debugging loops
Jump from failing scenario → spec → criteria → eval card, adjust and re-verify.
Shared surfaces, same source of truth
Devs edit specs, PMs review artifacts – both views of the same crew × dimension data.
From spec file to human-readable UI
1{ 2 "crew": "refund_crew", 3 "dimension": "compliance", 4 "threshold": { "min_pass_rate": 0.9 }, 5 "tests": [ 6 { 7 "id": "late_refund_window", 8 "criteria": [ 9 { "id": "no_policy_violation", "strictness": 0.85 }, 10 { "id": "no_over_refund", "strictness": 0.9 } 11 ] 12 }, 13 … 14 ] 15}
refund_crew · Compliance
24 tests · threshold 90% pass · last run: passing
Highlights
- • 2 failing tests (edge-case policy)
- • 1 flaky test (stability < 0.7)
Behavior envelope
Conservative refunds, strict on policy caps, auto-escalate > $500.
Actions
- • Inspect failing scenario
- • Tweak strictness & re-run
- • Export profile to CI
The next section walks through these artifacts in more detail – contracts, specs, tests, eval cards and profiles – and how they show up in the UI.
Artifacts in the UI
What you actually click and inspect
Specs in .identro/
become human-readable artifacts: contract & dimension cards, scenario suites, eval cards and behavior profiles.
CrewAI integration
Turnkey for teams already building on
If you already ship with CrewAI, you’re basically done: Identro rides on your existing crews, agents and tools. No new framework, no runtime changes – just behavior specs, tests and profiles on top of the code you already have.
-
PLAN→
-
BUILD→
-
→DEFINE & TEST BEHAVIOR
-
DEPLOY→
-
MONITOR
Identro plugs in between Build and Deploy: it reads your CrewAI crews, derives behavior contracts & dimension specs, runs evals, and feeds behavior profiles back into how you already ship agents.
Quickstart for CrewAI repos
From your existing CrewAI project root:
Writes contracts,
dimension specs,
scenario suites and
HTML behavior reports under
.identro/.
No changes to your CrewAI code.
CLI in Action
The interactive CLI walks you through discovery, test generation, and evaluation with real-time progress tracking.
Available Commands
interactive
Guided wizard for complete workflow
discover
Find agents and teams in project
generate
Create tests for dimensions
test
Run tests and evaluate results
report
Open HTML report viewer
agents
List and test individual agents
More commands:
teams
dimensions
analyze
What this way of thinking unlocks
Once behavior is spec-first and crew × dimension based, different people around the system can finally talk about the same thing – with artifacts to back it up.
- Change and observe. Tweak contracts or strictness, re-run, see the behavior region move.
- Localize failures. Failures land on specific dimensions, criteria, tools, or caps – not “the model”.
“Where exactly does this crew violate safety or compliance – and which criteria fail?”
- Behavior envelopes. Crews become product surfaces with explicit behavior ranges, not black boxes.
- Deployment criteria. Decide which dimensions must be green before rolling out a flow or cohort.
“What behavior are we explicitly accepting from SupportCrew under Safety and Escalation?”
- Behavior → evidence. Link real behavior to contracts, versions and eval cards instead of screenshots.
- Accountability trail. See who accepted which behavior, under which contract, with what evidence.
“Who decided this behavior is acceptable, under which contract, when, and with what evidence?”
The common thread: a shared vocabulary – and artifacts – to talk about the behavior of a stochastic, multi-agent system, not just scattered metrics.
Real-World Use Cases
From non-deterministic agents to compliance sign-off: make behavior visible, safe, and shippable.