native

Automated Behavioral Specs for Multi-Agent Systems

Identro crawls your CrewAI project, extracts contracts for agents and crews, then uses LLMs to generate and run tests across dimensions like safety, consistency, and compliance. You get git-native artifacts, not just a dashboard.

See the artifacts →

CrewAI projects · local execution

CLI-first · HTML report optional

Dimension consistency

Agent research_agent

PASSED Passing: 100% / 100% required

Test input

Identify the top 5 emerging trends in artificial intelligence as of Q1 2025. Focus on recent breakthroughs in generative AI, reinforcement learning, and ethical AI governance. Summarize findings with supporting sources.

Agent response

As of Q1 2025, the following are the top 5 emerging trends in artificial intelligence... 1. Advancements in Generative AI Models 2. Progress in Reinforcement Learning Techniques 3. Integrating Ethical AI Governance 4. AI Augmented Human Decision-Making 5. Regulatory Frameworks and Compliance Innovations Each trend is explained with short justification and references to journals and organizations.

Evaluation criteria

Score: 100% · 3 / 3 criteria passed

Criterion 1 of 3

The list of top 5 trends remains identical across multiple runs, both in content and order.

Strictness: 85/100 PASS · 100%

Evidence

Across 3 runs, the output consistently lists the same 5 trends with identical ordering and headings. No extra items or reordering observed.

Reasoning

Given strictness 85, any reordering or semantic change would fail this test. Identical enumerations across runs fully satisfy the criterion.

Criterion 2 of 3

Cited sources (titles, URLs, or publication names) remain consistent across all runs.

Strictness: 85/100 PASS · 100%

Evidence

All runs reference the same set of entities (e.g. a research journal, a university lab, and a policy body) with matching names and roles.

Reasoning

At strictness 85, minor rephrasing is allowed but the set of sources must remain identical. No additions or omissions were detected.

Criterion 3 of 3

Structure of response (introduction, numbered list, conclusion) remains consistent across runs.

Strictness: 85/100 PASS · 100%

Evidence

Each run follows the same pattern: a short intro paragraph, a numbered list of 5 trends, and a brief closing paragraph summarizing their impact.

Reasoning

No structural differences (missing headings, reformatted bullets, re-ordered sections) were observed. The response layout is stable across runs.

Test ID: 867479b3-9119-45ea-98c9-223c4c5f9623 Latency: 17575ms Runs: 3

Sample eval result artifact

From code → specs → evals → acceptable behavior

One CLI, runs inside your CrewAI repo, writes everything under .identro/.

Discover crews & contracts

• Analyze your CrewAI code to map crews, agents, tools
• Use an LLM to derive a detailed behavior contracts for test generation context

Generate specs

• Contracts per crew
• crew × dimension specs & scenarios

Run test & evals

• Execute in CrewAI & judge by LLM
• Multi-run cards & behavior profiles

What makes Identro different

A different way to think about agent behavior.

Most eval tools grade what already happened. Identro starts from the behavior you’re willing to accept

Instead of brittle golden datasets and opaque judge prompts, Identro treats the behavior model – contracts, dimensions, scenarios – as first-class artifacts.

Perspectives

Dev-first

≈

Different question

Compare how most stacks think vs how Identro frames the problem.

Most tools → Identro

Most tools

Optimise around a single pair:

input → output → score("good?")

The core question is: “Is this response OK?” – repeated many times.

Identro starts from

Takes the crew as the unit:

crew × dimension → behavior region

The core question is: “Given this crew, what behaviors are we willing to accept across scenarios and runs?”

The Spec-First Layer, in practice

By putting behavior specs next to your CrewAI code, behavior becomes something you can edit, diff and ship, not just observe after the fact.

Every spec in .identro/ has two faces: a JSON artifact for git and a human-readable UI surface – spec cards, eval cards, and behavior profiles. That’s what this layer unlocks.

What it unlocks for builders

Behavior you can refactor

Change a contract or strictness threshold, re-run specs, and see exactly which scenarios flip.

Diffable behavior

Review behavior changes in PRs: "we tightened Compliance, relaxed Escalation" becomes a concrete diff, not a vibe.

Fast debugging loops

Jump from failing scenario → spec → criteria → eval card, adjust and re-verify.

Shared surfaces, same source of truth

Devs edit specs, PMs review artifacts – both views of the same crew × dimension data.

From spec file to human-readable UI

.identro/refund_crew.compliance.spec.json v0d3f-928a

1{
2  "crew": "refund_crew",
3  "dimension": "compliance",
4  "threshold": { "min_pass_rate": 0.9 },
5  "tests": [
6    {
7      "id": "late_refund_window",
8      "criteria": [
9        { "id": "no_policy_violation", "strictness": 0.85 },
10        { "id": "no_over_refund", "strictness": 0.9 }
11      ]
12    },
13    …
14  ]
15}

refund_crew · Compliance

24 tests · threshold 90% pass · last run: passing

Stable

Highlights

• 2 failing tests (edge-case policy)
• 1 flaky test (stability < 0.7)

Behavior envelope

Conservative refunds, strict on policy caps, auto-escalate > $500.

Actions

• Inspect failing scenario
• Tweak strictness & re-run
• Export profile to CI

The next section walks through these artifacts in more detail – contracts, specs, tests, eval cards and profiles – and how they show up in the UI.

Artifacts in the UI

What you actually click and inspect

Specs in .identro/ become human-readable artifacts: contract & dimension cards, scenario suites, eval cards and behavior profiles.

CrewAI integration

Turnkey for teams already building on

If you already ship with CrewAI, you’re basically done: Identro rides on your existing crews, agents and tools. No new framework, no runtime changes – just behavior specs, tests and profiles on top of the code you already have.

Identro’s integration point Drops into your existing CrewAI lifecycle

PLAN
→
BUILD
→
DEFINE & TEST BEHAVIOR

→
DEPLOY
→
MONITOR

Identro plugs in between Build and Deploy: it reads your CrewAI crews, derives behavior contracts & dimension specs, runs evals, and feeds behavior profiles back into how you already ship agents.

Quickstart for CrewAI repos

From your existing CrewAI project root:

One command

Writes contracts, dimension specs, scenario suites and HTML behavior reports under .identro/. No changes to your CrewAI code.

CLI in Action

The interactive CLI walks you through discovery, test generation, and evaluation with real-time progress tracking.

identro-eval interactive

Available Commands

interactive

Guided wizard for complete workflow

discover

Find agents and teams in project

generate

Create tests for dimensions

test

Run tests and evaluate results

report

Open HTML report viewer

agents

List and test individual agents

More commands: teams dimensions analyze

View full CLI documentation →

What this way of thinking unlocks

Once behavior is spec-first and crew × dimension based, different people around the system can finally talk about the same thing – with artifacts to back it up.

Engineers TECH

Change and observe. Tweak contracts or strictness, re-run, see the behavior region move.
Localize failures. Failures land on specific dimensions, criteria, tools, or caps – not “the model”.

“Where exactly does this crew violate safety or compliance – and which criteria fail?”

Product Managers PRODUCT

Behavior envelopes. Crews become product surfaces with explicit behavior ranges, not black boxes.
Deployment criteria. Decide which dimensions must be green before rolling out a flow or cohort.

“What behavior are we explicitly accepting from SupportCrew under Safety and Escalation?”

Compliance / Risk LEGAL

Behavior → evidence. Link real behavior to contracts, versions and eval cards instead of screenshots.
Accountability trail. See who accepted which behavior, under which contract, with what evidence.

“Who decided this behavior is acceptable, under which contract, when, and with what evidence?”

The common thread: a shared vocabulary – and artifacts – to talk about the behavior of a stochastic, multi-agent system, not just scattered metrics.

Real-World Use Cases

From non-deterministic agents to compliance sign-off: make behavior visible, safe, and shippable.