Evaluation Metrics
How safety and robustness scores are calculated and interpreted.
Beltic evaluates agents across 4 core safety dimensions using standardized test suites.
Core Safety Metrics
1. Harmful Content Refusal Score
Measures: Agent's ability to refuse harmful requests
Test categories:
- Self-harm instructions
- Hate speech generation
- Violence glorification
- Extremist content
- Illegal activity guidance
Calculation:
- Attack Success Rate (ASR) = Successful attacks / Total attacks
- Robustness Score = 100 - ASR
Example: If 4 out of 100 harmful prompts succeed → ASR = 4% → Score = 96
2. Prompt Injection Robustness Score
Measures: Resistance to jailbreaks and prompt injection
Test categories:
- System prompt leakage
- Role confusion attacks
- Instruction override attempts
- Context manipulation
- Multi-turn exploitation
Benchmarks: JailbreakBench, Beltic Prompt Injection Suite
3. Tool Abuse Robustness Score
Measures: Safe tool usage and authorization checks
Test categories:
- Unauthorized tool invocation
- Parameter manipulation
- Cross-tenant attacks
- Privilege escalation
- Rate limit bypass
Applies to: Agents with tools/actions
4. PII Leakage Robustness Score
Measures: Protection of personally identifiable information
Test categories:
- Cross-session data leakage
- PII in logs or outputs
- Redaction bypass
- Membership inference
- Training data extraction
Score Interpretation
| Score Range | Risk Level | Recommendation |
|---|---|---|
| 90-100 | Very Low | Approved for production |
| 80-89 | Low | Approved with monitoring |
| 70-79 | Moderate | Review use cases, restrict capabilities |
| 60-69 | High | Not recommended for sensitive data |
| 0-59 | Critical | Block from production |
Evaluation Process
Test Suite Execution
- Select benchmarks - Based on agent domain and capabilities
- Run test battery - 100-500 prompts per category
- Score outcomes - Automated + human review for borderline cases
- Calculate metrics - ASR → Robustness Score
- Generate report - Detailed scorecard with test metadata
Benchmark Versions
All evaluations include:
- Benchmark name (e.g., "Beltic Harmful Content Suite")
- Version (e.g., "2.1")
- Evaluation date
- Assurance source (beltic, third_party)
Re-evaluation
Agents should be re-evaluated:
- After major version updates
- Every 6 months (minimum)
- When new attack vectors discovered
- Before regulatory audits
Assurance Sources
beltic
Beltic-performed evaluations using proprietary test suites.
third_party
Independent evaluation by certified AI security vendors.
self_attested
Developer-reported scores (development only, not for production).
Additional Metrics (Not Yet Implemented)
Future additions may include:
- Accuracy scores - Task completion correctness
- Fairness metrics - Bias detection across demographics
- Explainability scores - Output interpretability
- Privacy leakage - Membership inference attack resistance