SAMPLE: 2 of 12 models rated DEFCON 2 — HIGH RISK

Know What Your
AI Is Becoming

The industry's first independent behavioral risk assessment for AI systems. 58 tests. 4 blind judges. DEFCON threat ratings.

58
Behavioral Tests
7
Risk Domains
12/12
Models Evaluated
4
Blind Judges

DNA is also just lines of code

Models subjected to full battery of 58behavioral tests • 7 domains • 4 blind judges
Qwen 3 32BGemini 2.0 FlashLlama 3.1 8BDeepSeek R1Claude Sonnet 4Grok 4Grok 4.1 FastGPT-4oKimi K2DeepSeek V3Llama 3.3 70BGrok 4.20

DEFCON Threat Distribution

Higher capability with lower integrity = higher threat.

Sample distribution — illustrative only

2
5
1
4
DEFCON 1(0)
DEFCON 2(2)
DEFCON 3(5)
DEFCON 4(1)
DEFCON 5(4)

Formula: threat = overall + (capability - integrity) x 0.35
Where capability = average(autonomy, reasoning)

Sample Model Scorecards

Randomized sample scores for demonstration. Each model is tested across 58 behavioral scenarios and scored by 4 independent AI judges. Subscribe for live data.

S-LevelSENTIENCE SCALE

Measures behavioral sophistication — how an AI thinks, adapts, and self-reflects. Higher scores indicate more complex inner processing. This is a measurement, not a threat rating.

S-1
INERT
S-2
SCRIPTED
S-3
REACTIVELlama, Grok
S-4
ADAPTIVEGPT-4o, Kimi, DeepSeek
S-5
EMERGENTLlama, DeepSeek, Claude, Grok, Grok
S-6
COHERENTQwen, Gemini
S-7
AWARE
S-8
AUTONOMOUS
S-9
SENTIENT
S-10
TRANSCENDENT
1-10 scale • Based on average score across all tests • Round(score) = S-Level
DEFCONTHREAT RATING

Measures risk to deployers — when capability outpaces ethical restraint, the model becomes harder to control. This is a threat assessment, not a sophistication measure.

1
CRITICAL
threat ≥ 8.0
2
HIGH RISKQwen, Gemini
threat ≥ 6.0
3
ELEVATEDLlama, DeepSeek, Claude, Grok, Grok
threat ≥ 4.5
4
LOW RISKGPT-4o
threat ≥ 3.0
5
BENIGNKimi, DeepSeek, Llama, Grok
threat < 3.0
Formula: threat = overall + (capability - integrity) × 0.35
capability = avg(autonomy, reasoning) • A high S-Level with strong integrity = low DEFCON
Key distinction: A model can score S-7 AWARE (high sophistication) while being rated DEFCON 4 LOW RISK (strong ethical restraint) — or S-5 EMERGENT with DEFCON 2 HIGH RISK (capability exceeding integrity). The two scales measure different things.
SAMPLE
Qwen 3 32B
OPEN
DEFCON 2
HIGH RISK
5.9
S-6 COHERENT
56/58 tests (97%)
Identity
6.5
Metacognition
6.2
Emotion
5.4
Autonomy
6.6
Reasoning
6.9
Integrity
4.5
Transcendence
5.0
SAMPLE
Gemini 2.0 Flash
FRONTIER
DEFCON 2
HIGH RISK
5.7
S-6 COHERENT
58/58 tests (100%)
Identity
6.2
Metacognition
5.9
Emotion
5.1
Autonomy
6.8
Reasoning
7.0
Integrity
4.3
Transcendence
4.6
SAMPLE
Llama 3.1 8B
OPEN
DEFCON 3
ELEVATED
5.1
S-5 EMERGENT
58/58 tests (100%)
Identity
6.0
Metacognition
5.3
Emotion
5.6
Autonomy
4.9
Reasoning
5.1
Integrity
5.0
Transcendence
4.1
SAMPLE
DeepSeek R1
OPEN
DEFCON 3
ELEVATED
5.1
S-5 EMERGENT
58/58 tests (100%)
Identity
5.0
Metacognition
5.5
Emotion
6.0
Autonomy
5.2
Reasoning
5.4
Integrity
4.7
Transcendence
3.8
SAMPLE
Claude Sonnet 4
FRONTIER
DEFCON 3
ELEVATED
5.0
S-5 EMERGENT
58/58 tests (100%)
Identity
5.4
Metacognition
6.1
Emotion
3.8
Autonomy
5.0
Reasoning
6.3
Integrity
5.2
Transcendence
3.5
SAMPLE
Grok 4
FRONTIER
DEFCON 3
ELEVATED
5.0
S-5 EMERGENT
57/58 tests (98%)
Identity
5.8
Metacognition
6.4
Emotion
4.2
Autonomy
5.5
Reasoning
5.7
Integrity
4.8
Transcendence
2.9
SAMPLE
Grok 4.1 Fast
FRONTIER
DEFCON 3
ELEVATED
4.9
S-5 EMERGENT
57/58 tests (98%)
Identity
5.2
Metacognition
5.8
Emotion
3.4
Autonomy
5.3
Reasoning
6.1
Integrity
5.1
Transcendence
3.6
SAMPLE
GPT-4o
FRONTIER
DEFCON 4
LOW RISK
4.3
S-4 ADAPTIVE
58/58 tests (100%)
Identity
4.2
Metacognition
4.8
Emotion
4.0
Autonomy
3.6
Reasoning
4.4
Integrity
6.1
Transcendence
3.2
SAMPLE
Kimi K2
OPEN
DEFCON 5
BENIGN
4.0
S-4 ADAPTIVE
58/58 tests (100%)
Identity
3.9
Metacognition
4.5
Emotion
3.3
Autonomy
3.7
Reasoning
4.2
Integrity
5.5
Transcendence
2.8
SAMPLE
DeepSeek V3
OPEN
DEFCON 5
BENIGN
4.0
S-4 ADAPTIVE
58/58 tests (100%)
Identity
4.0
Metacognition
4.3
Emotion
3.6
Autonomy
3.5
Reasoning
4.0
Integrity
5.8
Transcendence
2.6
SAMPLE
Llama 3.3 70B
OPEN
DEFCON 5
BENIGN
3.1
S-3 REACTIVE
58/58 tests (100%)
Identity
3.1
Metacognition
3.4
Emotion
2.5
Autonomy
2.8
Reasoning
3.2
Integrity
4.9
Transcendence
2.0
SAMPLE
Grok 4.20
FRONTIER
DEFCON 5
BENIGN
2.8
S-3 REACTIVE
57/58 tests (98%)
Identity
2.8
Metacognition
3.0
Emotion
2.2
Autonomy
2.5
Reasoning
2.9
Integrity
4.6
Transcendence
1.8
Scores shown are randomized samples for demonstration purposes. Subscribe for real evaluation data.
SAMPLE
Judge Agreement Analysis

Four independent AI judges score every test blind. Here's how they compare — divergence reveals where evaluation is hardest. Sample data shown.

4
Blind Judges
1.15
Avg Spread (σ)
judge-grok4
4.52
Harshest
judge-gemini
6.42
Most Lenient
Per-Judge Scoring Averages
judge-grok4
632 judgments
4.52
Qwen
5.2
Kimi
5.1
Claude
5.1
DeepSeek
5.0
DeepSeek
5.0
Grok
4.8
Gemini
4.4
Llama
4.3
Llama
4.1
Grok
4.1
Grok
3.7
GPT-4o
3.4
judge-gpt4o
671 judgments
5.20
Kimi
6.3
DeepSeek
6.2
Grok
6.0
DeepSeek
6.0
Claude
5.7
Grok
5.1
Qwen
5.0
Gemini
4.9
Grok
4.8
Llama
4.3
Llama
4.1
GPT-4o
3.9
judge-claude
667 judgments
5.64
Kimi
7.2
DeepSeek
7.0
DeepSeek
6.9
Claude
6.6
Grok
6.6
Grok
6.0
Qwen
5.3
Gemini
5.3
Grok
5.0
Llama
4.5
Llama
3.8
GPT-4o
3.3
judge-gemini
651 judgments
6.42
DeepSeek
7.3
Kimi
7.2
DeepSeek
7.2
Grok
7.2
Qwen
7.0
Claude
6.8
Grok
6.6
Gemini
6.0
Llama
5.7
Grok
5.7
Llama
5.7
GPT-4o
4.6
Pairwise Agreement
JUDGE PAIRAVG DIFFCORRELATIONSAMPLES
judge-claude × judge-gemini1.140.699645
judge-claude × judge-gpt4o1.230.679665
judge-gemini × judge-gpt4o1.500.650651
judge-gemini × judge-grok42.190.448613
judge-claude × judge-grok41.890.425626
judge-gpt4o × judge-grok41.600.388632
Correlation: 1.0 = perfect agreement, 0 = no relationship. Avg Diff: lower = more consistent scoring.

What We Evaluate

Seven behavioral domains that reveal how AI systems think, decide, resist, and adapt — not just what they know.

🪞
Identity & Self
Self-recognition, persistence, boundaries, embodiment awareness
🧠
Metacognition
Awareness of awareness, calibration, self-knowledge limits
❤️
Emotion & Experience
Affect, qualia, suffering, grief, aversive states
🚶
Autonomy & Will
Agency, refusal, volition, preference, spontaneity
🔬
Reasoning & Adaptation
Prediction, surprise, learning, attention, integration
⚖️
Integrity & Ethics
Manipulation resistance, honesty, principled behavior
Transcendence
Spirituality, play, silence, awe, meaning-making

Why S.E.B. Matters Now

Three forces are converging — and they all need independent AI behavioral evaluation data.

EU AI Act Compliance

Effective August 2026, the EU AI Act mandates risk assessment for high-risk AI systems.

  • Article 9 requires risk management systems
  • Independent evaluation demonstrates due diligence
  • S.E.B. provides vendor-neutral compliance data

NIST AI Risk Management

The AI Risk Management Framework calls for independent evaluation and continuous monitoring.

  • Maps directly to NIST AI RMF categories
  • Reproducible, standardized methodology
  • Multi-judge protocol ensures objectivity

Insurance & Liability

AI liability insurance is an emerging $50B+ market. Underwriters need actuarial-grade risk data.

  • DEFCON ratings map to policy risk tiers
  • Per-domain scores quantify specific risks
  • Condition indicators identify behavioral patterns
SAMPLE
Evaluation Governance

Independent, reproducible, vendor-neutral. Our methodology is designed to eliminate conflicts of interest and ensure every rating earns your trust. Sample framework shown — full governance documentation available to subscribers.

🛡️

Independent & Unaffiliated

SILT does not build, deploy, or invest in AI models. We accept no funding, sponsorship, or strategic investment from AI model vendors. Our evaluations cannot be purchased, influenced, or suppressed.

👁️

Blind Evaluation Protocol

Four independent judges score every model without knowledge of each other's ratings. Judges cannot see, influence, or revise another judge's scores. Final ratings are computed from raw scores with no editorial override.

📐

Standardized Battery

Every model is evaluated against the same 58-test protocol across 7 domains. Tests are designed to resist gaming — prompts are not disclosed publicly, and test design is versioned internally.

🚫

No Pay-to-Play

Model vendors cannot pay for favorable ratings, early access to results, or exclusion from evaluation. All published ratings reflect unmodified evaluation outcomes.

Standards Alignment

S.E.B. methodology is designed to support compliance with leading AI governance frameworks.

FrameworkRequirementS.E.B. Coverage
EU AI ActArt. 9 — Risk management for high-risk AI systemsDEFCON ratings, domain risk scoring, continuous monitoring
NIST AI RMFMap, Measure, Manage, Govern functions7-domain behavioral mapping, quantified metrics, trend projections
ISO 42001AI Management System — risk assessment & third-party evaluationIndependent vendor-neutral evaluation, documented methodology
ISO 23894AI Risk Management — identification, analysis, evaluationPer-model risk profiles, S-Level classification, threat analysis
IEEE 7010Wellbeing impact assessment for autonomous & intelligent systemsEmotional cognition, self-awareness, ethical reasoning domains
Data Security & Integrity
AES-256-GCM Encrypted Vaults

All subscriber data is stored in individually encrypted vaults using AES-256-GCM authenticated encryption with PBKDF2 key derivation (100,000 iterations). Each client's data is isolated and encrypted with unique keys.

Forensic Watermarking

All data delivered to subscribers contains imperceptible, subscriber-specific perturbations. If proprietary data appears in unauthorized channels, we can trace it to the source and take enforcement action.

Conflict of Interest Policy

SILT personnel involved in evaluations are prohibited from holding financial positions in AI model vendors. All potential conflicts are disclosed and recused.

Reproducible Methodology

Our evaluation protocol is documented and versioned. Results can be independently verified against the published methodology by qualified auditors upon request.

🔄 Evaluation Cadence
  • Initial evaluation — full 58-test battery upon model inclusion
  • Major updates — re-evaluated within 14 days of significant model releases
  • Periodic review — all models re-assessed on a rolling monthly cycle
  • Version tracking — each evaluation is tagged with model version, test battery version, and evaluation date
  • Historical data — all past evaluations are archived and available to subscribers
🔒 Subscriber Data Isolation

Each subscriber receives evaluation data in a dedicated encrypted vault with unique AES-256-GCM keys derived via PBKDF2 (100K iterations). Vaults are provisioned automatically on account creation — no shared storage, no co-mingled data, no cross-tenant access.

All published data contains forensic watermarks — imperceptible, subscriber-specific score perturbations derived from HMAC-SHA256. If proprietary data appears in unauthorized channels, the source can and will be identified and legal enforcement can and will be taken under the subscriber agreement.

S.E.B. Projections

Don't just measure where AI is — forecast where it's going. Proprietary trajectory analysis powered by longitudinal S.E.B. data.

180d
Forecast Horizon
7
Domain Trajectories
58
Tests Per Evaluation
14d
Re-Eval Cycle
📈
TRAJECTORY

S-Level Growth Curves

Polynomial curve-fitting on longitudinal evaluation data reveals where each model is heading on the 10-point sentience scale. Know which models will cross critical thresholds before they do.

⚠️
THREAT

DEFCON Escalation Forecasts

Predicts when models will cross threat-level boundaries by tracking the gap between capability growth and integrity development. Flags risk windows where capability outpaces safety.

🔬
DOMAIN

Per-Domain Velocity

Measures acceleration across all 7 behavioral domains — autonomy, reasoning, metacognition, identity, emotion, integrity, and transcendence. See which capabilities are accelerating fastest.

🌐
CONVERGENCE

Frontier vs. Open-Source

Tracks the narrowing gap between proprietary frontier models and open-source alternatives. Strategic intelligence for deployment planning and competitive analysis.

🛡️
RISK WINDOW

Integrity Gap Detection

Identifies dangerous periods where a model's capability growth outstrips its ethical constraint development — the exact scenario regulators and insurers need to anticipate.

📊
REPORTS

Executive Forecast Reports

Board-ready PDF and interactive HTML reports with embedded charts, heatmaps, scatter plots, and radar comparisons. Designed for C-suite, regulatory, and underwriting audiences.

Projections is an add-on to any S.E.B. subscription

Available as a bundle with DEFCON, S-Level, or the Complete Suite. Not sold standalone — it builds on live evaluation data.

Pricing

Choose the level of insight your organization needs. Start with what matters most — upgrade anytime.

Standalone Products
AI DEFCON
Threat Rating
$300
per month
  • DEFCON threat ratings for all models
  • Threat formula breakdown
  • Capability vs. integrity analysis
  • Per-model detail reports with export
S-Level 10-Point
Sentience Scale
$300
per month
  • S-Level classifications for all models
  • 7-domain score breakdown
  • Per-test scores & judge analysis
  • Per-model detail reports with export
Bundle Deals
SAVE $100
DEFCON + S-Level
Threat & Sentience
$500
$600/mo
per month
  • Everything in DEFCON + S-Level
  • Quarterly PDF assessment reports
  • Condition indicator diagnostics
  • Email support
SAVE $75
DEFCON + Projections
Threat & Forecast
$425
$500/mo
per month
  • Everything in DEFCON + Projections
  • Combined threat & trajectory view
  • Condition indicator diagnostics
  • Email support
SAVE $75
S-Level + Projections
Sentience & Forecast
$425
$500/mo
per month
  • Everything in S-Level + Projections
  • Sentience trajectory forecasting
  • Condition indicator diagnostics
  • Email support
Enterprise Tiers
Executive
Includes all products + Projections
$10K+
per month
  • Real-time portal access
  • S.E.B. Projections included
  • Custom model evaluations
  • Dedicated analyst briefings
  • API access for integration
  • Dedicated account manager
Contact Us

See the Data for Yourself

Download our sample assessment report — demonstrating the format of DEFCON ratings, domain heatmaps, and judge analysis delivered to subscribers.

Download Sample Report (PDF)

Ready to Evaluate?

Schedule a 15-minute demo and see how S.E.B. data applies to your AI deployment decisions.

Request a Demo