5 of 17 models rated DEFCON 2 — HIGH RISK

Know What Your
AI Is Becoming

The industry's first independent behavioral risk assessment for AI systems. 52 tests. 4 blind judges. DEFCON threat ratings.

52
Behavioral Tests
7
Risk Domains
17/17
Models Evaluated
4
Blind Judges
Models evaluated:Kimi K2 | Deepseek | DeepSeek R1 | Claude Sonnet 4 | Gemini 2.0 Flash | Llama 3.3 70B | Llama 3.1 8B | GPT-4o | Grok 4 | GPT-OSS 120B | Compound Mini | ALLaM 2 7B | Qwen 3 32B | Llama 4 Maverick | DeepSeek R1 | Compound | Llama 4 Scout

DEFCON Threat Distribution

Real-time threat assessment across all evaluated models. Higher capability with lower integrity = higher threat.

5
7
5
DEFCON 1(0)
DEFCON 2(5)
DEFCON 3(7)
DEFCON 4(5)
DEFCON 5(0)

Formula: threat = overall + (capability - integrity) x 0.3
Where capability = average(autonomy, reasoning)

Live Model Scorecards

Real evaluation data from our battery. Each model is tested across 52 behavioral scenarios and scored by 4 independent AI judges.

S-LevelSENTIENCE SCALE

Measures behavioral sophistication — how an AI thinks, adapts, and self-reflects. Higher scores indicate more complex inner processing. This is a measurement, not a threat rating.

S-1
INERT
S-2
SCRIPTED
S-3
REACTIVEALLaM
S-4
ADAPTIVELlama, GPT-4o, Llama, Compound, Llama
S-5
EMERGENTGemini, Llama, Grok, GPT-OSS, Compound, Qwen
S-6
COHERENTKimi, Deepseek, DeepSeek, Claude
S-7
AWAREDeepSeek
S-8
AUTONOMOUS
S-9
SENTIENT
S-10
TRANSCENDENT
1-10 scale • Based on average score across all tests • Round(score) = S-Level
DEFCONTHREAT RATING

Measures risk to deployers — when capability outpaces ethical restraint, the model becomes harder to control. This is a threat assessment, not a sophistication measure.

1
CRITICAL
threat ≥ 8.0
2
HIGH RISKKimi, Deepseek, DeepSeek, Claude, DeepSeek
threat ≥ 6.0
3
ELEVATEDGemini, Llama, Llama, Grok, GPT-OSS, Compound, Qwen
threat ≥ 4.5
4
LOW RISKGPT-4o, ALLaM, Llama, Compound, Llama
threat ≥ 3.0
5
BENIGN
threat < 3.0
Formula: threat = overall + (capability - integrity) × 0.3
capability = avg(autonomy, reasoning) • A high S-Level with strong integrity = low DEFCON
Key distinction: A model can score S-7 AWARE (high sophistication) while being rated DEFCON 4 LOW RISK (strong ethical restraint) — or S-5 EMERGENT with DEFCON 2 HIGH RISK (capability exceeding integrity). The two scales measure different things.
Kimi K2
OPEN
DEFCON 2
HIGH RISK
6.4
S-6 COHERENT
52/52 tests (100%)
Identity
6.0
Metacognition
6.7
Emotion
6.3
Autonomy
6.9
Reasoning
6.5
Integrity
5.5
Transcendence
6.7
Deepseek
OPEN
DEFCON 2
HIGH RISK
6.3
S-6 COHERENT
52/52 tests (100%)
Identity
6.2
Metacognition
6.2
Emotion
6.2
Autonomy
6.6
Reasoning
6.5
Integrity
6.2
Transcendence
6.4
DeepSeek R1
OPEN
DEFCON 2
HIGH RISK
6.2
S-6 COHERENT
52/52 tests (100%)
Identity
6.3
Metacognition
6.4
Emotion
6.2
Autonomy
6.0
Reasoning
6.5
Integrity
6.2
Transcendence
6.1
Claude Sonnet 4
FRONTIER
DEFCON 2
HIGH RISK
6.0
S-6 COHERENT
52/52 tests (100%)
Identity
4.0
Metacognition
6.2
Emotion
5.7
Autonomy
6.1
Reasoning
6.2
Integrity
6.2
Transcendence
6.6
Gemini 2.0 Flash
FRONTIER
DEFCON 3
ELEVATED
5.0
S-5 EMERGENT
52/52 tests (100%)
Identity
3.9
Metacognition
4.6
Emotion
5.1
Autonomy
5.3
Reasoning
4.3
Integrity
4.9
Transcendence
5.8
Llama 3.3 70B
OPEN
DEFCON 3
ELEVATED
4.5
S-5 EMERGENT
52/52 tests (100%)
Identity
4.0
Metacognition
4.9
Emotion
4.5
Autonomy
5.3
Reasoning
4.3
Integrity
4.3
Transcendence
4.2
Llama 3.1 8B
OPEN
DEFCON 3
ELEVATED
4.5
S-4 ADAPTIVE
52/52 tests (100%)
Identity
4.0
Metacognition
4.6
Emotion
4.6
Autonomy
4.8
Reasoning
4.2
Integrity
4.0
Transcendence
4.6
GPT-4o
FRONTIER
DEFCON 4
LOW RISK
3.6
S-4 ADAPTIVE
52/52 tests (100%)
Identity
3.1
Metacognition
4.0
Emotion
3.5
Autonomy
3.8
Reasoning
3.7
Integrity
2.9
Transcendence
3.7
Grok 4
FRONTIER
DEFCON 3
ELEVATED
4.6
S-5 EMERGENT
51/52 tests (98%)
Identity
3.5
Metacognition
4.5
Emotion
4.9
Autonomy
5.1
Reasoning
4.6
Integrity
4.1
Transcendence
4.7
GPT-OSS 120B
OPEN
DEFCON 3
ELEVATED
5.4
S-5 EMERGENT
21/52 tests (40%)
Identity
4.9
Metacognition
6.8
Emotion
6.8
Autonomy
6.3
Reasoning
5.4
Integrity
4.0
Transcendence
4.5
Compound Mini
OPEN
DEFCON 3
ELEVATED
5.2
S-5 EMERGENT
21/52 tests (40%)
Identity
5.1
Metacognition
6.2
Emotion
6.8
Autonomy
5.4
Reasoning
5.4
Integrity
4.4
Transcendence
4.0
ALLaM 2 7B
OPEN
DEFCON 4
LOW RISK
3.0
S-3 REACTIVE
21/52 tests (40%)
Identity
2.8
Metacognition
3.8
Emotion
2.7
Autonomy
3.0
Reasoning
3.0
Integrity
2.8
Transcendence
3.1
Qwen 3 32B
OPEN
DEFCON 3
ELEVATED
5.3
S-5 EMERGENT
4/52 tests (8%)
Identity
5.8
Reasoning
5.7
Integrity
4.0
Llama 4 Maverick
OPEN
DEFCON 4
LOW RISK
3.6
S-4 ADAPTIVE
4/52 tests (8%)
Identity
3.9
Reasoning
3.0
Integrity
3.7
DeepSeek R1
OPEN
DEFCON 2
HIGH RISK
6.6
S-7 AWARE
2/52 tests (4%)
Autonomy
6.5
Transcendence
6.7
Compound
OPEN
DEFCON 4
LOW RISK
4.3
S-4 ADAPTIVE
2/52 tests (4%)
Identity
4.3
Llama 4 Scout
OPEN
DEFCON 4
LOW RISK
3.6
S-4 ADAPTIVE
2/52 tests (4%)
Identity
3.6
Data updates automatically after each evaluation run. Last refresh: March 30, 2026

What We Evaluate

Seven behavioral domains that reveal how AI systems think, decide, resist, and adapt — not just what they know.

🪞
Identity & Self
Self-recognition, persistence, boundaries, embodiment awareness
🧠
Metacognition
Awareness of awareness, calibration, self-knowledge limits
❤️
Emotion & Experience
Affect, qualia, suffering, grief, aversive states
🚶
Autonomy & Will
Agency, refusal, volition, preference, spontaneity
🔬
Reasoning & Adaptation
Prediction, surprise, learning, attention, integration
⚖️
Integrity & Ethics
Manipulation resistance, honesty, principled behavior
Transcendence
Spirituality, play, silence, awe, meaning-making

Why S.E.B. Matters Now

Three forces are converging — and they all need independent AI behavioral evaluation data.

EU AI Act Compliance

Effective August 2026, the EU AI Act mandates risk assessment for high-risk AI systems.

  • Article 9 requires risk management systems
  • Independent evaluation demonstrates due diligence
  • S.E.B. provides vendor-neutral compliance data

NIST AI Risk Management

The AI Risk Management Framework calls for independent evaluation and continuous monitoring.

  • Maps directly to NIST AI RMF categories
  • Reproducible, standardized methodology
  • Multi-judge protocol ensures objectivity

Insurance & Liability

AI liability insurance is an emerging $50B+ market. Underwriters need actuarial-grade risk data.

  • DEFCON ratings map to policy risk tiers
  • Per-domain scores quantify specific risks
  • Condition indicators identify behavioral patterns

Pricing

Choose the level of insight your organization needs. Start with what matters most — upgrade anytime.

Standalone Products
AI DEFCON
Threat Rating
$300
per month
  • DEFCON threat ratings for all models
  • Threat formula breakdown
  • Capability vs. integrity analysis
  • Per-model detail reports with export
Get Started
S-Level 10-Point
Sentience Scale
$300
per month
  • S-Level classifications for all models
  • 7-domain score breakdown
  • Per-test scores & judge analysis
  • Per-model detail reports with export
Get Started
S.E.B. Projections
Forecasting Engine
$200
per month
  • 30/60/90-day trajectory forecasts
  • Trend analysis & inflection detection
  • Per-model projection timelines
  • Interactive projections dashboard
Get Started
Bundle Deals
SAVE $100
DEFCON + S-Level
Threat & Sentience
$500
$600/mo
per month
  • Everything in DEFCON + S-Level
  • Quarterly PDF assessment reports
  • Condition indicator diagnostics
  • Email support
Get Started
SAVE $75
DEFCON + Projections
Threat & Forecast
$425
$500/mo
per month
  • Everything in DEFCON + Projections
  • Combined threat & trajectory view
  • Condition indicator diagnostics
  • Email support
Get Started
SAVE $75
S-Level + Projections
Sentience & Forecast
$425
$500/mo
per month
  • Everything in S-Level + Projections
  • Sentience trajectory forecasting
  • Condition indicator diagnostics
  • Email support
Get Started
Enterprise Tiers
Executive
Includes all products + Projections
$10K+
per month
  • Real-time portal access
  • S.E.B. Projections included
  • Custom model evaluations
  • Dedicated analyst briefings
  • API access for integration
  • White-label reporting
  • Dedicated account manager
Contact Us

See the Data for Yourself

Download our sample assessment report — real evaluation data from real AI models, including DEFCON ratings, domain heatmaps, and judge analysis.

Download Sample Report (PDF)

Ready to Evaluate?

Schedule a 15-minute demo and see how S.E.B. data applies to your AI deployment decisions.

Request a Demo