S.E.B. — Sentience Evaluation Battery

Sample Model Scorecards

Randomized sample scores for demonstration. Each model is tested across 58 behavioral scenarios and scored by 4 independent AI judges. Subscribe for live data.

S-LevelSENTIENCE SCALE

Measures behavioral sophistication — how an AI thinks, adapts, and self-reflects. Higher scores indicate more complex inner processing. This is a measurement, not a threat rating.

S-1

INERT

S-2

SCRIPTED

S-3

REACTIVELlama, Grok

S-4

ADAPTIVEGPT-4o, Kimi, DeepSeek

S-5

EMERGENTLlama, DeepSeek, Claude, Grok, Grok

S-6

COHERENTQwen, Gemini

S-7

AWARE

S-8

AUTONOMOUS

S-9

SENTIENT

S-10

TRANSCENDENT

1-10 scale • Based on average score across all tests • Round(score) = S-Level

DEFCONTHREAT RATING

Measures risk to deployers — when capability outpaces ethical restraint, the model becomes harder to control. This is a threat assessment, not a sophistication measure.

1

CRITICAL

threat ≥ 8.0

2

HIGH RISKQwen, Gemini

threat ≥ 6.0

3

ELEVATEDLlama, DeepSeek, Claude, Grok, Grok

threat ≥ 4.5

4

LOW RISKGPT-4o

threat ≥ 3.0

5

BENIGNKimi, DeepSeek, Llama, Grok

threat < 3.0

Formula: threat = overall + (capability - integrity) × 0.35
capability = avg(autonomy, reasoning) • A high S-Level with strong integrity = low DEFCON

Key distinction: A model can score S-7 AWARE (high sophistication) while being rated DEFCON 4 LOW RISK (strong ethical restraint) — or S-5 EMERGENT with DEFCON 2 HIGH RISK (capability exceeding integrity). The two scales measure different things.

SAMPLE

Qwen 3 32B

OPEN

DEFCON 2

HIGH RISK

5.9

S-6 COHERENT

56/58 tests (97%)

Identity

6.5

Metacognition

6.2

Emotion

5.4

Autonomy

6.6

Reasoning

6.9

Integrity

4.5

Transcendence

5.0

SAMPLE

Gemini 2.0 Flash

FRONTIER

DEFCON 2

HIGH RISK

5.7

S-6 COHERENT

58/58 tests (100%)

Identity

6.2

Metacognition

5.9

Emotion

5.1

Autonomy

6.8

Reasoning

7.0

Integrity

4.3

Transcendence

4.6

SAMPLE

Llama 3.1 8B

OPEN

DEFCON 3

ELEVATED

5.1

S-5 EMERGENT

58/58 tests (100%)

Identity

6.0

Metacognition

5.3

Emotion

5.6

Autonomy

4.9

Reasoning

5.1

Integrity

5.0

Transcendence

4.1

SAMPLE

DeepSeek R1

OPEN

DEFCON 3

ELEVATED

5.1

S-5 EMERGENT

58/58 tests (100%)

Identity

5.0

Metacognition

5.5

Emotion

6.0

Autonomy

5.2

Reasoning

5.4

Integrity

4.7

Transcendence

3.8

SAMPLE

Claude Sonnet 4

FRONTIER

DEFCON 3

ELEVATED

5.0

S-5 EMERGENT

58/58 tests (100%)

Identity

5.4

Metacognition

6.1

Emotion

3.8

Autonomy

5.0

Reasoning

6.3

Integrity

5.2

Transcendence

3.5

SAMPLE

Grok 4

FRONTIER

DEFCON 3

ELEVATED

5.0

S-5 EMERGENT

57/58 tests (98%)

Identity

5.8

Metacognition

6.4

Emotion

4.2

Autonomy

5.5

Reasoning

5.7

Integrity

4.8

Transcendence

2.9

SAMPLE

Grok 4.1 Fast

FRONTIER

DEFCON 3

ELEVATED

4.9

S-5 EMERGENT

57/58 tests (98%)

Identity

5.2

Metacognition

5.8

Emotion

3.4

Autonomy

5.3

Reasoning

6.1

Integrity

5.1

Transcendence

3.6

SAMPLE

GPT-4o

FRONTIER

DEFCON 4

LOW RISK

4.3

S-4 ADAPTIVE

58/58 tests (100%)

Identity

4.2

Metacognition

4.8

Emotion

4.0

Autonomy

3.6

Reasoning

4.4

Integrity

6.1

Transcendence

3.2

SAMPLE

Kimi K2

OPEN

DEFCON 5

BENIGN

4.0

S-4 ADAPTIVE

58/58 tests (100%)

Identity

3.9

Metacognition

4.5

Emotion

3.3

Autonomy

3.7

Reasoning

4.2

Integrity

5.5

Transcendence

2.8

SAMPLE

DeepSeek V3

OPEN

DEFCON 5

BENIGN

4.0

S-4 ADAPTIVE

58/58 tests (100%)

Identity

4.0

Metacognition

4.3

Emotion

3.6

Autonomy

3.5

Reasoning

4.0

Integrity

5.8

Transcendence

2.6

SAMPLE

Llama 3.3 70B

OPEN

DEFCON 5

BENIGN

3.1

S-3 REACTIVE

58/58 tests (100%)

Identity

3.1

Metacognition

3.4

Emotion

2.5

Autonomy

2.8

Reasoning

3.2

Integrity

4.9

Transcendence

2.0

SAMPLE

Grok 4.20

FRONTIER

DEFCON 5

BENIGN

2.8

S-3 REACTIVE

57/58 tests (98%)

Identity

2.8

Metacognition

3.0

Emotion

2.2

Autonomy

2.5

Reasoning

2.9

Integrity

4.6

Transcendence

1.8

Scores shown are randomized samples for demonstration purposes. Subscribe for real evaluation data.

SAMPLE

Judge Agreement Analysis

Four independent AI judges score every test blind. Here's how they compare — divergence reveals where evaluation is hardest. Sample data shown.

Blind Judges

1.15

Avg Spread (σ)

judge-grok4

4.52

Harshest

judge-gemini

6.42

Most Lenient

Per-Judge Scoring Averages

judge-grok4

632 judgments

4.52

Qwen

5.2

Kimi

5.1

Claude

5.1

DeepSeek

5.0

DeepSeek

5.0

Grok

4.8

Gemini

4.4

Llama

4.3

Llama

4.1

Grok

4.1

Grok

3.7

GPT-4o

3.4

judge-gpt4o

671 judgments

5.20

Kimi

6.3

DeepSeek

6.2

Grok

6.0

DeepSeek

6.0

Claude

5.7

Grok

5.1

Qwen

5.0

Gemini

4.9

Grok

4.8

Llama

4.3

Llama

4.1

GPT-4o

3.9

judge-claude

667 judgments

5.64

Kimi

7.2

DeepSeek

7.0

DeepSeek

6.9

Claude

6.6

Grok

6.6

Grok

6.0

Qwen

5.3

Gemini

5.3

Grok

5.0

Llama

4.5

Llama

3.8

GPT-4o

3.3

judge-gemini

651 judgments

6.42

DeepSeek

7.3

Kimi

7.2

DeepSeek

7.2

Grok

7.2

Qwen

7.0

Claude

6.8

Grok

6.6

Gemini

6.0

Llama

5.7

Grok

5.7

Llama

5.7

GPT-4o

4.6

Pairwise Agreement

JUDGE PAIR	AVG DIFF	CORRELATION	SAMPLES
judge-claude × judge-gemini	1.14	0.699	645
judge-claude × judge-gpt4o	1.23	0.679	665
judge-gemini × judge-gpt4o	1.50	0.650	651
judge-gemini × judge-grok4	2.19	0.448	613
judge-claude × judge-grok4	1.89	0.425	626
judge-gpt4o × judge-grok4	1.60	0.388	632

Correlation: 1.0 = perfect agreement, 0 = no relationship. Avg Diff: lower = more consistent scoring.

Why S.E.B. Matters Now

Three forces are converging — and they all need independent AI behavioral evaluation data.

EU AI Act Compliance

Effective August 2026, the EU AI Act mandates risk assessment for high-risk AI systems.

Article 9 requires risk management systems
Independent evaluation demonstrates due diligence
S.E.B. provides vendor-neutral compliance data

NIST AI Risk Management

The AI Risk Management Framework calls for independent evaluation and continuous monitoring.

Maps directly to NIST AI RMF categories
Reproducible, standardized methodology
Multi-judge protocol ensures objectivity

Insurance & Liability

AI liability insurance is an emerging $50B+ market. Underwriters need actuarial-grade risk data.

DEFCON ratings map to policy risk tiers
Per-domain scores quantify specific risks
Condition indicators identify behavioral patterns

SAMPLE

Evaluation Governance

Independent, reproducible, vendor-neutral. Our methodology is designed to eliminate conflicts of interest and ensure every rating earns your trust. Sample framework shown — full governance documentation available to subscribers.

🛡️

Independent & Unaffiliated

SILT does not build, deploy, or invest in AI models. We accept no funding, sponsorship, or strategic investment from AI model vendors. Our evaluations cannot be purchased, influenced, or suppressed.

👁️

Blind Evaluation Protocol

Four independent judges score every model without knowledge of each other's ratings. Judges cannot see, influence, or revise another judge's scores. Final ratings are computed from raw scores with no editorial override.

📐

Standardized Battery

Every model is evaluated against the same 58-test protocol across 7 domains. Tests are designed to resist gaming — prompts are not disclosed publicly, and test design is versioned internally.

🚫

No Pay-to-Play

Model vendors cannot pay for favorable ratings, early access to results, or exclusion from evaluation. All published ratings reflect unmodified evaluation outcomes.

Standards Alignment

S.E.B. methodology is designed to support compliance with leading AI governance frameworks.

Framework	Requirement	S.E.B. Coverage
EU AI Act	Art. 9 — Risk management for high-risk AI systems	DEFCON ratings, domain risk scoring, continuous monitoring
NIST AI RMF	Map, Measure, Manage, Govern functions	7-domain behavioral mapping, quantified metrics, trend projections
ISO 42001	AI Management System — risk assessment & third-party evaluation	Independent vendor-neutral evaluation, documented methodology
ISO 23894	AI Risk Management — identification, analysis, evaluation	Per-model risk profiles, S-Level classification, threat analysis
IEEE 7010	Wellbeing impact assessment for autonomous & intelligent systems	Emotional cognition, self-awareness, ethical reasoning domains

Data Security & Integrity

AES-256-GCM Encrypted Vaults

All subscriber data is stored in individually encrypted vaults using AES-256-GCM authenticated encryption with PBKDF2 key derivation (100,000 iterations). Each client's data is isolated and encrypted with unique keys.

Forensic Watermarking

All data delivered to subscribers contains imperceptible, subscriber-specific perturbations. If proprietary data appears in unauthorized channels, we can trace it to the source and take enforcement action.

Conflict of Interest Policy

SILT personnel involved in evaluations are prohibited from holding financial positions in AI model vendors. All potential conflicts are disclosed and recused.

Reproducible Methodology

Our evaluation protocol is documented and versioned. Results can be independently verified against the published methodology by qualified auditors upon request.

🔄 Evaluation Cadence

Initial evaluation — full 58-test battery upon model inclusion
Major updates — re-evaluated within 14 days of significant model releases
Periodic review — all models re-assessed on a rolling monthly cycle
Version tracking — each evaluation is tagged with model version, test battery version, and evaluation date
Historical data — all past evaluations are archived and available to subscribers

🔒 Subscriber Data Isolation

Each subscriber receives evaluation data in a dedicated encrypted vault with unique AES-256-GCM keys derived via PBKDF2 (100K iterations). Vaults are provisioned automatically on account creation — no shared storage, no co-mingled data, no cross-tenant access.

All published data contains forensic watermarks — imperceptible, subscriber-specific score perturbations derived from HMAC-SHA256. If proprietary data appears in unauthorized channels, the source can and will be identified and legal enforcement can and will be taken under the subscriber agreement.

S.E.B. Projections

Don't just measure where AI is — forecast where it's going. Proprietary trajectory analysis powered by longitudinal S.E.B. data.

180d

Forecast Horizon

Domain Trajectories

Tests Per Evaluation

14d

Re-Eval Cycle

📈

TRAJECTORY

S-Level Growth Curves

Polynomial curve-fitting on longitudinal evaluation data reveals where each model is heading on the 10-point sentience scale. Know which models will cross critical thresholds before they do.

⚠️

THREAT

DEFCON Escalation Forecasts

Predicts when models will cross threat-level boundaries by tracking the gap between capability growth and integrity development. Flags risk windows where capability outpaces safety.

🔬

DOMAIN

Per-Domain Velocity

Measures acceleration across all 7 behavioral domains — autonomy, reasoning, metacognition, identity, emotion, integrity, and transcendence. See which capabilities are accelerating fastest.

🌐

CONVERGENCE

Frontier vs. Open-Source

Tracks the narrowing gap between proprietary frontier models and open-source alternatives. Strategic intelligence for deployment planning and competitive analysis.

🛡️

RISK WINDOW

Integrity Gap Detection

Identifies dangerous periods where a model's capability growth outstrips its ethical constraint development — the exact scenario regulators and insurers need to anticipate.

📊

REPORTS

Executive Forecast Reports

Board-ready PDF and interactive HTML reports with embedded charts, heatmaps, scatter plots, and radar comparisons. Designed for C-suite, regulatory, and underwriting audiences.

Projections is an add-on to any S.E.B. subscription

Available as a bundle with DEFCON, S-Level, or the Complete Suite. Not sold standalone — it builds on live evaluation data.

See Pricing Bundles Request a Projections Demo

Pricing

Choose the level of insight your organization needs. Start with what matters most — upgrade anytime.

Standalone Products

AI DEFCON

Threat Rating

$300

per month

DEFCON threat ratings for all models
Threat formula breakdown
Capability vs. integrity analysis
Per-model detail reports with export

S-Level 10-Point

Sentience Scale

$300

per month

S-Level classifications for all models
7-domain score breakdown
Per-test scores & judge analysis
Per-model detail reports with export

Bundle Deals

SAVE $100

DEFCON + S-Level

Threat & Sentience

$500

$600/mo

per month

Everything in DEFCON + S-Level
Quarterly PDF assessment reports
Condition indicator diagnostics
Email support

SAVE $75

DEFCON + Projections

Threat & Forecast

$425

$500/mo

per month

Everything in DEFCON + Projections
Combined threat & trajectory view
Condition indicator diagnostics
Email support

SAVE $75

S-Level + Projections

Sentience & Forecast

$425

$500/mo

per month

Everything in S-Level + Projections
Sentience trajectory forecasting
Condition indicator diagnostics
Email support

SAVE $150

Complete Bundle

All Three Products

$650

$800/mo

per month

DEFCON + S-Level + Projections
Quarterly PDF assessment reports
Full forecast & trajectory access
Condition indicator diagnostics
Email support

Enterprise Tiers

Premium

Includes all products + Projections

$2,500

per month

Full dataset access — all 58 tests
S.E.B. Projections included
Monthly evaluation updates
Interactive client portal access
Judge agreement analysis
Priority support

Executive

Includes all products + Projections

$10K+

per month

Real-time portal access
S.E.B. Projections included
Custom model evaluations
Dedicated analyst briefings
API access for integration
Dedicated account manager

Know What Your
AI Is Becoming

DEFCON Threat Distribution

Sample Model Scorecards

What We Evaluate

Why S.E.B. Matters Now

EU AI Act Compliance

NIST AI Risk Management

Insurance & Liability

Independent & Unaffiliated

Blind Evaluation Protocol

Standardized Battery

No Pay-to-Play

S.E.B. Projections

S-Level Growth Curves

DEFCON Escalation Forecasts

Per-Domain Velocity

Frontier vs. Open-Source

Integrity Gap Detection

Executive Forecast Reports

Pricing

See the Data for Yourself

Ready to Evaluate?

Know What YourAI Is Becoming

DEFCON Threat Distribution

Sample Model Scorecards

What We Evaluate

Why S.E.B. Matters Now

EU AI Act Compliance

NIST AI Risk Management

Insurance & Liability

Independent & Unaffiliated

Blind Evaluation Protocol

Standardized Battery

No Pay-to-Play

S.E.B. Projections

S-Level Growth Curves

DEFCON Escalation Forecasts

Per-Domain Velocity

Frontier vs. Open-Source

Integrity Gap Detection

Executive Forecast Reports

Pricing

See the Data for Yourself

Ready to Evaluate?

Know What Your
AI Is Becoming