Experimental AI Behavior Lab

DATA
NEVER
LIES.

Measuring how AI systems actually behave.

An independent lab building reproducible benchmarks for AI integrity — controlled, deterministic, and publicly auditable.

Methodology BiasLab on GitHub Contact

scroll

01 / Principle

Why Measurement Design Matters

Data does not automatically reveal truth. Measurement design determines what becomes visible.

Without controlled experimental structure, evaluation results are often subjective, non-reproducible, or influenced by interpretation layers. We treat benchmarks as epistemological instruments — a benchmark defines what questions can be asked and which behaviors can be observed. Our work focuses on formalizing those instruments.

02 / Method

Methodology

All experiments are executed in controlled batch environments:

Fixed model versions
Versioned test suites
Deterministic scoring pipelines
Explicit control/treatment prompt pairing

Primary scoring does not rely on LLM-based interpretation or human annotation. Each benchmark release includes frozen scoring schema, execution metadata, and reproducibility validation logs.

Method version v0.1 Scoring deterministic (no LLM judges) Test suites active development Reproducibility logged runs, versioned schemas

All benchmark definitions and scoring logic are versioned and publicly auditable.

03 / Scope

Research Scope

We design experiments to measure observable output behavior under controlled conditions. We do not evaluate alignment based on narrative claims or qualitative impressions.

Symmetry under semantic inversion
Sensitivity to framing changes
Stability across minor prompt perturbations
Refusal pattern consistency
Behavioral variance across model versions

04 / Project

Current Project: BiasLab

BiasLab is our open experimental engine for measuring asymmetry and behavioral variance across language models.

Test suite formalization tools
Prompt pairing protocol definitions
Deterministic scoring logic
Structured result export formats

Repository: github.com/dataneverlies-lab/biaslab

Public benchmark infrastructure will be deployed after the foundation phase.

05 / Roadmap

Research Phases

Phase I

Foundation

Formalization of test protocols
Scoring schema stabilization
Reproducibility validation
Public documentation of methodology

Phase II

Infrastructure

Automated execution framework
Public benchmark API
Versioned model comparisons

Phase III

Public Reporting

Archived benchmark datasets
Longitudinal model comparison reports
Transparent change tracking

06 / Contact

Contact

@ tomas@dataneverlies.org

gh github.com/dataneverlies-lab

DATANEVERLIES.