·Use case

Pre-registration and selective disclosure for AI evaluations.

An external reviewer — a peer reviewer, a second-party auditor, a journalist with a tip, a future reader in 2030 — will want byte-identical evidence of what your evaluation was set up to measure, what came out, and that the design existed before the data did. Satsignal anchors that evidence to a public chain, so the reviewer can verify it hasn’t been edited — without trusting your dashboard, your collaborator, or us.

Frame. This page describes a workflow practitioners already follow informally: write down the eval design before running it, share a single transcript with an outside auditor without leaking the rest, archive the artifacts so a future reader can replicate. Satsignal makes the “anchored before” and “byte-identical since” properties externally checkable. It does not certify any claim about a model, a method, or a result; it does not replace peer review or replication; it is not a benchmark.

01Three problems an evaluation has to defend against

Selection, exposure, and the long tail of unpublished runs.

Every published evaluation result has to defend against three structural critiques: you tweaked the design after seeing the data, you can’t share the underlying transcripts without leaking your test set, and you only published the runs that looked good. Each maps to a primitive that’s live on the API today; each verifies independently in any browser against any public block explorer.

1

Pre-register the design before the run

Hash the rubric, prompt, decoding parameters, model config, scoring function, and test-set identifier. Anchor the snapshot via a policy_snapshot before any data is generated. The chain timestamp proves the design existed before the run, so a later reviewer can rule out post-hoc tweaking of the parts that were committed.

Policy snapshots →

2

Disclose one transcript without leaking the rest

Roll all transcripts, scored rows, or per-prompt outputs into one Merkle-batched evidence_bundle — up to 10,000 items per receipt. Hand a single item to a reviewer with its inclusion path; the other 9,999 stay sealed. Same shape via merkle-row-sealed-v1 when individual rows are low-entropy and would otherwise be guessable from their hash.

Manifest receipts →

3

Surface the unpublished runs at the matter level

All anchors a lab makes under one matter slug list at GET /api/v1/matters/<slug>/anchors — soft-deletes included. A reviewer who sees five published receipts in a matter and one hundred anchors in the listing can ask the obvious follow-up question. Honest scoping note in the disclosures below.

Matter-level audit →

None of these three primitives answer the substantive questions a reviewer will also ask — was the test set contaminated, was the scoring rubric well-calibrated, was the sample size adequate, did the model leak training-set examples. Those need replication, peer review, and methods that aren’t about anchoring hashes. Satsignal’s contribution is narrower: when the answers exist, it lets a third party verify them without trusting your platform.

02Two cross-cutting properties

What an outside reader of a 2030 paper actually needs.

Reviewers and replicators arrive at different times, with different access. Two properties matter across all three primitives above — one about who can verify, one about when.

Independent verification, no platform trust. A reviewer with a Satsignal-anchored receipt does not need an account on our service to check it. The receipt’s .mbnt bundle plus a public block explorer is enough; the in-browser verifier at proof.satsignal.cloud works as a convenience but is not load-bearing. The verification recipe is documented in the public spec and reproducible in any language — the cold-start auditor walked the protocol from spec alone in Go without our helpers, end of April.
Time-shifted durability. A 2030 reader of a 2026 paper needs the .mbnt bundle, the original payload (or the leaf to be verified), the chain transaction, and the MBNT format spec. The bundle is small (a few KB even at the 10,000-leaf limit), holds locally without any Satsignal service in the loop, and is well-suited to journal supplementary materials, Zenodo, OSF, or arXiv attachments. The MBNT wire format is published at /spec-mbnt; a verifier could be reimplemented from spec long after this site has been turned off, against any public BSV block explorer that survives.

The chain itself is operated by independent miners with no relationship to Satsignal. The security posture page covers what we keep and what we don’t; the design assumes a future where this site is gone and a reviewer with a saved bundle and the spec can still verify.

03A 30-line example

Pre-register an evaluation, then anchor the results.

The opening move: hash the components that define the evaluation, build a snapshot, anchor its sha256 with category: "policy_snapshot"before you run anything. Then run the eval, batch the per-prompt results, and anchor the manifest. Two anchors, one second apart on chain, with the design provably preceding the data. The policy_snapshot.py helper is stdlib-only; no SDK to install.

curl -O https://satsignal.cloud/policy_snapshot.py

# Step 1. Pre-registration. Hash the five components that pin down
# the evaluation: rubric, instruction template, decoding/tools,
# budget caps, model config (incl. test-set hash if applicable).
RUB=$(python3 policy_snapshot.py hash-component --file rubric.md          | jq -r .sha256_hex)
INS=$(python3 policy_snapshot.py hash-component --file instruction.txt    | jq -r .sha256_hex)
DEC=$(python3 policy_snapshot.py hash-component --json-file decoding.json | jq -r .sha256_hex)
BUD=$(python3 policy_snapshot.py hash-component --json-string '{"max_calls":500}' | jq -r .sha256_hex)
MOD=$(python3 policy_snapshot.py hash-component --json-file model_cfg.json | jq -r .sha256_hex)

python3 policy_snapshot.py build \
    --agent-name eval-2026-q2 \
    --agent-version v1 \
    --system-policy-hash    $RUB \
    --user-instruction-hash $INS \
    --tool-permissions-hash $DEC \
    --budget-limits-hash    $BUD \
    --model-config-hash     $MOD \
    --out preregister.json

# Anchor the design BEFORE running anything. Use a stable matter
# slug for the project so all sibling anchors list together.
SHA=$(jq -r .anchor.sha256_hex preregister.json)
SIZE=$(jq -r .anchor.file_size preregister.json)
curl -H "Authorization: Bearer sk_..." \
     -H "Content-Type: application/json" \
     -d "{\"matter_slug\":\"acme-evals-2026-q2\",\"sha256_hex\":\"$SHA\", \
          \"file_size\":$SIZE,\"category\":\"policy_snapshot\", \
          \"label\":\"pre-registration $(date -u +%FT%TZ)\"}" \
     https://app.satsignal.cloud/api/v1/anchors

# Step 2. Run the eval. Score each row. Build a manifest of the
# per-prompt outputs (or scored transcripts) and anchor the root.
# See /uses.html#manifest for the manifest body shape.

# Step 3. Reviewer side, later: verify any one component without
# seeing the others. Hand the reviewer rubric.md plus
# preregister.json; rest of the design stays sealed.
python3 policy_snapshot.py verify \
    --snapshot preregister.json \
    --system-policy-file rubric.md
# {"verified": true, "matched": ["system_policy_hash"]}

The Agent Evaluation demo walks the same shape end-to-end with real on-chain receipts — a grader scoring five answers, policy anchored before the run, manifest anchored after, chain timestamps one second apart. That demo was originally written as a commercial agent-eval example; structurally it is exactly the pre-registration + result-manifest pattern this page describes.

04What this page does not say

Honest limits of what the chain anchor proves.

Satsignal is not a benchmark, a peer-review service, or a safety-institute endorsement. Specifically:

  • Pre-registration via a single matter is defeatable. A lab can pre-register a hundred candidate designs across a hundred separate matter slugs and only publish the matching one; the matter-level listing endpoint above surfaces siblings within a matter, not across matters. Defending against cross-matter selection requires a community norm — for example, publishing the project’s matter slug at design time — not a cryptographic primitive. We don’t have a fix for this and won’t pretend to.
  • An anchor proves a design existed at a moment, not that it was followed. The chain timestamp confirms the snapshot’s sha256 was committed before the run; it does not confirm the agent ran under that policy, that the test set was untouched, or that the scoring code matched the rubric. Those need their own evidence.
  • Selective disclosure does not validate the held-back rows. A reviewer who sees one transcript verified against the manifest root learns nothing about the other 9,999 transcripts — including whether they exist, whether they were scored consistently, or whether some were dropped pre-anchor. Replication remains the authority.
  • Satsignal is not endorsed by, affiliated with, or recognized under any public AI safety institute, standards body, or government evaluation programme. This page describes a workflow that the cryptographic primitives support; it makes no claim that any specific institution accepts the resulting receipts.
  • The receipt is not the artifact. The .mbnt bundle and the chain transaction prove a hash existed at a moment. The artifact — the rubric, the transcripts, the model config — is yours to archive (Zenodo, OSF, arXiv supplementary, your institutional repository). A 2030 reader needs both.

What Satsignal supplies is one verifiable property in your stack: a third party can re-hash the payload, walk the Merkle path if needed, and check the on-chain transaction in any block explorer — without trusting Satsignal, your platform, or your collaborator. That property is useful in methods sections, supplementary materials, replication protocols, and external audit packets. It is not a substitute for any of the others.

Working on an evaluation methodology where this would help? Mail hello@satsignal.cloud — we read every email and prefer concrete designs to abstract pitches.