Test Catalog¶
Every test in the suite, what it verifies, and why its area
exists. Generated by tools/gen_test_catalog.py — regenerate
after adding tests.
Total test functions: 321 (run them with
python -m pytest tests/ -q).
tests/active/¶
Active Mode is the opt-in interventional path (expressibility probing). These tests check the probe records correctly, labels the trace as active, and that analyzers needing interventional data refuse passive traces.
test_active_mode.py¶
Tests for Active Mode probing and the kl_expressibility analyzer they feed. Active Mode runs real circuits, so these use small ansätze and modest sample counts. The headline physics check is directional: a rigid ansatz must score a larger Haar KL divergence than an expressive one.
TestProbeCore
| Test | What it verifies |
|---|---|
test_records_active_mode_trace |
Records active mode trace |
test_trace_verifies |
Trace verifies |
test_seed_reproducible |
Seed reproducible — Same seed → identical sampled parameters → identical statevectors |
test_statevector_round_trips |
Statevector round trips |
TestQiskitProbe
| Test | What it verifies |
|---|---|
test_qiskit_probe_stores_circuit_qasm |
Qiskit probe stores circuit qasm |
test_qiskit_circuit_structure_visible |
Qiskit circuit structure visible — decomposed → real gates visible, depth > 1 |
TestPennyLaneProbe
| Test | What it verifies |
|---|---|
test_pennylane_probe |
Pennylane probe |
TestExpressibility
| Test | What it verifies |
|---|---|
test_rigid_more_than_expressive |
Directional physics: rigid ansatz has larger Haar KL than expressive. |
test_expressive_ansatz_low_kl |
StronglyEntanglingLayers is known to be highly expressive. |
test_num_qubits_inferred |
Num qubits inferred |
test_passive_trace_guard |
Expressibility on a passive trace returns a guard, not a number. |
test_insufficient_states |
Insufficient states |
tests/analysis/¶
Each analyzer is tested against constructed ground-truth traces: plant a known condition, assert the verdict and the quantitative evidence (variance, SNR, KL, fidelity) with their confidence intervals. Includes regression tests for hardware-format (ISA) circuits and active-qubit calibration scoping.
test_builtin_analyzers.py¶
Tests for the function-based analysis layer (hilbertbench.analysis). Traces are built deterministically with the tape; no quantum execution.
TestBarrenPlateau
| Test | What it verifies |
|---|---|
test_trainable |
Trainable |
test_barren_plateau |
Barren plateau |
test_insufficient_data |
Insufficient data — span with no numeric outcome (counts dict) |
test_custom_threshold |
Custom threshold |
test_accepts_path_and_trace_object |
Accepts path and trace object |
test_variance_matches_numpy |
Variance matches numpy |
TestShotNoise
| Test | What it verifies |
|---|---|
test_with_recorded_shots |
With recorded shots — high trajectory variance, low shots → signal clear |
test_shot_noise_dominated |
Shot noise dominated — tiny trajectory variance, many shots → buried in noise |
test_no_shots_recorded |
No shots recorded |
test_default_shots_fallback |
Default shots fallback |
test_precision_fallback |
Precision fallback — estimator runs record target precision, not shots; the floor |
test_recorded_shots_win_over_precision |
Recorded shots win over precision |
test_insufficient_data |
Insufficient data |
TestSummary
| Test | What it verifies |
|---|---|
test_combined_report |
Combined report |
test_summary_accepts_path |
Summary accepts path |
TestCustomAnalysis
| Test | What it verifies |
|---|---|
test_user_can_write_own_analysis |
A user composes their own diagnostic on the same trace API. |
test_confidence.py¶
Tests for the statistical-uncertainty measures added to the analyzers (proposal Section 2.6: "reported with statistical uncertainty and confidence measures, emphasizing transparency over definitive attribution").
TestBootstrapCI
| Test | What it verifies |
|---|---|
test_ci_brackets_statistic |
Ci brackets statistic |
test_degenerate_inputs_return_none |
Degenerate inputs return none |
test_n_boot_zero_disables |
N boot zero disables |
test_reproducible_with_seed |
Reproducible with seed |
test_wider_ci_for_higher_level |
Wider ci for higher level |
TestBarrenPlateauConfidence
| Test | What it verifies |
|---|---|
test_ci_brackets_variance |
Ci brackets variance |
test_clear_trainable_high_confidence |
Clear trainable high confidence |
test_clear_barren_high_confidence |
Clear barren high confidence |
test_near_threshold_low_confidence |
Near threshold low confidence — variance engineered to sit right at the 0.005 threshold |
test_n_boot_zero_skips_ci |
N boot zero skips ci |
TestShotNoiseConfidence
| Test | What it verifies |
|---|---|
test_empirical_variance_ci_present |
Empirical variance ci present |
test_ci_present_even_without_shots |
Ci present even without shots |
test_noise.py¶
Tests for the noise-profile analyzer (Diagnostic Axis: Noise). Calibration data only exists on real/fake hardware backends, so these use Qiskit's FakeManilaV2 (which ships realistic T1/T2/readout/gate-error data) and assert ideal simulators degrade gracefully to a no-calibration result.
TestDeviceSummary
| Test | What it verifies |
|---|---|
test_reports_calibration_stats |
Reports calibration stats |
test_gate_errors_present |
Gate errors present |
test_estimated_fidelity_in_unit_interval |
Estimated fidelity in unit interval |
TestIdealSimulator
| Test | What it verifies |
|---|---|
test_no_calibration_status |
No calibration status |
TestDepthInteraction
| Test | What it verifies |
|---|---|
test_fidelity_decreases_with_depth |
Fidelity decreases with depth |
test_dominant_error_shifts_to_two_qubit_gates |
Dominant error shifts to two qubit gates — shallow circuit: readout dominates the small infidelity |
TestCalibrationScoping
| Test | What it verifies |
|---|---|
test_stats_scoped_to_active_qubits |
Stats scoped to active qubits — the awful qubit 2 must not contaminate any statistic |
test_fidelity_uses_scoped_errors |
Fidelity uses scoped errors — 1 sx + 1 cx on good qubits: fidelity must stay high; the |
test_optimization_and_circuit.py¶
Tests for the optimization-loop (Axis 4) and circuit-structure analyzers.
TestOptimizationConvergence
| Test | What it verifies |
|---|---|
test_converging_trajectory |
Converging trajectory |
test_constant_step_still_improving |
Constant step still improving |
test_converging_path_length_positive |
Converging path length positive |
test_insufficient_data |
Insufficient data |
test_outcome_envelope_reported |
Outcome envelope reported — cost should drop from start to finish |
test_accepts_path_and_trace |
Accepts path and trace |
TestCircuitStructure
| Test | What it verifies |
|---|---|
test_bell_circuit |
Bell circuit |
test_bell_depth |
Bell depth — h on q0 (layer 1) then cx q0,q1 (layer 2) → depth 2 |
test_parametric_circuit |
Parametric circuit |
test_no_qasm_circuit |
No qasm circuit — Trace with only inline outcomes, no circuit_qasm artifact |
test_entangling_fraction_bounds |
Entangling fraction bounds |
test_isa_circuit_physical_qubits |
Isa circuit physical qubits — Regression: hardware ISA circuits previously parsed as empty |
test_isa_circuit_depth |
Isa circuit depth — sx, rz, sx, rz stack on $0 (layers 1-4), cx joins \(0/\)1 → 5 |
tests/compliance/¶
Architecture-level checks that the documented invariants (INV-001 and friends) hold end to end — the contract the paper and the docs promise.
test_backward_compatibility.py¶
These fixtures are FROZEN. They represent real v1.0 traces that must remain parseable for the lifetime of the project. Never update these dicts — if a model change breaks them, that is a BREAKING CHANGE and requires a schema version bump.
TestGoldenRecords — These must NEVER fail. Failure = breaking change.
| Test | What it verifies |
|---|---|
test_golden_trace_always_parses |
Golden trace always parses |
test_golden_span_always_parses |
Golden span always parses |
test_golden_artifact_always_parses |
Golden artifact always parses |
test_execution_parity.py¶
Validates the proposal's central "1:1 Execution Parity" claim (§2.1): wrapping a primitive with HilbertBench must NOT change what reaches the backend — same number of executions, same circuits, same shots — and must NOT change the results. This is what makes the recorder a non-confounding observer.
TestEstimatorParity
| Test | What it verifies |
|---|---|
test_backend_called_once_per_user_call |
Backend called once per user call — No silent extra executions: the backend saw exactly the same calls. |
test_pubs_unchanged |
Pubs unchanged — The parameter bindings submitted to the backend are bit-identical. |
test_results_identical |
Results identical |
TestSamplerParity
| Test | What it verifies |
|---|---|
test_backend_called_once_per_user_call |
Backend called once per user call |
test_shots_unchanged |
Shots unchanged — No silent shot inflation: every backed call kept the requested 256. |
TestPennyLaneParity
| Test | What it verifies |
|---|---|
test_device_executed_same_number_of_times |
Device executed same number of times |
TestOverhead
| Test | What it verifies |
|---|---|
test_estimator_overhead_under_budget |
Per-call recording overhead must stay small (proposal target: <5ms). We assert a generous CI-safe ceiling; the demo reports the real number (typically well under 1ms on a workstation). |
test_invariants.py¶
TestINV003 — INV-003: models are auto-generated, never manually edited.
| Test | What it verifies |
|---|---|
test_all_generated_files_have_header |
All generated files have header |
TestINV004 — INV-004: models must only import stdlib + pydantic.
| Test | What it verifies |
|---|---|
test_no_forbidden_imports |
No forbidden imports |
test_schema_roundtrip.py¶
Validates that all Pydantic v2 models correctly enforce schema constraints. Tests are organized by model. Negative tests confirm bad data is rejected.
TestTraceManifest
| Test | What it verifies |
|---|---|
test_valid_construction |
Valid construction |
test_roundtrip |
Roundtrip |
test_mode_enum |
Mode enum |
test_status_enum |
Status enum |
test_all_status_values_valid |
All status values valid |
test_all_mode_values_valid |
All mode values valid |
test_null_timestamp_end_allowed |
Null timestamp end allowed |
test_null_integrity_seal_allowed |
Null integrity seal allowed |
test_optional_fields_absent |
Optional fields absent — timestamp_end, integrity_seal, tags are all optional |
test_rejects_wrong_version |
Rejects wrong version |
test_rejects_extra_fields |
Rejects extra fields |
test_rejects_invalid_mode |
Rejects invalid mode |
test_rejects_invalid_status |
Rejects invalid status |
test_requires_client_environment |
Requires client environment |
test_client_environment_requires_version |
Client environment requires version |
test_tags_arbitrary_strings |
Tags arbitrary strings |
TestSpan
| Test | What it verifies |
|---|---|
test_valid_construction |
Valid construction |
test_roundtrip |
Roundtrip |
test_events_preserved |
Events preserved |
test_trace_id_matches_parent |
Trace id matches parent |
test_sequence_number_zero_allowed |
Sequence number zero allowed |
test_sequence_number_large_value |
Sequence number large value |
test_all_status_values_valid |
All status values valid |
test_null_outcome_ref_allowed |
Null outcome ref allowed |
test_null_parent_span_id_allowed |
Null parent span id allowed — Null parent_span_id = root span |
test_event_type_open_pattern |
Event type open pattern — event_type is open pattern ^[A-Z_]+$ — custom types must be allowed |
test_event_attributes_allow_arbitrary_scalars |
Event attributes allow arbitrary scalars |
test_event_null_attributes_allowed |
Event null attributes allowed |
test_rejects_negative_sequence_number |
Rejects negative sequence number |
test_rejects_empty_events |
Rejects empty events — minItems: 1 — a span with no events is invalid |
test_rejects_lowercase_event_type |
Rejects lowercase event type — event_type pattern is ^[A-Z_]+$ — lowercase must be rejected |
test_rejects_extra_fields |
Rejects extra fields |
TestArtifact
| Test | What it verifies |
|---|---|
test_valid_construction |
Valid construction |
test_roundtrip |
Roundtrip |
test_all_kind_values_valid |
All kind values valid |
test_all_encoding_values_valid |
All encoding values valid |
test_all_compression_values_valid |
All compression values valid |
test_compression_null_allowed |
Compression null allowed |
test_size_bytes_zero_allowed |
Size bytes zero allowed |
test_producer_null_allowed |
Producer null allowed |
test_hash_pattern_enforced |
Hash pattern enforced |
test_hash_wrong_length_rejected |
Hash wrong length rejected |
test_rejects_negative_size |
Rejects negative size |
test_rejects_ref_count_zero |
Rejects ref count zero — minimum: 1 — an artifact with zero references is orphaned |
test_rejects_invalid_kind |
Rejects invalid kind |
TestCatalog
| Test | What it verifies |
|---|---|
test_valid_construction |
Valid construction |
test_roundtrip |
Roundtrip |
test_multiple_artifacts |
Multiple artifacts |
test_empty_artifacts_allowed |
Empty artifacts allowed |
test_artifact_values_are_validated |
Artifact values are validated — Even though the key is not pattern-validated by Pydantic (see below), |
test_artifact_key_format_not_enforced_by_pydantic |
catalog.json uses additionalProperties (not patternProperties) so Pydantic accepts any string key at runtime. Key format and key==artifact_hash integrity is the responsibility of reader/verify.py, not the Pydantic model. This is a deliberate design tradeoff — see design_decisions/0003. This test documents and pins the behaviour. If it starts raising, the schema was changed back to patternProperties. |
test_rejects_wrong_version |
Rejects wrong version |
test_rejects_extra_fields |
Rejects extra fields |
tests/e2e/¶
Full journeys: record a realistic workload, seal, reopen, analyze — the integration surface a real user touches.
test_full_algorithms.py¶
Tier 3 end-to-end regression tests. Each test runs a complete algorithm for a small number of steps and verifies: - The trace is sealed and complete - The expected number of spans were recorded - Inline artifacts contain the correct data kinds
TestVQERegression
| Test | What it verifies |
|---|---|
test_vqe_trace_complete |
Runs 5 steps of gradient-based VQE on the simple H = Z⊗Z Hamiltonian. Verifies: spans created, outcome + parameters + observables captured. |
TestQAOARegression
| Test | What it verifies |
|---|---|
test_qaoa_bitstrings_recorded |
Runs a 2-qubit QAOA-like circuit for 3 angles and records bitstring outcomes. Verifies that counts are captured inline with proper structure. |
test_qaoa_multiple_angle_sets |
Multiple parameter sets in one PUB → one span total. |
TestQNNRegression
| Test | What it verifies |
|---|---|
test_qnn_training_trace_complete |
Trains a 2-qubit QNN for 5 steps on a 4-point dataset. Verifies: spans recorded, outcomes + params captured, trace sealed. |
TestCrossFramework
| Test | What it verifies |
|---|---|
test_both_frameworks_produce_valid_traces |
Evaluate Z⊗Z expectation value using both frameworks. Both traces should be valid, sealed, and contain outcome data. |
test_phenomenology.py¶
Phenomenological validation (proposal Section 2.6): plant a known QML phenomenon in synthetic ground-truth circuits and confirm the detector attributes it correctly from trace evidence alone.
TestBarrenPlateauValidation
| Test | What it verifies |
|---|---|
test_wide_deep_circuit_flagged_barren |
Wide deep circuit flagged barren |
test_shallow_control_is_trainable |
Shallow control is trainable |
test_variance_collapses_with_width |
The planted property: variance must shrink as width grows. |
test_trace_is_active_mode |
Landscape probing is a controlled, opt-in active diagnostic. |
test_qiskit_aer.py¶
End-to-End verification using a REAL Qiskit Aer simulator. Proves the transparent proxy works with actual quantum execution without breaking standard QML workflows.
| Test | What it verifies |
|---|---|
test_real_qml_parameterized_circuit |
Runs a real parameterized circuit (typical of QML/VQE) through the AerSimulator, verifying the proxy handles real Qiskit objects. |
tests/integrations/¶
The proxies (Qiskit Estimator/Sampler, backend.run, PennyLane) must be perfectly transparent to the wrapped framework (1:1 execution parity, INV-001) while recording faithfully. This area also covers calibration-snapshot capture across all three backend-access conventions found in the wild, drift refresh, and shot/precision evidence.
test_pennylane.py¶
Verifies the dynamic proxy integration for PennyLane. Tests that strict ML type-checks are preserved and synchronous executions are correctly logged as single, unified spans.
TestPennyLaneProxyTransparency
| Test | What it verifies |
|---|---|
test_dynamic_inheritance |
Crucial for PennyLane QNodes: The proxy must pass isinstance(). |
TestPennyLaneExecutionLifecycle
| Test | What it verifies |
|---|---|
test_synchronous_execution_span |
PennyLane evaluates synchronously. We should get exactly ONE span. |
TestPennyLaneExceptionVisibility
| Test | What it verifies |
|---|---|
test_synchronous_exception_handling |
Verifies INV-007 for synchronous failures. |
test_pennylane_measurements.py¶
Tier 2 integration tests for HilbertPennyLaneDeviceProxy covering all common PennyLane measurement types: expval, probs, counts, sample, state. Also verifies exception handling and backend_id propagation.
TestMeasurementTypes
| Test | What it verifies |
|---|---|
test_expval_inline |
Expval inline |
test_probs_inline |
Probs inline |
test_counts_inline |
Counts inline — counts is a bitstring → int dict; "00" should dominate |
test_sample_inline |
Sample inline |
test_state_inline_as_complex_pairs |
State inline as complex pairs — Stored as [[real, imag], ...] — first amplitude should be [1.0, 0.0] |
TestSpanStructure
| Test | What it verifies |
|---|---|
test_four_events_per_span |
EXECUTION_REQUEST + DEVICE_EXECUTE_STARTED + EXECUTION_COMPLETED + EXECUTION_RESULT |
test_device_started_event_has_num_tapes |
Device started event has num tapes |
test_parameters_captured_per_span |
Parameters captured per span |
test_observables_captured |
Observables captured |
test_payload_ref_resolves_to_circuit_qasm |
The circuit is now a templated QASM in the file store (deduplicated), so payload_ref must resolve from the catalog, not inline. |
test_circuit_qasm_deduplicates_across_steps |
Many evaluations of the same circuit structure produce one QASM file. |
test_backend_id_set |
Backend id set |
test_span_status_completed |
Span status completed |
TestPennyLaneExceptions
| Test | What it verifies |
|---|---|
test_device_exception_creates_failed_span |
When the device raises, a FAILED span with ERROR event is recorded. |
test_exception_propagates_to_caller |
Exception propagates to caller |
TestNoFilePollution
| Test | What it verifies |
|---|---|
test_all_measurements_stay_inline |
expval, probs, counts, sample — none should write .npy files. |
test_pennylane_qasm_reproducibility.py¶
Proves that the templated OpenQASM stored for PennyLane traces is useful: template + recorded parameters reconstructs a valid circuit whose outcome matches what was recorded — verified by re-executing through Qiskit (a different framework), proving the QASM is portable and complete.
TestTemplateHelper
| Test | What it verifies |
|---|---|
test_placeholders_replace_numeric_literals |
Placeholders replace numeric literals |
test_wire_indices_untouched |
Wire indices untouched — wire indices live in [...] not (...) — must not become placeholders |
test_multi_param_gate |
Multi param gate |
test_template_stable_across_values |
Template stable across values |
TestQASMRoundTrip
| Test | What it verifies |
|---|---|
test_template_plus_params_reproduces_outcome |
The core guarantee: bind(template, params) reproduces the recorded expval. |
test_single_qubit_rotation_round_trip |
Minimal case: one RY rotation, check exact reproduction. |
test_qiskit.py¶
Verifies the transparent proxy integration for Qiskit. Tests that circuits are serialized, spans are split (async mirroring), and all underlying framework exceptions are properly propagated (INV-007).
TestProxyTransparency
| Test | What it verifies |
|---|---|
test_backend_proxy_passthrough |
Backend proxy passthrough — The proxy must perfectly imitate the underlying backend properties |
test_job_proxy_passthrough |
Job proxy passthrough |
TestAsyncLifecycle
| Test | What it verifies |
|---|---|
test_successful_run_and_result |
Successful run and result — 1. Trigger the run (SUBMIT SPAN) |
TestExceptionVisibility
| Test | What it verifies |
|---|---|
test_run_exception_visibility |
Run exception visibility — Simulate a crash during circuit translation/submission |
test_result_exception_visibility |
Result exception visibility — Simulate a timeout while waiting for an IBM cloud job |
test_qiskit_calibration.py¶
Tests for calibration-snapshot capture. Calibration data (T1, T2, readout error, gate errors) only exists on real/fake hardware backends, never on ideal simulators — so these tests use Qiskit's FakeManilaV2 which ships realistic calibration data, and assert that ideal simulators produce no snapshot.
TestSerializeCalibration
| Test | What it verifies |
|---|---|
test_extracts_t1_t2_readout |
Extracts t1 t2 readout |
test_none_backend_returns_none |
None backend returns none |
test_backend_without_properties_returns_none |
Backend without properties returns none |
test_backend_raising_properties_returns_none |
Backend raising properties returns none |
TestEstimatorCalibrationCapture
| Test | What it verifies |
|---|---|
test_snapshot_captured |
Snapshot captured |
test_snapshot_captured_once_across_runs |
Snapshot captured once across runs — Content-addressed: identical calibration → one artifact regardless of run count |
test_ideal_simulator_produces_no_snapshot |
Ideal simulator produces no snapshot |
TestSamplerCalibrationCapture
| Test | What it verifies |
|---|---|
test_snapshot_captured |
Snapshot captured |
test_ideal_sampler_produces_no_snapshot |
Ideal sampler produces no snapshot |
TestResolveBackend — The three conventions in the wild: qiskit BackendEstimatorV2 exposes .backend as a property, qiskit-ibm-runtime primitives expose backend() as a bound method, and qiskit-aer primitives only hold ._backend.
| Test | What it verifies |
|---|---|
test_property_style |
Property style |
test_method_style_runtime_convention |
Method style runtime convention |
test_private_attr_aer_convention |
Private attr aer convention |
test_backend_passed_directly |
Backend passed directly |
test_statevector_primitive_resolves_to_none |
Statevector primitive resolves to none |
test_none_resolves_to_none |
None resolves to none |
TestCalibrationRefresh
| Test | What it verifies |
|---|---|
test_drift_yields_snapshot_history |
Drift yields snapshot history |
test_stable_calibration_attaches_once |
Stable calibration attaches once |
test_rate_limit_skips_query_inside_window |
Rate limit skips query inside window |
TestCalibrationHistory
| Test | What it verifies |
|---|---|
test_single_snapshot_history |
Single snapshot history |
test_drift_history_is_chronological |
Drift history is chronological — calibration() returns the newest snapshot |
test_ideal_trace_has_empty_history |
Ideal trace has empty history |
test_qiskit_sampler.py¶
Tier 2 integration tests for HilbertSamplerProxy. Uses real Qiskit circuits but keeps them minimal (1–2 qubits, few shots).
TestSamplerBasic
| Test | What it verifies |
|---|---|
test_one_span_per_pub |
Each PUB produces exactly one span. |
test_outcome_inline_with_counts |
Bitstring counts are stored inline as JSON, not as files. |
test_no_outcome_files_on_disk |
All data is inline — artifacts/ holds only QASM, not outcomes. |
test_circuit_deduplication |
Same circuit template across many shots produces only one QASM file. |
test_span_status_completed |
Span status completed |
TestSamplerParametric
| Test | What it verifies |
|---|---|
test_parameter_bindings_captured |
Parameter bindings captured — Should contain the parameter array flattened |
test_different_params_different_outcomes |
Different params different outcomes — theta=0: should be almost all '0' |
TestSamplerTransparency
| Test | What it verifies |
|---|---|
test_job_result_unchanged |
The job returned by the proxy produces the same result as unproxied. |
test_shots_in_execution_completed_event |
EXECUTION_COMPLETED event carries the actual shot count. |
test_tape_closed_skips_recording |
After tape closes, proxy forwards calls but records nothing. |
test_deepcopy_preserves_tape |
Deepcopy preserves tape |
tests/reader/¶
Verification: trace.verify() must pass on honest traces and fail loudly on any tampering — the property the blinded validation protocol depends on.
test_verify.py¶
Proves the cryptographic and causal verification engine. Guarantees that tampered data, missing files, or out-of-order execution spans are strictly rejected.
| Test | What it verifies |
|---|---|
test_verify_valid_trace_passes |
A perfectly clean trace should pass with True. |
| Test | What it verifies |
|---|---|
test_verify_detects_tampered_artifact |
Simulates a malicious user altering their result file after the run to make their quantum benchmark look better. |
| Test | What it verifies |
|---|---|
test_verify_detects_missing_events_file |
If events.jsonl is deleted, the trace is invalid. |
| Test | What it verifies |
|---|---|
test_integrity_seal_present_and_valid |
A sealed trace carries an integrity_seal that matches events.jsonl. |
| Test | What it verifies |
|---|---|
test_verify_detects_event_stream_tampering |
Modifying events.jsonl in a way that still passes causal/reference checks (e.g. flipping a backend_id that no check inspects) must still be caught by the integrity seal's byte-level checksum. |
| Test | What it verifies |
|---|---|
test_verify_detects_causal_sequence_violation |
Simulates a logging error where sequence numbers are duplicated, or a user copy-pasting spans to fake execution data. |
| Test | What it verifies |
|---|---|
test_verify_detects_dangling_artifact_references |
Simulates a span pointing to an artifact hash that doesn't exist in the catalog. |
| Test | What it verifies |
|---|---|
test_verify_detects_child_before_parent_violation |
A child span cannot legally finish and flush to the logs BEFORE its parent span has been created. Causal arrows flow one way. |
tests/recorder/¶
The recorder is the write path: HilbertTape, spans, events, and the content-addressed artifact store. These tests protect the append-only discipline (INV-002), atomic sealing, and the guarantee that every initiated span terminates explicitly (INV-007).
test_inline_artifacts.py¶
Tier 1 unit tests for the two-tier storage system: - SpanHandle.attach_inline() correctness - Hash integrity (key == sha256 of data) - Routing enforcement (structural kinds rejected inline) - Inline data appears in JSONL, not in the file store - Parquet writer preserves inline_artifacts column
TestAttachInlineBasics
| Test | What it verifies |
|---|---|
test_returns_sha256_hash |
Returns sha256 hash |
test_hash_matches_data |
Hash matches data |
test_size_bytes_matches_data |
Size bytes matches data |
test_all_fields_present |
All fields present |
test_same_data_same_hash_idempotent |
Same data same hash idempotent — dict is keyed by hash — still just one entry |
test_raises_after_tape_closed |
Raises after tape closed |
TestInlineKindEnforcement
| Test | What it verifies |
|---|---|
test_circuit_qasm_rejected |
Circuit qasm rejected |
test_calibration_snapshot_rejected |
Calibration snapshot rejected |
test_execution_outcome_allowed |
Execution outcome allowed |
test_parameters_allowed |
Parameters allowed |
test_observables_allowed |
Observables allowed |
test_generic_blob_allowed |
Generic blob allowed |
TestStorageRouting
| Test | What it verifies |
|---|---|
test_inline_artifact_not_written_to_disk |
Inline artifact not written to disk — artifacts directory must be empty (no files, not even shard dirs with files) |
test_inline_artifact_not_in_catalog |
Inline artifact not in catalog |
test_inline_appears_in_jsonl |
Inline appears in jsonl |
test_outcome_ref_resolves_from_inline |
Outcome ref resolves from inline |
test_structural_artifact_still_uses_file_store |
Structural artifact still uses file store |
TestParquetWriterInline
| Test | What it verifies |
|---|---|
test_inline_artifacts_column_written |
Inline artifacts column written |
test_inline_artifacts_data_round_trips |
Inline artifacts data round trips |
test_spans_without_inline_have_null_column |
Spans without inline have null column — a null cell reads back as None or NaN depending on pandas version |
test_invariants.py¶
Tier 4 property / invariant tests. These test the eight architectural invariants stated in docs/architecture/001_invariants.md, plus hash integrity and storage-triage consistency properties.
TestINV001ObserverEffect
| Test | What it verifies |
|---|---|
test_qiskit_estimator_does_not_alter_result |
Proxy result must be bitwise identical to unproxied result. |
test_pennylane_proxy_does_not_alter_result |
PennyLane proxy result must match direct device result. |
TestINV002TraceImmutability
| Test | What it verifies |
|---|---|
test_write_after_close_raises |
Write after close raises |
test_attach_artifact_after_close_raises |
Attach artifact after close raises |
test_close_idempotent |
Close idempotent |
test_events_jsonl_append_only |
Span data is flushed immediately — JSONL length only grows. |
TestPROP007IntegritySeal
| Test | What it verifies |
|---|---|
test_seal_present_after_seal |
Seal present after seal |
test_seal_checksum_matches_events_file |
Seal checksum matches events file |
test_artifact_count_includes_inline |
Artifact count includes inline |
test_artifact_count_includes_filestore |
Artifact count includes filestore |
test_seal_absent_while_in_flight |
Before sealing, trace.json is CRASHED_IN_FLIGHT with no seal. |
TestINV007FailureVisibility
| Test | What it verifies |
|---|---|
test_exception_span_has_error_event |
Exception span has error event |
test_error_event_captures_exception_type |
Error event captures exception type |
test_exception_propagates_to_caller |
Exception propagates to caller |
test_tape_sealed_with_errors_on_outer_exception |
Tape sealed with errors on outer exception |
TestPROP001HashIntegrity
| Test | What it verifies |
|---|---|
test_inline_artifact_keys_match_sha256 |
Inline artifact keys match sha256 |
test_all_spans_in_jsonl_pass_hash_check |
All spans in jsonl pass hash check |
TestPROP002FileStoreIntegrity
| Test | What it verifies |
|---|---|
test_attached_file_hash_matches_disk |
Attached file hash matches disk — Verify file on disk |
TestPROP003CatalogConsistency
| Test | What it verifies |
|---|---|
test_catalog_entries_match_file_count |
Catalog entries match file count — Files on disk may be < catalog entries if circuits are identical (dedup) |
test_inline_artifacts_not_counted_in_catalog |
Inline artifacts not counted in catalog |
TestPROP004SequenceNumbers
| Test | What it verifies |
|---|---|
test_sequence_numbers_monotonic_and_unique |
Sequence numbers monotonic and unique |
test_nested_spans_both_get_unique_sequences |
Nested spans both get unique sequences |
TestPROP005NoCircuitInline
| Test | What it verifies |
|---|---|
test_circuit_qasm_always_in_file_store |
Circuit qasm always in file store |
TestPROP006OutcomeRefResolves
| Test | What it verifies |
|---|---|
test_inline_outcome_ref_resolves |
Inline outcome ref resolves |
test_file_store_outcome_ref_resolves |
File store outcome ref resolves |
test_storage.py¶
Verifies the PyArrow Parquet conversion engine. Ensures columnar arrays maintain strict integrity against the JSON schema.
| Test | What it verifies |
|---|---|
test_parquet_conversion_creates_file |
Parquet conversion creates file |
| Test | What it verifies |
|---|---|
test_parquet_schema_and_data_integrity |
Parquet schema and data integrity — Read it back into memory to verify column types |
| Test | What it verifies |
|---|---|
test_missing_jsonl_raises_error |
Missing jsonl raises error |
test_tape.py¶
All I/O is isolated to tmp_path. All model types imported from hilbertbench.models public interface only — never from v1_0 directly. Adheres strictly to INV-001, INV-003, INV-004, and INV-007.
TestOpen
| Test | What it verifies |
|---|---|
test_creates_run_directory |
Creates run directory |
test_dir_name_format |
Dir name format |
test_artifacts_subdir_exists |
Artifacts subdir exists |
test_trace_json_written_on_open |
Trace json written on open |
test_events_jsonl_created_on_open |
Events jsonl created on open |
TestTraceLifecycle
| Test | What it verifies |
|---|---|
test_sealed_success_on_clean_exit |
Sealed success on clean exit |
test_sealed_with_errors_on_exception |
Sealed with errors on exception |
test_timestamp_end_absent_while_open |
Timestamp end absent while open |
test_timestamp_end_present_after_close |
Timestamp end present after close |
test_tags_persisted |
Tags persisted |
TestSpans
| Test | What it verifies |
|---|---|
test_span_flushed_immediately_on_close |
Span flushed immediately on close |
test_span_fields_present |
Span fields present |
test_span_nesting_parent_id |
Span nesting parent id — Inner span is closed and flushed first, so it is at index 0 |
test_root_span_has_no_parent |
Root span has no parent |
test_sequence_numbers_monotonic_and_unique |
Sequence numbers monotonic and unique |
test_span_event_recorded |
Span event recorded — Should have REQUEST, CALIBRATION_CHECK, RESULT |
TestThreadSafety
| Test | What it verifies |
|---|---|
test_parallel_spans_do_not_cross_nest |
Per-thread span stack must be independent (threading.local check). |
test_events_jsonl_valid_under_concurrency |
Every line must be valid JSON after 10 concurrent span writers. |
TestAttach
| Test | What it verifies |
|---|---|
test_artifact_copied_to_artifacts_dir |
Artifact copied to artifacts dir — Artifacts use 2-char sharding: artifacts/ |
test_catalog_json_written_on_close |
Catalog json written on close |
test_sha256_correct |
Sha256 correct |
test_size_bytes_correct |
Size bytes correct |
test_missing_file_raises |
Missing file raises |
test_compression_stored |
Compression stored |
TestFreezeOnClose
| Test | What it verifies |
|---|---|
test_span_after_close_raises |
Span after close raises |
test_attach_after_close_raises |
Attach after close raises |
test_close_idempotent |
Close idempotent |
TestExceptionPath
| Test | What it verifies |
|---|---|
test_exception_span_written |
Exception span written — The exception occurred inside the span, so it should be FAILED |
test_original_exception_propagates |
Original exception propagates |
test_exception_attributes_captured |
Exception attributes captured |
tests/tools/¶
The blinded-corpus protocol tool: leakage auditing, verbatim blinding with random IDs, SHA-256 answer-key commitments, and confusion-matrix scoring with Wilson intervals.
test_blind_corpus.py¶
Tests for the blinded-corpus protocol tool (tools/blind_corpus.py): leakage audit, blinding round-trip, commitment verification, and confusion-matrix scoring.
TestAudit
| Test | What it verifies |
|---|---|
test_clean_run_passes |
Clean run passes |
test_label_in_tags_is_flagged |
Label in tags is flagged |
test_label_in_dirname_is_flagged |
Label in dirname is flagged |
TestBlind
| Test | What it verifies |
|---|---|
test_blinding_roundtrip |
Blinding roundtrip — blinded copies + key + commitment + sheet all exist |
test_leaky_corpus_is_refused |
Leaky corpus is refused |
test_invalid_label_is_refused |
Invalid label is refused |
TestScore
| Test | What it verifies |
|---|---|
test_perfect_diagnosis_scores_one |
Perfect diagnosis scores one |
test_wrong_diagnosis_scores_zero |
Wrong diagnosis scores zero |
test_secondary_label_counts_in_top2 |
Secondary label counts in top2 — primary wrong, but the true label is given as the secondary |
test_missing_primary_is_refused |
Missing primary is refused |
test_tampered_key_fails_commitment |
Tampered key fails commitment — flip the label so the tamper is guaranteed to change content |
test_wilson_interval_sane |
Wilson interval sane |
tests/trace/¶
HilbertTrace is the public read API. These tests guarantee that whatever the recorder wrote, the reader resolves back exactly — spans, outcomes, parameters, circuits, calibration history — without the caller knowing about storage details.
test_hilberttrace.py¶
Tests for the HilbertTrace unified data API. Traces are built with the tape directly (no quantum execution) so the resolution logic — inline vs file-store, scalar vs array vs dict outcomes — is exercised deterministically.
TestConstruction
| Test | What it verifies |
|---|---|
test_missing_directory_raises |
Missing directory raises |
test_directory_without_events_raises |
Directory without events raises |
test_repr |
Repr |
TestMetadata
| Test | What it verifies |
|---|---|
test_status_mode_tags |
Status mode tags |
test_integrity_seal_present |
Integrity seal present |
test_environment |
Environment |
TestSpanAccess
| Test | What it verifies |
|---|---|
test_len_and_iteration |
Len and iteration |
test_completed_filter |
Completed filter |
test_filter_by_backend |
Filter by backend |
test_dataframe_view |
Dataframe view |
TestInlineResolution
| Test | What it verifies |
|---|---|
test_outcome_resolves |
Outcome resolves |
test_parameters_resolve |
Parameters resolve |
test_observables_resolve |
Observables resolve |
test_missing_parameters_returns_none |
Missing parameters returns none |
TestFileStoreResolution
| Test | What it verifies |
|---|---|
test_circuit_resolves_from_file |
Circuit resolves from file |
test_outcome_inline_circuit_filestore_same_span |
A span may mix storage tiers: inline outcome + file-store circuit. |
TestNumericOutcomes
| Test | What it verifies |
|---|---|
test_scalars |
Scalars |
test_arrays_flattened |
Arrays flattened |
test_counts_dict_skipped |
Sampler-style counts dicts are not numeric outcomes. |
test_variance_matches_manual |
Variance matches manual |
TestCalibrationAndVerify
| Test | What it verifies |
|---|---|
test_calibration_none_when_absent |
Calibration none when absent |
test_calibration_resolves_when_present |
Calibration resolves when present |
test_verify_passes_on_clean_trace |
Verify passes on clean trace |
TestLazyImport
| Test | What it verifies |
|---|---|
test_top_level_import |
Top level import |
test_unknown_attribute_raises |
Unknown attribute raises |