Provenance system

Code traceability system for identifying and documenting which code was written by AI and which by humans.

Status: Functional since v0.2.0 (Phase 2 completed).

Overview

The provenance system answers the question: who wrote this code? It analyzes git history and AI agent session logs to classify each file as ai, human, or mixed, with a confidence score between 0.0 and 1.0.

licit trace --stats

Analyzing git history...
Records: 45 files analyzed
AI-generated: 18 (40.0%)
Human-written: 22 (48.9%)
Mixed: 5 (11.1%)

AI tools detected: claude-code (15), cursor (3)
Models detected: claude-sonnet-4 (12), claude-opus-4 (3), gpt-4o (3)

Stored in .licit/provenance.jsonl

Architecture

The system consists of 8 modules in src/licit/provenance/:

provenance/
├── heuristics.py          # Engine with 6 AI detection heuristics
├── git_analyzer.py        # Git log parser with heuristic analysis
├── store.py               # JSONL append-only store
├── attestation.py         # HMAC-SHA256 signing + Merkle tree
├── tracker.py             # Full pipeline orchestrator
├── report.py              # Markdown report generator
└── session_readers/
    ├── base.py            # Protocol SessionReader
    └── claude_code.py     # Reader for Claude Code

Data flow

Git Log ──→ GitAnalyzer ──→ ProvenanceRecord[] ─┐
                                                 ├──→ ProvenanceTracker ──→ Store
Session Logs ──→ SessionReader ──→ Records[] ───┘          │
                                                           ├──→ Attestation (HMAC + Merkle)
                                                           └──→ Report (Markdown)

Heuristics engine

AICommitHeuristics applies 6 independent heuristics to each git commit. Each heuristic produces a score (0.0-1.0) and a relative weight.

Heuristics

#	Name	Weight	What it detects
H1	Author pattern	3.0	AI author names: `claude`, `copilot`, `cursor`, `bot`, `devin`, `aider`, `[bot]`
H2	Message pattern	1.5	Commit patterns: conventional commits, `[ai]`, `implement`, `generate`, `Co-authored-by` in subject
H3	Bulk changes	2.0	Bulk changes: >20 files and >500 lines in a single commit
H4	Co-author	3.0	`Co-authored-by:` trailer with AI keywords in the commit body
H5	File patterns	1.0	All modified files are test files (`test_`, `_test.`, `.spec.`)
H6	Time pattern	0.5	Commits between 1:00 AM and 5:00 AM

Score calculation

Only heuristics that produce a signal (score > 0) participate in the weighted average:

signaling = [h for h in results if h.score > 0]
total_weight = sum(h.weight for h in signaling)
final_score = sum(h.score * h.weight for h in signaling) / total_weight

If no heuristic signals, the final score is 0.0 (human).

Classification

Score	Classification
>= 0.7	`ai` — Code probably generated by AI
>= 0.5	`mixed` — Code with mixed contribution
< 0.5	`human` — Code probably human-written

The configurable confidence_threshold (default: 0.6) affects filtering in reports, not the base classification.

Git Analyzer

GitAnalyzer parses git history and applies heuristics to each commit.

Git log parsing

Runs git log with a custom format using hexadecimal separators (%x00, %x01) for robust field parsing:

git log --format="%x00%x01H%x01an%x01ae%x01aI%x01s%x01b" --numstat

Fields extracted into CommitInfo:

sha: Commit hash
author / author_email: Author
date: ISO 8601 date
message: Commit subject
files_changed: List of modified files
insertions / deletions: Lines added/deleted
co_authors: Co-authors extracted from the body (Co-authored-by:)

Options

Option	CLI flag	Description
`since`	`--since`	Analyze commits from date (YYYY-MM-DD) or tag
Timeout	—	30 seconds for `git log` (prevents blocking on massive repos)

Result per file

For each file, the maximum score across all commits that modified it is taken. The detection method is always ProvenanceSource.GIT_INFER.

Session Readers

Session readers extract provenance information directly from AI agent session logs.

Protocol

class SessionReader(Protocol):
    def can_read(self, path: Path) -> bool: ...
    def read_sessions(self, path: Path) -> list[ProvenanceRecord]: ...

Claude Code Reader

Reads Claude Code JSONL session files (typically in ~/.claude/projects/).

Extracted fields:

Modified files (from Write, Edit tool calls)
Model used (claude-sonnet-4, claude-opus-4, etc.)
Tool: claude-code
Session ID
Estimated cost (if available)

Configuration:

provenance:
  methods:
    - git-infer
    - session-log
  session_dirs:
    - ~/.claude/projects/

Extensibility

To add support for another agent (e.g., Cursor), implement the SessionReader Protocol and register it in ProvenanceTracker.

JSONL Store

ProvenanceStore stores provenance records in JSONL (JSON Lines) append-only format.

Format

Each line is an independent JSON object:

{"file_path": "src/app.py", "source": "ai", "confidence": 0.85, "method": "git-infer", "timestamp": "2026-03-10T14:30:00", "model": "claude-sonnet-4", "agent_tool": "claude-code"}
{"file_path": "tests/test_app.py", "source": "human", "confidence": 0.0, "method": "git-infer", "timestamp": "2026-03-10T14:30:00"}

Operations

Operation	Method	Description
Append	`append(records)`	Appends records to the end of the file
Load	`load()`	Reads all records from the store
Count	`count()`	Counts records without loading everything into memory
Clear	`clear()`	Empties the store (for re-analysis)

Characteristics

Append-only: Records are never modified or deleted during normal operation
Immutable per record: Each record has a timestamp and signature (if enabled)
Safe serialization: Uses default=str for datetime and other types
Configurable path: provenance.store_path in .licit.yaml

Attestation (Cryptographic signing)

ProvenanceAttestor provides individual HMAC-SHA256 signing and batch verification with Merkle tree.

Individual signing

from licit.provenance.attestation import ProvenanceAttestor

attestor = ProvenanceAttestor()  # Auto-generates key if it doesn't exist

# Sign a record
data = {"file": "app.py", "source": "ai", "confidence": 0.85}
signature = attestor.sign_record(data)

# Verify
assert attestor.verify_record(data, signature)

Merkle tree (batch)

To verify the integrity of a set of records:

records = [record1, record2, record3, record4]
root_hash = attestor.sign_batch(records)

         root_hash
        /         \
    hash_01      hash_23
    /    \       /    \
 hash_0 hash_1 hash_2 hash_3
   |      |      |      |
 rec_0  rec_1  rec_2  rec_3

Each record is serialized as canonical JSON (sort_keys=True, default=str)
SHA256 of each record forms the tree leaves
Pairs are concatenated and re-hashed up to the root
Odd number of records: the last one is duplicated
Timing-safe verification with hmac.compare_digest

Key management

The signing key is resolved in this order:

Explicit path in config: provenance.sign_key_path
Local fallback: .licit/.signing-key in the project
Auto-generation: 32 random bytes with os.urandom(32)

# Example with explicit key
provenance:
  sign: true
  sign_key_path: ~/.licit/signing-key

Tracker (Orchestrator)

ProvenanceTracker orchestrates the full pipeline:

from licit.provenance.tracker import ProvenanceTracker

tracker = ProvenanceTracker(config=config, project_root="/path/to/project")
stats = tracker.run(since="2026-01-01")

Pipeline

Git analysis: Runs GitAnalyzer to analyze commits
Session reading: Reads session logs if session-log is in methods
Merge: Combines results from git and sessions (sessions take priority on conflict)
Signing: Signs each record if sign: true
Storage: Stores in JSONL via ProvenanceStore
Stats: Returns aggregated statistics

Returned statistics

{
    "total_files": 45,
    "ai_count": 18,
    "human_count": 22,
    "mixed_count": 5,
    "ai_percentage": 40.0,
    "human_percentage": 48.9,
    "mixed_percentage": 11.1,
    "tools_detected": {"claude-code": 15, "cursor": 3},
    "models_detected": {"claude-sonnet-4": 12, "claude-opus-4": 3, "gpt-4o": 3},
}

Report

ProvenanceReportGenerator generates Markdown reports from stored records.

Report contents

Summary: Totals and percentages by classification
Detailed table: File, source, confidence, method, model, tool
Detected tools: Frequency of each AI agent
Detected models: Frequency of each model

Generation

licit trace --report
# Generates .licit/reports/provenance.md

from licit.provenance.report import ProvenanceReportGenerator

generator = ProvenanceReportGenerator()
markdown = generator.generate(records, project_name="mi-proyecto")

Full configuration

provenance:
  enabled: true
  methods:
    - git-infer              # Git history heuristics
    - session-log            # Agent session logs
  session_dirs:
    - ~/.claude/projects/    # Directory with Claude Code logs
  sign: true                 # Sign records with HMAC-SHA256
  sign_key_path: ~/.licit/signing-key
  confidence_threshold: 0.6  # Confidence threshold
  store_path: .licit/provenance.jsonl

Integration with compliance

Provenance evidence directly feeds the EvidenceBundle:

Bundle field	What provenance provides
`has_provenance`	`True` if a store with records exists
`provenance_stats`	Aggregated statistics (totals, percentages, tools, models)

These fields are evaluated by the compliance frameworks:

EU AI Act Art. 10 (Data and governance): Code origin traceability
EU AI Act Art. 13 (Transparency): Disclosure of AI usage in development
OWASP ASI-06 (Insufficient Monitoring): Provenance trail as monitoring evidence
OWASP ASI-10 (Insufficient Logging): Structured records of agent activity

Testing

167 tests cover the provenance system:

Module	Tests	File
Heuristics	23	`tests/test_provenance/test_heuristics.py`
Git Analyzer	15	`tests/test_provenance/test_git_analyzer.py`
Store	15	`tests/test_provenance/test_store.py`
Attestation	13	`tests/test_provenance/test_attestation.py`
Tracker	7	`tests/test_provenance/test_tracker.py`
Session Reader	13	`tests/test_provenance/test_session_reader.py`
QA Edge Cases	81	`tests/test_provenance/test_qa_edge_cases.py`
Total	167

The tests include:

Unit tests per module
Edge cases (Unicode, empty files, invalid keys, repos without commits, etc.)
Regression tests for 9 bugs found in QA
Cross-module integration tests