Provenance system

Code traceability system for identifying and documenting which code was written by AI and which by humans.

Status: Functional since v0.2.0 (Phase 2 completed).


Overview

The provenance system answers the question: who wrote this code? It analyzes git history and AI agent session logs to classify each file as ai, human, or mixed, with a confidence score between 0.0 and 1.0.

licit trace --stats
Analyzing git history...
Records: 45 files analyzed
AI-generated: 18 (40.0%)
Human-written: 22 (48.9%)
Mixed: 5 (11.1%)

AI tools detected: claude-code (15), cursor (3)
Models detected: claude-sonnet-4 (12), claude-opus-4 (3), gpt-4o (3)

Stored in .licit/provenance.jsonl

Architecture

The system consists of 8 modules in src/licit/provenance/:

provenance/
├── heuristics.py          # Engine with 6 AI detection heuristics
├── git_analyzer.py        # Git log parser with heuristic analysis
├── store.py               # JSONL append-only store
├── attestation.py         # HMAC-SHA256 signing + Merkle tree
├── tracker.py             # Full pipeline orchestrator
├── report.py              # Markdown report generator
└── session_readers/
    ├── base.py            # Protocol SessionReader
    └── claude_code.py     # Reader for Claude Code

Data flow

Git Log ──→ GitAnalyzer ──→ ProvenanceRecord[] ─┐
                                                 ├──→ ProvenanceTracker ──→ Store
Session Logs ──→ SessionReader ──→ Records[] ───┘          │
                                                           ├──→ Attestation (HMAC + Merkle)
                                                           └──→ Report (Markdown)

Heuristics engine

AICommitHeuristics applies 6 independent heuristics to each git commit. Each heuristic produces a score (0.0-1.0) and a relative weight.

Heuristics

#NameWeightWhat it detects
H1Author pattern3.0AI author names: claude, copilot, cursor, bot, devin, aider, [bot]
H2Message pattern1.5Commit patterns: conventional commits, [ai], implement, generate, Co-authored-by in subject
H3Bulk changes2.0Bulk changes: >20 files and >500 lines in a single commit
H4Co-author3.0Co-authored-by: trailer with AI keywords in the commit body
H5File patterns1.0All modified files are test files (test_, _test., .spec.)
H6Time pattern0.5Commits between 1:00 AM and 5:00 AM

Score calculation

Only heuristics that produce a signal (score > 0) participate in the weighted average:

signaling = [h for h in results if h.score > 0]
total_weight = sum(h.weight for h in signaling)
final_score = sum(h.score * h.weight for h in signaling) / total_weight

If no heuristic signals, the final score is 0.0 (human).

Classification

ScoreClassification
>= 0.7ai — Code probably generated by AI
>= 0.5mixed — Code with mixed contribution
< 0.5human — Code probably human-written

The configurable confidence_threshold (default: 0.6) affects filtering in reports, not the base classification.


Git Analyzer

GitAnalyzer parses git history and applies heuristics to each commit.

Git log parsing

Runs git log with a custom format using hexadecimal separators (%x00, %x01) for robust field parsing:

git log --format="%x00%x01H%x01an%x01ae%x01aI%x01s%x01b" --numstat

Fields extracted into CommitInfo:

Options

OptionCLI flagDescription
since--sinceAnalyze commits from date (YYYY-MM-DD) or tag
Timeout30 seconds for git log (prevents blocking on massive repos)

Result per file

For each file, the maximum score across all commits that modified it is taken. The detection method is always ProvenanceSource.GIT_INFER.


Session Readers

Session readers extract provenance information directly from AI agent session logs.

Protocol

class SessionReader(Protocol):
    def can_read(self, path: Path) -> bool: ...
    def read_sessions(self, path: Path) -> list[ProvenanceRecord]: ...

Claude Code Reader

Reads Claude Code JSONL session files (typically in ~/.claude/projects/).

Extracted fields:

Configuration:

provenance:
  methods:
    - git-infer
    - session-log
  session_dirs:
    - ~/.claude/projects/

Extensibility

To add support for another agent (e.g., Cursor), implement the SessionReader Protocol and register it in ProvenanceTracker.


JSONL Store

ProvenanceStore stores provenance records in JSONL (JSON Lines) append-only format.

Format

Each line is an independent JSON object:

{"file_path": "src/app.py", "source": "ai", "confidence": 0.85, "method": "git-infer", "timestamp": "2026-03-10T14:30:00", "model": "claude-sonnet-4", "agent_tool": "claude-code"}
{"file_path": "tests/test_app.py", "source": "human", "confidence": 0.0, "method": "git-infer", "timestamp": "2026-03-10T14:30:00"}

Operations

OperationMethodDescription
Appendappend(records)Appends records to the end of the file
Loadload()Reads all records from the store
Countcount()Counts records without loading everything into memory
Clearclear()Empties the store (for re-analysis)

Characteristics


Attestation (Cryptographic signing)

ProvenanceAttestor provides individual HMAC-SHA256 signing and batch verification with Merkle tree.

Individual signing

from licit.provenance.attestation import ProvenanceAttestor

attestor = ProvenanceAttestor()  # Auto-generates key if it doesn't exist

# Sign a record
data = {"file": "app.py", "source": "ai", "confidence": 0.85}
signature = attestor.sign_record(data)

# Verify
assert attestor.verify_record(data, signature)

Merkle tree (batch)

To verify the integrity of a set of records:

records = [record1, record2, record3, record4]
root_hash = attestor.sign_batch(records)
         root_hash
        /         \
    hash_01      hash_23
    /    \       /    \
 hash_0 hash_1 hash_2 hash_3
   |      |      |      |
 rec_0  rec_1  rec_2  rec_3

Key management

The signing key is resolved in this order:

  1. Explicit path in config: provenance.sign_key_path
  2. Local fallback: .licit/.signing-key in the project
  3. Auto-generation: 32 random bytes with os.urandom(32)
# Example with explicit key
provenance:
  sign: true
  sign_key_path: ~/.licit/signing-key

Tracker (Orchestrator)

ProvenanceTracker orchestrates the full pipeline:

from licit.provenance.tracker import ProvenanceTracker

tracker = ProvenanceTracker(config=config, project_root="/path/to/project")
stats = tracker.run(since="2026-01-01")

Pipeline

  1. Git analysis: Runs GitAnalyzer to analyze commits
  2. Session reading: Reads session logs if session-log is in methods
  3. Merge: Combines results from git and sessions (sessions take priority on conflict)
  4. Signing: Signs each record if sign: true
  5. Storage: Stores in JSONL via ProvenanceStore
  6. Stats: Returns aggregated statistics

Returned statistics

{
    "total_files": 45,
    "ai_count": 18,
    "human_count": 22,
    "mixed_count": 5,
    "ai_percentage": 40.0,
    "human_percentage": 48.9,
    "mixed_percentage": 11.1,
    "tools_detected": {"claude-code": 15, "cursor": 3},
    "models_detected": {"claude-sonnet-4": 12, "claude-opus-4": 3, "gpt-4o": 3},
}

Report

ProvenanceReportGenerator generates Markdown reports from stored records.

Report contents

  1. Summary: Totals and percentages by classification
  2. Detailed table: File, source, confidence, method, model, tool
  3. Detected tools: Frequency of each AI agent
  4. Detected models: Frequency of each model

Generation

licit trace --report
# Generates .licit/reports/provenance.md
from licit.provenance.report import ProvenanceReportGenerator

generator = ProvenanceReportGenerator()
markdown = generator.generate(records, project_name="mi-proyecto")

Full configuration

provenance:
  enabled: true
  methods:
    - git-infer              # Git history heuristics
    - session-log            # Agent session logs
  session_dirs:
    - ~/.claude/projects/    # Directory with Claude Code logs
  sign: true                 # Sign records with HMAC-SHA256
  sign_key_path: ~/.licit/signing-key
  confidence_threshold: 0.6  # Confidence threshold
  store_path: .licit/provenance.jsonl

Integration with compliance

Provenance evidence directly feeds the EvidenceBundle:

Bundle fieldWhat provenance provides
has_provenanceTrue if a store with records exists
provenance_statsAggregated statistics (totals, percentages, tools, models)

These fields are evaluated by the compliance frameworks:


Testing

167 tests cover the provenance system:

ModuleTestsFile
Heuristics23tests/test_provenance/test_heuristics.py
Git Analyzer15tests/test_provenance/test_git_analyzer.py
Store15tests/test_provenance/test_store.py
Attestation13tests/test_provenance/test_attestation.py
Tracker7tests/test_provenance/test_tracker.py
Session Reader13tests/test_provenance/test_session_reader.py
QA Edge Cases81tests/test_provenance/test_qa_edge_cases.py
Total167

The tests include: