Provenance system
Code traceability system for identifying and documenting which code was written by AI and which by humans.
Status: Functional since v0.2.0 (Phase 2 completed).
Overview
The provenance system answers the question: who wrote this code? It analyzes git history and AI agent session logs to classify each file as ai, human, or mixed, with a confidence score between 0.0 and 1.0.
licit trace --stats
Analyzing git history...
Records: 45 files analyzed
AI-generated: 18 (40.0%)
Human-written: 22 (48.9%)
Mixed: 5 (11.1%)
AI tools detected: claude-code (15), cursor (3)
Models detected: claude-sonnet-4 (12), claude-opus-4 (3), gpt-4o (3)
Stored in .licit/provenance.jsonl
Architecture
The system consists of 8 modules in src/licit/provenance/:
provenance/
├── heuristics.py # Engine with 6 AI detection heuristics
├── git_analyzer.py # Git log parser with heuristic analysis
├── store.py # Append-only JSONL store
├── attestation.py # HMAC-SHA256 signing + Merkle tree
├── tracker.py # Full pipeline orchestrator
├── report.py # Markdown report generator
└── session_readers/
├── base.py # Protocol SessionReader
└── claude_code.py # Reader for Claude Code
Data flow
Git Log ──→ GitAnalyzer ──→ ProvenanceRecord[] ─┐
├──→ ProvenanceTracker ──→ Store
Session Logs ──→ SessionReader ──→ Records[] ───┘ │
├──→ Attestation (HMAC + Merkle)
└──→ Report (Markdown)
Heuristics engine
AICommitHeuristics applies 6 independent heuristics to each git commit. Each heuristic produces a score (0.0-1.0) and a relative weight.
Heuristics
| # | Name | Weight | What it detects |
|---|---|---|---|
| H1 | Author pattern | 3.0 | AI author names: claude, copilot, cursor, bot, devin, aider, [bot] |
| H2 | Message pattern | 1.5 | Commit patterns: conventional commits, [ai], implement, generate, Co-authored-by in subject |
| H3 | Bulk changes | 2.0 | Bulk changes: >20 files and >500 lines in a single commit |
| H4 | Co-author | 3.0 | Co-authored-by: trailer with AI keywords in the commit body |
| H5 | File patterns | 1.0 | All modified files are test files (test_, _test., .spec.) |
| H6 | Time pattern | 0.5 | Commits between 1:00 AM and 5:00 AM |
Score calculation
Only heuristics that produce a signal (score > 0) participate in the weighted average:
signaling = [h for h in results if h.score > 0]
total_weight = sum(h.weight for h in signaling)
final_score = sum(h.score * h.weight for h in signaling) / total_weight
If no heuristic signals, the final score is 0.0 (human).
Classification
| Score | Classification |
|---|---|
| >= 0.7 | ai — Code likely generated by AI |
| >= 0.5 | mixed — Code with mixed contribution |
| < 0.5 | human — Code likely written by a human |
The configurable
confidence_threshold(default: 0.6) affects filtering in reports, not the base classification.
Git Analyzer
GitAnalyzer parses the git history and applies heuristics to each commit.
Git log parsing
Executes git log with a custom format using hexadecimal separators (%x00, %x01) to robustly parse fields:
git log --format="%x00%x01H%x01an%x01ae%x01aI%x01s%x01b" --numstat
Fields extracted in CommitInfo:
sha: Commit hashauthor/author_email: Authordate: ISO 8601 datemessage: Commit subjectfiles_changed: List of modified filesinsertions/deletions: Lines added/deletedco_authors: Co-authors extracted from body (Co-authored-by:)
Options
| Option | CLI flag | Description |
|---|---|---|
since | --since | Analyze commits from date (YYYY-MM-DD) or tag |
| Timeout | — | 30 seconds for git log (prevents blocking on massive repos) |
Per-file result
For each file, the maximum score across all commits that modified it is taken. The detection method is always ProvenanceSource.GIT_INFER.
Session Readers
Session readers extract provenance information directly from AI agent session logs.
Protocol
class SessionReader(Protocol):
def can_read(self, path: Path) -> bool: ...
def read_sessions(self, path: Path) -> list[ProvenanceRecord]: ...
Claude Code Reader
Reads Claude Code JSONL session files (typically in ~/.claude/projects/).
Extracted fields:
- Modified files (from
Write,Edittool calls) - Model used (
claude-sonnet-4,claude-opus-4, etc.) - Tool:
claude-code - Session ID
- Estimated cost (if available)
Configuration:
provenance:
methods:
- git-infer
- session-log
session_dirs:
- ~/.claude/projects/
Extensibility
To add support for another agent (e.g., Cursor), implement the SessionReader Protocol and register it in ProvenanceTracker.
JSONL Store
ProvenanceStore stores provenance records in append-only JSONL (JSON Lines) format.
Format
Each line is an independent JSON object:
{"file_path": "src/app.py", "source": "ai", "confidence": 0.85, "method": "git-infer", "timestamp": "2026-03-10T14:30:00", "model": "claude-sonnet-4", "agent_tool": "claude-code"}
{"file_path": "tests/test_app.py", "source": "human", "confidence": 0.0, "method": "git-infer", "timestamp": "2026-03-10T14:30:00"}
Operations
| Operation | Method | Description |
|---|---|---|
| Append | append(records) | Adds records to the end of the file |
| Load | load() | Reads all records from the store |
| Count | count() | Counts records without loading everything into memory |
| Clear | clear() | Empties the store (for re-analysis) |
Features
- Append-only: Records are never modified or deleted in normal operation
- Immutable per record: Each record has a timestamp and signature (if enabled)
- Safe serialization: Uses
default=strfor datetime and other types - Configurable path:
provenance.store_pathin.licit.yaml
Attestation (Cryptographic signing)
ProvenanceAttestor provides individual HMAC-SHA256 signing and batch verification with Merkle tree.
Individual signing
from licit.provenance.attestation import ProvenanceAttestor
attestor = ProvenanceAttestor() # Auto-generates key if it doesn't exist
# Sign a record
data = {"file": "app.py", "source": "ai", "confidence": 0.85}
signature = attestor.sign_record(data)
# Verify
assert attestor.verify_record(data, signature)
Merkle tree (batch)
To verify the integrity of a set of records:
records = [record1, record2, record3, record4]
root_hash = attestor.sign_batch(records)
root_hash
/ \
hash_01 hash_23
/ \ / \
hash_0 hash_1 hash_2 hash_3
| | | |
rec_0 rec_1 rec_2 rec_3
- Each record is serialized as canonical JSON (
sort_keys=True, default=str) - SHA256 of each record -> tree leaves
- Pairs are concatenated and re-hashed up to the root
- Odd number of records: the last one is duplicated
- Verification is timing-safe with
hmac.compare_digest
Key management
The signing key is resolved in this order:
- Explicit path in config:
provenance.sign_key_path - Local fallback:
.licit/.signing-keyin the project - Auto-generation: 32 random bytes with
os.urandom(32)
# Example with explicit key
provenance:
sign: true
sign_key_path: ~/.licit/signing-key
Tracker (Orchestrator)
ProvenanceTracker orchestrates the full pipeline:
from licit.provenance.tracker import ProvenanceTracker
tracker = ProvenanceTracker(config=config, project_root="/path/to/project")
stats = tracker.run(since="2026-01-01")
Pipeline
- Git analysis: Runs
GitAnalyzerto analyze commits - Session reading: Reads session logs if
session-logis inmethods - Merge: Combines results from git and sessions (sessions take priority on conflict)
- Signing: Signs each record if
sign: true - Storage: Stores in JSONL via
ProvenanceStore - Stats: Returns aggregated statistics
Returned statistics
{
"total_files": 45,
"ai_count": 18,
"human_count": 22,
"mixed_count": 5,
"ai_percentage": 40.0,
"human_percentage": 48.9,
"mixed_percentage": 11.1,
"tools_detected": {"claude-code": 15, "cursor": 3},
"models_detected": {"claude-sonnet-4": 12, "claude-opus-4": 3, "gpt-4o": 3},
}
Report
ProvenanceReportGenerator generates Markdown reports from the stored records.
Report contents
- Summary: Totals and percentages by classification
- Detailed table: File, source, confidence, method, model, tool
- Detected tools: Frequency of each AI agent
- Detected models: Frequency of each model
Generation
licit trace --report
# Generates .licit/reports/provenance.md
from licit.provenance.report import ProvenanceReportGenerator
generator = ProvenanceReportGenerator()
markdown = generator.generate(records, project_name="mi-proyecto")
Full configuration
provenance:
enabled: true
methods:
- git-infer # Git history heuristics
- session-log # Agent session logs
session_dirs:
- ~/.claude/projects/ # Directory with Claude Code logs
sign: true # Sign records with HMAC-SHA256
sign_key_path: ~/.licit/signing-key
confidence_threshold: 0.6 # Confidence threshold
store_path: .licit/provenance.jsonl
Integration with compliance
Provenance evidence directly feeds the EvidenceBundle:
| Bundle field | What provenance provides |
|---|---|
has_provenance | True if a store with records exists |
provenance_stats | Aggregated statistics (totals, percentages, tools, models) |
These fields are evaluated by the compliance frameworks:
- EU AI Act Art. 10 (Data and governance): Code origin traceability
- EU AI Act Art. 13 (Transparency): Disclosure of AI usage in development
- OWASP ASI-06 (Insufficient Monitoring): Provenance trail as monitoring evidence
- OWASP ASI-10 (Insufficient Logging): Structured records of agent activity
Testing
167 tests cover the provenance system:
| Module | Tests | File |
|---|---|---|
| Heuristics | 23 | tests/test_provenance/test_heuristics.py |
| Git Analyzer | 15 | tests/test_provenance/test_git_analyzer.py |
| Store | 15 | tests/test_provenance/test_store.py |
| Attestation | 13 | tests/test_provenance/test_attestation.py |
| Tracker | 7 | tests/test_provenance/test_tracker.py |
| Session Reader | 13 | tests/test_provenance/test_session_reader.py |
| QA Edge Cases | 81 | tests/test_provenance/test_qa_edge_cases.py |
| Total | 167 |
Tests include:
- Unit tests per module
- Edge cases (Unicode, empty files, invalid keys, repos without commits, etc.)
- Regression tests for 9 bugs found during QA
- Cross-module integration tests