Provenance System
Code traceability system to identify and document which code was written by AI and which by humans.
Status: Functional since v0.2.0 (Phase 2 completed).
Overview
The provenance system answers the question: who wrote this code? It analyzes the git history and AI agent session logs to classify each file as ai, human, or mixed, with a confidence score between 0.0 and 1.0.
licit trace --stats
Analyzing git history...
Records: 45 files analyzed
AI-generated: 18 (40.0%)
Human-written: 22 (48.9%)
Mixed: 5 (11.1%)
AI tools detected: claude-code (15), cursor (3)
Models detected: claude-sonnet-4 (12), claude-opus-4 (3), gpt-4o (3)
Stored in .licit/provenance.jsonl
Architecture
The system consists of 8 modules in src/licit/provenance/:
provenance/
├── heuristics.py # Engine with 6 AI detection heuristics
├── git_analyzer.py # Git log parser with heuristic analysis
├── store.py # Deduplicated JSONL store
├── attestation.py # HMAC-SHA256 signing + Merkle tree
├── tracker.py # Full pipeline orchestrator
├── report.py # Markdown report generator
└── session_readers/
├── base.py # SessionReader Protocol
└── claude_code.py # Reader for Claude Code
Data flow
Git Log ──→ GitAnalyzer ──→ ProvenanceRecord[] ─┐
├──→ ProvenanceTracker ──→ Store
Session Logs ──→ SessionReader ──→ Records[] ───┘ │
├──→ Attestation (HMAC + Merkle)
└──→ Report (Markdown)
Heuristics Engine
AICommitHeuristics applies 6 independent heuristics to each git commit. Each heuristic produces a score (0.0-1.0) and a relative weight.
Heuristics
| # | Name | Weight | What it detects |
|---|---|---|---|
| H1 | Author pattern | 3.0 | AI author names: claude, copilot, cursor, bot, devin, aider, [bot] |
| H2 | Message pattern | 1.5 | Commit patterns: conventional commits, [ai], implement, generate, Co-authored-by in subject |
| H3 | Bulk changes | 2.0 | Massive changes: >20 files and >500 lines in a single commit |
| H4 | Co-author | 3.0 | Co-authored-by: trailer with AI keywords in the commit body |
| H5 | File patterns | 1.0 | All modified files are test files (test_, _test., .spec.) |
| H6 | Time pattern | 0.5 | Commits between 1:00 AM and 5:00 AM |
Score calculation
Only heuristics that produce a signal (score > 0) participate in the weighted average:
signaling = [h for h in results if h.score > 0]
total_weight = sum(h.weight for h in signaling)
final_score = sum(h.score * h.weight for h in signaling) / total_weight
If no heuristic signals, the final score is 0.0 (human).
Classification
| Score | Classification |
|---|---|
| >= 0.7 | ai — Code probably generated by AI |
| >= 0.5 | mixed — Code with mixed contribution |
| < 0.5 | human — Code probably written by a human |
The configurable threshold
confidence_threshold(default: 0.6) affects filtering in reports, not the base classification.
Git Analyzer
GitAnalyzer parses the git history and applies heuristics to each commit.
Git log parsing
Executes git log with a custom format using hexadecimal separators (%x00, %x01) for robust field parsing:
git log --format="%x00%x01H%x01an%x01ae%x01aI%x01s%x01b" --numstat
Fields extracted in CommitInfo:
sha: Commit hashauthor/author_email: Authordate: ISO 8601 datemessage: Commit subjectfiles_changed: List of modified filesinsertions/deletions: Lines added/removedco_authors: Co-authors extracted from the body (Co-authored-by:)
Options
| Option | CLI flag | Description |
|---|---|---|
since | --since | Analyze commits from date (YYYY-MM-DD) or tag |
| Timeout | — | 30 seconds for git log (prevents hangs on massive repos) |
Result per file
For each file, the maximum score across all commits that modified it is taken. The detection method is always ProvenanceSource.GIT_INFER.
Session Readers
Session readers extract provenance information directly from AI agent session logs.
Protocol
class SessionReader(Protocol):
def can_read(self, path: Path) -> bool: ...
def read_sessions(self, path: Path) -> list[ProvenanceRecord]: ...
Claude Code Reader
Reads Claude Code session JSONL files (typically in ~/.claude/projects/).
Extracted fields:
- Modified files (from tool calls
Write,Edit) - Model used (
claude-sonnet-4,claude-opus-4, etc.) - Tool:
claude-code - Session ID
- Estimated cost (if available)
Configuration:
provenance:
methods:
- git-infer
- session-log
session_dirs:
- ~/.claude/projects/
Extensibility
To add support for another agent (e.g., Cursor), implement the SessionReader Protocol and register it in ProvenanceTracker.
JSONL Store
ProvenanceStore stores provenance records in JSONL (JSON Lines) format with automatic deduplication per file.
Format
Each line is an independent JSON object (one record per unique file):
{"file_path": "src/app.py", "source": "ai", "confidence": 0.85, "method": "git-infer", "timestamp": "2026-03-10T14:30:00", "model": "claude-sonnet-4", "agent_tool": "claude-code"}
{"file_path": "tests/test_app.py", "source": "human", "confidence": 0.0, "method": "git-infer", "timestamp": "2026-03-10T14:30:00"}
Operations
| Operation | Method | Description |
|---|---|---|
| Save | save(records) | Merges records with existing ones (dedup by file path, latest wins) and rewrites |
| Load | load_all() | Reads all records from the store |
| Stats | get_stats() | Provenance statistics (ai/human/mixed) |
| By file | get_by_file(path) | Records for a specific file |
Features
- Merge + dedup: Each
save()merges with existing records — the most recent per file wins. The store does not grow with repeated executions. - Immutable per record: Each record has a timestamp and signature (if enabled)
- Safe serialization: Uses
default=strfor datetime and other types - Error handling:
PermissionErrorand other I/O errors are reported with a clean message - Configurable path:
provenance.store_pathin.licit.yaml
Attestation (Cryptographic Signing)
ProvenanceAttestor provides individual HMAC-SHA256 signing and batch verification with Merkle tree.
Individual signing
from licit.provenance.attestation import ProvenanceAttestor
attestor = ProvenanceAttestor() # Auto-generates key if it doesn't exist
# Sign a record
data = {"file": "app.py", "source": "ai", "confidence": 0.85}
signature = attestor.sign_record(data)
# Verify
assert attestor.verify_record(data, signature)
Merkle tree (batch)
To verify the integrity of a set of records:
records = [record1, record2, record3, record4]
root_hash = attestor.sign_batch(records)
root_hash
/ \
hash_01 hash_23
/ \ / \
hash_0 hash_1 hash_2 hash_3
| | | |
rec_0 rec_1 rec_2 rec_3
- Each record is serialized as canonical JSON (
sort_keys=True, default=str) - SHA256 of each record → tree leaves
- Pairs are concatenated and re-hashed up to the root
- Odd records: the last one is duplicated
- Timing-safe verification with
hmac.compare_digest
Key management
The signing key is resolved in this order:
- Explicit path in config:
provenance.sign_key_path - Local fallback:
.licit/.signing-keyin the project - Auto-generation: 32 random bytes with
os.urandom(32)
# Example with explicit key
provenance:
sign: true
sign_key_path: ~/.licit/signing-key
Tracker (Orchestrator)
ProvenanceTracker orchestrates the full pipeline:
from licit.provenance.tracker import ProvenanceTracker
tracker = ProvenanceTracker(config=config, project_root="/path/to/project")
stats = tracker.run(since="2026-01-01")
Pipeline
- Git analysis: Runs
GitAnalyzerto analyze commits - Session reading: Reads session logs if
session-logis inmethods - Merge: Combines results from git and sessions (sessions take priority on conflict)
- Signing: Signs each record if
sign: true - Storage: Stores in JSONL via
ProvenanceStore - Stats: Returns aggregated statistics
Returned statistics
{
"total_files": 45,
"ai_count": 18,
"human_count": 22,
"mixed_count": 5,
"ai_percentage": 40.0,
"human_percentage": 48.9,
"mixed_percentage": 11.1,
"tools_detected": {"claude-code": 15, "cursor": 3},
"models_detected": {"claude-sonnet-4": 12, "claude-opus-4": 3, "gpt-4o": 3},
}
Report
ProvenanceReportGenerator generates Markdown reports from the stored records.
Report contents
- Summary: Totals and percentages by classification
- Detailed table: File, source, confidence, method, model, tool
- Detected tools: Frequency of each AI agent
- Detected models: Frequency of each model
Generation
licit trace --report
# Generates .licit/reports/provenance.md
from licit.provenance.report import ProvenanceReportGenerator
generator = ProvenanceReportGenerator()
markdown = generator.generate(records, project_name="my-project")
Full configuration
provenance:
enabled: true
methods:
- git-infer # Git history heuristics
- session-log # Agent session logs
session_dirs:
- ~/.claude/projects/ # Directory with Claude Code logs
sign: true # Sign records with HMAC-SHA256
sign_key_path: ~/.licit/signing-key
confidence_threshold: 0.6 # Confidence threshold
store_path: .licit/provenance.jsonl
Compliance integration
Provenance evidence feeds directly into the EvidenceBundle:
| Bundle field | What provenance provides |
|---|---|
has_provenance | True if store exists with records |
provenance_stats | Aggregated statistics (totals, percentages, tools, models) |
These fields are evaluated by the compliance frameworks:
- EU AI Act Art. 10 (Data and governance): Code origin traceability
- EU AI Act Art. 13 (Transparency): Disclosure of AI use in development
- OWASP ASI-06 (Insufficient Monitoring): Provenance trail as monitoring evidence
- OWASP ASI-10 (Insufficient Logging): Structured records of agent activity
Testing
167 tests cover the provenance system:
| Module | Tests | File |
|---|---|---|
| Heuristics | 23 | tests/test_provenance/test_heuristics.py |
| Git Analyzer | 15 | tests/test_provenance/test_git_analyzer.py |
| Store | 15 | tests/test_provenance/test_store.py |
| Attestation | 13 | tests/test_provenance/test_attestation.py |
| Tracker | 7 | tests/test_provenance/test_tracker.py |
| Session Reader | 13 | tests/test_provenance/test_session_reader.py |
| QA Edge Cases | 81 | tests/test_provenance/test_qa_edge_cases.py |
| Total | 167 |
Tests include:
- Unit tests per module
- Edge cases (Unicode, empty files, invalid keys, repos without commits, etc.)
- Regression tests for 9 bugs found in QA
- Cross-module integration tests