Input Formats

intake supports 8 input formats through specialized parsers. The format is auto-detected by file extension and content.


Summary table

FormatParserExtensionsDependencyWhat it extracts
MarkdownMarkdownParser.md, .markdownYAML front matter, sections by headings
Plain textPlaintextParser.txt, stdin (-)Paragraphs as sections
YAML / JSONYamlInputParser.yaml, .yml, .jsonTop-level keys as sections
PDFPdfParser.pdfpdfplumberText by page, tables as Markdown
DOCXDocxParser.docxpython-docxParagraphs, tables, metadata, sections by headings
JiraJiraParser.json (auto-detected)Issues, comments, links, labels, priority
ConfluenceConfluenceParser.html, .htm (auto-detected)bs4, markdownifyClean content as Markdown
ImagesImageParser.png, .jpg, .jpeg, .webp, .gifLLM visionDescription of visual content

Format auto-detection

The registry detects the format automatically following this order:

  1. Stdin (-): always treated as plaintext
  2. File extension: direct mapping (.md -> markdown, .pdf -> pdf, etc.)
  3. JSON subtype: if the extension is .json:
    • If it has key "issues" or is a list with objects that have "key" + "fields" -> jira
    • Otherwise -> yaml (treated as structured data)
  4. HTML subtype: if the extension is .html or .htm:
    • If the first 2000 characters contain “confluence” or “atlassian” -> confluence
    • Otherwise -> fallback to plaintext
  5. Fallback: if there is no parser for the detected format -> plaintext

Parsers in detail

Markdown

Extensions: .md, .markdown

What it extracts:

  • YAML front matter: if the file starts with ---, it extracts the metadata as key-value pairs
  • Sections by headings: each #, ##, ###, etc. becomes a section with title, level, and content
  • Full text: the content without the front matter

Source example:

---
project: Users API
version: 2.0
priority: high
---

# Functional Requirements

## FR-01: User Registration
The system must allow registration with email and password...

## FR-02: Authentication
The system must support OAuth2 and JWT...

Extracted metadata: project, version, priority (from front matter)


Plain text

Extensions: .txt, stdin (-), files without extension

What it extracts:

  • Sections by paragraphs: each block separated by blank lines becomes a section
  • Metadata: source_type (“stdin” or “file”)

Ideal for:

  • Quick notes
  • Slack dumps
  • Raw ideas
  • Text copied from any source

Example:

We need a real-time notification system.
It must support WebSocket for immediate updates.

Users must be able to configure their preferences:
- Email for important notifications
- Push for real-time updates
- Mute by schedule

YAML / JSON

Extensions: .yaml, .yml, .json (when not Jira)

What it extracts:

  • Sections by top-level keys: each first-level key becomes a section
  • Text: YAML representation of the full content
  • Metadata: top_level_keys (count) or item_count

Source example:

functional_requirements:
  - id: FR-01
    title: User Registration
    description: Users must be able to register...
    priority: high
    acceptance_criteria:
      - Email validation
      - Password strength check

non_functional_requirements:
  - id: NFR-01
    title: API Response Time
    description: All API endpoints must respond in under 200ms

PDF

Extensions: .pdf Requires: pdfplumber

What it extracts:

  • Text by page: each page becomes a section
  • Tables: automatically converted to Markdown format
  • Metadata: page_count

Limitations:

  • Only works with PDFs that have extractable text
  • Scanned PDFs (images only) are not directly supported — use the image parser instead

DOCX

Extensions: .docx Requires: python-docx

What it extracts:

  • Paragraphs: text from each paragraph
  • Sections by headings: Word headings are converted into structured sections
  • Tables: converted to Markdown format
  • Document metadata: author, title, subject, creation date

Jira

Extensions: .json (auto-detected by structure)

Supports two Jira export formats:

REST API format ({"issues": [...]}):

{
  "issues": [
    {
      "key": "PROJ-001",
      "fields": {
        "summary": "Implement login",
        "description": "The user must be able to...",
        "priority": {"name": "High"},
        "status": {"name": "To Do"},
        "labels": ["auth", "mvp"],
        "comment": {
          "comments": [...]
        },
        "issuelinks": [...]
      }
    }
  ]
}

List format ([{"key": "...", "fields": {...}}, ...]):

[
  {
    "key": "PROJ-001",
    "fields": {
      "summary": "Implement login",
      "description": "..."
    }
  }
]

What it extracts per issue:

DataJira fieldLimit
Summaryfields.summary
Descriptionfields.description
Priorityfields.priority.name
Statusfields.status.name
Labelsfields.labels
Commentsfields.comment.commentsLast 5, max 500 chars each
Issue linksfields.issuelinksType, direction, target

ADF support: Comments in Atlassian Document Format (nested JSON) are automatically converted to plain text.

Extracted relationships:

  • blocks / is blocked by
  • depends on
  • relates to

Confluence

Extensions: .html, .htm (auto-detected by content) Requires: beautifulsoup4, markdownify

Detection: The first 2000 characters of the file are inspected looking for “confluence” or “atlassian”.

What it extracts:

  • Main content: looks for the main content div (by id, class, or role)
  • Markdown conversion: converts HTML to clean Markdown with ATX headings
  • Sections by headings: from the resulting Markdown
  • Metadata: title, author, date, description (from <meta> tags)

Content selectors (in order of priority):

  1. div#main-content
  2. div.wiki-content
  3. div.confluence-information-macro
  4. div#content
  5. div[role=main]
  6. <body> (fallback)

Images

Extensions: .png, .jpg, .jpeg, .webp, .gif Requires: LLM with vision capability

What it does:

  1. Encodes the image in base64
  2. Sends it to the vision LLM with a prompt asking to describe:
    • UI mockups / wireframes
    • Architecture diagrams
    • Visible text in the image
  3. Returns the description as text

Metadata: image_format, file_size_bytes

Note: By default it uses a stub that returns placeholder text. Real vision is activated when the LLMAdapter is configured with a model that supports vision.


General limitations

LimitValueDescription
Maximum size50 MBFiles larger than 50 MB are rejected with FileTooLargeError
Empty filesErrorEmpty files or files with only whitespace produce EmptySourceError
EncodingUTF-8 + fallbackTries UTF-8 first, fallback to latin-1
DirectoriesErrorPassing a directory as a source produces an error

Adding support for more formats

intake uses the Protocol pattern for parsers. To add a new parser:

  1. Create a file in src/intake/ingest/ (e.g.: asana.py)
  2. Implement the methods can_parse(source: str) -> bool and parse(source: str) -> ParsedContent
  3. Register it in registry.py inside create_default_registry()

There is no need to inherit from any base class — just implement the correct interface.