Input Formats

intake supports 12 input formats through specialized parsers. The format is auto-detected by file extension and content. Parsers are automatically discovered via the plugin system.


Summary table

FormatParserExtensions / SourceDependencyWhat it extracts
MarkdownMarkdownParser.md, .markdownYAML front matter, sections by headings
Plain textPlaintextParser.txt, stdin (-)Paragraphs as sections
YAML / JSONYamlInputParser.yaml, .yml, .jsonTop-level keys as sections
PDFPdfParser.pdfpdfplumberText by page, tables as Markdown
DOCXDocxParser.docxpython-docxParagraphs, tables, metadata, sections by headings
JiraJiraParser.json (auto-detected)Issues, comments, links, labels, priority
ConfluenceConfluenceParser.html, .htm (auto-detected)bs4, markdownifyClean content as Markdown
ImagesImageParser.png, .jpg, .jpeg, .webp, .gifLLM visionVisual content description
URLsUrlParserhttp://, https://httpx, bs4, markdownifyWeb page content as Markdown
SlackSlackParser.json (auto-detected)Messages, threads, decisions, action items
GitHub IssuesGithubIssuesParser.json (auto-detected)Issues, labels, comments, cross-references
GitLab IssuesGitlabIssuesParser.json (auto-detected)Issues, labels, notes, milestones, merge requests

Format auto-detection

The registry detects the format automatically following this order:

  1. Stdin (-): always treated as plaintext
  2. File extension: direct mapping (.md -> markdown, .pdf -> pdf, etc.)
  3. JSON subtype: if the extension is .json, the content is inspected in this order:
    • If it has key "issues" or is a list with objects that have "key" + "fields" -> jira
    • If it has field "iid" (object or list of objects) -> gitlab_issues
    • If it is a list with objects that have "number" + ("html_url" or "labels") -> github_issues
    • If it is a list with objects that have "type": "message" + "ts" -> slack
    • If no subtype matches -> yaml (treated as structured data)
  4. HTML subtype: if the extension is .html or .htm:
    • If the first 2000 characters contain “confluence” or “atlassian” -> confluence
    • Otherwise -> fallback to plaintext
  5. URLs: if the source starts with http:// or https:// -> url
  6. Fallback: if there is no parser for the detected format -> plaintext

Note: JSON subtype detection follows a strict priority order: Jira > GitLab Issues > GitHub Issues > Slack > generic YAML. This avoids ambiguities when a JSON has fields that could match multiple formats.


Parsers in detail

Markdown

Extensions: .md, .markdown

What it extracts:

  • YAML front matter: if the file starts with ---, extracts metadata as key-value pairs
  • Sections by headings: each #, ##, ###, etc. becomes a section with title, level, and content
  • Full text: the content without the front matter

Source example:

---
project: User API
version: 2.0
priority: high
---

# Functional Requirements

## FR-01: User registration
The system must allow registration with email and password...

## FR-02: Authentication
The system must support OAuth2 and JWT...

Extracted metadata: project, version, priority (from front matter)


Plain text

Extensions: .txt, stdin (-), files without extension

What it extracts:

  • Sections by paragraphs: each block separated by blank lines becomes a section
  • Metadata: source_type (“stdin” or “file”)

Ideal for:

  • Quick notes
  • Slack dumps
  • Raw ideas
  • Text copied from any source

Example:

We need a real-time notification system.
It must support WebSocket for immediate updates.

Users must be able to configure their preferences:
- Email for important notifications
- Push for real-time updates
- Mute by schedule

YAML / JSON

Extensions: .yaml, .yml, .json (when not Jira)

What it extracts:

  • Sections by top-level keys: each first-level key becomes a section
  • Text: YAML representation of the complete content
  • Metadata: top_level_keys (count) or item_count

Source example:

functional_requirements:
  - id: FR-01
    title: User Registration
    description: Users must be able to register...
    priority: high
    acceptance_criteria:
      - Email validation
      - Password strength check

non_functional_requirements:
  - id: NFR-01
    title: API Response Time
    description: All API endpoints must respond in under 200ms

PDF

Extensions: .pdf Requires: pdfplumber

What it extracts:

  • Text by page: each page becomes a section
  • Tables: automatically converted to Markdown format
  • Metadata: page_count

Limitations:

  • Only works with PDFs that have extractable text
  • Scanned PDFs (images only) are not directly supported — use the image parser instead

DOCX

Extensions: .docx Requires: python-docx

What it extracts:

  • Paragraphs: text from each paragraph
  • Sections by headings: Word headings are converted into structured sections
  • Tables: converted to Markdown format
  • Document metadata: author, title, subject, creation date

Jira

Extensions: .json (auto-detected by structure)

Supports two Jira export formats:

REST API format ({"issues": [...]}):

{
  "issues": [
    {
      "key": "PROJ-001",
      "fields": {
        "summary": "Implement login",
        "description": "The user must be able to...",
        "priority": {"name": "High"},
        "status": {"name": "To Do"},
        "labels": ["auth", "mvp"],
        "comment": {
          "comments": [...]
        },
        "issuelinks": [...]
      }
    }
  ]
}

List format ([{"key": "...", "fields": {...}}, ...]):

[
  {
    "key": "PROJ-001",
    "fields": {
      "summary": "Implement login",
      "description": "..."
    }
  }
]

What it extracts per issue:

DataJira fieldLimit
Summaryfields.summary
Descriptionfields.description
Priorityfields.priority.name
Statusfields.status.name
Labelsfields.labels
Commentsfields.comment.commentsLast 5, max 500 chars each
Issue linksfields.issuelinksType, direction, target

ADF support: Comments in Atlassian Document Format (nested JSON) are automatically converted to plain text.

Extracted relationships:

  • blocks / is blocked by
  • depends on
  • relates to

Confluence

Extensions: .html, .htm (auto-detected by content) Requires: beautifulsoup4, markdownify

Detection: The first 2000 characters of the file are inspected looking for “confluence” or “atlassian”.

What it extracts:

  • Main content: searches for the main content div (by id, class, or role)
  • Markdown conversion: converts HTML to clean Markdown with ATX headings
  • Sections by headings: from the resulting Markdown
  • Metadata: title, author, date, description (from <meta> tags)

Content selectors (in priority order):

  1. div#main-content
  2. div.wiki-content
  3. div.confluence-information-macro
  4. div#content
  5. div[role=main]
  6. <body> (fallback)

Images

Extensions: .png, .jpg, .jpeg, .webp, .gif Requires: LLM with vision capability

What it does:

  1. Encodes the image in base64
  2. Sends to the vision LLM with a prompt asking to describe:
    • UI mockups / wireframes
    • Architecture diagrams
    • Visible text in the image
  3. Returns the description as text

Metadata: image_format, file_size_bytes

Note: By default it uses a stub that returns placeholder text. Real vision is activated when the LLMAdapter is configured with a model that supports vision.


URLs

Source: URLs starting with http:// or https:// Requires: httpx, beautifulsoup4, markdownify

What it does:

  1. Downloads the page via httpx (sync, configurable timeout)
  2. Converts HTML to clean Markdown via BeautifulSoup4 + markdownify
  3. Extracts page title, sections by headings
  4. Auto-detects source type by URL patterns

Type auto-detection:

URL patternDetected type
confluence, wikiconfluence
jira, atlassianjira
github.comgithub
Otherwebpage

Extracted metadata: url, title, source_type, section_count

Error handling:

  • Timeout -> ParseError with suggestion to verify the URL
  • HTTP 4xx/5xx -> ParseError with the status code
  • Connection error -> ParseError with suggestion to verify the network

Example:

intake init "API review" -s https://wiki.company.com/rfc/auth

Slack

Extensions: .json (auto-detected by structure)

Detection: The JSON file must be a list of objects with "type": "message" and a "ts" field (Slack timestamp).

What it extracts:

  • Messages: text from each message with user and timestamp
  • Threads: messages grouped by thread_ts
  • Decisions: messages with specific reactions (thumbsup, white_check_mark) or keywords like “decided”, “agreed”
  • Action items: messages with keywords like “TODO”, “action item”, “we need”

Metadata:

FieldDescription
message_countTotal messages
thread_countNumber of threads
decision_countDetected decisions
action_item_countDetected action items

Source example:

[
  {"type": "message", "user": "U123", "text": "We need to use PostgreSQL", "ts": "1700000000.000"},
  {"type": "message", "user": "U456", "text": "Agreed", "ts": "1700000001.000",
   "reactions": [{"name": "thumbsup", "count": 3}]},
  {"type": "message", "user": "U789", "text": "TODO: configure the database", "ts": "1700000002.000",
   "thread_ts": "1700000000.000"}
]

GitHub Issues

Extensions: .json (auto-detected by structure)

Detection: The JSON file must contain objects with a "number" field and at least "html_url", "title" + "labels", or "title" + "body". Supports both a single issue and a list.

What it extracts:

  • Issues: number, title, body, state (open/closed)
  • Labels: issue labels
  • Assignees: assigned users
  • Milestones: associated milestone
  • Comments: issue comments
  • Cross-references: detects #NNN in the text as references to other issues

Supported formats:

// List format (multiple issues)
[
  {
    "number": 1,
    "title": "Login bug",
    "body": "Login fails when...",
    "html_url": "https://github.com/org/repo/issues/1",
    "state": "open",
    "labels": [{"name": "bug"}, {"name": "priority:high"}],
    "comments": [
      {"body": "Reproduced in production", "user": {"login": "dev1"}}
    ]
  }
]

// Individual format (single issue)
{
  "number": 42,
  "title": "Feature request",
  "body": "We need...",
  "html_url": "https://github.com/org/repo/issues/42"
}

Metadata: source_type (“github_issues”), issue_count, labels (comma-separated list), milestone (if exists)

Extracted relationships: cross-references via #NNN in body and comments.


GitLab Issues

Extensions: .json (auto-detected by structure)

Detection: The JSON file must contain objects with an "iid" field (GitLab internal ID). Supports a single issue, a list, or a wrapped format {"issues": [...]}.

What it extracts:

  • Issues: IID, title, description, state (opened/closed)
  • Labels: issue labels
  • Assignees: assigned users
  • Milestones: associated milestone (title)
  • Weight: issue weight (if exists)
  • Task completion status: checkbox progress (count/completed_count)
  • Discussion notes: non-system discussion notes (max 500 chars each)
  • Merge requests: linked MRs (as relationships)

Supported formats:

// Individual format (single issue)
{
  "iid": 42,
  "title": "Implement SSO login",
  "description": "Login must support SAML...",
  "state": "opened",
  "labels": ["feature", "auth"],
  "milestone": {"title": "v2.0"},
  "assignees": [{"username": "jdoe"}],
  "notes": [
    {"author": {"username": "dev"}, "body": "Implemented", "system": false}
  ]
}

// List format (multiple issues)
[
  {"iid": 42, "title": "...", ...},
  {"iid": 43, "title": "...", ...}
]

// Wrapped format
{"issues": [{"iid": 42, ...}]}

Metadata: source_type (“gitlab_issues”), issue_count, labels (comma-separated list), milestone (if exists)

Extracted relationships: Linked merge requests (if they exist).


Direct API connectors

In addition to local files, intake can obtain data directly from APIs using scheme URIs. Connectors require configuration in .intake.yaml and credentials via environment variables. There are currently 4 connectors: Jira, Confluence, GitHub, and GitLab.

Jira

Supported URIs:

PatternWhat it does
jira://PROJ-123A single issue
jira://PROJ-123,PROJ-124,PROJ-125Multiple issues
jira://PROJ?jql=sprint%20%3D%2042JQL search
jira://PROJ/sprint/42All issues from a sprint

Dependency: atlassian-python-api (install with pip install "intake-ai-cli[connectors]")

Example:

intake init "Sprint 42 tasks" -s jira://PROJ/sprint/42

Issues are downloaded as temporary JSON and parsed with JiraParser. Comments are included according to connectors.jira.include_comments.

Confluence

Supported URIs:

PatternWhat it does
confluence://page/123456789Page by ID
confluence://SPACE/Page-TitlePage by space and title
confluence://search?cql=space.key%3DENGCQL search

Dependency: atlassian-python-api

Example:

intake init "Architecture RFC" -s confluence://ENG/Architecture-RFC-2025

Pages are downloaded as temporary HTML and parsed with ConfluenceParser.

GitHub

Supported URIs:

PatternWhat it does
github://org/repo/issues/42A single issue
github://org/repo/issues/42,43,44Multiple issues
github://org/repo/issues?labels=bug&state=openIssues filtered by labels, state, milestone

Dependency: PyGithub (install with pip install "intake-ai-cli[connectors]")

Example:

intake init "Bug triage" -s github://org/repo/issues?labels=bug&state=open

Issues are downloaded as temporary JSON and parsed with GithubIssuesParser. Maximum 50 issues per query, 10 comments per issue.

GitLab

Supported URIs:

PatternWhat it does
gitlab://group/project/issues/42A single issue
gitlab://group/project/issues/42,43,44Multiple issues
gitlab://group/project/issues?labels=bug&state=openedFiltered issues
gitlab://group/project/milestones/3/issuesIssues from a milestone

Dependency: python-gitlab (install with pip install "intake-ai-cli[connectors]")

Example:

intake init "Sprint review" -s gitlab://team/backend/issues?labels=sprint&state=opened

Issues are downloaded as temporary JSON and parsed with GitlabIssuesParser. Maximum 50 issues per query, 10 notes per issue. Supports nested groups and configurable SSL.

Connector configuration

See Configuration > Connectors for the complete configuration fields. Connectors need:

  1. Base URL of the instance (Jira/Confluence) or token (GitHub)
  2. Credentials via environment variables
  3. intake doctor verifies that credentials are configured

General limitations

LimitValueDescription
Maximum size50 MBFiles larger than 50 MB are rejected with FileTooLargeError
Empty filesErrorEmpty files or whitespace-only files produce EmptySourceError
EncodingUTF-8 + fallbackTries UTF-8 first, fallback to latin-1
DirectoriesErrorPassing a directory as a source produces an error

Adding support for more formats

There are two ways to add a new parser:

Option 1: Built-in parser (V1 Protocol)

  1. Create a file in src/intake/ingest/ (e.g.: asana.py)
  2. Implement the methods can_parse(source: str) -> bool and parse(source: str) -> ParsedContent
  3. Register it in create_default_registry() and as an entry_point in pyproject.toml

Option 2: External plugin (V2 ParserPlugin)

  1. Create a separate Python package
  2. Implement the ParserPlugin protocol from intake.plugins.protocols
  3. Register as an entry_point in the intake.parsers group in your pyproject.toml

The parser will be automatically discovered when the package is installed. See Plugins for details.

No base class inheritance is required — just implement the correct interface (structural subtyping via typing.Protocol).