Changelog

What's in ClicheFactory

A snapshot of what the platform does today, plus an append-only log of what's shipped since.

What's in ClicheFactory today

Evergreen reference. Edited when features change. Dated release notes are below.

Surfaces

REST API — language-agnostic. Single extract endpoint plus to_markdown, batch, and training endpoints.
Python SDK — fluent client, sync and async, Pydantic schema validation, batch helpers.
CLI — configure, extract, to-markdown, batch processing.
MCP Server — native integration for Cursor IDE and Claude Desktop. No glue code.
Web app — project-based workflow. Three ways to use it:
- Extract documents directly (full service or BYOK).
- Label documents with AI pre-fills, correct as needed, and export the result as a clean golden dataset you can use anywhere — even outside ClicheFactory, with your own training stack.
- Train custom extraction pipelines from your labeled data — on our cloud (callable from the SDK, CLI, and REST API), or download a runnable DSPy bundle and train locally with your own key.

Document support

PDF — digital-native and scanned (OCR routed automatically).
Images — PNG, JPG, WEBP, GIF, BMP.
Office — DOC, DOCX, ODT.
Spreadsheets — XLSX (with sheet selection), CSV.
Plain text — TXT, MD.
Email — EML, with recursive attachment extraction.
Raw text input — pass a string, no file required.

Key differentiators

Trainable pipelines (BYOK) — fine-tune extraction on your own labeled documents. Not prompt engineering, an actual compiled pipeline (DSPy under the hood). Two optimizer tiers: Tiny for fast iteration (< 75 examples), MIPRO for production accuracy (≥ 75 examples).
Labeling as a standalone tool, DSPy-native — use the web app purely as a fast ground-truth builder. AI prefills get you most of the way, you correct, and you download the result as a runnable DSPy bundle: your documents, your reviewed ground truth, and a self-contained train.py. Optimize locally with your own key — we don't hide the framework or lock you in.
Double Pydantic enforcement — schema validated server-side at inference time and client-side at deserialization. Two independent layers.
Graceful degradation — allow_partial=True returns whatever validated with per-field errors, instead of failing the whole call. Partial responses are not billed.
EML recursive attachment parsing — email files parsed through all nested attachments, not just the top level.
Dual model configuration — route OCR to a cheaper model independently of the extraction model. Two dials, one call.
Extraction debugging — include_doc=True returns the exact markdown the LLM received, so you can see what the model actually saw.
Smart document routing — text-vs-OCR decision happens automatically per page. No preprocessing required.
Batch processing — configurable concurrency, per-call model override without changing global client config.
Native MCP integration — works as a tool inside Cursor IDE and Claude Desktop. No glue code, no shim.
Full local mode with Ollama — entire pipeline runs on your machine. Air-gapped, no API key required, no credits consumed, no data leaves your infrastructure. Quality is your call: strong cloud models and larger local models handle complex schemas well; smaller local models are fine for simple ones. We don't publish a quality matrix for Ollama models — the landscape changes too fast to commit to one.

Extraction modes

Default — balanced pipeline. Smart routing + structured extraction.
Fast — one-shot extraction; document sent directly to a vision-capable LLM. No OCR step.
Trained — your custom DSPy pipeline (BYOK only during MVP).
Robust — extract + independent verification pass, for high-stakes documents.
Robust + Trained — combine the two; trained pipeline plus verification, one call.

Execution modes

Service — runs on ClicheFactory cloud with a ClicheFactory API key. All operations, including trained and robust.
BYOK-service — ClicheFactory handles OCR and infrastructure; you supply your own LLM API key for the inference step. Reduced credit cost.
Local — full pipeline on your machine via Ollama, OpenAI, Gemini, or Anthropic keys. Supports extract and to_markdown. Air-gapped with Ollama.

Release notes

Dated, append-only. The block above is the source of truth for what currently exists; entries here describe what changed.

Public changelog started recently. Earlier history is summarized in the snapshot above — entries below describe changes from this point forward.

June 8, 2026

Web app Download a runnable DSPy training bundle for any run. From a training run you can now buy and download a self-contained DSPy bundle: your documents (inputs/), your reviewed ground truth (ground_truth/), a Pydantic + JSON schema, and a generic, ready-to-run train.py. Label here, then optimize locally with your own LLM key — no SDK lock-in and none of our proprietary signature or metric. Only reviewed / human-labeled examples are included; unreviewed pipeline echoes are skipped (they carry no training signal). Flat fee per run, re-downloads are free, paid plans only.

May 14, 2026

SDK 0.6.1 OpenDocument (.odt) parsing works out of the box. Legacy Office files are now routed: .odt goes straight through soffice (LibreOffice), and .doc uses soffice with a pandoc fallback. Also fixes XlsxParser / DocxParser instantiation so they correctly pick up the shared parser registry.

MCP 0.1.6 Picks up the SDK fixes above. No tool-surface changes — extract and to_markdown now succeed on .odt / .xlsx / .docx documents that previously dropped to the VLM fallback.

Service Better defaults for non-PDF formats. The hosted extraction service now ships with LibreOffice and the system MIME database in the image, so .odt, .csv, .xlsx, .docx, and .eml uploads parse cleanly without falling through to a raw VLM call. Added an upstream guard that returns HTTP 400 immediately when a file's MIME type isn't accepted by the OCR model, instead of bubbling up an opaque "Unsupported MIME type" error from deep in the stack.

Web app .csv and .odt uploads accepted in the batch flow. The batch upload form now allows CSV and ODT documents alongside the existing PDF / image / DOC / DOCX / XLSX / EML / TXT formats.