Training

Train custom extraction pipelines optimized for your specific document types — higher accuracy, lower cost at scale.

Training currently runs in BYOK mode only. Supply your model provider key (OpenAI, Gemini, or Anthropic) when starting a run. Full-service training is on the roadmap.

Overview

Training creates a custom extraction pipeline that learns from your labeled examples. Instead of relying on a generic LLM prompt, the trained pipeline achieves higher accuracy, consistency, and often lower cost than standard or robust modes.

When to train:

  • You have a recurring document type (e.g., always the same invoice format from a vendor)
  • Generic extraction gets 80–90% right but you need 95%+
  • You process high volumes and want consistent, reproducible results

Concepts

TermMeaning
ProjectA workspace for a document type (e.g., "Vendor Invoices").
TaskA labeling batch within a project — your training data.
SchemaThe JSON schema / Pydantic model defining what to extract. See Schemas.
Training runAn optimization job that produces an artifact.
ArtifactThe trained pipeline — a versioned binary with an artifact_id.
DeploymentActivating an artifact for a project + environment.

Workflow

  1. Create a project in the web UI at clichefactory.com.
  2. Define your schema — the fields you want to extract from this document type.
  3. Upload documents — PDFs, images, DOCX, etc. to your project.
  4. Label ground truth — manually verify and correct AI-assisted extractions to build training data.
  5. Start a training run — select your data, choose an optimizer tier, and launch.
  6. Monitor progress — via the web UI. The web app shows real-time progress, metrics, and results.
  7. Use the trained model — once training completes, you get an artifact_id. Use it in any extraction call.

Using Trained Models

Pass the artifact_id from a completed training run to any extraction call:

cliche = client.cliche(Invoice, artifact_id="art_abc123")
result = cliche.extract(file="new_invoice.pdf")
clichefactory extract invoice.pdf --schema schema.json --artifact-id art_abc123
curl -X POST "https://api.clichefactory.com/v1/extract" \
  -H "X-API-KEY: cliche-..." \
  -F "file=@invoice.pdf" \
  -F "artifact_id=art_abc123"

In MCP, the LLM passes artifact_id to the extract tool. The trained pipeline knows its data model — the schema is used for local validation only.

Optimizer Tiers

The optimizer is selected automatically based on dataset size. Both tiers are available on every BYOK training run.

TierWhen to UseWhat it does
Tiny < 75 examples, quick iteration and testing. Lightweight optimizer pass tuned for low-data regimes. Fast turnaround, useful for validating that the schema and labels make sense before scaling.
MIPRO ≥ 75 examples, production-quality pipelines. Multi-step prompt + few-shot optimization for production accuracy. Recommended once you have ≥ 75 well-labeled examples.

Pricing: 3000 credits flat per training run, regardless of optimizer tier or dataset size. You also pay your LLM provider directly for tokens consumed by the optimizer.

See Pricing for credit-to-dollar conversion.

Train Locally (Download DSPy Examples)

Prefer to own the training loop? From any training run you can download a runnable DSPy bundle and optimize on your own machine with your own LLM key. We're DSPy-native and don't hide the framework — you get plain, editable code, not a black box.

What's in the bundle (a single .zip):

run_dspy_bundle.zip
├── README.md         # how to run it
├── requirements.txt  # dspy + pydantic
├── schema.py        # your Pydantic model
├── schema.json      # target JSON schema
├── train.py         # self-contained training script
├── inputs/          # one file per example (markdown)
└── ground_truth/    # your reviewed labels (.json)

Run it:

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python train.py --model "openai/gpt-4o-mini" --api-key "$YOUR_LLM_KEY"

The compiled program is written to trained_extractor.json. If your inputs/ hold raw files (PDF, images) instead of markdown, pip install clichefactory and add --parse to convert them locally with your key.

Only reviewed ground truth is included. Unreviewed items just mirror the base pipeline's own prediction (ground truth == prediction), so they carry no training signal and are skipped. Review documents in the web app, then download — re-downloads are free.

Honest caveat: train.py ships a generic DSPy signature and a simple scoring metric so it runs immediately. It is a starting point, not a clone of our hosted pipeline — ClicheFactory's cloud training uses a tuned signature, a graduated metric, and server-side optimizer tuning, so results will differ. The bundle's real value is the clean, reviewed ground truth and the zero-friction local setup.

Pricing: 1250 credits flat per run to download the bundle (charged once per run; re-downloads free; paid plans only). Minimum 10 reviewed examples. See Pricing.

API

Training runs are managed through the web app. To use a trained artifact programmatically, pass artifact_id to the extract endpoint — the pipeline resolves its own schema and mode automatically.

Tips

  • Start with 20–30 labeled examples and the Tiny tier for a quick validation.
  • Accuracy improves significantly between 50 and 100 examples.
  • For documents where errors are costly, combine a trained artifact with robust mode for an additional verification pass.
  • You can assign different weights to fields in your schema to prioritize accuracy on must-have fields. High-weight fields are optimized first during training.
  • Artifacts are immutable and versioned — you can always roll back to a previous artifact.