Dirty unstructured assets only become agent-safe after quarantine, provenance, and promotion rules—not one-shot extraction. Treat ingestion as a quality system with owner queues, not an ML weekend.
Executive snapshot
Multi-modal catalog ingestion: Handling dirty data
Retail & commerceCatalog fidelity
The move
Suppliers ship PDFs, images, and tables that must collapse into one governed product truth for humans and agents.
The friction
OCR and generative extraction create confident errors unless confidence, corroboration, and merge rules are first-class.
The product verdict
Ship pipelines with quarantine, diffable merges, and liability-tiered human review—speed without gates is technical debt.
Field note · Strategic dispatch · /dispatch/multimodal-catalog-ingestion-dirty-data