Document Digitization - Corvic Platform

Overview

Document Digitization is a feature that converts unstructured data sources (such as PDFs, scanned documents, images, and other document formats) into clean, structured Markdown format. This conversion makes documents more accessible for processing, analysis, and integration into downstream workflows.

Input

The Document Digitization feature accepts data from the following input sources:

File Upload - Upload unstructured files directly from your local system. Supported formats include PDFs, scanned documents, images with text content, Word documents, and other document formats.
Learn more: Data Sources - Upload and manage data sources
Live Data Connectors - Connect directly to live data sources without duplicating data:
- Blob Storage: Amazon S3, Azure Blob Storage, Google Cloud Storage (GCS) buckets
- Databases: Snowflake, Databricks, and other database systems
Learn more: ▶️ Live Data Connectors Tutorial - Connect directly to live data sources

The input data source must be marked as “Unstructured Type” in your data room.

Output

Corvic Table - The Document Digitization feature produces a new Corvic Table containing extracted markdown and all images for each unstructured file added per row. Each row in the table represents one unstructured file from the input data source, with columns for:

Extracted Markdown: The digitized Markdown content preserving document structure, headings, lists, tables, and formatting
Images: All images extracted from the document, with references and metadata

The output Corvic Table provides a structured representation where each row corresponds to one input file, making it easy to process and analyze document content programmatically. The Markdown preserves the original document structure, and all images are included with proper references.

Parameters

Parameter	Type	Required	Description
`input_data_source`	`string`	Yes	The unstructured data source to digitize. Select a data source from your data room that contains documents, PDFs, scanned images, or other document formats. The data source must be marked as “Unstructured Type”.
`output_name`	`string`	No	Optional custom name for the output Corvic Table. If not provided, a default name will be automatically generated based on the input data source name.

Usage Example

To use Document Digitization in a Data App:

Add your unstructured document data source to the Data App canvas
Click the ”+” button next to the data source
Select “Document Digitization” from the actions menu
Select the input data source (if not already selected)
Optionally provide a name for the output Corvic Table
Run the Data App to execute the digitization
Review the generated Corvic Table containing extracted markdown and all images for each unstructured file added per row

Use Cases

Document Digitization is particularly useful for:

Converting scanned documents and PDFs into searchable, processable Markdown
Preparing documents for semantic search and analysis
Extracting structured content from unstructured document formats
Creating embedding spaces from document collections
Enabling agent interactions with document content

Data Apps

Learn how to build workflows using Data Apps.

Multi-modal Knowledge Extraction

Extract structured knowledge from digitized documents.

Corvic Tables

Understand how Corvic Tables work with digitized content.

Data Sources

Learn how to upload and manage document data sources.

​Overview

​Category

​Input

​Output

​Parameters

​Usage Example

​Use Cases

​Related Documentation

Data Apps

Multi-modal Knowledge Extraction

Corvic Tables

Data Sources

Overview

Category

Input

Output

Parameters

Usage Example

Use Cases

Related Documentation