Overview
Document Digitization is a feature that converts unstructured data sources (such as PDFs, scanned documents, images, and other document formats) into clean, structured Markdown format. This conversion makes documents more accessible for processing, analysis, and integration into downstream workflows.Category
Unstructured Data - This feature is designed to work with unstructured data sources including PDFs, scanned documents, images with text, and various document formats.Input
The Document Digitization feature accepts data from the following input sources:- File Upload - Upload unstructured files directly from your local system. Supported formats include PDFs, scanned documents, images with text content, Word documents, and other document formats.
-
Live Data Connectors - Connect directly to live data sources without duplicating data:
- Blob Storage: Amazon S3, Azure Blob Storage, Google Cloud Storage (GCS) buckets
- Databases: Snowflake, Databricks, and other database systems
The input data source must be marked as “Unstructured Type” in your data room.
Output
Corvic Table - The Document Digitization feature produces a new Corvic Table containing extracted markdown and all images for each unstructured file added per row. Each row in the table represents one unstructured file from the input data source, with columns for:- Extracted Markdown: The digitized Markdown content preserving document structure, headings, lists, tables, and formatting
- Images: All images extracted from the document, with references and metadata
The output Corvic Table provides a structured representation where each row corresponds to one input file, making it easy to process and analyze document content programmatically. The Markdown preserves the original document structure, and all images are included with proper references.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
input_data_source | string | Yes | The unstructured data source to digitize. Select a data source from your data room that contains documents, PDFs, scanned images, or other document formats. The data source must be marked as “Unstructured Type”. |
output_name | string | No | Optional custom name for the output Corvic Table. If not provided, a default name will be automatically generated based on the input data source name. |
Usage Example
To use Document Digitization in a Data App:- Add your unstructured document data source to the Data App canvas
- Click the ”+” button next to the data source
- Select “Document Digitization” from the actions menu
- Select the input data source (if not already selected)
- Optionally provide a name for the output Corvic Table
- Run the Data App to execute the digitization
- Review the generated Corvic Table containing extracted markdown and all images for each unstructured file added per row
Use Cases
Document Digitization is particularly useful for:- Converting scanned documents and PDFs into searchable, processable Markdown
- Preparing documents for semantic search and analysis
- Extracting structured content from unstructured document formats
- Creating embedding spaces from document collections
- Enabling agent interactions with document content

