Skip to main content

Overview

Document Digitization is a feature that converts unstructured data sources (such as PDFs, scanned documents, images, and other document formats) into clean, structured Markdown format. This conversion makes documents more accessible for processing, analysis, and integration into downstream workflows.

Category

Unstructured Data - This feature is designed to work with unstructured data sources including PDFs, scanned documents, images with text, and various document formats.

Input

The Document Digitization feature accepts data from the following input sources:
  1. File Upload - Upload unstructured files directly from your local system. Supported formats include PDFs, scanned documents, images with text content, Word documents, and other document formats.
    Learn more: Data Sources - Upload and manage data sources
  2. Live Data Connectors - Connect directly to live data sources without duplicating data:
    • Blob Storage: Amazon S3, Azure Blob Storage, Google Cloud Storage (GCS) buckets
    • Databases: Snowflake, Databricks, and other database systems
    Learn more: ▶️ Live Data Connectors Tutorial - Connect directly to live data sources
The input data source must be marked as “Unstructured Type” in your data room.

Output

Corvic Table - The Document Digitization feature produces a new Corvic Table containing extracted markdown and all images for each unstructured file added per row. Each row in the table represents one unstructured file from the input data source, with columns for:
  • Extracted Markdown: The digitized Markdown content preserving document structure, headings, lists, tables, and formatting
  • Images: All images extracted from the document, with references and metadata
The output Corvic Table provides a structured representation where each row corresponds to one input file, making it easy to process and analyze document content programmatically. The Markdown preserves the original document structure, and all images are included with proper references.

Parameters

ParameterTypeRequiredDescription
input_data_sourcestringYesThe unstructured data source to digitize. Select a data source from your data room that contains documents, PDFs, scanned images, or other document formats. The data source must be marked as “Unstructured Type”.
output_namestringNoOptional custom name for the output Corvic Table. If not provided, a default name will be automatically generated based on the input data source name.

Usage Example

To use Document Digitization in a Data App:
  1. Add your unstructured document data source to the Data App canvas
  2. Click the ”+” button next to the data source
  3. Select “Document Digitization” from the actions menu
  4. Select the input data source (if not already selected)
  5. Optionally provide a name for the output Corvic Table
  6. Run the Data App to execute the digitization
  7. Review the generated Corvic Table containing extracted markdown and all images for each unstructured file added per row

Use Cases

Document Digitization is particularly useful for:
  • Converting scanned documents and PDFs into searchable, processable Markdown
  • Preparing documents for semantic search and analysis
  • Extracting structured content from unstructured document formats
  • Creating embedding spaces from document collections
  • Enabling agent interactions with document content