Skip to main content

Overview

Data Sources are the foundation of your embedding generation pipeline. They represent the raw data that you upload to a data room for processing. The platform supports various data formats and structures.

Supported Data Formats

Structured Data

  • Parquet files: Recommended format for tabular data
  • CSV files: Comma-separated values
  • Database connections: Direct database access
  • Excel files: Spreadsheet data

Semi-Structured Data

  • JSON files: JavaScript Object Notation
  • XML files: Extensible Markup Language

Unstructured Data

  • Text documents: PDF, DOCX, TXT
  • Images: PNG, JPG, and other image formats

Uploading Data Sources

Via Web Interface

  1. Navigate to your data room
  2. Go to the “Data Sources” section
  3. Click “Upload Data Source”
  4. Select your file(s) or provide a connection string
  5. Wait for the upload and validation to complete

Via API

You can also upload data sources programmatically using the Corvic API:
import corvic

client = corvic.Client(api_key="your-api-key")

# Upload a parquet file
client.upload_data_source(
    room_id="your-room-id",
    file_path="data.parquet",
    name="My Data Source"
)

Data Requirements

Data Quality

For best results, ensure your data is:
  • Clean: Remove duplicates and handle missing values
  • Well-structured: Follow consistent schemas
  • Complete: Include all necessary fields
  • Validated: Check for data type consistency

Schema Definition

When uploading data, the platform will:
  • Automatically detect the schema
  • Identify entity and relation files
  • Validate data types
  • Suggest schema improvements

Ingestion Pipeline

Once uploaded, data sources go through an ingestion pipeline that:
  1. Validates the data format and structure
  2. Transforms the data into a standardized format
  3. Indexes the data for efficient access
  4. Makes available for Corvic Table creation

Pipeline Status

You can monitor the ingestion pipeline status:
  • Pending: Waiting to be processed
  • Processing: Currently being ingested
  • Completed: Successfully ingested and ready
  • Failed: Error during ingestion (check logs)

Managing Data Sources

Viewing Data Sources

View all data sources in a room:
  • List view with metadata
  • Search and filter capabilities
  • Preview data samples
  • View schema information

Updating Data Sources

  • Refresh: Re-upload updated data
  • Rename: Change the data source name
  • Delete: Remove the data source (affects dependent Corvic Tables)
Deleting a data source may affect Corvic Tables and spaces that depend on it. Review dependencies before deleting.

Best Practices

File Organization

  • Use descriptive file names
  • Group related data sources together
  • Document data source schemas
  • Version your data sources

Data Preparation

  • Clean data before uploading
  • Ensure consistent schemas across related files
  • Handle missing values appropriately
  • Validate data types

Performance

  • Use parquet format for large datasets
  • Compress files when possible
  • Split very large files into smaller chunks
  • Monitor ingestion pipeline performance

Example: Financial Data

For financial data, you might have: Entity Files:
  • accounts.parquet: Account information
  • customers.parquet: Customer details
  • transactions.parquet: Transaction records
Relation Files:
  • account_transactions.parquet: Links accounts to transactions
  • customer_accounts.parquet: Links customers to accounts

Sample Datasets

Explore sample datasets to see examples.