Data Sources

Overview

Data Sources are the foundation of your embedding generation pipeline. They represent the raw data that you upload to a data room for processing. The platform supports various data formats and structures.

Supported Data Formats

Structured Data

Parquet files: Recommended format for tabular data
CSV files: Comma-separated values
Database connections: Direct database access
Excel files: Spreadsheet data

Semi-Structured Data

JSON files: JavaScript Object Notation
XML files: Extensible Markup Language

Unstructured Data

Text documents: PDF, DOCX, TXT
Images: PNG, JPG, and other image formats

Uploading Data Sources

Via Web Interface

Navigate to your data room
Go to the “Data Sources” section
Click “Upload Data Source”
Select your file(s) or provide a connection string
Wait for the upload and validation to complete

Via API

You can also upload data sources programmatically using the Corvic API:

import corvic

client = corvic.Client(api_key="your-api-key")

# Upload a parquet file
client.upload_data_source(
    room_id="your-room-id",
    file_path="data.parquet",
    name="My Data Source"
)

Data Requirements

Data Quality

For best results, ensure your data is:

Clean: Remove duplicates and handle missing values
Well-structured: Follow consistent schemas
Complete: Include all necessary fields
Validated: Check for data type consistency

Schema Definition

When uploading data, the platform will:

Automatically detect the schema
Identify entity and relation files
Validate data types
Suggest schema improvements

Ingestion Pipeline

Once uploaded, data sources go through an ingestion pipeline that:

Validates the data format and structure
Transforms the data into a standardized format
Indexes the data for efficient access
Makes available for Corvic Table creation

Pipeline Status

You can monitor the ingestion pipeline status:

Pending: Waiting to be processed
Processing: Currently being ingested
Completed: Successfully ingested and ready
Failed: Error during ingestion (check logs)

Managing Data Sources

Viewing Data Sources

View all data sources in a room:

List view with metadata
Search and filter capabilities
Preview data samples
View schema information

Updating Data Sources

Refresh: Re-upload updated data
Rename: Change the data source name
Delete: Remove the data source (affects dependent Corvic Tables)

Deleting a data source may affect Corvic Tables and spaces that depend on it. Review dependencies before deleting.

Best Practices

File Organization

Use descriptive file names
Group related data sources together
Document data source schemas
Version your data sources

Data Preparation

Clean data before uploading
Ensure consistent schemas across related files
Handle missing values appropriately
Validate data types

Performance

Use parquet format for large datasets
Compress files when possible
Split very large files into smaller chunks
Monitor ingestion pipeline performance

Example: Financial Data

For financial data, you might have: Entity Files:

accounts.parquet: Account information
customers.parquet: Customer details
transactions.parquet: Transaction records

Relation Files:

account_transactions.parquet: Links accounts to transactions
customer_accounts.parquet: Links customers to accounts

Sample Datasets

Explore sample datasets to see examples.

Corvic Tables

Create Corvic Tables from your data sources for distributed processing.

Rooms

Learn about data room organization.

Platform Overview

Getting Started

Data Apps

Agents

Administration

Resources

Overview

Supported Data Formats

Structured Data

Semi-Structured Data

Unstructured Data

Uploading Data Sources

Via Web Interface

Via API

Data Requirements

Data Quality

Schema Definition

Ingestion Pipeline

Pipeline Status

Managing Data Sources

Viewing Data Sources

Updating Data Sources

Best Practices

File Organization

Data Preparation

Performance

Example: Financial Data

Sample Datasets

Corvic Tables

Rooms

Platform Overview

Getting Started

Data Apps

Agents

Administration

Resources

​Overview

​Supported Data Formats

​Structured Data

​Semi-Structured Data

​Unstructured Data

​Uploading Data Sources

​Via Web Interface

​Via API

​Data Requirements

​Data Quality

​Schema Definition

​Ingestion Pipeline

​Pipeline Status

​Managing Data Sources

​Viewing Data Sources

​Updating Data Sources

​Best Practices

​File Organization

​Data Preparation

​Performance

​Example: Financial Data

Sample Datasets

​Related Concepts

Corvic Tables

Rooms

Overview

Supported Data Formats

Structured Data

Semi-Structured Data

Unstructured Data

Uploading Data Sources

Via Web Interface

Via API

Data Requirements

Data Quality

Schema Definition

Ingestion Pipeline

Pipeline Status

Managing Data Sources

Viewing Data Sources

Updating Data Sources

Best Practices

File Organization

Data Preparation

Performance

Example: Financial Data

Related Concepts