Overview
Data Sources are the foundation of your embedding generation pipeline. They represent the raw data that you upload to a data room for processing. The platform supports various data formats and structures.Supported Data Formats
Structured Data
- Parquet files: Recommended format for tabular data
- CSV files: Comma-separated values
- Database connections: Direct database access
- Excel files: Spreadsheet data
Semi-Structured Data
- JSON files: JavaScript Object Notation
- XML files: Extensible Markup Language
Unstructured Data
- Text documents: PDF, DOCX, TXT
- Images: PNG, JPG, and other image formats
Uploading Data Sources
Via Web Interface
- Navigate to your data room
- Go to the “Data Sources” section
- Click “Upload Data Source”
- Select your file(s) or provide a connection string
- Wait for the upload and validation to complete
Via API
You can also upload data sources programmatically using the Corvic API:Data Requirements
Data Quality
For best results, ensure your data is:- Clean: Remove duplicates and handle missing values
- Well-structured: Follow consistent schemas
- Complete: Include all necessary fields
- Validated: Check for data type consistency
Schema Definition
When uploading data, the platform will:- Automatically detect the schema
- Identify entity and relation files
- Validate data types
- Suggest schema improvements
Ingestion Pipeline
Once uploaded, data sources go through an ingestion pipeline that:- Validates the data format and structure
- Transforms the data into a standardized format
- Indexes the data for efficient access
- Makes available for Corvic Table creation
Pipeline Status
You can monitor the ingestion pipeline status:- Pending: Waiting to be processed
- Processing: Currently being ingested
- Completed: Successfully ingested and ready
- Failed: Error during ingestion (check logs)
Managing Data Sources
Viewing Data Sources
View all data sources in a room:- List view with metadata
- Search and filter capabilities
- Preview data samples
- View schema information
Updating Data Sources
- Refresh: Re-upload updated data
- Rename: Change the data source name
- Delete: Remove the data source (affects dependent Corvic Tables)
Best Practices
File Organization
- Use descriptive file names
- Group related data sources together
- Document data source schemas
- Version your data sources
Data Preparation
- Clean data before uploading
- Ensure consistent schemas across related files
- Handle missing values appropriately
- Validate data types
Performance
- Use parquet format for large datasets
- Compress files when possible
- Split very large files into smaller chunks
- Monitor ingestion pipeline performance
Example: Financial Data
For financial data, you might have: Entity Files:accounts.parquet: Account informationcustomers.parquet: Customer detailstransactions.parquet: Transaction records
account_transactions.parquet: Links accounts to transactionscustomer_accounts.parquet: Links customers to accounts
Sample Datasets
Explore sample datasets to see examples.

