Skip to main content

Overview

Sanitize Parquet is a feature that processes structured data sources to extract metadata and prepare them for further processing such as data augmentation or text-to-SQL operations in agents. This feature works with any structured data format including parquet files, CSV files, databases, time series data, knowledge graphs, and data warehouses.

Category

Structured Data - This feature is designed to work with structured data sources containing columns, tables, and relational data formats.

Input

The Sanitize Parquet feature accepts data from the following input sources:
  1. File Upload - Upload structured files directly from your local system. Supported formats include:
    • Parquet files: Columnar storage format optimized for analytics
    • CSV files: Comma-separated values and other delimited text files
    • Any data with columns: Tabular data formats with defined schemas
    Learn more: Data Sources - Upload and manage data sources
  2. Live Data Connectors - Connect directly to live data sources without duplicating data:
    • Databases: Direct connections to relational databases
    • Data Warehouses: Snowflake, Databricks, and other data warehouse systems
    • Time Series Data: Structured time-based data sources
    • Knowledge Graphs: Graph-structured data with relationships
    Learn more: ▶️ Live Data Connectors Tutorial - Connect directly to live data sources
The input data source must be marked as “Structured Type” in your data room. The feature works with any data that has columns and a defined schema.

Output

Corvic Table - The Sanitize Parquet feature produces a new Corvic Table with extracted metadata to prepare for further processing such as augmentation or text-to-SQL operations in agents. The output includes:
  • Schema Metadata: Column names, data types, and constraints extracted from the input data
  • Data Quality Metrics: Statistics and quality indicators for each column
  • Relationship Information: Foreign keys, relationships, and dependencies between tables
  • Processing Ready Format: Structured representation optimized for downstream operations
The output Corvic Table provides comprehensive metadata extraction that enables advanced processing operations including data augmentation, text-to-SQL query generation, and agent-based data operations. This metadata serves as the foundation for intelligent data transformations and analysis workflows.

Parameters

ParameterTypeRequiredDescription
input_data_sourcestringYesThe structured data source to sanitize. Select a data source from your data room that contains structured data with columns. Can be parquet files, CSV files, database connections, time series data, knowledge graphs, or data warehouse connections (Snowflake, Databricks, etc.). The data source must be marked as “Structured Type”.
output_namestringNoOptional custom name for the output Corvic Table. If not provided, a default name will be automatically generated based on the input data source name.

Usage Example

To use Sanitize Parquet in a Data App:
  1. Add your structured data source to the Data App canvas
  2. Click the ”+” button next to the data source
  3. Select “Sanitize Parquet” from the actions menu
  4. Select the input data source (if not already selected)
  5. Optionally provide a name for the output Corvic Table
  6. Run the Data App to execute the sanitization
  7. Review the generated Corvic Table containing extracted metadata ready for further processing such as augmentation or text-to-SQL operations