Sanitize Parquet

Overview

Sanitize Parquet is a feature that processes structured data sources to extract metadata and prepare them for further processing such as data augmentation or text-to-SQL operations in agents. This feature works with any structured data format including parquet files, CSV files, databases, time series data, knowledge graphs, and data warehouses.

Input

The Sanitize Parquet feature accepts data from the following input sources:

File Upload - Upload structured files directly from your local system. Supported formats include:
- Parquet files: Columnar storage format optimized for analytics
- CSV files: Comma-separated values and other delimited text files
- Any data with columns: Tabular data formats with defined schemas
Learn more: Data Sources - Upload and manage data sources
Live Data Connectors - Connect directly to live data sources without duplicating data:
- Databases: Direct connections to relational databases
- Data Warehouses: Snowflake, Databricks, and other data warehouse systems
- Time Series Data: Structured time-based data sources
- Knowledge Graphs: Graph-structured data with relationships
Learn more: ▶️ Live Data Connectors Tutorial - Connect directly to live data sources

The input data source must be marked as “Structured Type” in your data room. The feature works with any data that has columns and a defined schema.

Output

Corvic Table - The Sanitize Parquet feature produces a new Corvic Table with extracted metadata to prepare for further processing such as augmentation or text-to-SQL operations in agents. The output includes:

Schema Metadata: Column names, data types, and constraints extracted from the input data
Data Quality Metrics: Statistics and quality indicators for each column
Relationship Information: Foreign keys, relationships, and dependencies between tables
Processing Ready Format: Structured representation optimized for downstream operations

The output Corvic Table provides comprehensive metadata extraction that enables advanced processing operations including data augmentation, text-to-SQL query generation, and agent-based data operations. This metadata serves as the foundation for intelligent data transformations and analysis workflows.

Parameters

Parameter	Type	Required	Description
`input_data_source`	`string`	Yes	The structured data source to sanitize. Select a data source from your data room that contains structured data with columns. Can be parquet files, CSV files, database connections, time series data, knowledge graphs, or data warehouse connections (Snowflake, Databricks, etc.). The data source must be marked as “Structured Type”.
`output_name`	`string`	No	Optional custom name for the output Corvic Table. If not provided, a default name will be automatically generated based on the input data source name.

Usage Example

To use Sanitize Parquet in a Data App:

Add your structured data source to the Data App canvas
Click the ”+” button next to the data source
Select “Sanitize Parquet” from the actions menu
Select the input data source (if not already selected)
Optionally provide a name for the output Corvic Table
Run the Data App to execute the sanitization
Review the generated Corvic Table containing extracted metadata ready for further processing such as augmentation or text-to-SQL operations

Data Apps

Learn how to build workflows using Data Apps.

Corvic Tables

Understand how Corvic Tables work with sanitized structured data.

Agents

Use sanitized data for text-to-SQL operations in agents.

Data Sources

Learn how to upload and manage structured data sources.

Unstructured Data

Structured Data

Corvic Tables

Overview

Category

Input

Output

Parameters

Usage Example

Data Apps

Corvic Tables

Agents

Data Sources

Unstructured Data

Structured Data

Corvic Tables

​Overview

​Category

​Input

​Output

​Parameters

​Usage Example

​Related Documentation

Data Apps

Corvic Tables

Agents

Data Sources

Overview

Category

Input

Output

Parameters

Usage Example

Related Documentation