> ## Documentation Index
> Fetch the complete documentation index at: https://docs.corvic.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Sample Datasets

> Explore sample datasets to kick-start your embedding generation spaces

## Overview

In this section, we present sample datasets recommended by Corvic to kick-start your embedding generation spaces. These datasets cover a variety of use cases, showcasing the capabilities and versatility of the Corvic platform. Available in `.parquet` format, these datasets can be accessed from the [corvic-sample-datasets](https://drive.google.com/drive/folders/1XR2i3ilv1L6MZtT6upRjp13p4MEeFHM_) G-Drive folder.

## LDBC-SF0.01

This sample dataset is derived from the LDBC Financial Benchmark, as detailed in the [specification document](https://ldbcouncil.org/ldbc_finbench_docs/ldbc-finbench-specification.pdf). The primary goal of this benchmark is to establish a standard that captures the unique data and query patterns prevalent in the financial industry.

### Schema Definition

**Entity (Dimension) Files:**

* **Person.parquet**: Real-world individuals
* **Company.parquet**: Entities that people or other companies invest in
* **Account.parquet**: Financial systems registered and owned by persons and companies
* **Loan.parquet**: Loans applied by individuals and companies
* **Medium.parquet**: Things used to sign in an account (IP, MAC, phone numbers)

**Relation (Fact) Files:**

* **AccountTransferAccount.parquet**: Fund transfers between accounts
* **AccountWithdrawAccount.parquet**: Funds moved from one card account to another
* **AccountRepayLoan.parquet**: Loan repayment from an account
* **LoanDepositAccount.parquet**: Loan fund deposited to an account
* **MediumSignInAccount.parquet**: Account signed in with a Media
* **CompanyInvestCompany.parquet**: Company invests in a company
* **PersonInvestCompany.parquet**: Person invests in a company
* **CompanyApplyLoan.parquet**: Company applies for a Loan
* **PersonApplyLoan.parquet**: Person applies for loan
* **CompanyGuaranteeCompany.parquet**: Company guarantees another
* **PersonGuaranteePerson.parquet**: Person guarantees another
* **CompanyOwnAccount.parquet**: Company owns an account
* **PersonOwnAccount.parquet**: Person owns an account

### Sample Embedding Space

The Corvic Table incorporates all primary entities from the LDBC-SF dataset as the entities of interest for embedding. These entities can be embedded using our graph transformation and structural encoding algorithm. The UMAP plot presents graph structural embeddings consolidated into a single vector space.

## AmazonReviews

This collection comprises more than 34,000 consumer reviews for various Amazon products, including popular items such as the Kindle and Fire TV Stick. These reviews are sourced from a [Kaggle](https://www.kaggle.com/datasets/datafiniti/consumer-reviews-of-amazon-products) project, which uses a representative sample of a larger dataset that is made available by [Datafiniti's Product Database](https://www.datafiniti.co/products/product-data).

### Schema Definition

**Entity (Dimension) Files:**

* **products.parquet**: Information about the products: name, brand, category, etc.
* **users.parquet**: Information about the customers: username, age, state, etc.

**Relation (Fact) Files:**

* **reviews.parquet**: Customer review of a product: title, date, text, etc.
* **transaction.parquet**: user product purchase transactions.

### Sample Embedding Space

The Corvic Table represents a mixture of users and products, linked by transactions. The UMAP plot illustrates the structural embeddings of these entities in a unified vector space.

## ZacharyKarateClub

The Zachary karate club network is an undirected social network collected by Wayne Zachary in 1977, where each node represents a club member, and each edge a tie. It's often used to identify groups formed after a dispute between two teachers. The network, featured in Zachary's paper and later popularized by Girvan and Newman in 2002, includes 34 nodes, 156 edges, and 2 classes.

Official website: [http://konect.cc/networks/ucidata-zachary/](http://konect.cc/networks/ucidata-zachary/)

### Schema Definition

**Entity (Dimension) Files:**

* **adherent.parquet**: Information about the club member: id, name, club.

**Relation (Fact) Files:**

* **relation.parquet**: Connection between the club members

### Sample Embedding Space

The Corvic Table represents club members and their connections. The UMAP plot illustrates the structural embeddings of these entities in a unified vector space.

## Getting Started with Sample Datasets

1. **Download Datasets**: Access datasets from the [corvic-sample-datasets](https://drive.google.com/drive/folders/1XR2i3ilv1L6MZtT6upRjp13p4MEeFHM_) folder
2. **Upload to Corvic**: Upload the parquet files to a data room
3. **Create Corvic Tables**: Define entities to embed
4. **Generate Spaces**: Create embedding spaces
5. **Analyze Results**: Use visualization tools and quality metrics

<Card title="Quickstart Guide" icon="rocket" href="/get-started/quickstart">
  Follow the quickstart guide to get started with sample datasets.
</Card>

## Related Concepts

<CardGroup cols={2}>
  <Card title="Data Sources" icon="database" href="/concepts/data-sources">
    Learn about uploading data sources.
  </Card>

  <Card title="Corvic Tables" icon="eye" href="/concepts/feature-views">
    Learn about creating Corvic Tables for distributed processing.
  </Card>

  <Card title="Spaces" icon="layer-group" href="/concepts/spaces">
    Learn about generating embedding spaces.
  </Card>
</CardGroup>
