Skip to main content

Overview

In this section, we present sample datasets recommended by Corvic to kick-start your embedding generation spaces. These datasets cover a variety of use cases, showcasing the capabilities and versatility of the Corvic platform. Available in .parquet format, these datasets can be accessed from the corvic-sample-datasets G-Drive folder.

LDBC-SF0.01

This sample dataset is derived from the LDBC Financial Benchmark, as detailed in the specification document. The primary goal of this benchmark is to establish a standard that captures the unique data and query patterns prevalent in the financial industry.

Schema Definition

Entity (Dimension) Files:
  • Person.parquet: Real-world individuals
  • Company.parquet: Entities that people or other companies invest in
  • Account.parquet: Financial systems registered and owned by persons and companies
  • Loan.parquet: Loans applied by individuals and companies
  • Medium.parquet: Things used to sign in an account (IP, MAC, phone numbers)
Relation (Fact) Files:
  • AccountTransferAccount.parquet: Fund transfers between accounts
  • AccountWithdrawAccount.parquet: Funds moved from one card account to another
  • AccountRepayLoan.parquet: Loan repayment from an account
  • LoanDepositAccount.parquet: Loan fund deposited to an account
  • MediumSignInAccount.parquet: Account signed in with a Media
  • CompanyInvestCompany.parquet: Company invests in a company
  • PersonInvestCompany.parquet: Person invests in a company
  • CompanyApplyLoan.parquet: Company applies for a Loan
  • PersonApplyLoan.parquet: Person applies for loan
  • CompanyGuaranteeCompany.parquet: Company guarantees another
  • PersonGuaranteePerson.parquet: Person guarantees another
  • CompanyOwnAccount.parquet: Company owns an account
  • PersonOwnAccount.parquet: Person owns an account

Sample Embedding Space

The Corvic Table incorporates all primary entities from the LDBC-SF dataset as the entities of interest for embedding. These entities can be embedded using our graph transformation and structural encoding algorithm. The UMAP plot presents graph structural embeddings consolidated into a single vector space.

AmazonReviews

This collection comprises more than 34,000 consumer reviews for various Amazon products, including popular items such as the Kindle and Fire TV Stick. These reviews are sourced from a Kaggle project, which uses a representative sample of a larger dataset that is made available by Datafiniti’s Product Database.

Schema Definition

Entity (Dimension) Files:
  • products.parquet: Information about the products: name, brand, category, etc.
  • users.parquet: Information about the customers: username, age, state, etc.
Relation (Fact) Files:
  • reviews.parquet: Customer review of a product: title, date, text, etc.
  • transaction.parquet: user product purchase transactions.

Sample Embedding Space

The Corvic Table represents a mixture of users and products, linked by transactions. The UMAP plot illustrates the structural embeddings of these entities in a unified vector space.

ZacharyKarateClub

The Zachary karate club network is an undirected social network collected by Wayne Zachary in 1977, where each node represents a club member, and each edge a tie. It’s often used to identify groups formed after a dispute between two teachers. The network, featured in Zachary’s paper and later popularized by Girvan and Newman in 2002, includes 34 nodes, 156 edges, and 2 classes. Official website: http://konect.cc/networks/ucidata-zachary/

Schema Definition

Entity (Dimension) Files:
  • adherent.parquet: Information about the club member: id, name, club.
Relation (Fact) Files:
  • relation.parquet: Connection between the club members

Sample Embedding Space

The Corvic Table represents club members and their connections. The UMAP plot illustrates the structural embeddings of these entities in a unified vector space.

Getting Started with Sample Datasets

  1. Download Datasets: Access datasets from the corvic-sample-datasets folder
  2. Upload to Corvic: Upload the parquet files to a data room
  3. Create Corvic Tables: Define entities to embed
  4. Generate Spaces: Create embedding spaces
  5. Analyze Results: Use visualization tools and quality metrics

Quickstart Guide

Follow the quickstart guide to get started with sample datasets.