Back to Projects

Synthetic Data Generator (Data Minter)

AI-powered realistic test data creation

Data Engineer & ML EngineerData Engineering / LLMNovember 2023

Technology Stack

PythonDuckDBLocal LLMPandasDockerFastAPI

Overview

Created an intelligent synthetic data generator that produces realistic, GDPR-compliant test datasets for development and testing environments using advanced AI techniques.

Business Problem

Development teams needed realistic test data but couldn't use production data due to privacy regulations (GDPR, CCPA) and compliance requirements. Manual test data creation was time-consuming and often unrealistic, leading to bugs that only appeared in production.

Approach & Solution

Built a modular system using local LLMs to understand data schemas and generate contextually appropriate synthetic data while maintaining statistical properties. Implemented differential privacy techniques and ensured referential integrity across related tables.

Challenges Overcome

Preserving complex data relationships and foreign key constraints, ensuring realistic statistical distributions that match production patterns, scaling generation for large datasets (1M+ rows), and maintaining performance while ensuring complete privacy compliance.

Results & Impact

Enabled teams to work with production-like data while maintaining 100% compliance with privacy regulations. Reduced test environment setup time from weeks to hours and improved bug detection by 40% in pre-production testing.

Showcases advanced data engineering skills, ML integration, privacy-preserving techniques, and deep understanding of modern data compliance requirements in enterprise environments.

Key Highlights

Quick bullets for recruiters and hiring managers:

  • GDPR and CCPA compliant synthetic data generation
  • Preserved statistical relationships and distributions
  • 10x faster development cycles with realistic test data
  • Zero privacy risk for test environments
  • Automated generation of 1M+ row datasets in minutes