Uniform test data - One Giant Leap

Today’s post is a short one. I’ve been working on a project that required supporting multiple data backends. To simplify querying and testing across systems, I created a uniform dataset available in several formats.

To that effect I have created an open-source Docker Compose setup with S3 storage, and DuckDB, Oracle and PostgreSQL databases. Each of them contains the same Simpsons dataset from Kaggle.

20th Television Animation, Public domain, via Wikimedia Commons

Dataset

This dataset contains the characters, locations, episode details, and script lines for approximately 600 Simpsons episodes, dating back to 1989.

This package contains four datasets:

characters: The Simpsons characters information.
episodes: metadata related to each The Simpsons episode.
locations: The Simpsons world locations.
script_lines: the text spoken during each The Simpsons episode (including details about which character said it and where).

I converted it from the R data packages to parquet and database tables. The Docker Compose is idempotent. You can run it as many times as needed and it will not duplicate your data. Data is stored in volumes, so you can edit data, stop Docker and resume later again where you have left off.

Source code

You can find the code and the data in my repo. If you feel something is missing, or see room for improvement, feel free to open a PR or an Issue.

Dataset

Source code

Leave a Reply Cancel reply