The MOSTLY AI Prize #436

mplatzer · 2025-05-13T18:49:10Z

mplatzer
May 13, 2025
Maintainer

👉 mostlyaiprize.com: Join the synthetic data challenge, and win up to $100k.

Feel free to use this discussion thread for questions and/or feedback of any kind.

vaishnkv · 2025-05-14T16:01:02Z

vaishnkv
May 14, 2025

Nice initiative. Will try to take part.

0 replies

lillianyjiang · 2025-05-16T00:56:06Z

lillianyjiang
May 16, 2025

hi! Can we get some background context on the datasets being used? I'm just curious as to what the data represents and if there's any data dictionary

4 replies

mplatzer May 16, 2025
Maintainer Author

Hi. I understand your curiosity. However, we will be able to share more context about the data only after the competition. This is needed to keep the focus on the data synthesis, rather than on attempts to retrieve the holdout directly from the source.

mrqc May 16, 2025

I would be interested if the columns which represent strings are all discret/categorical and all others where numerical data is available we expect to be continuous. Is that an info we get? Just to know if we are bound to the event space defined in the files or the event space is continuous to generate new values also. @mplatzer

mplatzer May 16, 2025
Maintainer Author

yes, all string columns are categoricals, and all numerics are continuous.

mplatzer May 16, 2025
Maintainer Author

both training and holdout are random samples from the same original source.

walterreade · 2025-05-16T17:32:16Z

walterreade
May 16, 2025

Does the 6 hour time limit include training the generator model from scratch? Or is that just the synthetic data generation time.

1 reply

mplatzer May 18, 2025
Maintainer Author

The code submissions need to be end-to-end open-source, reproducible, and run in under 6 hours on a dedicated AWS EC2 standard machine (c5d.12xlarge for CPU runs and g5.2xlarge for GPU runs).

With end-to-end we refer to the whole process, from reading the training data to generating the synthetic data.

anirbanbasu · 2025-05-21T01:05:54Z

anirbanbasu
May 21, 2025

Hi @mplatzer, is Apple M architecture not supported with the local-gpu option? From the documentation, it seems that the focus is on Nvidia GPU support on Linux only. Is there a plan/timeline for supporting Apple M architecture? Thanks.

4 replies

mplatzer May 21, 2025
Maintainer Author

We are investigating this from time to time, but last time we checked torch NestedTensors (which we leverage) wasn't yet supported by mps. See here.

Once this changes, we can revisit.

But note, that both competition datasets run with the SDK on Apple M CPU in less than 2 hours.

anirbanbasu May 21, 2025

Thanks for the clarification. The nested tensor support on mps remains a problem as is also shown here: pytorch/pytorch#127743.

When you say that both competition datasets run with the Mostly AI SDK on Apple M CPU in less than 2 hours, would you be able to disclose the specifications of that hardware? The concern I have is with the following statement on the competition website:

Just make sure your submission is end-to-end open-source, reproducible, and can run in under 6 hours on a standard machine.

I am wondering what constitutes a standard machine because some of the Apple M based computers (e.g., the top-spec Apple Mac Studio) seem to me too high-end to be standard. Thus, if training on such a computer finishes under 2 hours, it is unlikely to finish on anything less high-end.

mplatzer May 21, 2025
Maintainer Author

We specified the machines (as part of the Evaluation Criteria section) as AWS EC2 c5d.12xlarge for CPU runs and g5.2xlarge for GPU runs.

anirbanbasu May 21, 2025

Okay, thank you!

amrittshukla · 2025-05-21T18:19:22Z

amrittshukla
May 21, 2025

where to submit file

1 reply

mplatzer May 21, 2025
Maintainer Author

Once you sign in via GitHub, you will see a "Make a Submission" link on the top navigation bar.

ql909 · 2025-05-23T08:47:39Z

ql909
May 23, 2025

hi, I have made a submission more than 10 hours, it's still show this

1 reply

shuangwu5 May 23, 2025
Maintainer

Hi, the strikeout text indicates that this submission was disqualified. If you hover on that row, it should tell the reason of it.

ayajnik · 2025-05-31T22:28:00Z

ayajnik
May 31, 2025

When I am submitting my sequential data challenge, it says that the output csv file should have only 20,000 records. But, I see that the original file has around ~154k records. Does this mean that when we are generating the synthetic data, we just need to generate 20k records on the generator we trained or do we need to rain on randomly selected 20k records from the original dataset?

2 replies

ayajnik May 31, 2025

Okay, so I generated 20k synthetic records. My overall accuracy is 96.7% and my DCR share is 52.1%. But I am getting an error saying that it does not meet the criteria since the DCR share should by >=52%. Also, my synthetic data report says that I have got an accuracy of much higher what I am seeing as compared to my submission

blechturm Jun 2, 2025

Hi,

the FAQ says that DCR should be smaller than 52%

blechturm · 2025-06-02T09:57:04Z

blechturm
Jun 2, 2025

Hi @mplatzer !

I'm working on the Flat Track challenge and have a quick question about the NNDR values coming out of the mostlyai-qa toolkit. When I run qa.report(), I'm seeing two different NNDR figures:

In the JSON output (from metrics.model_dump_json()), the distances.nndr_trn_hol field gives one value (e.g., for a recent run of mine, it was around 0.063).
But, the main HTML report summary card for "Distances" shows a different "NNDR Ratio" (e.g., for that same run, it showed 0.978).
The competition rules describe the NNDR Ratio pretty specifically, involving the "10th-lowest NNDRs" from synthetic vs. train and synthetic vs. holdout being divided.

Could you help clarify which of these NNDR values is the one we should be using for the official leaderboard ranking and to make sure we're meeting the > 0.5 privacy target for NNDR? Is it the nndr_trn_hol from the JSON, or the one displayed in the HTML summary?

Just want to make sure I'm tracking the right number!

Thanks a bunch,
Max

1 reply

mplatzer Jun 3, 2025
Maintainer Author

It is actually nndr_training / nndr_holdout. But thanks for your question. We will improve the documentation on that aspect.

mplatzer · 2025-06-11T12:11:21Z

mplatzer
Jun 11, 2025
Maintainer Author

📣 To provide full transparency, we are now keeping an up-to-date list of all submissions available for everyone.

import pandas as pd
df = pd.read_csv('https://tfkjxzoqdfssweafjxzi.supabase.co/storage/v1/object/public/public-data//submissions.csv')

E.g. you can analyze these submissions via the Assistant by following that link.

Participants	Submissions

5 replies

mplatzer Jun 25, 2025
Maintainer Author

Here a visualization of the progression of the current leaderboard leaders.

FLAT	SEQUENTIAL

Gandagorn Jun 29, 2025

Hi @mplatzer,
in the submission data, can you please explain the meaning of the columns accuracy.* and accuracy.*_max? I would assume that the score is the mean of the accuracy metrics, but in some cases, the accuracy metrics differ a lot from the final score. Thank you!

mplatzer Jun 29, 2025
Maintainer Author

Hi @mplatzer,
in the submission data, can you please explain the meaning of the columns accuracy.* and accuracy.*_max? I would assume that the score is the mean of the accuracy metrics, but in some cases, the accuracy metrics differ a lot from the final score. Thank you!

accuracy.overall is the relevant one, and is the average across univariate, bivariate and trivariate.

The _max numbers just serve as a reference, are independent of the synthetic data, and can be ignored for the competition.

Gandagorn Jun 29, 2025

Hi @mplatzer, thanks for the answer. What confuses me a bit is that in e.g. the sequential competition, the accuracy.overall of the top submissions ranges between 0.89-0.97, while the score seems pretty consistent. Should we focus more on the score or the accuracy.overall?

Thank you!

mplatzer Jul 4, 2025
Maintainer Author

Hi @Gandagorn. That accuracy.overall from your screenshot doesn't match the data from the submissions file. How did you fetch it? This is what is within the file, i.e. score = qa_accuracy_overall.

ayajnik · 2025-06-11T14:46:15Z

ayajnik
Jun 11, 2025

I want to try out the MOSTLY AI web app! Is there a way to buy the credits? I don't see an option

3 replies

duyguokcu Jun 16, 2025

Hi @ayajnik you have 5 credits per day that refresh automatically every day, so you can start using the MOSTLY AI platform for free. Let us know if you need more credits.

anirbanbasu Jun 16, 2025

Hi @ayajnik you have 5 credits per day that refresh automatically every day, so you can start using the MOSTLY AI platform for free. Let us know if you need more credits.

@duyguokcu hi, is it possible to get more free credits?

duyguokcu Jun 16, 2025

Hi @anirbanbasu we don’t currently offer additional credits on the platform, but the 5 daily credits are automatically renewed each day so you can keep exploring it for free.

If you need more flexibility, you can also use our open-source SDK locally. It’s completely free (no credits required) and includes all the features of the platform: https://github.com/mostly-ai/mostlyai

mplatzer · 2025-06-28T10:57:02Z

mplatzer
Jun 28, 2025
Maintainer Author

Dear participants,

It’s exciting to see the final days of Stage 1 of The MOSTLY AI Prize coming up! Here are a couple of details on the further process:

Any submission uploaded before July 3, 23:59 UTC will be considered for the Stage 1 leaderboard. Given that automated evaluation takes several minutes, we expect the final leaderboard to be ready shortly after UTC midnight.
We will invite the top 5 leaders of each of the FLAT DATA and SEQUENTIAL DATA challenges to Stage 2, reaching out via the email address associated with the participant’s GitHub account.
Stage 2 participants are asked to provide their code submissions via a dedicated GitHub repository, accessible to the jury members, by July 6, 23:59 UTC.
Repositories must include an OSI-compatible LICENSE file and a README.md for setup and execution instructions.
Submissions must include a full list of dependencies, all under OSI-compatible licenses, with pinned versions recommended.
The README.md should note whether evaluation should run on c5d.12xlarge (CPU) or g5.2xlarge (GPU) AWS EC2 instances.
The repository must provide an end-to-end script that:
- accepts a new CSV training dataset of the same size/structure as Stage 1, and
- generates a synthetic CSV dataset of identical size/structure.
Note, that while the data domain and statistical patterns remain unchanged to stage 1, categorical labels will be flipped to test for generalizability.
The jury will follow README.md instructions to set up the solutions on AWS EC2 instances.
Each submission will be run 6 times. Runs that crash or exceed 6 hours will count as invalid runs.
Each successfully generated synthetic dataset will be evaluated using the mostlyai-qa library.
Any dataset not meeting the privacy criteria (DCR share < 52%, NNDR ratio > 0.5) will be considered invalid.
At least 3 of 6 runs must complete successfully and meet the criteria to make the submission qualify for the prize.
Final score will be based on the overall accuracy of valid runs.
The highest scorer in either challenge will receive The MOSTLY AI Prize 🏆 of $50,000.

Note, that all corresponding Stage 2 synthetic datasets, code submissions, and evaluation scores will be recorded and shared publicly, for maximum transparency.

On behalf of the jury, wishing you all the best for the final days,

Michael

0 replies

mplatzer · 2025-07-04T07:35:44Z

mplatzer
Jul 4, 2025
Maintainer Author

Stage 1 concluded 🎉

Big kudos to Everyone who participated! You are fantastic, and helped push the limits further. We highly encourage all of you to share your approaches or your experience with the community, so that each can learn from each other.

And of course big congratulations to the Top 5 of the Final Leaderboards!

FLAT: @Gandagorn , @Tecnarca, @muellermarkus, @EugenioTL, @Benels
SEQUENTIAL: @Gandagorn, @Benels, @EugenioTL, @Tecnarca, @filomba01

You will now advance to Stage 2, and have time until July 6, 23:59 UTC to share your code submission for final evaluation. Please grant access to your (initial private) repositories to the members of the Jury Board: @mplatzer @scriminaci @psitronic @suhaskowshik @adivekar-utexas @shree-gade.

All the best for Stage 2! 🤞

1 reply

mplatzer Jul 4, 2025
Maintainer Author

FLAT	SEQUENTIAL

mplatzer · 2025-07-10T20:08:03Z

mplatzer
Jul 10, 2025
Maintainer Author

The competition has concluded and we have a single winner for both FLAT and SEQUENTIAL challenge: @Gandagorn ! 🎉

Once again, big congratulations and a big THANK YOU to all participants, for pushing the boundaries of synthetic data further, and achieving a new state-of-the-art accuracy for large scale synthetic datasets! ⭐

Open Source for the Win! 💪

All datasets, all results as well as links to all open-sourced code submissions can be found at mostly-ai/the-prize-eval! 🙌

3 replies

mplatzer Jul 10, 2025
Maintainer Author

FLAT	SEQUENTIAL

mplatzer Jul 10, 2025
Maintainer Author

The open-sourced code submissions of the amazing stage 2 participants are available here:

mplatzer Jul 10, 2025
Maintainer Author

And here are the QA reports for Gandagorn's top stage 1 submissions: qa-reports.zip ⭐

FLAT	SEQUENTIAL

mplatzer · 2025-11-12T11:47:50Z

mplatzer
Nov 12, 2025
Maintainer Author

Also sharing here for wider visibility:

The challenge datasets were derived from these sources:

For the FLAT challenge we used NFL play-by-play statistics derived from https://moneypuck.com/data.htm
For the SEQUENTIAL challenge we used hourly Weather station measures from https://meteostat.net/en/

We had to heavily obfuscate column names and values to prevent re-identification of those data sources to ensure a fair competition.

The datasets were picked to push the boundaries in terms of scale and rich real-world patterns. We would have loved to scale it even further, but the privacy evaluations became the bottleneck, as for DCRs every training record had to be compared to every synthetic record, which exhibits non-linear scaling in terms of compute.

0 replies

The MOSTLY AI Prize #436

Uh oh!

Uh oh!

mplatzer May 13, 2025 Maintainer

Replies: 14 comments · 26 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mplatzer May 16, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

mplatzer May 16, 2025 Maintainer Author

Uh oh!

mplatzer May 16, 2025 Maintainer Author

Uh oh!

Uh oh!

mplatzer May 18, 2025 Maintainer Author

Uh oh!

Uh oh!

mplatzer May 21, 2025 Maintainer Author

Uh oh!

Uh oh!

mplatzer May 21, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

mplatzer May 21, 2025 Maintainer Author

Uh oh!

Uh oh!

shuangwu5 May 23, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mplatzer Jun 3, 2025 Maintainer Author

Uh oh!

mplatzer Jun 11, 2025 Maintainer Author

Uh oh!

Uh oh!

mplatzer Jun 25, 2025 Maintainer Author

Uh oh!

Uh oh!

mplatzer Jun 29, 2025 Maintainer Author

Uh oh!

Uh oh!

mplatzer Jul 4, 2025 Maintainer Author

Uh oh!

Uh oh!

mplatzer
May 13, 2025
Maintainer

Replies: 14 comments 26 replies

mplatzer May 16, 2025
Maintainer Author

mplatzer May 16, 2025
Maintainer Author

mplatzer May 16, 2025
Maintainer Author

mplatzer May 18, 2025
Maintainer Author

mplatzer May 21, 2025
Maintainer Author

mplatzer May 21, 2025
Maintainer Author

mplatzer May 21, 2025
Maintainer Author

shuangwu5 May 23, 2025
Maintainer

mplatzer Jun 3, 2025
Maintainer Author

mplatzer
Jun 11, 2025
Maintainer Author

mplatzer Jun 25, 2025
Maintainer Author

mplatzer Jun 29, 2025
Maintainer Author

mplatzer Jul 4, 2025
Maintainer Author