Skip to content

[FEATURE] Validate generated dataset with simple experiments for performance improvement #19

@ncudlenco

Description

@ncudlenco

Is your feature request related to a problem? Please describe.
The value of new datasets generated by the simulation pipeline is not systematically validated. This makes it difficult to assess whether the generated data brings measurable improvements to existing problems or tasks.

Describe the solution you'd like
Design and implement a set of simple, non-time-consuming experiments to validate the practical value of the generated dataset. These experiments should:

  • Use the generated data in existing machine learning or analysis pipelines
  • Evaluate whether the new data brings performance improvements to known tasks or benchmarks
  • Focus on experiments that are quick to set up and run, avoiding resource-intensive or large-scale studies
  • Report on findings and, if possible, recommend integration or further investigation

Describe alternatives you've considered

  • Relying on subjective or qualitative assessments alone (less reliable)
  • Delaying validation until large-scale experiments are possible (slower feedback loop)

Acceptance Criteria

  • At least one simple experiment is designed and implemented for dataset validation
  • The experiment(s) use the generated data in an existing analysis or ML pipeline
  • Performance on a relevant metric or benchmark is measured and reported
  • Results are documented and recommendations provided

Additional context

  • Potential experiments could include training a simple classifier, running clustering, or evaluating on a subset of a public benchmark.
  • The goal is to quickly demonstrate practical value and identify possible improvements.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions