Skip to content

Generate metadata.yaml for existing tokenized datasets #135

@vmly

Description

@vmly

Problem

For studio to effectively use data prep, metadata should be generated for older datasets already onboarded to studio.

Requirements

  • Create a function that accepts a folder with existing hdf5 files

  • Dataset might support both training and evaluation

  • Process the contents of the existing dataset, derive various metrics present in the metadata.yaml

  • Is it possible to generate all the fields in the metadata.yaml using a existing dataset ?

Background

  • Studio uses add_seq_metadata_dataset function here to populate the training sequences for existing hdf5 tokenized datasets.

  • This populates train_sequences field only in metadata.yaml . Other fields are not available for existing datasets.

  • Incase metadata.yaml already exists, train_sequences is added to metadata.yaml

  • metadata.yaml for existing datasets

train_sequences: 54
  • For new datasets, metadata.yaml is directly used. For new datasets prepared with data prep, all the fields are available
train_articles: 100
train_completion_tokens: 53020
train_input_tokens: 53020
max_batch_size_dev: null
max_batch_size_train: 13
max_seq_length: 1024
number_of_dev_files: 0
number_of_test_files: 0
number_of_training_files: 4
train_output_tokens: 55296
train_padding_tokens: 2276
train_prompt_tokens: 0
train_sequences: 54
token_type_ids: true
tokenizer_model_type: "<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>"
train_tokens_dropped_from_all_prompt: 0
train_tokens_dropped_from_packing: 0
vocab_size: 50257

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions