-
Notifications
You must be signed in to change notification settings - Fork 1
Mixture of datasets #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
m1kush
wants to merge
42
commits into
main
Choose a base branch
from
mixture-of-datasets
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Mixture of datasets #138
Changes from all commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
6084ec7
add skeleton of mixture of datasets with new configuration files and …
m1kush 93ff6b9
refactor dataset handling to support mixture of datasets with weighte…
m1kush fe2a7b2
add skeleton of mixture of datasets with new configuration files and …
m1kush 63a836a
refactor dataset initialization to always use MixtureOfDatasets for i…
m1kush d004369
add skeleton of mixture of datasets with new configuration files and …
m1kush 3899160
add tests for tokenize_fn to verify behavior across different models
m1kush f8a618f
update dataloader configurations to use get_mixture_of_datasets_datal…
m1kush 19b1d5c
refactor dataset configurations to use path-weight pairs and update d…
m1kush 70ef0e9
update dataset weights in fineweb.yaml and clean up smollm_corpus.yam…
m1kush ca95813
Update src/tests/test_tokenize_fn.py
m1kush cc3693f
Update src/core/datasets.py
m1kush 47304b4
update c4.yaml to use dynamic paths for train and eval datasets
m1kush a56ab19
update default.yaml to change the train seed value from 1000 to 123 t…
m1kush 73d4f73
fix indentation in datasets.py for better readability
m1kush 96d177d
refactor: improve code readability by adjusting indentation and forma…
m1kush 8f413ad
refactor: remove commented-out print statement in datasets.py for cle…
m1kush 6f42629
refactor: update type hints for datasets parameter in get_mixture_of_…
m1kush 980b737
fix: add validation for paths and weights in datasets.py to ensure th…
m1kush 5e1c549
Merge branch 'main' into mixture-of-datasets
m1kush 855c118
update smollm config files
m1kush 9a0a174
add skeleton of mixture of datasets with new configuration files and …
m1kush 80a6df7
refactor dataset handling to support mixture of datasets with weighte…
m1kush 7f28979
add skeleton of mixture of datasets with new configuration files and …
m1kush d8a35e2
refactor dataset initialization to always use MixtureOfDatasets for i…
m1kush 54ece61
add skeleton of mixture of datasets with new configuration files and …
m1kush a85e828
add tests for tokenize_fn to verify behavior across different models
m1kush 0a33943
update dataloader configurations to use get_mixture_of_datasets_datal…
m1kush f8d9e29
refactor dataset configurations to use path-weight pairs and update d…
m1kush 981f738
update dataset weights in fineweb.yaml and clean up smollm_corpus.yam…
m1kush ccb35d5
Update src/tests/test_tokenize_fn.py
m1kush e74eca7
Update src/core/datasets.py
m1kush 5c5a7a8
update c4.yaml to use dynamic paths for train and eval datasets
m1kush fe44c6b
update default.yaml to change the train seed value from 1000 to 123 t…
m1kush b6a1418
fix indentation in datasets.py for better readability
m1kush 5fbbcc5
refactor: improve code readability by adjusting indentation and forma…
m1kush edeca0c
refactor: remove commented-out print statement in datasets.py for cle…
m1kush 34c0f04
refactor: update type hints for datasets parameter in get_mixture_of_…
m1kush 9089918
fix: add validation for paths and weights in datasets.py to ensure th…
m1kush b4230bc
update smollm config files
m1kush be57603
Merge remote-tracking branch 'origin/mixture-of-datasets' into mixtur…
m1kush 5352f99
Merge branch 'main' into mixture-of-datasets
m1kush b38a167
update dataset references and checkpoint paths in configuration files
m1kush File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,12 +1,26 @@ | ||
| trainer: | ||
| train_dataloader: | ||
| _target_: src.core.datasets.get_dataloader | ||
| dataset: ??? | ||
| total_batch_size: ${common.batch_size} | ||
| _target_: src.core.datasets.get_mixture_of_datasets_dataloader | ||
| datasets: ??? | ||
| dataset_split: ??? | ||
| num_workers: ??? | ||
| seed: 123 | ||
| sequence_length: ${common.sequence_length} | ||
| shuffle: true | ||
| total_batch_size: ${common.batch_size} | ||
| use_new_sampling_method: true | ||
| world_size_independent: false | ||
| tokenize_fn: ??? | ||
|
|
||
| eval_dataloader: | ||
| _target_: src.core.datasets.get_dataloader | ||
| dataset: ??? | ||
| _target_: src.core.datasets.get_mixture_of_datasets_dataloader | ||
| datasets: ??? | ||
| dataset_split: ??? | ||
| num_workers: ??? | ||
| seed: 123 | ||
| sequence_length: ${common.sequence_length} | ||
| shuffle: true | ||
| total_batch_size: ${common.batch_size} | ||
| num_workers: ??? | ||
| use_new_sampling_method: true | ||
| world_size_independent: false | ||
| tokenize_fn: ??? | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| defaults: | ||
| - default | ||
| - _self_ | ||
|
|
||
| trainer: | ||
| train_dataloader: | ||
| _target_: src.core.datasets.get_mixture_of_datasets_dataloader | ||
| datasets: | ||
| - path: /storage_nvme_1/llm-random/datasets/fineweb-edu-dedup/train | ||
| weight: 0.7 | ||
| - path: /storage_nvme_1/llm-random/datasets/cosmopedia-v2/train | ||
| weight: 0.15 | ||
| - path: /storage_nvme_2/llm-random/datasets/python-edu | ||
| weight: 0.08 | ||
| - path: /storage_nvme_2/llm-random/datasets/open-web-math/train | ||
| weight: 0.07 | ||
|
m1kush marked this conversation as resolved.
|
||
| dataset_split: train | ||
| num_workers: 2 | ||
| tokenize_fn: | ||
| _target_: src.core.datasets.get_tokenize_fn | ||
| model_name: HuggingFaceTB/SmolLM-1.7B | ||
|
|
||
| eval_dataloader: | ||
| _target_: src.core.datasets.get_mixture_of_datasets_dataloader | ||
| datasets: | ||
| - path: /storage_nvme_1/llm-random/datasets/fineweb-edu-dedup/train | ||
| weight: 0.7 | ||
| - path: /storage_nvme_1/llm-random/datasets/cosmopedia-v2/train | ||
| weight: 0.15 | ||
| - path: /storage_nvme_2/llm-random/datasets/python-edu | ||
| weight: 0.08 | ||
| - path: /storage_nvme_2/llm-random/datasets/open-web-math/train | ||
| weight: 0.07 | ||
|
m1kush marked this conversation as resolved.
|
||
| dataset_split: validation | ||
| num_workers: 1 | ||
| tokenize_fn: | ||
| _target_: src.core.datasets.get_tokenize_fn | ||
| model_name: HuggingFaceTB/SmolLM-1.7B | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| defaults: | ||
| - llama_distillation | ||
|
|
||
| common_distillation: | ||
| dmodel: 2048 | ||
| dff: 8192 | ||
| datt: 2048 | ||
| n_blocks: 24 | ||
| q_heads: 32 | ||
| kv_heads: 32 | ||
| vocab_size: 49152 | ||
|
|
||
| distillation: | ||
| load: | ||
| path: "HuggingFaceTB/SmolLM-1.7B" | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| defaults: | ||
| - _cluster@_here_: local | ||
| - _model@_here_: tiny | ||
| - _trainer@_here_: llama | ||
| - _dataset@_here_: smollm_corpus | ||
| - _checkpoints@_here_: none | ||
| - _misc@_here_: default | ||
| - _eval@_here_: default | ||
|
|
||
| common: | ||
| sequence_length: 16 | ||
| batch_size: 2 | ||
|
|
||
| trainer: | ||
| gradient_accumulation_steps: 1 | ||
| n_steps: 100 | ||
| learning_rate: 1e-3 | ||
|
|
||
| checkpoint: | ||
| save: | ||
| type: huggingface | ||
| path: checkpoint | ||
|
|
||
| infrastructure: | ||
| metric_logger: | ||
| name: tiny_Local | ||
| tags: | ||
| - nano | ||
| - local | ||
| - tiny | ||
|
|
||
| evaluator: | ||
| limit: 1 | ||
| device: cpu |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,7 +28,7 @@ trainer: | |
| learning_rate: 15 | ||
|
|
||
| train_dataloader: | ||
| dataset: | ||
| datasets: | ||
| seed: 1000 | ||
|
|
||
| checkpoint: | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.