Skip to content

Batch size while training #17

@HahmDY

Description

@HahmDY

Paper states that batch size of 64 on each of the 8 GPUs.
Then, how was the contrastive loss calculated?

Calculate loss from each gpu(local batch 64), backward for each GPU, aggregate gradients(like ring-reduce) and update (just using DDP)
Calculate embeddings from all GPUs, get loss for global batch(64 * 8) and back pass (using additional methods such as dist.all_gather()).
Which of the two methods above was used for your training? If neither, can you explain how many image-text pairs were used when calculating contrastive loss?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions