Skip to content

handling different number of layers for vision and language models #4

@peymanrostami

Description

@peymanrostami

thank you very much indeed for the excellent work.

for the Retrieval task on the Flickr30K Dataset with CLIP, are you using the same weights as UPOP?

If so, then UPOP has 24 layers for the vision model and 12 layers for the language model on the retrieval task using clip. but your algorithm (as in figure2) is assuming the same number of layers for both the vision and language models and then introduce learnable tokens that attend to the vision and language models for the same layer index. how could you handle the case where the vision and language transformers have different number of layers?

thanks you in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions