handling different number of layers for vision and language models

thank you very much indeed for the excellent work.

for the Retrieval task on the Flickr30K Dataset with CLIP, are you using the same weights as UPOP?

If so, then UPOP has 24 layers for the vision model and 12 layers for the language model on the retrieval task using clip. but your algorithm (as in figure2) is assuming the same number of layers for both the vision and language models and then introduce learnable tokens that attend to the vision and language models for the same layer index. how could you handle the case where the vision and language transformers have different number of layers?

thanks you in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling different number of layers for vision and language models #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

handling different number of layers for vision and language models #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions