thank you very much indeed for the excellent work.
for the Retrieval task on the Flickr30K Dataset with CLIP, are you using the same weights as UPOP?
If so, then UPOP has 24 layers for the vision model and 12 layers for the language model on the retrieval task using clip. but your algorithm (as in figure2) is assuming the same number of layers for both the vision and language models and then introduce learnable tokens that attend to the vision and language models for the same layer index. how could you handle the case where the vision and language transformers have different number of layers?
thanks you in advance
thank you very much indeed for the excellent work.
for the Retrieval task on the Flickr30K Dataset with CLIP, are you using the same weights as UPOP?
If so, then UPOP has 24 layers for the vision model and 12 layers for the language model on the retrieval task using clip. but your algorithm (as in figure2) is assuming the same number of layers for both the vision and language models and then introduce learnable tokens that attend to the vision and language models for the same layer index. how could you handle the case where the vision and language transformers have different number of layers?
thanks you in advance