Learning to scale multilingual representations for vision-language tasks
Burns, Andrea; Kim, Donghyun; Wijaya, Derry; Saenko, Kate; Plummer, Bryan A.
Vedaldi, Andrea; Bischof, Horst; Brox, Thomas; Frahm, Jan-Michael
Current multilingual vision-language models either require
a large number of additional parameters for each supported language,
or suffer performance degradation as languages are added. In this paper,
we propose a Scalable Multilingual Aligned Language Representation
(SMALR) that supports many languages with few model parameters
without sacrificing downstream task performance. SMALR learns
a fixed size language-agnostic representation for most words in a multilingual
vocabulary, keeping language-specific features for just a few. We
use a masked cross-language modeling loss to align features with context
from other languages. Additionally, we propose a cross-lingual consistency
module that ensures predictions made for a query and its machine
translation are comparable. The effectiveness of SMALR is demonstrated
with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval
and outperform prior work by 3-4% with less than 1/5th the
training parameters compared to other word embedding methods.
↧