Improving Bilingual Lexicon Induction with Unsupervised Post-Processing of Monolingual Word Vector Spaces

Author

Vulić, Ivan and Korhonen, Anna and Glavaš, Goran

Conference

Proceedings of the 5th Workshop on Representation Learning for NLP

Year

2020

Figures & Tables

Table 1: Languages used in the main BLI experiments (Vulić et al., 2019), along with family (IE=IndoEuropean), morphological type, and ISO 639-1 code.
Table 15: All BLI scores (MRR) with Lithuanian ( LT ) as the source language.
Table 8: All BLI scores (MRR) with Basque ( EU ) as the source language.
Table 7: All BLI scores (MRR) with Estonian ( ET ) as the source language.
Table 18: All BLI scores (MRR) with Turkish ( TR ) as the source language.
Table 13: All BLI scores (MRR) with Georgian ( KA ) as the source language.
Table 9: All BLI scores (MRR) with Finnish ( FI ) as the source language.
Table 16: All BLI scores (MRR) with Norwegian ( NO ) as the source language.
Table 10: All BLI scores (MRR) with Hebrew ( HE ) as the source language.
Table 11: All BLI scores (MRR) with Hungarian ( HU ) as the source language.
Table 6: All BLI scores (MRR) with Esperanto ( EO ) as the source language.
Table 4: All BLI scores (MRR) with Bulgarian ( BG ) as the source language.
Table 17: All BLI scores (MRR) with Thai ( TH as the source language.)
Table 5: All BLI scores (MRR) with Catalan ( CA ) as the source language.
Table 12: All BLI scores (MRR) with Indonesian ( ID ) as the source language.
Table 2: BLI results (MRR×100%) for main models in comparison. We report the results with the supervised BASELINE model based on the VecMap framework (Artetxe et al., 2018b), without any self-learning (i.e., supervised only), and with the most robust self-learning setup according to the comparative analysis of Vulić et al. (2019). The scores are averaged over experimental setups where each of the 15 languages is used as the source language L s(e.g., BG -* averages scores over 14 setups in which Bulgarian ( BG ) is the source language). 5k and 1k denote seed dictionary sizes. The Avg column shows averaged MRR scores for each model over all 15×14=210 BLI setups and we also report the number of BLI setups in which the POSTPROC method improves over both BASELINE models.
Table 14: All BLI scores (MRR) with Korean ( KO ) as the source language.

Table of Contents

  • Abstract sources is the main reason for popularity of the so-
  • 2 Methodology
  • 3 Experimental Setup
  • 4 Results and Discussion
  • 5 Conclusion and Future Work
  • Acknowledgments
  • References
  • 5:135–146. Alexis Conneau, Guillaume Lample, Marc’Aurelio
  • A Supplemental Material

References

  •  Jean Alaux, Edouard Grave, Marco Cuturi, and Armand Joulin. 2019. Unsupervised hyperalignment for multilingual word embeddings. In Proceedings of ICLR.View this Paper
  •   Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word em-beddings while preserving monolingual invariance. In Proceedings of EMNLP, pages 2289–2294.View this Paper
  •  Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost)no bilingual data. In Proceedings of ACL, pages 451–462.View this Paper
  • 2Mikel Artetxe, Gorka Labaka, and Eneko Agirre.2018a. A robust self-learning method for fully unsupervised cross-lingual mappings of word embed-dings. In Proceedings of ACL, pages 789–798.View this Paper
  • 2 Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018b. Unsupervised neural machine translation. In Proceedings of ICLR.View this Paper
  •  3Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio,and Eneko Agirre. 2018c. Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation. In Proceedings of CoNLL, pages 282–291.View this Paper
  •   Piotr Bojanowski, Edouard Grave, Armand Joulin, and
  •   Tomas Mikolov. 2017. Enriching word vectors with
  •   subword information. Transactions of the ACL,
+- Similar Papers (10)