Khuyagbaatar Batsuren

Associate Professor, National University of Mongolia

About Me

My research focuses on computational models of language diversity and how it could be understood and exploited in linguistics. With use of such computational models, I am particularly interested in developing multilingual linguistic resources including cognate, morphology, colexification, and lexical gap. I have more than ten years of programming experience (e.g., Java, C++, R, Python, Lua) as well as developing a wide range of applications (e.g., NLP, chatbot, web and mobile apps).

Projects

The Universal Knowledge Core

March 2015 - Present (IJCAI 2017)

ukc.datascientia.eu

The UKC is a huge, high-quality, human-curated multilingual lexical database, with millions of words from over 1,000 languages, interconnected by supra-lingual concepts that are shared across languages, yet are culture-specific. The UKC is our principal tool for collecting diversity-aware knowledge on the lexical level.

While a handful of independent efforts exist on large, multilingual, interconnected online lexical resources, the UKC is unique in several aspects.

  • High quality: the UKC strives to be a high-quality resource: language resources that constitute the UKC were built and/or validated by humans. We avoid ‘mass importing’ data from dubious sources.

  • Diversity-aware by design: most multilingual lexical resources do not represent concepts or linguistic phenomena that are specific to non-dominant languages, language families, dialects, geographical regions, etc. The UKC, on the other hand, was designed from the start to focus on such phenomena. As language diversity is our core research, we continuously expand the ways the UKC is able to formalise language diversity, e.g. through the support of lexical gaps, word or concept-level metadata, or cross-lingual lexico-semantic relations.

  • Common-sense lexicon: the UKC focuses on the common-sense lexicon, excluding encyclopaedic knowledge (i.e. named entities) as well as highly specialised knowledge (i.e. domain terminologies).

My role in this project is to build this resource from scratch by integrating wordnets and dictionaries and ensuring its highest quality. Currently, UKC contains more than 1000 languages, 120K concepts and their relations, and 2.1 million words.

CogNet - a large and evolving cognate database

June 2018 - June 2021 (ACL 2019 and LRE 2021)

github.com/kbatsuren/CogNet

alt text

CogNet is a large-scale database of cognate pairs: CogNet v2 contains 1.07M words and 8.1M cognates in 338 languages, 38 writing systems, and 91285 concepts. Its quality is manually evaulated at 94% precision. It was automatically constructed from wordnets and dictionaries contained within the UKC resource, as described in our paper. I built a website called the Linguarena to display and browse (currently an older version of) cognate data interactively on a world map, as also shown in the figure above. The fish example is on this link

WikTra

June 2018 - Present

github.com/kbatsuren/wiktra

Wiktra is a Unicode transliteration tool, written in Python. Wiktra supports 514 languages in 102 scripts (or orthographies) with the new API (nearly all of languages supported by Wiktionary modules). Since its core codes are developed by Wiktionary linguists and developers, Wiktra is a rule-based, high-quality transliteration tool. Initially, I developed this tool to build CogNet, and later on released the code.

Wiktra

MorphyNet - A Large Multilingual Database of Derivational and Inflectional Morphology

July 2019 - Present (SIGMORPHON 2021)

github.com/kbatsuren/MorphyNet

MorphyNet is a big morphological database that:

  • currently covers 15 languages: Catalan, Czech, English, Finnish, French, German, Hungarian, Italian, Mongolian, Polish, Portuguese, Russian, Serbo-Croatian, Spanish, and Swedish;
  • provides both derivational and inflectional data;
  • provides morpheme segmentations (e.g. segment+ation+s)
  • was, for most of it, extracted from Wiktionary and is thus of high quality;
  • contains 10M inflectional and 696K derivational instances.
  • is available for free.

Inflectional instances

language lemma inflected form morphological features morpheme segmentation
hun holmi holmikat N | PL | ACC holmi | k | at
rus оборваться оборвались V;PFV | IND;PST;FIN | PL | MID оборвать | л | и | ся
cat ossificar ossificaven V | IND;PST;IPFV | 3;PL ossificar | ava | en

Education

University of Trento, Italy (March 2015 - December 2018)

Ph.D. in Computer Science (cum laude)

Thesis: Understanding and Exploiting Language Diversity

Scholarship: Marie-Curie Three-Year Ph.D. Fellowship

Chungbuk National University, South Korea (March 2013 - February 2015)

M.Sc. in Computer Science

Thesis: A dependency-graph based keyphrase extraction using anti-patterns

Scholarship: CBNU BK fellowship

National University of Mongolia (September 2008 - June 2012)

Bachelor degree in Computer Science and Informational Technology

Publication

  • Batsuren, K., Bella, G. and Giunchiglia, F., 2021. MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology. SIGMORPHON 2021, (pp. 39-49).
  • Bella, G., Batsuren, K. and Giunchiglia, F., 2021 A Database and Visualization of the Similarity of Contemporary Lexicons. 24th International Conference on Text, Speech, and Dialogue, Olomouc, Czechia.
  • Batsuren, K., Bella, G. and Giunchiglia, F., 2021. A large and evolving cognate database. Language Resources and Evaluation (LRE), pp.1-25.
  • Giunchiglia, F., Otterbacher, J., Kleanthous, S., Batsuren, K., Bogin, V., Kuflik, T. and Tal, A.S., 2021. Towards Algorithmic Transparency: A Diversity Perspective. arXiv preprint arXiv:2104.05658.
  • Orphanou, K., Otterbacher, J., Kleanthous, S., Batsuren, K., Giunchiglia, F., Bogina, V., Tal, A.S. and Kuflik, T., 2021. Mitigating Bias in Algorithmic Systems: A Fish-Eye View of Problems and Solutions Across Domains. arXiv preprint arXiv:2103.16953.
  • Nair, N.C., Velayuthan, R.S. and Batsuren, K., 2019, September. Aligning the indoWordNet with the Princeton WordNet. In Proceedings of the 3rd international conference on natural language and speech processing (pp. 9-16).
  • Batsuren, K., Bella, G. and Giunchiglia, F., 2019, July. Cognet: A large-scale cognate database. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 3136-3145).
  • Tal, A.S., Batsuren, K., Bogina, V., Giunchiglia, F., Hartman, A., Loizou, S.K., Kuflik, T. and Otterbacher, J., 2019, June. “End to End” Towards a Framework for Reducing Biases and Promoting Transparency of Algorithmic Systems. In 2019 14th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP) (pp. 1-6). IEEE.
  • Batsuren, K., Ganbold, A., Chagnaa, A. and Giunchiglia, F., 2019, July. Building the mongolian wordnet. In Proceedings of the 10th global WordNet conference (pp. 238-244).
  • Giunchiglia, F., Batsuren, K. and Freihat, A.A., 2018, March. One world–seven thousand languages. In Proceedings 19th International Conference on Computational Linguistics and Intelligent Text Processing, CiCling2018, 18-24 March 2018.
  • Giunchiglia, F., Batsuren, K. and Bella, G., 2017, August. Understanding and Exploiting Language Diversity. In IJCAI (pp. 4009-4017).
  • Giunchiglia, F., Jovanovic, M., Huertas-Migueláñez, M. and Batsuren, K., 2015, November. Crowdsourcing a large scale multilingual lexico-semantic resource. In AAAI Conference on Human Computation and Crowdsourcing (HCOMP-15).
  • Munkhdalai, T., Li, M., Batsuren, K., Park, H.A., Choi, N.H. and Ryu, K.H., 2015. Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations. Journal of cheminformatics, 7(1), pp.1-8.
  • Munkhdalai, T., Li, M., Batsuren, K. and Ryu, K.H., 2015, January. Towards a unified named entity recognition system. In on Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies, Lisbon, Portugal (pp. 251-255).
  • Munkhdalai, T., Li, M., Batsuren, K. and Ryu, K.H., 2013, October. BANNER-CHEMDNER: incorporating domain knowledge in chemical and drug named entity recognition. In Proceedings of the Fourth BioCreative Challenge Evaluation Workshop (Vol. 2, pp. 135-139).
  • Batsuren, K., Munkhdalai, T., Li, M., Namsrai, E. and Ryu, K.H., 2013. A Novel Method of SMS Spam Filtering using Multiple Features. In Proceedings of the International Conference on the Frontiers of Information Technology, Applications and Tools (pp. 1-5). (Best paper award).