<![CDATA[PhD Proposal by Yang Chen ]]>

671015 event 1699650600 1699650600 <![CDATA[PhD Proposal by Yang Chen ]]> Title: Benchmarking Multilingual and Multimodal Intelligent Systems

Date/Time: Nov 17, 2023, 3:00 PM to 5:00 PM Eastern Time (US)

Location: Zoom Link

Meeting ID: 987 6029 3352

Passcode: 443653

Yang Chen (Homepage)

Ph.D. Candidate in Computer Science

School of Interactive Computing

Georgia Institute of Technology

Committee:

Dr. Alan Ritter (advisor), School of Interactive Computing, Georgia Tech

Dr. Wei Xu (co-advisor), School of Interactive Computing, Georgia Tech

Dr. Kartik Goyal, School of Interactive Computing, Georgia Tech

Dr. Hexiang (Frank) Hu, Google Deepmind

Dr. Ming-Wei Chang, Google Deepmind

Abstract:

Language serves as the cornerstone and medium to transfer human intellect across communities worldwide. Recent developments of large language models that consume vast amounts of human knowledge from large-scale online text corpora have revolutionized the field of natural language processing (NLP) and serve as the building blocks to build intelligent systems that benefit humanity.

However, two primary challenges are present: 1) the significant resource imbalance among languages, influenced by the disparities in the wealth of resources across different countries, cultures, and geographic regions, diminishes the efficacy of language models in understanding and serving speakers of low-resource languages; 2) the language-only modality may limit the model to acquire knowledge and how it could broaden the domain of applications to help human with visual inefficiency or facility people to interact with the visual environment.

This thesis proposal aims to benchmark the two challenges in the pursuit of building reliable multilingual and multimodal intelligent systems that benefit humanity.

In the first part of the presentation, I present methods that I developed to improve language model understanding on low-resource languages, including a synthetic data generation model and a novel algorithm that translates and fusion annotations from high-resource languages. In the second part of the presentation, I introduce the InfoSeek benchmark (1M+), which assesses the capabilities of vision-language models to answer visual information-seeking questions about entities present in an image. By benchmarking multimodal large language models and retrieval-augmented generation models on InfoSeek, we show insights that benefit the future development of multimodal intelligent systems.

]]> Benchmarking Multilingual and Multimodal Intelligent Systems

]]> <![CDATA[]]> 221981 1788 102851