Machine Learning Thesis Proposal
Machine Learning Department
Carnegie Mellon University
Virtual Presentation - ET
In recent years, the quest for artificial intelligence capable of digital, physical, and social intelligence has led to an explosion of interest in multimodal datasets and algorithms. This research area of multimodal machine learning studies the computational and theoretical foundations of learning from heterogeneous data sources. As a step towards the next generation of multimodal technologies, this thesis studies two core challenges in multimodal learning: (1) constructing multimodal models and datasets that enable generalization across a large number of modalities and different tasks, and (2) designing quantification methods to comprehensively understand the internal mechanics of multimodal representations and gain insights for safe real-world deployment. In the first part, we study generalization in multimodal learning. Generalization is particularly beneficial when one modality has limited resources such as the lack of annotated data, noisy inputs, or unreliable labels, and presents a step towards processing a large number of diverse and understudied modalities. To enable the study of generalization, we introduce MultiBench, a unified large-scale benchmark across a wide range of modalities, tasks, and research areas. Using MultiBench, we study generalization through 3 paradigms: (1) generalization across different modalities, (2) generalization across both modalities and tasks, and (3) generalization in non-parallel scenarios, where we are presented with a large number of modalities but each task is defined only over a small subset of them. The second part studies quantification of the multimodal learning process at 3 levels: (1) output qualities: the extent to which models are predictive, efficient, and robust under natural and targeted modality imperfections, (2) internal mechanics: understanding the internal modeling of multimodal information and cross-modal interactions, and (3) modality tradeoffs: theoretically quantifying the utility and risks of each input modality, while balancing these tradeoffs for reliable real-world usage. We conclude this thesis by discussing how future work can leverage these ideas to drive progress towards more general, scalable, and explainable multimodal models. Thesis Committee: Louis-Philippe Morency (Co-chair) Ruslan Salakhutdinov (Co-chair) Manuel Blum Lenore Blum (University of California, Berkeley) Trevor Darrell (University of California, Berkeley) Zoom Participation. See announcement.