Making Sentence Embeddings Robust to User-Generated Content @MicrosoftResearch

Microsoft Research | Making Sentence Embeddings Robust to User-Generated Content @MicrosoftResearch | Uploaded June 2024 | Updated October 2024, 1 week ago.
This seminar was hosted by Microsoft Research Africa, Nairobi together with the Microsoft AI for Good team in May 2024.

User-generated content (UGC), e.g. social media posts written in "Internet language", presents a lot of lexical variations and deviates from standard language. As a result, NLP models which were mostly trained on standard texts have been known to perform poorly on UGC, and sentence embedding models like LASER are no exception.

In this talk, we focus on the robustness of LASER to UGC data. We evaluate this robustness by LASER’s ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We also use data augmentation to generate synthetic UGC-like training data.

We show that RoLASER significantly improves LASER’s robustness to both natural and artificial UGC data by achieving up to 2× and 11× better alignment scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.

Speaker: Lydia Nishimwe

Learn more about Microsoft Research Lab – Africa, Nairobi: microsoft.com/en-us/research/lab/microsoft-research-lab-africa-nairobi/seminars

Interpretability, Responsibility and Controllability of Human Behaviors

Wildlife Conflict Resolution: Boma & Cattle Detection in the Masai Mara using AI

Research Forum: Closing Remarks and Announcements

Effective Human-AI Decision-Making or Everyone: A Sisyphean Task?

Getting Modular with Language Models: Building, Reusing a Library of Experts for Task Generalization

$Optimization from Structured Samples for Coverage and Influence Functions 2022 Data-driven Optimization Workshop: Optimization from Structured Samples for Coverage and Influence Functions Speaker: Zhijie Zhang, Fuzhou University We revisit the optimization from samples (OPS) model, which studies the problem of optimizing objective functions directly from the sample data. Previous results showed that we cannot obtain a constant approximation ratio for the maximum coverage problem using polynomial independent samples of the form {S_i,f(S_i )}_(i=1)^t (BRS, STOC17), even if coverage functions are (1-ϵ)-PMAC learnable using these samples (BDF+, SODA12). In this work, to circumvent the impossibility result of OPS, we propose a stronger model called optimization from structured samples (OPSS), where the data samples encode the structural information of the functions. We show that under OPSS model, the maximum coverage problem enjoys constant approximation under mild assumptions on the sample distribution. We further generalize the result and show that influence maximization also enjoys constant approximation under this model.$

Evaluation and Understanding of Foundation Models

Fighting the Global Social Media Infodemic: from Fake News to Harmful Content

Final intern talk: Distilling Self-Supervised-Learning-Based Speech Quality Assessment into Compact