爱可可AI论文推介(10月15日) LG-机器学习CL-计算与语言AS-音频与语

LG - 机器学习 CL - 计算与语言 AS - 音频与语音 IR - 信息检索
1、[LG]Characterising Bias in Compressed Models
S Hooker, N Moorosi, G Clark, S Bengio, E Denton
[Google Research]
模型压缩放大了深度网络的偏差。深度网络通过裁剪、量化等技术实现了高水平压缩，总体误差基本没有变化，但有一组数据承担了不成比例的高误差部分，称为子集压缩识别样本(CIE) ，对于CIE部分，压缩放大了算法偏差，对未充分表示的特征进行不成比例的修剪会影响性能，与通常公平性意义上的考虑相一致。 CIE集合可通过标注点来进行隔离。
The popularity and widespread use of pruning and quantization is driven by the severe resource constraints of deploying deep neural networks to environments with strict latency, memory and energy requirements. These techniques achieve high levels of compression with negligible impact on top-line metrics (top-1 and top-5 accuracy). However, overall accuracy hides disproportionately high errors on a small subset of examples; we call this subset Compression Identified Exemplars (CIE). We further establish that for CIE examples, compression amplifies existing algorithmic bias. Pruning disproportionately impacts performance on underrepresented features, which often coincides with considerations of fairness. Given that CIE is a relatively small subset but a great contributor of error in the model, we propose its use as a human-in-the-loop auditing tool to surface a tractable subset of the dataset for further inspection or annotation by a domain expert. We provide qualitative and quantitative support that CIE surfaces the most challenging examples in the data distribution for human-in-the-loop auditing.
文章插图
文章插图
2、[CL] Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!
S Sia, A Dalmia, S J. Mielke
[Johns Hopkins University]
预训练词嵌入聚类主题分析，对预训练的词嵌入进行聚类，同时合并文档信息进行加权聚类，并重排头部单词，实现无监督文本主题分析。实验表明，预训练的词嵌入(上下文化或非上下文化) ，与TF加权K-Means和基于TF的重排相结合，以较低复杂度和较低的运行时间，为传统主题建模提供了一种可行的替代方案。
Topic models are a useful analysis tool to uncover the underlying themes within document collections. The dominant approach is to use probabilistic topic models that posit a generative story, but in this paper we propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words. We provide benchmarks for the combination of different word embeddings and clustering algorithms, and analyse their performance under dimensionality reduction with PCA. The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
文章插图