2011
Back in 2011, the "Name That Dataset" experiment proposed by Torralba and Efros revealed the built-in bias of visual datasets at that time (Caltech-101, COIL-100, LabelMe, etc.)- the datasets could be classified very well by SVM classifiers.
A recent study (Liu and He, 2024) has shown that large-scale visual datasets are very biased: they can be easily classified by modern neural networks. However, the concrete forms of bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We further decompose their semantic bias with object-level analysis, and leverage natural language methods to generate detailed, open-ended descriptions of each dataset's characteristics. Our work aims to help researchers understand the bias in existing large-scale pre-training datasets, and build more diverse and representative ones in the future.
Dataset classification is a classification task where each dataset forms a class and models are trained to predict the dataset origin of each image. Serving as an indicator of dataset bias, it was proposed by Torralba and Efros in 2011 on smaller-scale datasets, and recently revisited by Liu and He on large-scale pre-training datasets.
Back in 2011, the "Name That Dataset" experiment proposed by Torralba and Efros revealed the built-in bias of visual datasets at that time (Caltech-101, COIL-100, LabelMe, etc.)- the datasets could be classified very well by SVM classifiers.
Surprisingly, after a decade's effort in creating more diverse and comprehensive visual datasets, the current largest and uncurated datasets (e.g., YFCC100M, CC12M, DataComp-1B) can still be classified with remarkably high accuracy.
Although these large-scale datasets are very biased, a lingering question remains:
what are the concrete forms of bias among them?
To better understand this, we apply various transformations to the datasets, selectively preserving or suppressing specific types of information. We then perform dataset classification on the transformed datasets and analyze its performance.
The dataset classification validation accuracy is 82.0% on the original datasets: YFCC, CC, and DataComp (abbreviated as YCD).
We extract semantic components from the images with decreasing levels of spatial details using semantic segmentation, object detection, and LLaVA captioning. The dataset classification accuracy is consistently high on the resulting datasets of these transformations. Also, passing the images through a VAE to potentially reduce low-level signature only marginally decreases the accuracy from the 82% reference. These highlight that semantic bias is an important component of dataset bias in YCD.
Next, we analyze the semantic bias in its most rudimentary forms, object shape and spatial geometry, using Canny edge, Segment Anything Model (SAM) contour, and depth estimation. The close-to-reference dataset classification accuracies on these transformed datasets show that object shape and spatial geometry variations are significant among the YCD datasets.
We shuffle an image on the pixel and the patch level, following a fixed order and a random order for all images. The significant performance drop with pixel shuffling shows completely destructing the local structure in YCD can reduce its dataset bias to a large extent. However, the minimal accuracy decrease after shuffling patches of size 16 indicates patch-level local structures in spatial information is sufficient for identifying visual signatures of the YCD datasets.
Even if we take only the mean RGB value for each image, the datasets can still be classified with higher-than-chance-level accuracy. The distribution of RGB channels for each dataset shows that YFCC is much darker than CC and DataComp.
We perform low-pass and high-pass filters to isolate low-frequency (general structure and smooth variations) and high-frequency (textures and sharp transitions) information. The high accuracy of models trained on either frequency component indicates that visual bias in the YCD datasets exists in both low-frequency and high-frequency components.
We train an unconditional diffusion model on each dataset. Dataset classification on the synthetic data generated from each model still reaches very high accuracy. This shows that the synthetic images sampled from a diffusion model can inherit the bias in the model's training images. Also, we revert the LLaVA-generated image captions to the image domain using text-to-image diffusion, resulting in a 58% classification accuracy. This further confirms that semantic discrepancy is a major contributor to dataset bias.
Semantic bias is pronounced in YCD datasets…
how can we explain the semantic patterns?
Object classes with highest proportions of YFCC, CC, or DataComp images
Object class ranking from logistic regression coefficients
Based on the image-level object annotations from ImageNet, LVIS, and ADE20k, we can identify the object classes representative of a certain dataset by either looking directly at the object distribution across datasets or their object class rankings according to coefficients of logistic regression model trained on a binary vector representing object presence. YFCC emphasizes outdoor scenes, while CC and DataComp focus on household items and products.
On average, YFCC contains the highest number of unique objects in each image, while DataComp exhibits the lowest.
We use LLaVA-generated captions as a proxy to analyze the semantic themes in each dataset. Specifically, we use LDA for unsupervised topic discovery and procedurally prompt an LLM for summarization of the dataset characteristics.
Unsupervised Topic Discovery (Latent Dirichlet Allocation)
LLM Summarization
In summary, YFCC is characterized by abundant outdoor, natural, and human-related scenes, while DataComp concentrates on static objects and synthetic images with clean backgrounds and minimal human presence. In contrast, CC blends elements of both YFCC's dynamic scenes and DataComp's static imagery.
Filtering based on a reference dataset may inherit its bias.
DataComp has the fewest unique objects per image. This is possibly because DataComp filters for images with visual content close to ImageNet data in the embedding space. Therefore, the remaining images tend to be object-centric.
The source website's image collection mechanism can introduce bias.
YFCC is heavily skewed towards outdoor scenes and human interactions. This bias likely stems from its reliance on a single data source, Flickr.com, where user-uploaded content often focuses on personal photos, landscapes, and social interactions.
Web-scraped images would naturally contain more digital graphics.
Since CC and DataComp images are from Internet webpages, professionally created content like advertisements, infographics, and digital media are prioritized. Dataset users should evaluate if this composition aligns with the downstream goals.
@inproceedings{zengyin2024bias,
title={Understanding Bias in Large-Scale Visual Datasets},
author={Boya Zeng and Yida Yin and Zhuang Liu},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
}