Wikipedia Bios Analysis

Controlled Analyses of Social Biases in Wikipedia Bios

Quick links: [code and data] [contact us] [paper]

Contact the first author: anjalief@cs.cmu.edu

Abstract

Social biases on Wikipedia, a widely-read global platform, could greatly influence public opinion. While prior research has examined man/woman gender bias in biography articles, possible influences of confounding variables limit conclusions. In this work, we present a methodology for reducing the effects of confounding variables in analyses of Wikipedia biography pages. Given a target corpus for analysis (e.g. biography pages about women), we present a method for constructing a comparison corpus that matches the target corpus in as many attributes as possible, except the target attribute (e.g. the gender of the subject). We evaluate our methodology by developing metrics to measure how well the comparison corpus aligns with the target corpus. We then examine how articles about gender and racial minorities (cisgender women, non-binary people, transgender women, and transgender men; African American, Asian American, and Hispanic/Latinx American people) differ from other articles, including analyses driven by social theories like intersectionality. In addition to identifying suspect social biases, our results show that failing to control for confounding variables can result in different conclusions and mask biases. Our contributions include methodology that facilitates further analyses of bias in Wikipedia articles, findings that can aid Wikipedia editors in reducing biases, and framework and evaluation metrics to guide future work in this area.

Data Display

Select one of the target groups described in our paper to see full results

Below displays metrics used to compare how effectively our matching method balances confounding variables in the corpus. On the left, we display metrics after matching, and on the right, we display metrics without matching. Each metric measures differences between the target and comparison corpora, meaning lower metrics indicates the corpora are well-balanced. We refer to the paper for details on how each metric is calculated. In this setting, only Standardized Mean Difference (SMD) metrics are meaningful. In SMD, we exclude values that we expect to differ between target and comparison corpora, but this exclusion is not possible for other metrics, and we report them only for completeness.

Results after tf-idf pivot-slope matching

Unmatched results

Data placeholder

Below displays summary statistics for the target and comparison sets, with and without matching. p-values assess whether or not the difference between the two corpora is significant and are computed using a paired t-test.

Results after tf-idf pivot-slope matching

Unmatched results

Data placeholder

We display words that are overrepresented in target or comparison artices, computed using log-odds with a Dirichlet prior. A positive score indicates stronger association with the target set.

Results after tf-idf pivot-slope matching

Unmatched results

Data placeholder

For all second-level sections that at least t target or comparison articles contain, we compute the average percentage of each article devoted to that section (number of tokens in section / total number of tokens in article, averaged across target/comparison group). We measure if the difference in the percentage of articles devoted to each section is significant between the target and comparison groups using a paired t-test with a Benjaminin Hochberg correction. We set t as follows: all race/ethnicity (100), intersectional (50), women (500), transgendermen/women (20), non-binary (50). Thresholds primarily exclude results that are notstatistically significant, and we use them merely for convenience of output.

Data placeholder

For all languages that at least a threshold t number of target or comparison articles are available in, we compute the percentage of target vs. comparison articles that are available (e.g. are target or comparison articles more likely to be available in German?). We compute significance using McNemar’s test with a Benjaminin Hochberg multiple hypothesis correction. We set t as described above.

Data placeholder

For the top 10 most edited languages (English, French, Arabic, Russian, Japanese, Italian, Spanish, German, Portuguese, and Chinese), for all target-comparison pairs where both members of the pair are available in the language, we compare article lengths and number of sections in the specified language. We compute if the difference between the target and comparison groups is significant using a paired t-test with a Benjaminin Hochberg multiple hypothesis correction.

Data placeholder

For the top 10 most edited languages, for all target-comparison pairs where both members of the pair are available in the language, we compare article lengths and number of sections in English. These results compare the same set of articles and statistics as the preceding section, but compare English versions of articles. Differences in statistics between this table and the previous one are likely to indicate bias. We compute if the difference between the target and comparison groups is significant using a paired t-test with a Benjaminin Hochberg multiple hypothesis correction.

Data placeholder

We display a random sample of matched pairs generated by our algorithm

Data placeholder

Read our paper for more:
Anjalie Field, Chan Young Park, Kevin Lin, and Yulia Tsvetkov (2021).
Controlled Analyses of Social Biases in Wikipedia Bios. [link]