Below displays metrics used to compare how effectively our matching method balances confounding variables in the corpus. On the left, we display metrics after matching, and on the right, we display metrics without matching. Each metric measures differences between the target and comparison corpora, meaning lower metrics indicates the corpora are well-balanced. We refer to the paper for details on how each metric is calculated. In this setting, only Standardized Mean Difference (SMD) metrics are meaningful. In SMD, we exclude values that we expect to differ between target and comparison corpora, but this exclusion is not possible for other metrics, and we report them only for completeness.
Results after tf-idf pivot-slope matching
Unmatched results
Data placeholder
Data placeholder
Below displays summary statistics for the target and comparison sets, with and without matching. p-values assess whether or not the difference between the two corpora is significant and are computed using a paired t-test.
Results after tf-idf pivot-slope matching
Unmatched results
Data placeholder
Data placeholder
We display words that are overrepresented in target or comparison artices, computed using log-odds with a Dirichlet prior. A positive score indicates stronger association with the target set.
Results after tf-idf pivot-slope matching
Unmatched results
Data placeholder
Data placeholder
For all second-level sections that at least t target or comparison articles contain, we compute the average percentage of each article devoted to that section (number of tokens in section / total number of tokens in article, averaged across target/comparison group). We measure if the difference in the percentage of articles devoted to each section is significant between the target and comparison groups using a paired t-test with a Benjaminin Hochberg correction. We set t as follows: all race/ethnicity (100), intersectional (50), women (500), transgendermen/women (20), non-binary (50). Thresholds primarily exclude results that are notstatistically significant, and we use them merely for convenience of output.
Data placeholder
For all languages that at least a threshold t number of target or comparison articles are available in, we compute the percentage of target vs. comparison articles that are available (e.g. are target or comparison articles more likely to be available in German?). We compute significance using McNemar’s test with a Benjaminin Hochberg multiple hypothesis correction. We set t as described above.
Data placeholder
For the top 10 most edited languages (English, French, Arabic, Russian, Japanese, Italian, Spanish, German, Portuguese, and Chinese), for all target-comparison pairs where both members of the pair are available in the language, we compare article lengths and number of sections in the specified language. We compute if the difference between the target and comparison groups is significant using a paired t-test with a Benjaminin Hochberg multiple hypothesis correction.
Data placeholder
For the top 10 most edited languages, for all target-comparison pairs where both members of the pair are available in the language, we compare article lengths and number of sections in English. These results compare the same set of articles and statistics as the preceding section, but compare English versions of articles. Differences in statistics between this table and the previous one are likely to indicate bias. We compute if the difference between the target and comparison groups is significant using a paired t-test with a Benjaminin Hochberg multiple hypothesis correction.
Data placeholder
We display a random sample of matched pairs generated by our algorithm
Data placeholder