Quantifying Bias in Hierarchical Category Systems

PMCID: PMC10898782

PMID:

Abstract

Abstract Categorization is ubiquitous in human cognition and society, and shapes how we perceive and understand the world. Because categories reflect the needs and perspectives of their creators, no category system is entirely objective, and inbuilt biases can have harmful social consequences. Here we propose methods for measuring biases in hierarchical systems of categories, a common form of category organization with multiple levels of abstraction. We illustrate these methods by quantifying the extent to which library classification systems are biased in favour of western concepts and male authors. We analyze a large library data set including more than 3 million books organized into thousands of categories, and find that categories related to religion show greater western bias than do categories related to literature or history, and that books written by men are distributed more broadly across library classification systems than are books written by women. We also find that the Dewey Decimal Classification shows a greater level of bias than does the Library of Congress Classification. Although we focus on library classification as a case study, our methods are general, and can be used to measure biases in both natural and institutional category systems across a range of domains. 1

Full Text

Categories inevitably reflect the needs, perspectives, and experiences of the people who create them (Bowker & Star, 2000). Consider, for example, Steinberg’s famous depiction of the View of the World from 9th Avenue, which devotes half of the page to three New York City blocks but shows China, Russia, and Japan as tiny blobs on the horizon. A View of the World from Tiananmen Square would look rather different, and might include separate categories for Shānxī and Shǎnxī provinces while making no distinction between Washington State and Washington DC.
Although systems of categories are often subjective, the distinctions that they encode or fail to encode can have important consequences (Crawford, 2021). For example, Gould (1990) points out that the United States’ categorization of drugs as legal or illegal results in some addictive drugs being advertised on TV, and others carrying life sentences. The categorization of an animal or plant population as a distinct species as opposed to a variant of an existing species can affect conservation efforts and biodiversity research (Freeman & Pennell, 2021; Thomson et al., 2021). Finally, categories can also lead to harmful stereotypes, especially when coarse categories are used for members of out-groups in contrast to the finer-grained categories used for members of one’s in-groups (Park & Rothbart, 1982). Because category systems can encode and reinforce stereotypes, it is important to ensure that the biases encoded by these systems are acknowledged as such instead of treated as ground truth.
Social, developmental and cognitive psychologists have previously explored how categories (e.g., racial and gender categories) arise and how these categories influence behaviour (Brewer, 2007; Misch et al., 2022; Timeo et al., 2017; Waxman, 2021). Existing work highlights two key connections between categorization and bias. First, people tend to have positive attitudes towards in-group categories and negative attitudes towards out-group categories, and methods such as the Implicit Association Test (Greenwald et al., 1998; Schimmack, 2021) attempt to measure these attitudes. Second, categories can bias the way in which people perceive individual members of in-groups and out-groups. For example, there is a tendency to perceive members of one’s out-group as being more similar to one another than are members of one’s ingroup (Judd & Park, 1988; Mackie & Worth, 1989; Park & Rothbart, 1982; Rubin & Badea, 2012). The link between categorization and biased perception has been explored more generally (Dubova & Goldstone, 2021; Goldstone et al., 2001), and the tendency to overestimate both within-category similarity and between-category distinctiveness is sometimes referred to as categorization bias (Ashby & Zeithamova, 2022) or categorical bias (Ester et al., 2020). Here we focus on a third connection between categorization and bias, and consider ways in which the structure of a category system (i.e., the extensions of the categories that it includes) can reflect bias. A canonical example is that a category system may include fine-grained categories in areas that are deemed valuable or important, and coarser categories in areas deemed less worthy of attention.
The term “bias” has been used to refer to distinct concepts in the literature. For example, “inductive bias” refers to constraints or expectations that guide learning (Griffiths et al., 2010; Markman, 1989), and is not directly relevant to our study. We define bias as preferential treatment for one group (e.g., western individuals) over another (e.g., non-western individuals). As we discuss later, this bias can either reflect external biases that have shaped the items to be categorized, or can be internal to a category system and imposed on the items by this system. To study this notion of bias we develop methods to measure how different groups are represented in a category system. Establishing that a system is biased requires us to demonstrate that the system departs from an unbiased alternative, and it is not always clear how an unbiased system should weigh different groups. We therefore begin with the simplifying assumption that an unbiased system should give roughly equal weight to each group, but return to this assumption later and discuss the extent to which it is appropriate for the specific oppositions that we consider (western vs non-western, and male vs female and non-binary).
Much of the previous work on bias in categorization has focused on biases in individual categories or in flat category systems. However, items can be categorized at multiple levels of abstraction (Mervis & Rosch, 1981) and natural categories are often organized into conceptual hierarchies. For example “flower” is a subcategory of “plant” which is a subcategory of “living things”. In addition, many formalized systems of categories such as biological taxonomies, medical ontologies, and library classification systems have hierarchical structures. Here we consider several kinds of biases that can occur in hierarchical category systems. Some of these biases have counterparts in flat systems: for instance, cat lovers could have more fine-grained category divisions for cat breeds than they do for dog breeds. Other biases, however, are distinctive to hierarchical systems. For example, Loehrlein (2012) demonstrated that people are biased towards concepts located near the top of a hierarchical system such that these concepts are perceived as being more important than those at the bottom.
Our methods are general and could potentially be used to measure bias in any hierarchical category system. Laboratory methods commonly used to elicit hierarchical category systems include successive pile sorting and hierarchical clustering based on judgments of similarity (Medin et al., 1997, 2006), and our approach could be applied to hierarchies generated by any such method. Another example is WordNet (Miller, 1994), a lexical database that organizes nouns and verbs into hierarchies. WordNet is both an influential theory of human lexical memory and a resource used to develop and test many other theoretical contributions in cognitive science, and understanding bias in WordNet is therefore important. WordNet aims for very broad coverage of the lexicon, and a separate research tradition aims to document folk taxonomies of specific semantic domains including plants, animals, artifacts, diseases, and soils (Holman, 2005). Our methods could be used to identify areas given more and less weight by these taxonomies—for example, we could measure the extent to which an animal taxonomy privileges domesticated animals ahead of wild animals, and compare the strength of this bias across cultures. Our approach may therefore be broadly useful as a tool for studying the way in which people’s category systems are influenced by cultural and individual biases, and for exploring the idea that category systems are not exclusively shaped by intrinsic properties of the things that they categorize, but are instead heavily influenced by human needs and values.
Although our approach has many possible applications, here we take library classification as a case study and apply our methods to the Library of Congress Classification (LCC) and the Dewey Decimal Classification (DDC). These library classification systems are large-scale, hierarchical examples of human categorization that are directly accessible and much more amenable to computational analysis than the category systems that all of us carry around in our heads. Focusing on library classification also allows us to connect our approach with a large body of existing work in the library and information sciences devoted to uncovering and mitigating bias in the LCC (Angell & Price, 2012; Howard & Knowlton, 2018; Intner & Futas, 1996; Kam, 2007; Rogers, 1993), the DDC (Higgins, 2016; Kua, 2008; Olson & Ward, 1997; Westenberg, 2022), or both (Mai, 2010; Zins & Santos, 2011). Category systems, especially more formal systems like library classifications, are often perceived as neutral or objective, making it all the more important to develop methods that enable us to quantify and thus acknowledge and address the biases that may be implicit in these systems. As such, studying formal systems like these is valuable in its own right and can also contribute to a better understanding of categorization in general (Glushko et al., 2008).
Previous work on bias in the LCC and DDC has documented that the language used to label categories, and the location of topics and books in the classification schemes can encode harmful bias. For example, in the LCC unglossed religious terms like “God,” and “devotional literature” refer to these concepts only in the context of Christianity (Knowlton, 2005). Similarly, subject headings such as “engineers” that have subheadings such as “women engineers” but not “male engineers” assume men as the default (Rogers, 1993). In general, the LCC and DDC systems have been found to be biased and unsystematic in their coverage of non-western religions and racial groups (Westenberg, 2022; Zins & Santos, 2011) and both systems are biased in their categorization of non-western languages and literatures (Higgins, 2016; Howard & Knowlton, 2018; Kua, 2008). In addition, both systems struggle to represent topics related to women and women’s studies, and these topics are often restricted to limited sets of categories that are scattered across the classification scheme (Intner & Futas, 1996; Olson & Ward, 1997). We thus apply our methods to two case studies of bias. The first case study measures western bias, or bias in favour of western culture, in the categories of the LCC and DDC. The second measures gender bias, and compares the representation of books written by women to books written by men in both systems.
Our work systematically quantifies the extent of bias within the two library classification systems that we consider. For institutional category systems such as these, quantifying bias is important because a quantitative measure can be used to identify the parts of a system that show the strongest bias and are therefore most important to consider when proposing future improvements to the system. Quantifying bias in category systems is also important because categories can influence perception and behaviour (Goldstone et al., 2001; Loehrlein, 2012) and it is therefore important to understand the extent to which a system might bias its users’ understanding of the items that it categorizes. A third benefit of a quantitative approach is that it allows for the comparison of bias across two or more related classification systems, and we illustrate by comparing the LCC and the DDC. Finally, our quantitative approach can be applied at a relatively large scale, and therefore allows us to analyze many more items and categories than a single researcher would be able to process on their own.
The LCC and DDC are both hierarchical systems that contain a set of main classes, each corresponding to a different discipline. These main classes are recursively subdivided into increasingly more specific subcategories that classify smaller and smaller subsets of the literature. In the LCC there are 21 main classes and classification numbers are alphanumeric. There is no formal limit on the number of subcategories a category can have. Figure 1A illustrates this system with the classification of religious literature. The DDC has 10 main classes and classification numbers are entirely numeric. Each category can have a maximum of 10 children. Figure 1B illustrates the classification of religious literature in the DDC. The category hierarchy is represented by the position of the digit that differentiates a category. The DDC has stronger structural constraints than the LCC as it enforces the strict upper limit on the number of subcategories (Svenonius, 2000). As a result, the LCC tends to be flatter and the DDC deeper.
We used the OhioLINK Circulation Data, a large publicly available data set of books and their circulation, to represent the books in our analysis of bias in library classification systems. OhioLINK contains 6.78 million MARC bibliographic records for books and manuscripts in the Ohio academic libraries (OhioLINK Collection Building Task Force et al., 2011). These bibliographic records include the LCC and DDC classification assigned to a book. Only books that had both an LCC and a DDC classification were kept resulting in 3.32 million books. These books were placed into the DDC and LCC tree structures using their relevant classification numbers. For each book, we found the most specific category associated with its classification number, and then recursively added it to each parent category until the top of the tree was reached. This ensured that each parent category contained all the books of its subcategories. For each book we stored its author and circulation statistics. See Appendix B for more details.
Assume that blue and red represent distinct but comparable labels that can be applied to the internal nodes of a hierarchical classification system. For example, in Figure 1 red categories are related to western topics and blue categories are related to non-western topics. Category bias occurs when the system gives preferential treatment to one group of nodes (e.g., red nodes) ahead of the other. We assume for now that an unbiased system treats red and blue categories identically.
Three kinds of category biases are illustrated in Figure 2: count bias, level bias, and descendant bias. Category count bias (Figure 2B) occurs when there are more red than blue categories in a classification scheme. Thus, more classification space is devoted to red categories. Category level bias (Figure 2C) occurs when red starting categories occur higher in a classification structure than blue starting categories. A starting category (or starting node) is the first category in a classification sub-tree that can be labelled as red or blue. Starting categories that are higher in the classification scheme are conceptualized as more general or important than those that are deeper. Finally, descendant bias (Figure 2D) occurs when red starting categories have more descendants than blue starting categories on average. In other words, red categories are privileged by having more fine-grained category divisions.
The three biases in Figure 2 may often be correlated in practice—for example, if there are more red categories (category count bias) it is likely that red starting categories will have more descendants (descendant bias). The biases, however, are conceptually distinct and can be separated in principle. For example, Figure 2D shows that even when node counts are held constant for red and blue it is possible to observe level bias (in favour of blue) and descendant bias (in favour of red). We therefore propose that considering the three biases individually is worthwhile as they highlight different aspects of category bias.
Instead of assigning the internal nodes of a hierarchical system to groups, assume now that purple and gold represent distinct but comparable labels that can be applied to a set of items. For example, gold items could be books written by men and purple items could be books written by women and nonbinary people. Figure 3 shows several examples in which the items are shown as small circles at the leaves of a classification hierarchy. Item bias occurs when the system gives preferential treatment to one group of items (e.g., gold items) ahead of the other. As before, we assume that an unbiased system would give equal treatment to gold and purple items.
Three kinds of item biases are illustrated in Figure 3: count bias, level bias, and distributional bias. Item count bias (Figure 3A.ii) occurs when there are more gold than purple items classified by a system. Item level bias (Figure 3A.iii) is similar to category level bias, and occurs when gold items tend to be found higher in the classification tree than gold items. Finally, distributional bias (Figure 3B.iii) occurs when gold items are distributed more broadly across the classification system than are purple items. In other words, purple items are more restricted to a limited part of the classification scheme than are gold items.
Distributional bias can be diagnosed by comparing the shape of the distributions of gold items to the shapes of the distributions of purple items. In Figure 3B.iii, the distribution of the gold items across the three categories at the lower level of the hierarchy is relatively flat, but the purple distribution is concentrated on the third of the three categories. In contrast, Figures 3B.i and 3B.ii show distributions of purple and gold authors that do not suffer from distributional bias. In Figure 3B.i the shape of the distribution of purple authors is identical to the shape of the distribution of gold authors. In Figure 3B.ii, although the distributions are not identical, they are shuffled versions of one another and therefore equally as flat.
Figure 4 shows what distributional bias can look like in library classification systems. In the LCC, the books by men are more evenly spread across the subcategories of “Handicrafts. Arts and crafts,” than are the books by women. The books by women are predominately restricted to two subcategories, “Home arts. Homecrafts” (TT697-927) and “Clothing manufacture. Dressmaking. Tailoring” (TT490-695). In the DDC’s equivalent category, “Handicrafts,” the difference between the shape of the distribution of books by men and books by women does not appear to be as big.
An example of the distributions of male and female authors in the category “Handicrafts. Arts and Crafts” in the LCC (top) and “Handicrafts” in the DDC (bottom). Number of tagged items is the number of books that have an author with a known gender in the dataset. See Appendix A for the full set of subcategory names.
In the context of gender bias in library classification systems, item count bias is a clear example of external bias as it comes from unequal numbers of purple and gold authors in the set of items to be classified. This bias could arise because society provides more opportunities for men to write books than women, or because libraries are more likely to acquire books written by men (Quinn, 2012), or both. How to characterize distributional bias and level bias is less clear. For example, in the case of distributional bias in a library classification system, it might be that purple and gold authors write about equally diverse sets of topics, but that the interests of purple authors are given limited space in the classification system. Thus it could be that the distributional bias is an example of an internal bias. It could also be that the topics addressed by purple authors are genuinely less diverse than the topics addressed by gold authors because of social pressures that encourage purple authors to specialize in a limited set of areas. Thus the distributional bias could also be an example of an external bias. Similarly, level bias against purple items could be the result of the classification system placing topics of interest to purple authors lower in the tree (internal bias), or social pressures pushing purple authors to write in smaller, more niche categories (external bias). Although the origins of level bias and distributional bias may not be clear, both biases are worth investigating as they can provide insight into how different groups are represented in a classification scheme, regardless of whether this difference in representation is imposed by the system itself or the result of external forces.
We manually tagged the categories selected for each topic as western or non-western, drawing on distinctions that have been previously suggested in the literature. Still, the tagging process is inevitably subjective, and in cases where a label of western or non-western was unclear, we left the category untagged, aiming for precision over recall. This somewhat limits the results, as there might be cases where a country or language, etc. falls into a category with a clear label in the LCC but not the DDC or vice versa.
The classes related to history tended to be divided into categories based on geographical and political divisions such as country or continent. We therefore used a list of western countries that were defined based on a cultural definition of “western” as opposed to a political, economic, or geographical definition (de Espinosa, 2017; Hall, 2018; Trubetskoy, 2017). For example, Australia tends to be considered a western country despite not being geographically in the western hemisphere. 68 countries, about 35% of the world’s countries, were included in the list of western countries and we assumed that countries left off the list were non-western. For each history-focused main class, we considered all categories associated with a country and tagged them as western or non-western based on the list. The tagged category became a starting category. If a category represented a group of countries (i.e., a category for a continent or a region) and all the categories beneath it shared the same tag, then that broader category became the starting category and inherited the tag. Similarly, every category under a starting category inherited the starting category’s tag.
In the language and literature-related classes, some categories were related to regional divisions like the history-focused classes so we based our tagging on the list of western countries used previously. Examples of these categories include “German literature” and “Languages and literature of Eastern Asia, Africa, Oceania.” Some categories were related to language families so we considered where these languages or language groups originated from to make the tagging choice. “Romance languages” is one example. The main deviation from the tagging method used for history was how we tagged Indigenous languages and literature from North America, South America, and Oceania. Consistent with our cultural definition of the western concept (Hall, 2018), we tagged them as non-western even if they originated from a country or region that is listed as western.
Finally, for the main classes covering religion, we mostly tagged Abrahamic religions as western and other religions as non-western. The few exceptions included tagging Scientology as western and Islam as non-western. Islam is an Abrahamic religion, but we made the conservative decision to tag Islam as non-western, because the opposite decision would probably only increase any western bias that we might find. The categories Doctrinal Theology and Practical Theology were tagged as western because they have only been used to classify literature on Christianity (Zins & Santos, 2011).
In total there were 3009 categories on the topics of religion, language & literature, and history in the LCC, and 13,536 in the DDC. Based on the tagging method, 86.3% of categories could be tagged as either western or non-western in the LCC, and 91.4% in the DDC. We refer to tagged categories as “nodes” to be consistent with our use of a tree representation.
Figure 5 shows analogous results for each of the three individual topics. For each topic, there is a higher percentage of nodes tagged as western than non-western. In the LCC, religion has the highest percentage of western nodes. In the DDC, history and religion had percentages that were almost equally high. For all topics, the DDC had a higher percentage of western nodes than the LCC. To test the statistical significance of this result we randomly assigned all nodes a western or non-western label with equal probability. We repeated the process 10,000 times, using the proportion of the times the absolute difference between western and non-western node counts was greater than or equal to the observed absolute difference as the p value. For all topics in the DDC, and religion and history in the LCC, p < 0.001. For language & literature in the LCC, p = 0.003. All category count biases were therefore statistically significant.
We have conservatively assumed that an unbiased system has an equal number of western and non-western nodes, but this assumption could be adjusted using statistics such as population sizes or the percentage of western countries. If anything, these statistics tend to suggest that an unbiased system should devote more space to non-western than to western nodes. For example, Africa and Asia accounted for 75% of the world’s population in 2022 (United Nations, DESA, Population Division, 2022). Western category count bias is substantial relative to a conservative 50–50 baseline, and would be even stronger relative to a a baseline favouring non-western nodes.
Library classification systems follow the principle of literary warrant, which means that their structures are derived from and justified by the body of literature that they classify (Svenonius, 2000). Based on this principle, it could be argued that there are more western categories because there are more western books that need to be classified. To test this idea, we calculated the mean rate of books per western node and non-western node in each system. These rates are reported as labels below the x axis of Figure 5. An unbiased system might be expected to have relatively equal rates of books per node. We found that language & literature in the DDC and history in the LCC have relatively equal rates of books per node for western and non-western nodes. Otherwise, there tend to be more books per non-western node than per western node. The difference in rates is most pronounced for religion (0.17% vs. 1.10%) and history (0.13% vs. 0.77%) in the DDC. These findings suggest that in some cases, especially in the DDC, the higher western category count cannot entirely be accounted for by literary warrant.
Similarly, it could be argued that there are more western nodes because western books are in higher demand than non-western books. To explore this idea we compared the circulation of books classified in western nodes to those in non-western nodes. Circulation statistics were drawn from the OhioLINK circulation data, and for each book we extracted three pieces of information: (i) whether the book was in circulation (i.e., available for borrowing) in 2007, (ii) whether the book was borrowed in 2007, and (iii) how often the book was borrowed in 2007. Mean values of all three variables are shown in Table 1. Across all topics and for both classification systems, a larger percentage of books classified under non-western nodes are in circulation than books classified under western nodes. For religion, a larger percentage of circulating non-western books were taken out than circulating western books in 2007 for both the LCC and DDC. In addition, among all religion books that were taken out, non-western books had a higher mean rate of circulation. The opposite was true for language & literature where a larger percentage of circulating western books were taken out and western books had a higher mean rate of circulation. For history, these statistics varied slightly but were relatively similar for western and non-western books. Overall, the circulation statistics do not seem to justify the large discrepancy between western and non-western node counts.
For each topic, Figure 6 shows the distributions of starting nodes over classification tree depths. To quantify the difference in distributions over western and non-western starting depths, we computed the Jensen Shannon Divergence (JSD) between these distributions. To test the statistical significance of the results we performed permutation tests. For each topic, the depth labels were shuffled among the western and non-western nodes to create randomized depth distributions. This shuffling was carried out 10,000 times and the proportion of times the JSD between the randomized western and non-western depth distributions was greater than or equal to the actual JSD was used as the p value.
The depths for the LCC and the DDC are not directly comparable, because the DDC tends to have a larger set of starting node depths than the LCC (e.g., the starting nodes for religion are spread across 5 different tree depths in the DDC versus just 2 in the LCC). We therefore computed an alternative measure of level bias that uses the probability that a randomly selected non-western starting node is deeper in a classification than a randomly selected western starting node. A western and a non-western starting node were randomly sampled 10,000 times. The depths of the two nodes were compared to determine the number of times the non-western one was deeper than the western one and vice versa (ties were ignored). The resulting statistic measured the probability that a non-western starting node would be deeper in a tree than a western starting node, given they were not of the same depth. The results are shown in Table 2. For every topic except history in the DDC, it is more likely that a randomly selected non-western starting node is deeper in the tree than a randomly selected western node. Western nodes for history in the DDC have a higher chance of starting deeper in the tree. Based on this statistic, the LCC displays a stronger level bias than does the DDC. To test for significance we performed a permutation test by randomly shuffling the depths among the western and non-western nodes and recalculating the probability that a non-western node was deeper in the tree. This was repeated 10,000 times and the proportion of times the absolute value of the difference between 50% and the recalculated probability was greater than the actual difference was used as the p values. The results are in Table 2. The significance by topic mirrored the significance of the initial divergence statistic for level bias.
We measure descendant bias by comparing the mean number of descendants per western starting node to the mean number of descendants per non-western starting node. We also recorded the number of starting nodes and the mean percentage of books per starting node. All statistics were computed for the LCC and DDC overall, as well as separately for religion, language & literature, and history. The results are shown in Table 3. To test for significance, the western and non-western tags were randomly shuffled among the starting nodes and the absolute difference in western and non-western descendant means was recomputed. This process was repeated 10,000 times, keeping track of the number of times the recomputed difference in means was greater than the absolute value of the observed difference in means.
Our finding that both the LCC and DDC show western category bias is expected given previous work on biases in library classification (Knowlton, 2005; Zins & Santos, 2011), but our approach departs from previous work in attempting to systematically quantify the nature and extent of this bias. For example, religion is known to be a topic that shows substantial western bias (Fox, 2019), but to our knowledge previous studies have not systematically quantified the level of bias observed for religion relative to the bias observed for other topics. Similarly, there have been suggestions that the DDC shows greater western bias than does the LCC (Sultanik, 2022), but prior work has not provided comprehensive quantitative analyses to support this claim.
To study item-level gender bias we worked with all books in our dataset that were classified under both the LCC and the DCC. We only considered books with non-empty author fields in their MARC records. Each of these books was tagged with the gender of its author. To determine an author’s gender, we used data from the author-name-index and author-gender tables created and kindly shared by Ekstrand and Kluver (2021) as part of their book data integration pipeline, PIReT Book Data Tools. These tables store processed versions of author name and gender data from the Virtual International Authority File (VIAF). The VIAF stores author information, including the variants of an author’s name and their gender. As discussed by Ekstrand and Kluver (2021), the VIAF, unfortunately, codes gender as binary and does not code for non-binary gender identities. Each author record is coded as either male, female, or unknown. For now, our analysis is thus limited to analyzing item-level biases between male and female authors. When more accurate author data is available, the same analyses can be performed including non-binary gender identities.
There is no linking identifier between a MARC record and its author’s VIAF record. We followed the method for linking records used in the PIReT Book Data Tools (Ekstrand & Kluver, 2021). At a high-level, string-matching was used to tag a book with its author’s gender, and there are three main cases where an author’s gender cannot be determined. The first is when a book’s author matches to a VIAF record with the gender code “unknown.” The second is when an a book’s author matches to multiple VIAF records with conflicting known gender identities and thus had an ambiguous gender code. The third is when a book’s author does not match any author record in the VIAF dataset. We discarded books in any of these three cases from our item-level analysis.
2.55 million of the 3.32 million MARC records had a non-empty author field and 1.95 million of these could be tagged with an author’s gender. Table 4 contains a breakdown of the record-matching process. Less than 1% of the records were tagged as ambiguous. Only 5% of the MARC records could not be linked to any VIAF record. These results are similar to results achieved by the PIReT Book Data Tools. In all the datasets they were applied to, somewhere between 3.3% and 6.9% of book records could not be matched to a VIAF record (Ekstrand & Kluver, 2021).
We also compared the circulation of books by men to the circulation of books by women. The results are shown in Table 5. The percentage of books by women in circulation is equal to the percentage of books by men. In 2007, 37% of circulating books by women were borrowed versus 29% of books by men. Similarly, borrowed books by women were taken out more times on average (3.30) than borrowed books by men (2.91). These findings reveal that demand alone cannot explain the under-representation of female authors and suggest that there are other forces that systematically reduce their representation.
We plotted the distributions of books over the classification tree depths in Figure 7 and computed the difference in the mean depths of books by men and women. In the LCC the difference is 0.001 and in the DDC it is 0.191. To test the significance of the results we performed a permutation test. We shuffled the classification depths among books and recomputed the difference between the mean depth of books by men and the mean depth of books by women. This was repeated 1000 times and the proportion of times the random difference in means was greater than or equal to the actual difference in means was used as the p value. In the LCC p = 0.61 and in the DDC p < 0.001. For the books in the Ohio academic libraries, LCC classifications did not yield a significant difference between the mean depth of classification for books written by men and the mean depth of books written by women. For the DDC the difference is statistically significant, but small.
Distributional bias is evident when the distribution of male authors across children of a given node tends to be flatter than the distribution of female authors. To test for this bias we collected every node that had at least 100 Ohio library books, at least 2 children, and both male and female authors. This approach yielded 822 LCC nodes and 2832 DDC nodes that could be used to compare the distributions of books by male and female authors. See Appendix C for a breakdown of the nodes that could not be used in the distribution bias analysis. To compute the distribution of male authors for each node, the number of male authors in each direct child node was divided by the total number of male authors across all child nodes. We did not use the total number of male authors in the parent node because not all items in a parent node are classified into one of the children. The same was done for the female author distribution. For example, in Figure 3B.iii there are 10 purple books and 1 is assigned to the first child node, 2 to the second, and 7 to the third. Thus the distribution of purple authors for that node is [0.1, 0.2, 0.7]. For each node, we compared the Shannon entropies of the male and female distributions to determine which of the two distributions was flatter.
Figure 8 shows the number of nodes for which the male distribution was flatter than the female distribution. In both the LCC and the DDC the distribution of male authors among a node’s children tended to be flatter than the distribution of female authors. The effect is stronger in the DDC as the relative difference in size between the two counts is 2.34 as opposed to 1.80. To test the significance of the results reported above, a permutation test was performed. The entire set of author gender labels was shuffled among the classified items. For each node, the author gender distributions were recalculated, and the same flatness comparison was applied. In both the DDC and the LCC the results were significant with p < 0.001.
In the first case study, we found that language & literature in the DDC had significant category count and level bias, and that religion and history had significant count bias in both systems. These biases were in favour of western nodes and confirm previous findings that the DDC is biased in its categorization of non-western language and literature (Kua, 2008), and that non-western religions and topics are under-represented in both the DDC and LCC (Westenberg, 2022; Zins & Santos, 2011). We also found that a category system that has count bias does not necessarily have level bias or descendant bias and vice versa suggesting that the three proposed biases can be used to quantify different aspects of category bias and provide a relatively nuanced picture of how it manifests. Finally, we found that the DDC tends to show a higher degree of western category bias than does the LCC. Specifically, there was evidence of strong category count and descendant bias in the DDC whereas there was no evidence of descendant bias in the LCC.
In the second case study, we found that women are underrepresented in the set of books we considered and that there is a strong distributional bias in favour of men in both the LCC and DDC. Previous studies have documented that topics relating to women in the LCC and DDC tended to be restricted to specific categories (Olson & Ward, 1997; Rogers, 1993), and our analyses support a similar conclusion by suggesting that books by women tend to be restricted to relatively limited sections of the LCC and DDC. Despite strong evidence of item count bias and distributional bias, we do not find much evidence of item level bias in favour of men or women in either system. Like the three category biases we define, the three item biases can provide a detailed picture of the different ways in which item biases can be found in a system or the set of items it classifies.
Thirdly, the observed biases can be partially attributed to biased decisions made by the individuals who created these systems. Western bias has been widespread in Western culture over the past century, and has inevitably shaped the thinking of those who build and maintain library classification systems. The underlying psychological mechanisms that bias the decisions of librarians are likely to include mechanisms that drive biased categorization in general. One example is the out-group homogeneity effect, or the tendency to perceive out-group members as less diverse than in-group members. The descendant bias in the DDC seems to mirror this effect because finer-grained categories are used for Western (in-group) than for non-Western (out-group) topics. Like the out-group homogeneity effect, descendant bias in the DDC could be potentially attributed to greater familiarity with and exposure to western literature over non-western literature, or increased attention to and better memory for topics and features of literature that are relevant to the in-group (Das-Smaal, 1990; Park & Rothbart, 1982). Category count bias similarly, could also occur because features of in-group literature are easier to perceive and recall and thus it is easier to differentiate this literature and create more categories.
In our second case study, our results for gender demonstrate that the three item biases in Figure 3 are sensitive to the preferential treatment of different groups of items. As mentioned earlier, however, these biases may be the result of external social pressures affecting the items classified by a system or may be imposed on the items by the classification system itself. The item count bias we find is clearly external to both classification systems, but it is unclear to what extent the distributional bias found is internal or external to either the LCC or DDC. Comparing the two systems provides some evidence that the item biases found in the DDC have an internal component. We found that the DDC has a stronger distributional gender bias than does the LCC and had a very slight item level bias where the LCC had none. These differences occurred even though the set of books considered was held constant across the two systems. Our results therefore suggest that some proportion of item bias is internal to the DDC, but do not allow us to tell whether the LCC is also subject to internal item biases.
Although our case studies focused on library classification, our methods are general and can potentially be applied to a broad range of hierarchical category systems. To illustrate, we apply our methods to WordNet (Miller, 1994). Both Western bias and gender bias are potentially relevant. For example, previous studies have documented western biases in ImageNet (Liu et al., 2021; Luccioni & Rolnick, 2023), and these biases are likely inherited from WordNet, the source of the ImageNet hierarchy. However, to illustrate the range of our methods we consider a third kind of bias. Using a procedure described in Appendix D, we identified synsets in WordNet that correspond to species of mammals, tagged these species as wild or domestic, then used our methods to measure the extent to which WordNet prioritizes domestic species ahead of wild species. Although domestic mammals account for less than 1% of all mammal species (Mammal Diversity Database, 2023), Table 6 shows that English WordNet 3.0 displays a clear bias for domestic over wild mammals. Despite a larger number of starting nodes for wild than for domestic species, count bias is present because there are more categories (i.e., WordNet synsets) overall for domestic than wild mammals. Descendant bias is also present, because domestic categories tend to have more subcategories (i.e., hyponyms) than do wild categories, leading to a more fine-grained representation.
Category bias analysis for domestic versus wild mammal species in English WordNet 3.0. The 3 measures of category bias reported are the total number of synsets (Count Bias), the mean depth of starting nodes (Level Bias), and the mean number of descendants per starting node (Descendant Bias). Starting nodes are WordNet synsets corresponding to different mammal species. See Appendix D for full details of this analysis.
WordNet lies somewhere between an institutional category system and a natural category system, but our approach can also be used to quantify cultural and individual differences in natural category systems. Names of plants (Berlin, 1992), animals, artifacts (Rosch et al., 1976), body parts (Majid, 2010), and places (Basso, 1984; Burenhult & Levinson, 2008) are all organized into hierarchies or partonymies, and our methods could be applied to each of these cases. For example, consistent with our WordNet analysis, plant and animal names could be labelled as wild or domesticated, and future studies could measure the extent to which a folk taxonomy is biased towards domesticated ahead of wild species. The degree of bias is likely to vary across cultures in line with existing findings that agricultural societies tend to have more names for plants than do hunter-gatherer societies (Balée, 1999; Berlin, 1992; Brown, 1985). Within cultures, the degree of bias is likely to correlate with factors such as expertise (Tanaka & Taylor, 1991). For example, Aguaruna Jivaro women have much more fine-grained categories for Manioc (a tropical root crop native to South America) than do men and this difference aligns with the division of labour among men and women in Aguaruna Jivaro culture (Boster, 1985).
We defined bias as a preference for one group over another, and focused on two cases (gender and western bias) where these preferences can cause harm, especially when systems that incorporate these preferences are perceived as objective. In the case of folk taxonomy, however, a preference for domestic over wild species may be beneficial in supporting communication about the species of most interest to a given culture. Preferences are not necessarily harmful, and can instead illuminate the different needs, values, or roles of the people and cultures who create and use category systems. Our approach therefore joins a set of existing quantitative techniques that can provide insight into conceptual variation both across and within cultures (Romney et al., 1986, 2000).
A key limitation that applies to both our case studies is that we focused on two US-based, western classification systems. Future work could aim to apply our methods to a more diverse set of library classification systems, including non-western systems such as Russian and Chinese library classification systems (Zhang, 2003), and systems like the Universal Decimal System, which was designed to be more comprehensive than the Dewey Decimal system. It is also important to note that in both studies our book-level statistics are based on data from the Ohio Academic Libraries. The analyses of bias in the LCC and DDC based on these statistics are thus limited to how biased the LCC and DDC are with respect to this specific group of western libraries. Despite their limitations, however, our analyses seem sufficient to demonstrate that our methods are capable of capturing biases in hierarchical category systems.
Although we focused on hierarchical category systems, future work could apply some of our methods to measure bias in flat category systems. One previous study in this area focused on gerrymandering, and developed methods for quantifying bias in United States’ congressional districts (McCartan & Imai, 2023). Some of our methods for detecting category and item biases in hierarchical category systems can be directly applied to flat systems. For example, item count and category count bias can be applied without modification. Distributional bias could also be applied by considering the distribution of different groups across an entire flat system instead of considering differences in the distribution across the subcategories of each internal node.
Our work documents and quantifies biases in hierarchical classification systems, and future work could study the cognitive mechanisms that give rise to these biases. Perception, attention, and memory can all help to explain how biased collections of library books are created (Quinn, 2012), and the same three mechanisms are likely to contribute to biases in hierarchical category systems. For example, differences between in-group members are often perceived as larger than differences between out-group members, and therefore more worthy of being recognized in a classification system (Park & Rothbart, 1982). These perceptual differences may arise as a consequence of selective attention to features that are more relevant to in-group members than to out-group members (Das-Smaal, 1990). Familiarity and exposure can also lead to bias, because frequently encountered items (i.e., in-group members) are more likely to come to mind than items encountered rarely (out-group members). Laboratory experiments have previously considered all of these factors, but more work can be done to explore how these factors produce biases in hierarchical systems of categories.
Finally, our methods could be used to explore how biases in category systems change and develop with time. Category systems are rarely formed all at once and instead develop over time in response to a sequence of items. The sequence in which items are encountered can affect the categories that are created (Medin & Bettger, 1994), and future work can examine how bias is compounded or reduced as items are encountered over time. Knowlton (2005) studies historical change in the LCC by manually documenting all the ways in which the subject headings have and have not changed three decades after Berman (1971) proposed modifications to reduce bias. With access to historical versions of the LCC, DDC, or other category systems, our methods could allow us to explicitly quantify how these systems differ on measures of category and item bias with time. We expect that institutional category systems should become increasingly unbiased with time, but it is possible that some structural biases may compound and increase instead.
Author names are stored in the Main Entry-Personal Name field of a book’s MARC record. This field records the person mainly responsible for the work (Library of Congress, 2022), whether they are the primary author in a multi-authored work or the editor of an anthology. For simplicity, we use the term “author” in all cases.

Sections

"[{\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib8\", \"fn2\"], \"section\": \"INTRODUCTION\", \"text\": \"Categories inevitably reflect the needs, perspectives, and experiences of the people who create them (Bowker & Star, 2000). Consider, for example, Steinberg\\u2019s famous depiction of the View of the World from 9th Avenue, which devotes half of the page to three New York City blocks but shows China, Russia, and Japan as tiny blobs on the horizon. A View of the World from Tiananmen Square would look rather different, and might include separate categories for Sh\\u0101nx\\u012b and Sh\\u01cenx\\u012b provinces while making no distinction between Washington State and Washington DC.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib13\", \"bib23\", \"bib20\", \"bib65\", \"bib54\"], \"section\": \"INTRODUCTION\", \"text\": \"Although systems of categories are often subjective, the distinctions that they encode or fail to encode can have important consequences (Crawford, 2021). For example, Gould (1990) points out that the United States\\u2019 categorization of drugs as legal or illegal results in some addictive drugs being advertised on TV, and others carrying life sentences. The categorization of an animal or plant population as a distinct species as opposed to a variant of an existing species can affect conservation efforts and biodiversity research (Freeman & Pennell, 2021; Thomson et al., 2021). Finally, categories can also lead to harmful stereotypes, especially when coarse categories are used for members of out-groups in contrast to the finer-grained categories used for members of one\\u2019s in-groups (Park & Rothbart, 1982). Because category systems can encode and reinforce stereotypes, it is important to ensure that the biases encoded by these systems are acknowledged as such instead of treated as ground truth.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib9\", \"bib50\", \"bib66\", \"bib69\", \"bib24\", \"bib61\", \"bib31\", \"bib39\", \"bib54\", \"bib60\", \"bib16\", \"bib22\", \"bib2\", \"bib18\"], \"section\": \"INTRODUCTION\", \"text\": \"Social, developmental and cognitive psychologists have previously explored how categories (e.g., racial and gender categories) arise and how these categories influence behaviour (Brewer, 2007; Misch et al., 2022; Timeo et al., 2017; Waxman, 2021). Existing work highlights two key connections between categorization and bias. First, people tend to have positive attitudes towards in-group categories and negative attitudes towards out-group categories, and methods such as the Implicit Association Test (Greenwald et al., 1998; Schimmack, 2021) attempt to measure these attitudes. Second, categories can bias the way in which people perceive individual members of in-groups and out-groups. For example, there is a tendency to perceive members of one\\u2019s out-group as being more similar to one another than are members of one\\u2019s ingroup (Judd & Park, 1988; Mackie & Worth, 1989; Park & Rothbart, 1982; Rubin & Badea, 2012). The link between categorization and biased perception has been explored more generally (Dubova & Goldstone, 2021; Goldstone et al., 2001), and the tendency to overestimate both within-category similarity and between-category distinctiveness is sometimes referred to as categorization bias (Ashby & Zeithamova, 2022) or categorical bias (Ester et al., 2020). Here we focus on a third connection between categorization and bias, and consider ways in which the structure of a category system (i.e., the extensions of the categories that it includes) can reflect bias. A canonical example is that a category system may include fine-grained categories in areas that are deemed valuable or important, and coarser categories in areas deemed less worthy of attention.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib25\", \"bib43\"], \"section\": \"INTRODUCTION\", \"text\": \"The term \\u201cbias\\u201d has been used to refer to distinct concepts in the literature. For example, \\u201cinductive bias\\u201d refers to constraints or expectations that guide learning (Griffiths et al., 2010; Markman, 1989), and is not directly relevant to our study. We define bias as preferential treatment for one group (e.g., western individuals) over another (e.g., non-western individuals). As we discuss later, this bias can either reflect external biases that have shaped the items to be categorized, or can be internal to a category system and imposed on the items by this system. To study this notion of bias we develop methods to measure how different groups are represented in a category system. Establishing that a system is biased requires us to demonstrate that the system departs from an unbiased alternative, and it is not always clear how an unbiased system should weigh different groups. We therefore begin with the simplifying assumption that an unbiased system should give roughly equal weight to each group, but return to this assumption later and discuss the extent to which it is appropriate for the specific oppositions that we consider (western vs non-western, and male vs female and non-binary).\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib48\", \"bib37\"], \"section\": \"INTRODUCTION\", \"text\": \"Much of the previous work on bias in categorization has focused on biases in individual categories or in flat category systems. However, items can be categorized at multiple levels of abstraction (Mervis & Rosch, 1981) and natural categories are often organized into conceptual hierarchies. For example \\u201cflower\\u201d is a subcategory of \\u201cplant\\u201d which is a subcategory of \\u201cliving things\\u201d. In addition, many formalized systems of categories such as biological taxonomies, medical ontologies, and library classification systems have hierarchical structures. Here we consider several kinds of biases that can occur in hierarchical category systems. Some of these biases have counterparts in flat systems: for instance, cat lovers could have more fine-grained category divisions for cat breeds than they do for dog breeds. Other biases, however, are distinctive to hierarchical systems. For example, Loehrlein (2012) demonstrated that people are biased towards concepts located near the top of a hierarchical system such that these concepts are perceived as being more important than those at the bottom.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib46\", \"bib47\", \"bib49\", \"bib28\"], \"section\": \"INTRODUCTION\", \"text\": \"Our methods are general and could potentially be used to measure bias in any hierarchical category system. Laboratory methods commonly used to elicit hierarchical category systems include successive pile sorting and hierarchical clustering based on judgments of similarity (Medin et al., 1997, 2006), and our approach could be applied to hierarchies generated by any such method. Another example is WordNet (Miller, 1994), a lexical database that organizes nouns and verbs into hierarchies. WordNet is both an influential theory of human lexical memory and a resource used to develop and test many other theoretical contributions in cognitive science, and understanding bias in WordNet is therefore important. WordNet aims for very broad coverage of the lexicon, and a separate research tradition aims to document folk taxonomies of specific semantic domains including plants, animals, artifacts, diseases, and soils (Holman, 2005). Our methods could be used to identify areas given more and less weight by these taxonomies\\u2014for example, we could measure the extent to which an animal taxonomy privileges domesticated animals ahead of wild animals, and compare the strength of this bias across cultures. Our approach may therefore be broadly useful as a tool for studying the way in which people\\u2019s category systems are influenced by cultural and individual biases, and for exploring the idea that category systems are not exclusively shaped by intrinsic properties of the things that they categorize, but are instead heavily influenced by human needs and values.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib1\", \"bib29\", \"bib30\", \"bib32\", \"bib56\", \"bib27\", \"bib34\", \"bib53\", \"bib70\", \"bib40\", \"bib72\", \"bib21\"], \"section\": \"INTRODUCTION\", \"text\": \"Although our approach has many possible applications, here we take library classification as a case study and apply our methods to the Library of Congress Classification (LCC) and the Dewey Decimal Classification (DDC). These library classification systems are large-scale, hierarchical examples of human categorization that are directly accessible and much more amenable to computational analysis than the category systems that all of us carry around in our heads. Focusing on library classification also allows us to connect our approach with a large body of existing work in the library and information sciences devoted to uncovering and mitigating bias in the LCC (Angell & Price, 2012; Howard & Knowlton, 2018; Intner & Futas, 1996; Kam, 2007; Rogers, 1993), the DDC (Higgins, 2016; Kua, 2008; Olson & Ward, 1997; Westenberg, 2022), or both (Mai, 2010; Zins & Santos, 2011). Category systems, especially more formal systems like library classifications, are often perceived as neutral or objective, making it all the more important to develop methods that enable us to quantify and thus acknowledge and address the biases that may be implicit in these systems. As such, studying formal systems like these is valuable in its own right and can also contribute to a better understanding of categorization in general (Glushko et al., 2008).\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib33\", \"bib56\", \"bib70\", \"bib72\", \"bib27\", \"bib29\", \"bib34\", \"bib30\", \"bib53\"], \"section\": \"INTRODUCTION\", \"text\": \"Previous work on bias in the LCC and DDC has documented that the language used to label categories, and the location of topics and books in the classification schemes can encode harmful bias. For example, in the LCC unglossed religious terms like \\u201cGod,\\u201d and \\u201cdevotional literature\\u201d refer to these concepts only in the context of Christianity (Knowlton, 2005). Similarly, subject headings such as \\u201cengineers\\u201d that have subheadings such as \\u201cwomen engineers\\u201d but not \\u201cmale engineers\\u201d assume men as the default (Rogers, 1993). In general, the LCC and DDC systems have been found to be biased and unsystematic in their coverage of non-western religions and racial groups (Westenberg, 2022; Zins & Santos, 2011) and both systems are biased in their categorization of non-western languages and literatures (Higgins, 2016; Howard & Knowlton, 2018; Kua, 2008). In addition, both systems struggle to represent topics related to women and women\\u2019s studies, and these topics are often restricted to limited sets of categories that are scattered across the classification scheme (Intner & Futas, 1996; Olson & Ward, 1997). We thus apply our methods to two case studies of bias. The first case study measures western bias, or bias in favour of western culture, in the categories of the LCC and DDC. The second measures gender bias, and compares the representation of books written by women to books written by men in both systems.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib22\", \"bib37\"], \"section\": \"INTRODUCTION\", \"text\": \"Our work systematically quantifies the extent of bias within the two library classification systems that we consider. For institutional category systems such as these, quantifying bias is important because a quantitative measure can be used to identify the parts of a system that show the strongest bias and are therefore most important to consider when proposing future improvements to the system. Quantifying bias in category systems is also important because categories can influence perception and behaviour (Goldstone et al., 2001; Loehrlein, 2012) and it is therefore important to understand the extent to which a system might bias its users\\u2019 understanding of the items that it categorizes. A third benefit of a quantitative approach is that it allows for the comparison of bias across two or more related classification systems, and we illustrate by comparing the LCC and the DDC. Finally, our quantitative approach can be applied at a relatively large scale, and therefore allows us to analyze many more items and categories than a single researcher would be able to process on their own.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F1\", \"F1\", \"bib63\"], \"section\": \"LIBRARY CLASSIFICATION SYSTEMS\", \"text\": \"The LCC and DDC are both hierarchical systems that contain a set of main classes, each corresponding to a different discipline. These main classes are recursively subdivided into increasingly more specific subcategories that classify smaller and smaller subsets of the literature. In the LCC there are 21 main classes and classification numbers are alphanumeric. There is no formal limit on the number of subcategories a category can have. Figure 1A illustrates this system with the classification of religious literature. The DDC has 10 main classes and classification numbers are entirely numeric. Each category can have a maximum of 10 children. Figure 1B illustrates the classification of religious literature in the DDC. The category hierarchy is represented by the position of the digit that differentiates a category. The DDC has stronger structural constraints than the LCC as it enforces the strict upper limit on the number of subcategories (Svenonius, 2000). As a result, the LCC tends to be flatter and the DDC deeper.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"fn3\", \"fn4\", \"bib52\", \"app2\"], \"section\": \"Library Classifications as Trees\", \"text\": \"We used the OhioLINK Circulation Data, a large publicly available data set of books and their circulation, to represent the books in our analysis of bias in library classification systems. OhioLINK contains 6.78 million MARC bibliographic records for books and manuscripts in the Ohio academic libraries (OhioLINK Collection Building Task Force et al., 2011). These bibliographic records include the LCC and DDC classification assigned to a book. Only books that had both an LCC and a DDC classification were kept resulting in 3.32 million books. These books were placed into the DDC and LCC tree structures using their relevant classification numbers. For each book, we found the most specific category associated with its classification number, and then recursively added it to each parent category until the top of the tree was reached. This ensured that each parent category contained all the books of its subcategories. For each book we stored its author and circulation statistics. See Appendix B for more details.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F1\"], \"section\": \"Category Bias\", \"text\": \"Assume that blue and red represent distinct but comparable labels that can be applied to the internal nodes of a hierarchical classification system. For example, in Figure 1 red categories are related to western topics and blue categories are related to non-western topics. Category bias occurs when the system gives preferential treatment to one group of nodes (e.g., red nodes) ahead of the other. We assume for now that an unbiased system treats red and blue categories identically.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F2\", \"F2\", \"F2\", \"F2\"], \"section\": \"Category Bias\", \"text\": \"Three kinds of category biases are illustrated in Figure 2: count bias, level bias, and descendant bias. Category count bias (Figure 2B) occurs when there are more red than blue categories in a classification scheme. Thus, more classification space is devoted to red categories. Category level bias (Figure 2C) occurs when red starting categories occur higher in a classification structure than blue starting categories. A starting category (or starting node) is the first category in a classification sub-tree that can be labelled as red or blue. Starting categories that are higher in the classification scheme are conceptualized as more general or important than those that are deeper. Finally, descendant bias (Figure 2D) occurs when red starting categories have more descendants than blue starting categories on average. In other words, red categories are privileged by having more fine-grained category divisions.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F2\", \"F2\"], \"section\": \"Category Bias\", \"text\": \"The three biases in Figure 2 may often be correlated in practice\\u2014for example, if there are more red categories (category count bias) it is likely that red starting categories will have more descendants (descendant bias). The biases, however, are conceptually distinct and can be separated in principle. For example, Figure 2D shows that even when node counts are held constant for red and blue it is possible to observe level bias (in favour of blue) and descendant bias (in favour of red). We therefore propose that considering the three biases individually is worthwhile as they highlight different aspects of category bias.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F3\"], \"section\": \"Item Bias\", \"text\": \"Instead of assigning the internal nodes of a hierarchical system to groups, assume now that purple and gold represent distinct but comparable labels that can be applied to a set of items. For example, gold items could be books written by men and purple items could be books written by women and nonbinary people. Figure 3 shows several examples in which the items are shown as small circles at the leaves of a classification hierarchy. Item bias occurs when the system gives preferential treatment to one group of items (e.g., gold items) ahead of the other. As before, we assume that an unbiased system would give equal treatment to gold and purple items.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F3\", \"F3\", \"F3\", \"F3\"], \"section\": \"Item Bias\", \"text\": \"Three kinds of item biases are illustrated in Figure 3: count bias, level bias, and distributional bias. Item count bias (Figure 3A.ii) occurs when there are more gold than purple items classified by a system. Item level bias (Figure 3A.iii) is similar to category level bias, and occurs when gold items tend to be found higher in the classification tree than gold items. Finally, distributional bias (Figure 3B.iii) occurs when gold items are distributed more broadly across the classification system than are purple items. In other words, purple items are more restricted to a limited part of the classification scheme than are gold items.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F3\", \"F3\", \"F3\", \"F3\", \"F3\"], \"section\": \"Item Bias\", \"text\": \"Distributional bias can be diagnosed by comparing the shape of the distributions of gold items to the shapes of the distributions of purple items. In Figure 3B.iii, the distribution of the gold items across the three categories at the lower level of the hierarchy is relatively flat, but the purple distribution is concentrated on the third of the three categories. In contrast, Figures 3B.i and 3B.ii show distributions of purple and gold authors that do not suffer from distributional bias. In Figure 3B.i the shape of the distribution of purple authors is identical to the shape of the distribution of gold authors. In Figure 3B.ii, although the distributions are not identical, they are shuffled versions of one another and therefore equally as flat.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F4\"], \"section\": \"Item Bias\", \"text\": \"Figure 4 shows what distributional bias can look like in library classification systems. In the LCC, the books by men are more evenly spread across the subcategories of \\u201cHandicrafts. Arts and crafts,\\u201d than are the books by women. The books by women are predominately restricted to two subcategories, \\u201cHome arts. Homecrafts\\u201d (TT697-927) and \\u201cClothing manufacture. Dressmaking. Tailoring\\u201d (TT490-695). In the DDC\\u2019s equivalent category, \\u201cHandicrafts,\\u201d the difference between the shape of the distribution of books by men and books by women does not appear to be as big.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"app1\"], \"section\": \"\", \"text\": \"An example of the distributions of male and female authors in the category \\u201cHandicrafts. Arts and Crafts\\u201d in the LCC (top) and \\u201cHandicrafts\\u201d in the DDC (bottom). Number of tagged items is the number of books that have an author with a known gender in the dataset. See Appendix A for the full set of subcategory names.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib55\"], \"section\": \"Item Bias\", \"text\": \"In the context of gender bias in library classification systems, item count bias is a clear example of external bias as it comes from unequal numbers of purple and gold authors in the set of items to be classified. This bias could arise because society provides more opportunities for men to write books than women, or because libraries are more likely to acquire books written by men (Quinn, 2012), or both. How to characterize distributional bias and level bias is less clear. For example, in the case of distributional bias in a library classification system, it might be that purple and gold authors write about equally diverse sets of topics, but that the interests of purple authors are given limited space in the classification system. Thus it could be that the distributional bias is an example of an internal bias. It could also be that the topics addressed by purple authors are genuinely less diverse than the topics addressed by gold authors because of social pressures that encourage purple authors to specialize in a limited set of areas. Thus the distributional bias could also be an example of an external bias. Similarly, level bias against purple items could be the result of the classification system placing topics of interest to purple authors lower in the tree (internal bias), or social pressures pushing purple authors to write in smaller, more niche categories (external bias). Although the origins of level bias and distributional bias may not be clear, both biases are worth investigating as they can provide insight into how different groups are represented in a classification scheme, regardless of whether this difference in representation is imposed by the system itself or the result of external forces.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"fn5\"], \"section\": \"Methods\", \"text\": \"We manually tagged the categories selected for each topic as western or non-western, drawing on distinctions that have been previously suggested in the literature. Still, the tagging process is inevitably subjective, and in cases where a label of western or non-western was unclear, we left the category untagged, aiming for precision over recall. This somewhat limits the results, as there might be cases where a country or language, etc. falls into a category with a clear label in the LCC but not the DDC or vice versa.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib15\", \"bib26\", \"bib67\"], \"section\": \"Methods\", \"text\": \"The classes related to history tended to be divided into categories based on geographical and political divisions such as country or continent. We therefore used a list of western countries that were defined based on a cultural definition of \\u201cwestern\\u201d as opposed to a political, economic, or geographical definition (de Espinosa, 2017; Hall, 2018; Trubetskoy, 2017). For example, Australia tends to be considered a western country despite not being geographically in the western hemisphere. 68 countries, about 35% of the world\\u2019s countries, were included in the list of western countries and we assumed that countries left off the list were non-western. For each history-focused main class, we considered all categories associated with a country and tagged them as western or non-western based on the list. The tagged category became a starting category. If a category represented a group of countries (i.e., a category for a continent or a region) and all the categories beneath it shared the same tag, then that broader category became the starting category and inherited the tag. Similarly, every category under a starting category inherited the starting category\\u2019s tag.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib26\"], \"section\": \"Methods\", \"text\": \"In the language and literature-related classes, some categories were related to regional divisions like the history-focused classes so we based our tagging on the list of western countries used previously. Examples of these categories include \\u201cGerman literature\\u201d and \\u201cLanguages and literature of Eastern Asia, Africa, Oceania.\\u201d Some categories were related to language families so we considered where these languages or language groups originated from to make the tagging choice. \\u201cRomance languages\\u201d is one example. The main deviation from the tagging method used for history was how we tagged Indigenous languages and literature from North America, South America, and Oceania. Consistent with our cultural definition of the western concept (Hall, 2018), we tagged them as non-western even if they originated from a country or region that is listed as western.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib72\"], \"section\": \"Methods\", \"text\": \"Finally, for the main classes covering religion, we mostly tagged Abrahamic religions as western and other religions as non-western. The few exceptions included tagging Scientology as western and Islam as non-western. Islam is an Abrahamic religion, but we made the conservative decision to tag Islam as non-western, because the opposite decision would probably only increase any western bias that we might find. The categories Doctrinal Theology and Practical Theology were tagged as western because they have only been used to classify literature on Christianity (Zins & Santos, 2011).\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"fn6\"], \"section\": \"Results\", \"text\": \"In total there were 3009 categories on the topics of religion, language & literature, and history in the LCC, and 13,536 in the DDC. Based on the tagging method, 86.3% of categories could be tagged as either western or non-western in the LCC, and 91.4% in the DDC. We refer to tagged categories as \\u201cnodes\\u201d to be consistent with our use of a tree representation.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F5\"], \"section\": \"Category Count Bias.\", \"text\": \"Figure 5 shows analogous results for each of the three individual topics. For each topic, there is a higher percentage of nodes tagged as western than non-western. In the LCC, religion has the highest percentage of western nodes. In the DDC, history and religion had percentages that were almost equally high. For all topics, the DDC had a higher percentage of western nodes than the LCC. To test the statistical significance of this result we randomly assigned all nodes a western or non-western label with equal probability. We repeated the process 10,000 times, using the proportion of the times the absolute difference between western and non-western node counts was greater than or equal to the observed absolute difference as the p value. For all topics in the DDC, and religion and history in the LCC, p < 0.001. For language & literature in the LCC, p = 0.003. All category count biases were therefore statistically significant.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib68\"], \"section\": \"Category Count Bias.\", \"text\": \"We have conservatively assumed that an unbiased system has an equal number of western and non-western nodes, but this assumption could be adjusted using statistics such as population sizes or the percentage of western countries. If anything, these statistics tend to suggest that an unbiased system should devote more space to non-western than to western nodes. For example, Africa and Asia accounted for 75% of the world\\u2019s population in 2022 (United Nations, DESA, Population Division, 2022). Western category count bias is substantial relative to a conservative 50\\u201350 baseline, and would be even stronger relative to a a baseline favouring non-western nodes.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib63\", \"F5\"], \"section\": \"Category Count Bias.\", \"text\": \"Library classification systems follow the principle of literary warrant, which means that their structures are derived from and justified by the body of literature that they classify (Svenonius, 2000). Based on this principle, it could be argued that there are more western categories because there are more western books that need to be classified. To test this idea, we calculated the mean rate of books per western node and non-western node in each system. These rates are reported as labels below the x axis of Figure 5. An unbiased system might be expected to have relatively equal rates of books per node. We found that language & literature in the DDC and history in the LCC have relatively equal rates of books per node for western and non-western nodes. Otherwise, there tend to be more books per non-western node than per western node. The difference in rates is most pronounced for religion (0.17% vs. 1.10%) and history (0.13% vs. 0.77%) in the DDC. These findings suggest that in some cases, especially in the DDC, the higher western category count cannot entirely be accounted for by literary warrant.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"T1\"], \"section\": \"Category Count Bias.\", \"text\": \"Similarly, it could be argued that there are more western nodes because western books are in higher demand than non-western books. To explore this idea we compared the circulation of books classified in western nodes to those in non-western nodes. Circulation statistics were drawn from the OhioLINK circulation data, and for each book we extracted three pieces of information: (i) whether the book was in circulation (i.e., available for borrowing) in 2007, (ii) whether the book was borrowed in 2007, and (iii) how often the book was borrowed in 2007. Mean values of all three variables are shown in Table 1. Across all topics and for both classification systems, a larger percentage of books classified under non-western nodes are in circulation than books classified under western nodes. For religion, a larger percentage of circulating non-western books were taken out than circulating western books in 2007 for both the LCC and DDC. In addition, among all religion books that were taken out, non-western books had a higher mean rate of circulation. The opposite was true for language & literature where a larger percentage of circulating western books were taken out and western books had a higher mean rate of circulation. For history, these statistics varied slightly but were relatively similar for western and non-western books. Overall, the circulation statistics do not seem to justify the large discrepancy between western and non-western node counts.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F6\"], \"section\": \"Level Bias.\", \"text\": \"For each topic, Figure 6 shows the distributions of starting nodes over classification tree depths. To quantify the difference in distributions over western and non-western starting depths, we computed the Jensen Shannon Divergence (JSD) between these distributions. To test the statistical significance of the results we performed permutation tests. For each topic, the depth labels were shuffled among the western and non-western nodes to create randomized depth distributions. This shuffling was carried out 10,000 times and the proportion of times the JSD between the randomized western and non-western depth distributions was greater than or equal to the actual JSD was used as the p value.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"T2\", \"T2\"], \"section\": \"Level Bias.\", \"text\": \"The depths for the LCC and the DDC are not directly comparable, because the DDC tends to have a larger set of starting node depths than the LCC (e.g., the starting nodes for religion are spread across 5 different tree depths in the DDC versus just 2 in the LCC). We therefore computed an alternative measure of level bias that uses the probability that a randomly selected non-western starting node is deeper in a classification than a randomly selected western starting node. A western and a non-western starting node were randomly sampled 10,000 times. The depths of the two nodes were compared to determine the number of times the non-western one was deeper than the western one and vice versa (ties were ignored). The resulting statistic measured the probability that a non-western starting node would be deeper in a tree than a western starting node, given they were not of the same depth. The results are shown in Table 2. For every topic except history in the DDC, it is more likely that a randomly selected non-western starting node is deeper in the tree than a randomly selected western node. Western nodes for history in the DDC have a higher chance of starting deeper in the tree. Based on this statistic, the LCC displays a stronger level bias than does the DDC. To test for significance we performed a permutation test by randomly shuffling the depths among the western and non-western nodes and recalculating the probability that a non-western node was deeper in the tree. This was repeated 10,000 times and the proportion of times the absolute value of the difference between 50% and the recalculated probability was greater than the actual difference was used as the p values. The results are in Table 2. The significance by topic mirrored the significance of the initial divergence statistic for level bias.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"T3\"], \"section\": \"Descendant Bias.\", \"text\": \"We measure descendant bias by comparing the mean number of descendants per western starting node to the mean number of descendants per non-western starting node. We also recorded the number of starting nodes and the mean percentage of books per starting node. All statistics were computed for the LCC and DDC overall, as well as separately for religion, language & literature, and history. The results are shown in Table 3. To test for significance, the western and non-western tags were randomly shuffled among the starting nodes and the absolute difference in western and non-western descendant means was recomputed. This process was repeated 10,000 times, keeping track of the number of times the recomputed difference in means was greater than the absolute value of the observed difference in means.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib33\", \"bib72\", \"bib19\", \"bib62\"], \"section\": \"Interim Summary\", \"text\": \"Our finding that both the LCC and DDC show western category bias is expected given previous work on biases in library classification (Knowlton, 2005; Zins & Santos, 2011), but our approach departs from previous work in attempting to systematically quantify the nature and extent of this bias. For example, religion is known to be a topic that shows substantial western bias (Fox, 2019), but to our knowledge previous studies have not systematically quantified the level of bias observed for religion relative to the bias observed for other topics. Similarly, there have been suggestions that the DDC shows greater western bias than does the LCC (Sultanik, 2022), but prior work has not provided comprehensive quantitative analyses to support this claim.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"fn7\", \"bib17\", \"bib17\"], \"section\": \"Methods\", \"text\": \"To study item-level gender bias we worked with all books in our dataset that were classified under both the LCC and the DCC. We only considered books with non-empty author fields in their MARC records. Each of these books was tagged with the gender of its author. To determine an author\\u2019s gender, we used data from the author-name-index and author-gender tables created and kindly shared by Ekstrand and Kluver (2021) as part of their book data integration pipeline, PIReT Book Data Tools. These tables store processed versions of author name and gender data from the Virtual International Authority File (VIAF). The VIAF stores author information, including the variants of an author\\u2019s name and their gender. As discussed by Ekstrand and Kluver (2021), the VIAF, unfortunately, codes gender as binary and does not code for non-binary gender identities. Each author record is coded as either male, female, or unknown. For now, our analysis is thus limited to analyzing item-level biases between male and female authors. When more accurate author data is available, the same analyses can be performed including non-binary gender identities.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib17\"], \"section\": \"Methods\", \"text\": \"There is no linking identifier between a MARC record and its author\\u2019s VIAF record. We followed the method for linking records used in the PIReT Book Data Tools (Ekstrand & Kluver, 2021). At a high-level, string-matching was used to tag a book with its author\\u2019s gender, and there are three main cases where an author\\u2019s gender cannot be determined. The first is when a book\\u2019s author matches to a VIAF record with the gender code \\u201cunknown.\\u201d The second is when an a book\\u2019s author matches to multiple VIAF records with conflicting known gender identities and thus had an ambiguous gender code. The third is when a book\\u2019s author does not match any author record in the VIAF dataset. We discarded books in any of these three cases from our item-level analysis.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"T4\", \"bib17\"], \"section\": \"Results\", \"text\": \"2.55 million of the 3.32 million MARC records had a non-empty author field and 1.95 million of these could be tagged with an author\\u2019s gender. Table 4 contains a breakdown of the record-matching process. Less than 1% of the records were tagged as ambiguous. Only 5% of the MARC records could not be linked to any VIAF record. These results are similar to results achieved by the PIReT Book Data Tools. In all the datasets they were applied to, somewhere between 3.3% and 6.9% of book records could not be matched to a VIAF record (Ekstrand & Kluver, 2021).\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"T5\"], \"section\": \"Item Count Bias.\", \"text\": \"We also compared the circulation of books by men to the circulation of books by women. The results are shown in Table 5. The percentage of books by women in circulation is equal to the percentage of books by men. In 2007, 37% of circulating books by women were borrowed versus 29% of books by men. Similarly, borrowed books by women were taken out more times on average (3.30) than borrowed books by men (2.91). These findings reveal that demand alone cannot explain the under-representation of female authors and suggest that there are other forces that systematically reduce their representation.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F7\"], \"section\": \"Level Bias.\", \"text\": \"We plotted the distributions of books over the classification tree depths in Figure 7 and computed the difference in the mean depths of books by men and women. In the LCC the difference is 0.001 and in the DDC it is 0.191. To test the significance of the results we performed a permutation test. We shuffled the classification depths among books and recomputed the difference between the mean depth of books by men and the mean depth of books by women. This was repeated 1000 times and the proportion of times the random difference in means was greater than or equal to the actual difference in means was used as the p value. In the LCC p = 0.61 and in the DDC p < 0.001. For the books in the Ohio academic libraries, LCC classifications did not yield a significant difference between the mean depth of classification for books written by men and the mean depth of books written by women. For the DDC the difference is statistically significant, but small.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"app3\", \"F3\"], \"section\": \"Distributional Bias.\", \"text\": \"Distributional bias is evident when the distribution of male authors across children of a given node tends to be flatter than the distribution of female authors. To test for this bias we collected every node that had at least 100 Ohio library books, at least 2 children, and both male and female authors. This approach yielded 822 LCC nodes and 2832 DDC nodes that could be used to compare the distributions of books by male and female authors. See Appendix C for a breakdown of the nodes that could not be used in the distribution bias analysis. To compute the distribution of male authors for each node, the number of male authors in each direct child node was divided by the total number of male authors across all child nodes. We did not use the total number of male authors in the parent node because not all items in a parent node are classified into one of the children. The same was done for the female author distribution. For example, in Figure 3B.iii there are 10 purple books and 1 is assigned to the first child node, 2 to the second, and 7 to the third. Thus the distribution of purple authors for that node is [0.1, 0.2, 0.7]. For each node, we compared the Shannon entropies of the male and female distributions to determine which of the two distributions was flatter.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F8\"], \"section\": \"Distributional Bias.\", \"text\": \"Figure 8 shows the number of nodes for which the male distribution was flatter than the female distribution. In both the LCC and the DDC the distribution of male authors among a node\\u2019s children tended to be flatter than the distribution of female authors. The effect is stronger in the DDC as the relative difference in size between the two counts is 2.34 as opposed to 1.80. To test the significance of the results reported above, a permutation test was performed. The entire set of author gender labels was shuffled among the classified items. For each node, the author gender distributions were recalculated, and the same flatness comparison was applied. In both the DDC and the LCC the results were significant with p < 0.001.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib34\", \"bib70\", \"bib72\"], \"section\": \"DISCUSSION\", \"text\": \"In the first case study, we found that language & literature in the DDC had significant category count and level bias, and that religion and history had significant count bias in both systems. These biases were in favour of western nodes and confirm previous findings that the DDC is biased in its categorization of non-western language and literature (Kua, 2008), and that non-western religions and topics are under-represented in both the DDC and LCC (Westenberg, 2022; Zins & Santos, 2011). We also found that a category system that has count bias does not necessarily have level bias or descendant bias and vice versa suggesting that the three proposed biases can be used to quantify different aspects of category bias and provide a relatively nuanced picture of how it manifests. Finally, we found that the DDC tends to show a higher degree of western category bias than does the LCC. Specifically, there was evidence of strong category count and descendant bias in the DDC whereas there was no evidence of descendant bias in the LCC.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib53\", \"bib56\"], \"section\": \"DISCUSSION\", \"text\": \"In the second case study, we found that women are underrepresented in the set of books we considered and that there is a strong distributional bias in favour of men in both the LCC and DDC. Previous studies have documented that topics relating to women in the LCC and DDC tended to be restricted to specific categories (Olson & Ward, 1997; Rogers, 1993), and our analyses support a similar conclusion by suggesting that books by women tend to be restricted to relatively limited sections of the LCC and DDC. Despite strong evidence of item count bias and distributional bias, we do not find much evidence of item level bias in favour of men or women in either system. Like the three category biases we define, the three item biases can provide a detailed picture of the different ways in which item biases can be found in a system or the set of items it classifies.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib14\", \"bib54\"], \"section\": \"Why Are Library Classification Systems Biased?\", \"text\": \"Thirdly, the observed biases can be partially attributed to biased decisions made by the individuals who created these systems. Western bias has been widespread in Western culture over the past century, and has inevitably shaped the thinking of those who build and maintain library classification systems. The underlying psychological mechanisms that bias the decisions of librarians are likely to include mechanisms that drive biased categorization in general. One example is the out-group homogeneity effect, or the tendency to perceive out-group members as less diverse than in-group members. The descendant bias in the DDC seems to mirror this effect because finer-grained categories are used for Western (in-group) than for non-Western (out-group) topics. Like the out-group homogeneity effect, descendant bias in the DDC could be potentially attributed to greater familiarity with and exposure to western literature over non-western literature, or increased attention to and better memory for topics and features of literature that are relevant to the in-group (Das-Smaal, 1990; Park & Rothbart, 1982). Category count bias similarly, could also occur because features of in-group literature are easier to perceive and recall and thus it is easier to differentiate this literature and create more categories.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"F3\"], \"section\": \"Why Are Library Classification Systems Biased?\", \"text\": \"In our second case study, our results for gender demonstrate that the three item biases in Figure 3 are sensitive to the preferential treatment of different groups of items. As mentioned earlier, however, these biases may be the result of external social pressures affecting the items classified by a system or may be imposed on the items by the classification system itself. The item count bias we find is clearly external to both classification systems, but it is unclear to what extent the distributional bias found is internal or external to either the LCC or DDC. Comparing the two systems provides some evidence that the item biases found in the DDC have an internal component. We found that the DDC has a stronger distributional gender bias than does the LCC and had a very slight item level bias where the LCC had none. These differences occurred even though the set of books considered was held constant across the two systems. Our results therefore suggest that some proportion of item bias is internal to the DDC, but do not allow us to tell whether the LCC is also subject to internal item biases.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib49\", \"bib36\", \"bib38\", \"app4\", \"bib42\", \"T6\"], \"section\": \"Beyond Library Classification\", \"text\": \"Although our case studies focused on library classification, our methods are general and can potentially be applied to a broad range of hierarchical category systems. To illustrate, we apply our methods to WordNet (Miller, 1994). Both Western bias and gender bias are potentially relevant. For example, previous studies have documented western biases in ImageNet (Liu et al., 2021; Luccioni & Rolnick, 2023), and these biases are likely inherited from WordNet, the source of the ImageNet hierarchy. However, to illustrate the range of our methods we consider a third kind of bias. Using a procedure described in Appendix D, we identified synsets in WordNet that correspond to species of mammals, tagged these species as wild or domestic, then used our methods to measure the extent to which WordNet prioritizes domestic species ahead of wild species. Although domestic mammals account for less than 1% of all mammal species (Mammal Diversity Database, 2023), Table 6 shows that English WordNet 3.0 displays a clear bias for domestic over wild mammals. Despite a larger number of starting nodes for wild than for domestic species, count bias is present because there are more categories (i.e., WordNet synsets) overall for domestic than wild mammals. Descendant bias is also present, because domestic categories tend to have more subcategories (i.e., hyponyms) than do wild categories, leading to a more fine-grained representation.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"app4\"], \"section\": \"\", \"text\": \"Category bias analysis for domestic versus wild mammal species in English WordNet 3.0. The 3 measures of category bias reported are the total number of synsets (Count Bias), the mean depth of starting nodes (Level Bias), and the mean number of descendants per starting node (Descendant Bias). Starting nodes are WordNet synsets corresponding to different mammal species. See Appendix D for full details of this analysis.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib5\", \"bib59\", \"bib41\", \"bib4\", \"bib11\", \"bib3\", \"bib5\", \"bib10\", \"bib64\", \"bib7\"], \"section\": \"Beyond Library Classification\", \"text\": \"WordNet lies somewhere between an institutional category system and a natural category system, but our approach can also be used to quantify cultural and individual differences in natural category systems. Names of plants (Berlin, 1992), animals, artifacts (Rosch et al., 1976), body parts (Majid, 2010), and places (Basso, 1984; Burenhult & Levinson, 2008) are all organized into hierarchies or partonymies, and our methods could be applied to each of these cases. For example, consistent with our WordNet analysis, plant and animal names could be labelled as wild or domesticated, and future studies could measure the extent to which a folk taxonomy is biased towards domesticated ahead of wild species. The degree of bias is likely to vary across cultures in line with existing findings that agricultural societies tend to have more names for plants than do hunter-gatherer societies (Bal\\u00e9e, 1999; Berlin, 1992; Brown, 1985). Within cultures, the degree of bias is likely to correlate with factors such as expertise (Tanaka & Taylor, 1991). For example, Aguaruna Jivaro women have much more fine-grained categories for Manioc (a tropical root crop native to South America) than do men and this difference aligns with the division of labour among men and women in Aguaruna Jivaro culture (Boster, 1985).\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib58\", \"bib57\"], \"section\": \"Beyond Library Classification\", \"text\": \"We defined bias as a preference for one group over another, and focused on two cases (gender and western bias) where these preferences can cause harm, especially when systems that incorporate these preferences are perceived as objective. In the case of folk taxonomy, however, a preference for domestic over wild species may be beneficial in supporting communication about the species of most interest to a given culture. Preferences are not necessarily harmful, and can instead illuminate the different needs, values, or roles of the people and cultures who create and use category systems. Our approach therefore joins a set of existing quantitative techniques that can provide insight into conceptual variation both across and within cultures (Romney et al., 1986, 2000).\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib71\"], \"section\": \"Limitations\", \"text\": \"A key limitation that applies to both our case studies is that we focused on two US-based, western classification systems. Future work could aim to apply our methods to a more diverse set of library classification systems, including non-western systems such as Russian and Chinese library classification systems (Zhang, 2003), and systems like the Universal Decimal System, which was designed to be more comprehensive than the Dewey Decimal system. It is also important to note that in both studies our book-level statistics are based on data from the Ohio Academic Libraries. The analyses of bias in the LCC and DDC based on these statistics are thus limited to how biased the LCC and DDC are with respect to this specific group of western libraries. Despite their limitations, however, our analyses seem sufficient to demonstrate that our methods are capable of capturing biases in hierarchical category systems.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib44\"], \"section\": \"Future Work\", \"text\": \"Although we focused on hierarchical category systems, future work could apply some of our methods to measure bias in flat category systems. One previous study in this area focused on gerrymandering, and developed methods for quantifying bias in United States\\u2019 congressional districts (McCartan & Imai, 2023). Some of our methods for detecting category and item biases in hierarchical category systems can be directly applied to flat systems. For example, item count and category count bias can be applied without modification. Distributional bias could also be applied by considering the distribution of different groups across an entire flat system instead of considering differences in the distribution across the subcategories of each internal node.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib55\", \"bib54\", \"bib14\"], \"section\": \"Future Work\", \"text\": \"Our work documents and quantifies biases in hierarchical classification systems, and future work could study the cognitive mechanisms that give rise to these biases. Perception, attention, and memory can all help to explain how biased collections of library books are created (Quinn, 2012), and the same three mechanisms are likely to contribute to biases in hierarchical category systems. For example, differences between in-group members are often perceived as larger than differences between out-group members, and therefore more worthy of being recognized in a classification system (Park & Rothbart, 1982). These perceptual differences may arise as a consequence of selective attention to features that are more relevant to in-group members than to out-group members (Das-Smaal, 1990). Familiarity and exposure can also lead to bias, because frequently encountered items (i.e., in-group members) are more likely to come to mind than items encountered rarely (out-group members). Laboratory experiments have previously considered all of these factors, but more work can be done to explore how these factors produce biases in hierarchical systems of categories.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib45\", \"bib33\", \"bib6\"], \"section\": \"Future Work\", \"text\": \"Finally, our methods could be used to explore how biases in category systems change and develop with time. Category systems are rarely formed all at once and instead develop over time in response to a sequence of items. The sequence in which items are encountered can affect the categories that are created (Medin & Bettger, 1994), and future work can examine how bias is compounded or reduced as items are encountered over time. Knowlton (2005) studies historical change in the LCC by manually documenting all the ways in which the subject headings have and have not changed three decades after Berman (1971) proposed modifications to reduce bias. With access to historical versions of the LCC, DDC, or other category systems, our methods could allow us to explicitly quantify how these systems differ on measures of category and item bias with time. We expect that institutional category systems should become increasingly unbiased with time, but it is possible that some structural biases may compound and increase instead.\"}, {\"pmc\": \"PMC10898782\", \"pmid\": \"\", \"reference_ids\": [\"bib35\"], \"section\": \"\", \"text\": \"Author names are stored in the Main Entry-Personal Name field of a book\\u2019s MARC record. This field records the person mainly responsible for the work (Library of Congress, 2022), whether they are the primary author in a multi-authored work or the editor of an anthology. For simplicity, we use the term \\u201cauthor\\u201d in all cases.\"}]"

Metadata

"{\"citation\": \"Warburton, K., Kemp, C., Xu, Y., & Frermann, L. (2024). Quantifying Bias in Hierarchical Category Systems. \"}"

PMC Articles

Quantifying Bias in Hierarchical Category Systems

Abstract

Full Text

Sections

Metadata