Networked Pantheon: a Relational Database of Globally Famous People

This article presents the Networked Pantheon , a relational database of biographies of globally famous people spanning the last 5,500 years of human history. This information source is intended to complement Pantheon 1.0 (Yu et al., 2016), a dataset that includes temporal, spatial, gender, and occupational information on 11,341 world-renowned people – defined as those who have biographies available in more than 25 languages on Wikipedia. The Networked Pantheon adds information about the biographical links between these historical figures, compiled from hyperlinks between the biographies in the English Wikipedia. This digital method enables techniques from network analysis to be used in studying the biographical relationships between globally famous people. Thus, distinct measures of historical centrality can be calculated for individuals, cities, countries, genders, and occupations. The Networked Pantheon includes indicators of figure centrality in the network of biographical references and provides an approximation of the information flows between various territories, genders, and occupations of famous people over time.

Following a series of pioneer studies in the area (Michel et al., 2011;Murray, 2003;Popescu & Grefenstette, 2010;Schich et al., 2014;Skiena & Ward, 2013), a remarkable database for conducting these types of research was published in 2016. This is Pantheon 1.0, a dataset of globally famous people that includes information about the 11,341 biographies present in more than 25 language versions of Wikipedia . Using the number of languages in which each biography is available as a proxy for its global cultural relevance, this dataset gathers indispensable information that historically locates recognized personalities (such as the year, city, and geographic coordinates of birth), data on the personal characteristics of each individual (e.g., gender and main occupation), and indicators of their historical popularity. In this way, Pantheon 1.0 enables the multidimensional study of the organization of the world's biographical knowledge in Wikipedia, facilitating the exploration of the temporal, spatial, gender, and occupational aspects over an enormous timeframe (3,500 BC -to date).
This article presents the Networked Pantheon, a database designed to complement Pantheon 1.0 with relational observations. It provides essentially three 52 research data journal for the humanities and social sciences 5 (2020) 50-65 new types of information. First of all, data on the biographical links between the notable people included in Pantheon dataset, approaching these relationships from the hyperlinks between their biographies in English Wikipedia.1 Second, network metrics for each biography, which can be used to better understand the structures of the information about prominent people in Wikipedia. Finally, the year of death of each historical figure, which enables each registered individual to be associated with a clearly delimited life period.
The Networked Pantheon has been used to study how hyperlinks between biographies enhance the dissemination of content about people born in some countries, what increases the geographical bias of Wikipedia's biographical record (Beytía, 2020). But it could be used to answer several questions, such as: -What links can be identified in the lives of world-renowned people? -How are these links structured into networks of biographical references across space and time? -Which historical figures are most central in Wikipedia's global network of biographical references? -Which groups with high biographical interconnection can be identified? -How independent or closed are the occupational networks in different periods and territories? -To what extent have women been excluded from certain professions over time? -How much have famous people from different occupations or territories tended to relate biographically to similar people? -Which cities historically have stronger links in terms of the biographical connections of their scientists, artists or politicians?
The Networked Pantheon database is freely available on the Open Science Framework (osf) server. It can be downloaded from the project's home page (www.osf.io/qtu2j/) or directly from the section that stores the files (www.osf .io/qtu2j/files/). It is registered under a Creative Commons "Attribution 4.0 International" license,2 which implies that it can be freely shared and adapted, giving credit to its authors and indicating whether changes to the original version were made.

53
Networked Pantheon research data journal for the humanities and social sciences 5 (2020) 50-65

Link Data
The hyperlinks between the Wikipedia biographies of 11,340 historically famous individuals contained in the original Pantheon Database 3 were extracted using the R software packages rvest (Wickham, 2016) and stringi (Gagolewski, 2020). These links were obtained from the English Wikipedia articles (extraction date 16.04.2018).4 To this effect, first, we converted the html document of the Wikipedia article into an xml-tree. From this tree, all nodes of the class "p a" were selected. The class selector indicates that a node corresponds to a hyperlink, thus including all hyperlinks present in the Wikipedia article. In a second step, we checked which links linked to the Wikipedia article of another famous person in the Pantheon Database, excluding all links that did not.5 The English language version was chosen because it is the most complete version -i.e., that with the most articles, biographies, editions, and editors (Aragon et al., 2012;Nemoto & Gloor, 2011) -and the one that registers the largest number of historical figures with biographies in 25 or more different languages. English Wikipedia includes biographies for all but one of the famous individuals recorded in Pantheon 1.0 (11,340 people in total),6 followed by the French (11,334), German (11,319), Russian (11,314), and Spanish (11,287) versions. While language selection might imply a better record of biographical links of English-speaking people, it has been documented that many of the hyperlinks between Wikipedia biographies overcome language barriers (Aragon et al., 2012) and this phenomenon should be more common among 3 Available at https://dataverse.harvard.edu/dataverse/pantheon. A new version, including people in more than 15 languages, is currently available at https://pantheon.world. 4 It has to be noted that updates of the Wikipedia pages might add or remove links from the Database in subsequent updates. 5 Initially, we compared this approach to (1) parsing the content of the Wikipedia-page into plain text and searching directly for the names of the famous individuals in the Pantheon Database using fuzzy string matching, and (2) to using more selective selectors (such as first filtering for "#content" and then selecting all nodes classed "p a"). Both yielded poorer results in heuristic validity checks that we performed on a small set of biographical Wikipedia articles from different historical domains and with different components and styles (for instance, names might be written in very different ways, e.g., in their Latin form; articles could include important biographical connections in biographical cards, picture descriptions or navigation boxes, and so on). 6 The only biography included in Pantheon, but not available in the English Wikipedia, is that of the Italian photographer Augusto de Luca. 54 research data journal for the humanities and social sciences 5 (2020) 50-65 biographies of globally known characters, such as those recorded in many different languages.

2.2.
Year of Death We extracted the year of death of the famous personalities from the html code of their Wikipedia pages by looking for the string "died" in relation to various date formats. Here, we selected the biography boxes from the xml-tree (selector ".vcard") to search for the year of death (extraction date 13.03.2018). The extracted dates were checked for plausibility against the birthdates already present in the original Pantheon database. Subsequently, we checked 436 cases manually that were implausible and corrected them (extraction between 13.03.2018 and 09.04.2018). Some biographies did not have a date of death, mostly because they describe very old historical figures who lack precise historical records. In a total of 136 cases, the dates of death had to be imputed. These imputations were calculated from the median lifespan associated with the historical period of each figure, which was approximated from the lifespan of the 10 closest cases according to the year of birth.

2.3.
Network Measures For each biography registered in the Networked Pantheon database, a series of structural measures were calculated from the network of biographical connections. These indicators are as follows: -Degree: number of connections or edges that one node (or biography) has to other nodes (Freeman, 1978(Freeman, -1979). -Indegree: number of edges (hyperlinks) going into a node.
-Outdegree: number of edges coming out of a node.
-Betweenness: the frequency with which a node appears in the shortest path between the nodes of the network (Brandes, 2001;Freeman, 1978Freeman, -1979. -Eigen Centrality (eigenvector): the centrality of an actor in proportion to the sum of the centralities of its neighbors in the graph (Bonacich, 1987). -PageRank: the measure of the global importance of nodes, computed recursively by placing greater weight on incoming connections from central nodes (Brin & Page, 1998;Page et al., 1999). -Eccentricity: the distance between a node and that furthest away from it in the network (Hage & Harary, 1995). -Closeness Centrality: the distance between a node and all other nodes in the network, based on the arithmetic mean of the minimum path between the nodes (Freeman, 1978(Freeman, -1979. -Harmonic Closeness Centrality: the distance between a node and all other nodes in the network, based on the harmonic mean of the minimum path between the nodes (Rochat, 2009).
-Authority: a good authority is a webpage (biography) that is pointed to by many good hubs (Kleinberg, 1998). -Hub: a good hub is a webpage (biography) that points to many good authorities (Kleinberg, 1998). -Clustering: the degree to which the nodes tend to cluster together (Saramäki et al., 2007).

2.4.
Biographical Centrality Index (bci) As a complement to the Historical Popularity Index , the Networked Pantheon includes a Biographical Centrality Index (bci) for each historical figure that denotes their cultural ubiquity (approximated by the number of languages in which a biography is available) weighted by its biographical connectivity (approximated by the PageRank algorithm). Considering the number of language versions of a biography (nl) and its PageRank (pr), the non-normalized bci is the multiplication of both values (nl × pr). This indicator, however, was later normalized through Feature Scaling method. Once we identified, in the complete group of biographies, the minimummin(nl × pr) -and maximum -max(nl × pr) -values of the non-normalized bci, normalization was carried out using this formula: ( NL×PR ) bci can be understood as a normalized indicator of the probability that a historical character would appear linked to a random biographical search in a random language in Wikipedia. It refers to a figure's degree of multilingual exposure and connectivity. The indicator is relative to the distribution of the cultural ubiquity and biographical connectivity of the total number of individuals considered, and it can be interpreted as the centrality of a biography compared to the most central one of the sample (as a value between 0 and 1, where the latter represents the greatest possible centrality).
The bci should be clearly distinguished from the Historical Popularity Index (hpi) included in Pantheon 1.0  for at least three reasons: 1.
The bci considers biographical connectivity as a relevant indicator for ranking the influence of characters on the discursive structure of Wikipedia. 2. It is an indicator focused on the organization of the content produced (supply of information), without considering the request or demand for biographical information ("page views" variable). 3. The bci aims to be a tool to understand how historical memory is currently being structured in Wikipedia, highlighting disparities in the distribution of historical information and concentrations of hyperlink 56 research data journal for the humanities and social sciences 5 (2020) 50-65 "flows" in certain periods, territories, genders and occupations. Therefore, it is not an adequate indicator for approaching the historical importance of each character -which should consider, for example, an adjustment for the excessive relative importance of some 20th-century characters.

Database Description
-

3.1.
General Aspects The Networked Pantheon records 126,279 direct relationships between 11,340 digital biographies of globally famous people, defined as those who have Wikipedia biographies in more than 25 languages. On average, each biography has 11.12 hyperlinks to others, ranging from 0 (e.g., Al Capone) to 105 (Meryl Streep). The number of incoming hyperlinks to these biographies varies between 0 (e.g., Josep Guardiola) and 414 (Barack Obama). The diameter of the network -that is, the shortest distance between the two most distant nodes in the network -is 15, while the average path length is 4.81. Figure 1 illustrates the general topography of the network using colors to distinguish large occupational domains classified in the Pantheon dataset (army, business and law, government, public figure, science and technology, arts, exploration, humanities, religion, and sports). As an example of the composition of the network in specialized fields, a cluster of globally recognized soccer players is shown in greater detail.

3.2.
Dynamic Analysis This database allows the analysis of networks of biographical references among contemporaries, and thus the comparison of linkage structures in specific historical periods. Moreover, this figure illustrates how the selected periods constitute different networks in terms of shape, size, density and occupational composition. Since data can be separated by year of birth, place of birth, gender and occupational domain, more specialized studies are also feasible.

3.3.
Flow of Biographical References With this database, the biographical relationships between historical figures can be geographically located; thus, it is possible to approach the flow of

Figure 4 Flows of biographical references between occupational fields
Note: The left side shows the occupational domains, ordered by the number of biographical references (hyperlinks) that they generate and the right side represents the same domains ordered according to the number of references that they receive. The percentage of links coming from the same domain is specified in parentheses on the right side.

59
Networked Pantheon research data journal for the humanities and social sciences 5 (2020) 50-65 Figure 4 represents the (origin and destination) flows of references between occupational domains. As can be seen, a large number of hyperlinks received in each domain come from the same domain. That degree of "self-referencing" is specified on the right side of the graph, which points out the percentage of received links that each occupational domain receives from itself. It could be inferred from this indicator, for example, that sports, arts and government are the most self-referential or autonomous occupational domains of the network.

Structural Measures of Centrality and Influence
Measures of network centrality observe various relational phenomena, so they can be used to order historical characters in different ways. Based on the hyperlinks between biographies in the English Wikipedia, Table 1 shows a comparison between the top 10 historical figures following selected coefficients. In the table, the Eigenvector highlights the biographical centrality of U.S. presidents, while the Authority coefficient prioritizes atp tennis players; the PageRank highlights the role of 20th-century politicians and classical humanists, while the Betweenness widens the range of influential occupations to religious leaders, former politicians, scientists, sportsmen, and businessmen. The Closeness coefficient is the least biased towards Western culture, and it includes sultans, Nobel Laureates in Physics, athletes, and politicians.

3.5.
Biographical Centrality Index (bci) The bci is an indicator of the positioning of biographical information in Wikipedia, and it can be used for spatial and temporal analysis. Figure 5 shows the geographical distribution of the accumulated biographical centrality in the countries. As can be seen, the biographical centrality in Wikipedia is not evenly distributed across the territory but clearly concentrated in the United States and Western Europe.
Also, the biographical positioning is concentrated on certain historical periods ( Figure 6). Given that the number of biographies has grown exponentially over the last five centuries, there is also a higher accumulation of biographical centrality (bci sum) in that period. A less intuitive pattern emerges when observing changes in the biographical centrality mean over time. Figure 6 shows that there are several periods in ancient history when famous people average high levels of centrality within the current historical record. For example, there is a centrality peak around 400 BC -when Greece was a center of science and humanities -and other around the year 0 -when many religious figures linked to Christianity were born. 60 research data journal for the humanities and social sciences 5 (2020) 50-65

Concluding Remarks
The Networked Pantheon (www.osf.io/qtu2j/) aims to increase the huge analytical potential of Pantheon 1.0 by adding relational observations that can answer new questions about digitally constructed history, collaborative media, cultural influence, distribution of information, global collective memory, and biographical knowledge structuration, among other areas. This database could be used for many purposes, such as: -Studying the historical links between world-famous people, or more precisely, the collective memory of those links in Wikipedia.  Note: The historical average of the bci was calculated for the accumulation of historical characters over 50-year periods. 62 research data journal for the humanities and social sciences 5 (2020) 50-65 -Identifying, in different historical periods, clusters of individuals highly interconnected by their biographical records. -Calculating the independence or closure of occupational networks in different territories and historical periods. -Investigating the degree of gender segregation in the biographical networks of particular occupations. -Studying homophily -or tendency to relate between similar people -in occupations and geographical locations over time. -Researching the scientific, artistic or political exchange (biographical flows) between cities, countries or continents. These topics may be associated with widely established fields of researchsuch as digital humanities, media studies, computer science or computational linguistic -but also with the emerging computational and digital social sciences (Lazer et al., 2009), which in recent years have undergone an adequate level of methodological reflection (Rieder & Röhle, 2012;Rogers, 2013;Venturini et al., 2018) and growing institutionalization in new sub-disciplines, such as digital sociology (Lupton, 2014;Marres, 2017;Orton-Johnson & Prior, 2013), digital anthropology (Horst & Miller, 2013;Miller & Slater, 2000), and digital geography (Graham, 2014;Zook et al., 2004).