This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Temporal graph-based clustering for historical record linkage

Work-in-progress paper
Charini Nanayakkara ?? Research School of Computer Science, The Australian National UniversityCanberraACT2601Australia charini.nanayakkara@anu.edu.au Peter Christen 0000-0003-3435-2015 Research School of Computer Science, The Australian National UniversityCanberraACT2601Australia peter.christen@anu.edu.au  and  Thilina Ranbaduge 0000-0001-5405-3704 Research School of Computer Science, The Australian National UniversityCanberraACT2601Australia thilina.ranbaduge@anu.edu.au
(2018)
Abstract.

Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from different domains are linked and integrated to allow advanced analytics. A popular type of data used in such a context are historical censuses, as well as birth, death, and marriage certificates. Individually, such data sets however limit the types of studies that can be conducted. Specifically, it is impossible to track individuals, families, or households over time. Once such data sets are linked and family trees spanning several decades are available it is possible to, for example, investigate how education, health, mobility, employment, and social status influence each other and the lives of people over two or even more generations. A major challenge is however the accurate linkage of historical data sets which is due to data quality and commonly also the lack of ground truth data being available. Unsupervised techniques need to be employed, which can be based on similarity graphs generated by comparing individual records. In this paper we present results from clustering birth records from Scotland where we aim to identify all births of the same mother and group siblings into clusters. We extend an existing clustering technique for record linkage by incorporating temporal constraints that must hold between births by the same mother, and propose a novel greedy temporal clustering technique. Experimental results show improvements over non-temporary approaches, however further work is needed to obtain links of high quality.

Entity resolution, birth records, Scottish, star clustering.
copyright: rightsretaineddoi: isbn: conference: 14th International Workshop on Mining and Learning with Graphs; August 2018; London, UKjournalyear: 2018ccs: Information systems Entity resolutionccs: Information systems Clusteringccs: Information systems Temporal dataccs: Theory of computation Graph algorithms analysis

1. Introduction

Databases that contain personal information, such as censuses or historical civil registries (Reid et al., 2002), generally contain multiple records describing the same individual (entity) or group of individuals such as families or households, where each individual will occur in such databases with different types of roles (Christen, 2016; Christen et al., 2017). A baby is born, then recorded as a daughter or son in a census, and later she or he might marry (as a bride or groom) and become the mother or father of her or his own children. Being able to link such records across different databases will allow the reconstruction of whole populations and open a multitude of studies in the health and social sciences that currently are not feasible on individual databases (Bloothooft et al., 2015; Kum et al., 2014).

The process of identifying the sets of records that correspond to the same individual is known as record linkage, entity resolution, or data matching (Christen, 2012). Record linkage involves comparing pairs of records to decide if the records of a pair refer to the same entity (known as a match) or to different entities (a non-match). In such a comparison process generally the similarities between the values of a selected set of attributes are compared to decide if a pair of records is similar enough to be classified as a match (if for example the similarities are above a pre-define threshold value). In many application domains this simple pair-wise linkage process does however not provide enough information to identify the relationships between different individuals (Christen, 2016; Dong and Srivastava, 2015).

Recently, in contrast to traditional pair-wise record linkage, group linkage (On et al., 2007) has received significant attention because of its applicability of linking groups of individuals, such as families or households (Christen et al., 2017; Fu et al., 2012). The identification of relationships between individuals can enrich data and improve the quality of data, and thus facilitate more sophisticated analysis of different socio-economic factors (such as health, wealth, occupation, and social structure) of large populations (Fu et al., 2014a; Grundy and Tomassini, 2005). Studying these issues are important to identify how societies evolve over time and discover the changes that influenced and contributed for social evolution (Dong and Tan, 2015).

Historical record linkage involves the linkage of historical records, including records from censuses as well as from birth, death, and marriage certificates, to construct longitudinal data sets about a population. Over the past two decades researchers working in different domains have studied the problem of historical record linkage. In 1996 Dillon investigated an approach to link census records from the US and Canada to generate a longitudinal database to examine changes in household structures (Dillon, 1996). The Integrated Public Use Microdata Series (IPUMS, see: https://www.ipums.org/) is a large project initiated by the Minnesota Population Centre (MPC) for linking large demographic data collections. The Life-M project is another example of transforming records from birth, marriage, and death certificates as well as census records into an intergenerational longitudinal database (Bailey et al., 2017). The project considers US data from the 19th and 20th centuries and aims to use birth certificates as a basis for historical record linkage of large historical databases.

The Digitising Scotland project (Dibben et al., 2012), which this work is a part of, aims to transcribe and link all civil registration events recorded in Scotland between 1856 and 1973. Around 14 million birth, 11 million death, and 4 million marriage records need to be linked to create a linked database covering the whole population of Scotland spanning more than a century to allow researchers in various domains to conduct studies that are currently impossible to do.

Here we present work-in-progress on a specific step used in traditional family reconstruction as conducted by demographers and historians (Reid et al., 2002; Wrigley and Schofield, 1973): the bundling (clustering) of birth records by the same mother to identify siblings. Once siblings groups have been identified, they can be linked to census, marriage, and death records using group linkage techniques (Fu et al., 2014b). Linked bundles of siblings allow a variety of studies for example about fertility and mortality and how these have changed over time (Reid et al., 2002).

Contributions: In this paper we investigate how clustering techniques for entity resolution (Hassanzadeh et al., 2009; Saeedi et al., 2017) can be employed for bundling birth records by the same mother, where temporal constraints can be incorporated to ensure no biologically impossible birth records by the same mother are linked together. We propose and evaluate a novel greedy temporal clustering approach, and compare it with a temporal variation of an existing clustering technique for entity resolution which has shown to work well in a previous study (Saeedi et al., 2017). We conduct an empirical study on a data set from Scotland which has been extensively linked semi-manually by domain experts (Reid et al., 2002) providing us with ground truth data to calculate linkage quality. We show that temporal clustering techniques can outperform the linkage using non-temporal techniques in terms of linkage quality.

2. Related Work

Record linkage has been an active field of research for over half a century in several research domains. Several recent books and surveys provide different perspectives of this area (Christen, 2012; Dong and Srivastava, 2015; Harron et al., 2015; Naumann and Herschel, 2010).

Classification techniques for record linkage can be categorised into supervised and unsupervised techniques. Clustering techniques, which are unsupervised, view record linkage as the problem of how to identify all records that refer to the same entity and to group these records into the same cluster. Hassanzadeh et al. (Hassanzadeh et al., 2009) presented a framework to comparatively evaluate different clustering techniques for record linkage. Saeedi et al. (Saeedi et al., 2017) recently proposed a framework to perform clustering for record linkage on a parallel platform using Apache Flink. Both these frameworks have implemented and evaluated several clustering approaches. In the evaluation by Saeedi et al. (Saeedi et al., 2017) star clustering (as described and modified in Section 3.3) was one of the overall best performing techniques compared to other clustering techniques. Neither of the two frameworks, however, has considered temporal constraints.

The linkage of historical data collections with the aim to produce large temporal linked data sets has recently received increased attention within the context of population reconstruction (Bloothooft et al., 2015; Kum et al., 2014). Such linked population databases can be an exciting resource in areas such as health, history, and demography because these databases allow answering complex questions about temporal changes of a society that so far have been impossible to address. Most projects in historical record linkage are challenged by low data quality (due to scanning and transcription errors of handwritten forms), as well as a lack of ground truth data (which is difficult and expensive to obtain). Therefore, research in this area has concentrated on either exploiting the structure in such data sets (such as households and families) and developed group linkage methods (Christen et al., 2017; Fu et al., 2014a, b; On et al., 2007) or collective techniques (Christen, 2016). Alternative approaches explore the use of limited ground truth data for evaluating linkage quality (Antonie et al., 2014; Bailey et al., 2017).

3. Temporal Graph Linkage

Our overall linkage approach consists of two major phases which we describe in detail in this section. First we generate an undirected graph based on pair-wise similarity calculations between individual records (birth certificates in our case). This is followed by a clustering of records (nodes) in this graph where we do take temporal constraints between records into account, as we describe in Section 3.2. In Sections 3.3 and 3.4 we discuss two temporal clustering approaches, the first based on the extension of an existing star-based clustering approach (Hassanzadeh et al., 2009; Saeedi et al., 2017), while the second approach generates clusters in a greedy temporal manner.

For notation we use bold letters for lists, sets and clusters (upper-case bold letters for lists of sets, lists and clusters), and normal type letters for numbers and text. Lists are shown with square and sets with curly brackets, where lists have an order but sets do not.

3.1. Similarity Graph Generation

Algorithm 1: Pair-wise similarity graph generation
Input:
- 𝐑\mathbf{R}: List of records to be linked
- 𝐀\mathbf{A}: List of attributes from 𝐑\mathbf{R} to be compared
- 𝐒\mathbf{S}: List of similarity functions to be applied on attributes from 𝐀\mathbf{A}
- 𝐰\mathbf{w}: List of weights given to attribute similarities, with |𝐰|=|𝐒||\mathbf{w}|=|\mathbf{S}|
- b,rb,r Number of bands and band size for min-hash based LSH blocking
- smins_{min}: Minimum similarity for record pairs to be added to the generated graph
Output:
- 𝐆\mathbf{G}: Undirected pair-wise similarity graph
1: 𝐕=\mathbf{V}=\emptyset, 𝐄=\mathbf{E}=\emptyset, 𝐆=(𝐕,𝐄)\mathbf{G}=(\mathbf{V},\mathbf{E})        // Initialise empty graph
2: 𝐋=𝐌𝐢𝐧𝐇𝐚𝐬𝐡𝐋𝐒𝐇𝐈𝐧𝐝𝐞𝐱𝐢𝐧𝐠(𝐑,b,r)\mathbf{L}=\mathbf{MinHashLSHIndexing}(\mathbf{R},b,r)   // Generate Min-hash index
3: for 𝐥𝐋\mathbf{l}\in\mathbf{L} do:            // Loop over all Min-hash blocks
4: for (ri,rj):ri𝐥,rj𝐥,ri.id<rj.id(r_{i},r_{j}):r_{i}\in\mathbf{l},r_{j}\in\mathbf{l},r_{i}.id<r_{j}.id do:
5:      𝐬i,j=𝐂𝐨𝐦𝐩𝐚𝐫𝐞𝐑𝐞𝐜𝐨𝐫𝐝𝐬(ri,rj,𝐀,𝐒,𝐰)\mathbf{s}_{i,j}=\mathbf{CompareRecords}(r_{i},r_{j},\mathbf{A},\mathbf{S},\mathbf{w}) // Compute similarities
6:      si,j=𝐍𝐨𝐫𝐦𝐚𝐥𝐢𝐬𝐞(𝐬i,j,𝐰)s_{i,j}=\mathbf{Normalise}(\mathbf{s}_{i,j},\mathbf{w})         // Normalise the similarity
7:      if si,jsmins_{i,j}\geq s_{min} then:
8:       𝐀𝐝𝐝𝐍𝐨𝐝𝐞𝐬(𝐆.𝐕,{ri,rj})\mathbf{AddNodes}(\mathbf{G}.\mathbf{V},\{r_{i},r_{j}\})     // Create two new nodes in 𝐆\mathbf{G}
9:       𝐀𝐝𝐝𝐄𝐝𝐠𝐞(𝐆.𝐄,(ri,rj),si,j)\mathbf{AddEdge}(\mathbf{G}.\mathbf{E},(r_{i},r_{j}),s_{i,j})     // Create an edge in 𝐆\mathbf{G}
10: return 𝐆\mathbf{G}

The steps involved in the pair-wise similarity calculation phase are outlined in Algorithm 1. The main input to the algorithm is a list of records, 𝐑\mathbf{R}, which we aim to link and cluster (in our case we aim to determine which birth records are by the same mother). We assume each record has a unique numerical identifier, r.idr.id, and a time-stamp, r.tr.t, which in our case is the registration date of a birth certificate. We use the list 𝐀\mathbf{A} of attributes which we will compare between records using the list of similarity functions 𝐒\mathbf{S}. These are approximate string matching functions such as Jaro-Winkler or edit distance (Christen, 2006), or functions specific to the content of an attribute like a numerical year difference function (Christen, 2012). We also provide a list of weights, 𝐰\mathbf{w}, to be assigned to the calculated similarities. The value of the similarity sas_{a} for attribute a𝐀a\in\mathbf{A} between two records rir_{i} and rjr_{j} will be calculated as sa(ri,rj)=𝐒a(ri,rj)was_{a}(r_{i},r_{j})=\mathbf{S}_{a}(r_{i},r_{j})\cdot w_{a}, where waw_{a} is the weight for attribute a𝐀a\in\mathbf{A} and 𝐒a\mathbf{S}_{a} is the similarity function used on aa. The attributes and corresponding weight values we use in our experiments are shown in Table 1 in Section 4.

In order to prevent a full pair-wise comparison of each record in 𝐑\mathbf{R} with every other record in 𝐑\mathbf{R} (which has a complexity of O(|𝐑|2)O(|\mathbf{R}|^{2})), we employ min-hashing based on locality sensitive hashing (LSH) (Leskovec et al., 2014) which requires the two parameters bb (the number of min-hash bands) and rr (the band size). Furthermore, we provide a minimum similarity threshold smins_{min} which determines which record pairs are to be included in the similarity graph 𝐆\mathbf{G} being generated.

Algorithm 1 starts by initialising an empty graph, followed by the generation of the min-hash index 𝐋\mathbf{L} which consists of blocks of records, 𝐥\mathbf{l}. Each block 𝐥𝐋\mathbf{l}\in\mathbf{L} contains one or more records from 𝐑\mathbf{R} that share the same min-hash value based on the content of the attribute values in 𝐀\mathbf{A}. In lines 3 and 4 of the algorithm we loop over these blocks 𝐥𝐋\mathbf{l}\in\mathbf{L} and generate all unique pairs of records in each block 𝐥\mathbf{l}. In line 5 we compare the unique record pairs (ri,rjr_{i},r_{j}) from block 𝐥\mathbf{l} to calculate a vector of similarities 𝐬i,j\mathbf{s}_{i,j}. We then normalise these similarities into 0.0si,j1.00.0\leq s_{i,j}\leq 1.0 in line 6. If this normalised similarity is at least the minimum similarity threshold smins_{min} then in lines 8 and 9 we insert the two records rir_{i} and rjr_{j} as nodes into the similarity graph 𝐆\mathbf{G}, and we create an undirected edge between rir_{i} and rjr_{j} where the edge attribute is the normalised similarity si,js_{i,j}.

We finally in line 10 return the generate graph 𝐆\mathbf{G} which is used in the second phase of our approach to conduct clustering of the nodes in this graph. While in the pair-wise similarity calculation algorithm we do not consider any temporal constraints, we could add a temporal plausibility calculation step after line 6 and only insert a record pair into 𝐆\mathbf{G} if the pair is both similar enough and also temporarily possible, as we describe next.

Refer to caption
Figure 1. Temporal constraints as the plausibility for the same mother to be able to give birth to two children, where the horizontal axis shows the time difference (in days) and the vertical axis the plausibility pΔtp_{\Delta t} that two birth records are possible for a certain time difference. Due to errors in registration dates, for multiple births we allow for a few days difference for twins and triplets, and then have a plausible interval between birth from 9 months onwards up-to 35 years. Two births by the same woman more than 40 years apart is deemed not to be plausible.

3.2. Modelling Temporal Constraints

Within the context of clustering birth records by the same mother, we model temporal constraints as a list 𝐓\mathbf{T} of time intervals where it is plausible for a mother to have given birth to two babies. As illustrated in Figure 1, we need to consider issues such as data quality as well as multiple births (like twins and triplets, which potentially are born on two consecutive days). For each day difference Δt\Delta t between two birth records (i.e. the number of days between two births) we calculate a plausibility value pΔtp_{\Delta t} (with 0.0pΔt1.00.0\leq p_{\Delta t}\leq 1.0), where pΔt=1.0p_{\Delta t}=1.0 for day differences where two births by the same mother are possible, and pΔt=0.0p_{\Delta t}=0.0 for day differences where it is biologically not possible for the same mother to have given birth to two babies. To account for wrongly recorded dates of birth we apply linear discounting of plausibility values, as shown in Figure 1.

We can use these temporal plausibility values to modify the similarity values between records by multiplying normalised record pair similarities (si,js_{i,j}, as calculated in Algorithm 1) with plausibility values, and then not considering record pairs in the graph 𝐆\mathbf{G} where their new modified similarity is below a given threshold.

We can apply these temporal constraints during the pair-wise similarity calculation step described in Section 3.1 (to only include record pairs into the graph 𝐆\mathbf{G} that are plausible from a temporal point of view). In the clustering step described in Sections 3.3 and 3.4 below, we also need to check for every pair of records in a cluster if they are temporarily plausible. A cluster can contain pairs of records that are not in 𝐆\mathbf{G} because their similarity si,js_{i,j} is below the threshold smins_{min}, and these pairs also need to be plausible with regard to the given temporal constraints. Formally, for a given cluster 𝐜\mathbf{c}, it must hold: (ri𝐜,rj𝐜):pΔtpmin\forall(r_{i}\in\mathbf{c},r_{j}\in\mathbf{c}):p_{\Delta t}\geq p_{min}, where pminp_{min} is a minimum plausibility threshold (similar to the similarity threshold smins_{min} used in Algorithm 1). If this condition is not fulfilled for a record ri𝐜r_{i}\in\mathbf{c} with all other records in 𝐜\mathbf{c}, then rir_{i} needs to be removed from 𝐜\mathbf{c}.

While we currently set these temporal intervals of plausible births by the same mother based on discussions with domain experts, in the future we aim to learn temporal plausibility values from ground truth data. Besides temporal constraints between birth records by the same mother, in our application (where we aim to reconstruct populations by linking birth, death, marriage, and census records) there are other constraints we can consider. For example, a death of an individual can only occur on the same day or after the person’s birth. A marriage should only occur once a person has reached a minimum age. Similarly, records of the births by a mother can only occur once she has reached a certain minimum age, and before she has reached a certain maximum age.

3.3. Star Clustering

The second phase of our approach is to use a clustering algorithm to group all births by the same mother. We selected star clustering because this algorithm has shown to be one of the best performers in a previous evaluation study of clustering algorithms for entity resolution (Saeedi et al., 2017). Our contribution to improve star clustering is two-fold: (a) we introduce temporal constraints as discussed in the previous section, and (b) we develop several methods for cluster centre selection and post-processing of overlapping clusters. Algorithm 2 outlines our modified star clustering algorithm.

Algorithm 2: Temporal star clustering
Input:
- 𝐆\mathbf{G}: Undirected pair-wise similarity graph
- 𝐓\mathbf{T}: List of temporal constraints (as discussed in Section 3.2)
- pminp_{min}: Minimum plausibility for record pairs to be added to a star cluster
- smins_{min}: Minimum similarity for record pairs to be added to a star cluster
- msortm_{sort}: Method to sort nodes for processing
- mresom_{reso}: Method to resolve overlapping clusters
Output:
- 𝐂\mathbf{C}: Final list of clusters
1: 𝐂=[]\mathbf{C}=[\,]        // Initialise an empty list of clusters
2: 𝐔=[]\mathbf{U}=[\,]        // Initialise an empty list to hold unassigned nodes
3: for vi𝐆.Vv_{i}\in\mathbf{G}.V do:    // Loop over all nodes in graph
4: 𝐧i=𝐆𝐞𝐭𝐒𝐢𝐦𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐮𝐫𝐬(𝐆,vi,smin)\mathbf{n}_{i}=\mathbf{GetSimNeighbours}(\mathbf{G},v_{i},s_{min})   // Similar neighbours of viv_{i}
5: di=|𝐧i|d_{i}=|\mathbf{n}_{i}|               // Degree of viv_{i}
6: ai=𝐂𝐚𝐥𝐜𝐀𝐯𝐫𝐒𝐢𝐦𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐮𝐫𝐬(𝐆,vi,𝐧i)a_{i}=\mathbf{CalcAvrSimNeighbours}(\mathbf{G},v_{i},\mathbf{n}_{i})   // Calculate average similarity
7: 𝐔.add((vi,di,𝐧i,ai))\mathbf{U}.add((v_{i},d_{i},\mathbf{n}_{i},a_{i}))   // Add tuple to list of unassigned nodes
8: 𝐒𝐨𝐫𝐭𝐓𝐮𝐩𝐥𝐞𝐬(𝐔,msort)\mathbf{SortTuples}(\mathbf{U},m_{sort})     // Sort according to sorting method
9: for (vi,di,𝐧i,ai)𝐔(v_{i},d_{i},\mathbf{n}_{i},a_{i})\in\mathbf{U} do:
10: U.removeTupleremoveTuple(viv_{i})    // Remove assigned node from unassigned list
11: 𝐜i={vi}\mathbf{c}_{i}=\{v_{i}\}      // Initialise a new cluster with selected node as centre
12: while 𝐧i\mathbf{n}_{i}\neq\emptyset do:
13:      vj=v_{j}= GetNextBestNeighbour(𝐜i,𝐧i\mathbf{c}_{i},\mathbf{n}_{i})    // Select next best neighbour
14:      𝐧i.remove(vj)\mathbf{n}_{i}.remove(v_{j})    // Remove selected next best neighbour
15:      if IsTempPossSimNeighbour(vj,𝐜i,𝐓,pminv_{j},\mathbf{c}_{i},\mathbf{T},p_{min}) do:
16:       𝐜i\mathbf{c}_{i} \cup {vj}\{v_{j}\}         // Add temporally plausible node to cluster
17:       𝐔.removeTuple(vj)\mathbf{U}.removeTuple(v_{j})   // Remove node added to the cluster
18: 𝐂.add(𝐜i)\mathbf{C}.add(\mathbf{c}_{i})          // Add cluster to the final cluster list
19: 𝐯rep=𝐆𝐞𝐭𝐑𝐞𝐩𝐞𝐚𝐭𝐍𝐨𝐝𝐞𝐬(𝐂)\mathbf{v}_{rep}=\mathbf{GetRepeatNodes}(\mathbf{C})     // Get nodes that occur in multiple clusters
20: 𝐂=𝐑𝐞𝐬𝐨𝐥𝐯𝐞𝐎𝐯𝐞𝐫𝐥𝐚𝐩(𝐂,𝐯rep,mreso,smin)\mathbf{C}=\mathbf{ResolveOverlap}(\mathbf{C},\mathbf{v}_{rep},m_{reso},s_{min})   // Assign nodes to best cluster
21: return 𝐂\mathbf{C}

Our modified algorithm can consider temporal constraints (if the list of constraints 𝐓\mathbf{T} is provided) or ignore them (if 𝐓\mathbf{T} is empty) when generating clusters. The input to the algorithm are the pair-wise similarity graph, 𝐆\mathbf{G}, as generated by Algorithm 1, and the list 𝐓\mathbf{T} of temporal constraints. We also require the minimum plausibility pminp_{min} and minimum similarity smins_{min} thresholds to decide if a node is added to a cluster, and the sorting and overlap resolving methods, msortm_{sort} and mresom_{reso}, which we discuss in detail below.

The algorithm starts by initialising an empty list of clusters, 𝐂\mathbf{C}, and an empty list 𝐔\mathbf{U} which will hold information about the nodes that are not yet assigned to clusters. Initially, all nodes in the similarity graph 𝐆\mathbf{G} are marked as unassigned by adding them to 𝐔\mathbf{U} in the loop starting in line 3. For each node vi𝐆.Vv_{i}\in\mathbf{G}.V, using the function 𝐆𝐞𝐭𝐒𝐢𝐦𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐮𝐫𝐬()\mathbf{GetSimNeighbours}() in line 4 we get the set of its neighbours 𝐧i𝐆\mathbf{n}_{i}\in\mathbf{G} that have an edge similarity of at least smins_{min}. We count the number of these neighbours as the degree did_{i} of node viv_{i} in line 5, and also calculate the average similarity of all edges between viv_{i} and its similar neighbours in 𝐧i\mathbf{n}_{i}. In line 7 we append a tuple containing viv_{i}, did_{i}, 𝐧i\mathbf{n}_{i}, and aia_{i} to the list of unassigned nodes 𝐔\mathbf{U}.

Once tuples for all nodes in 𝐆\mathbf{G} have been added into 𝐔\mathbf{U}, we sort 𝐔\mathbf{U} such that the best node to select as a cluster centre is at the beginning of this list. We investigate three different methods of how to order nodes based on the sorting method provided in msortm_{sort}:

  • Avr-sim-first: We order the tuples in descending order based on their average similarities aia_{i} first and then based on the degree did_{i} (with larger did_{i} first). With this ordering we will process nodes that have high similarities to other nodes first.

  • Degree-first: We order the tuples in descending order based on their degree did_{i} first and then based on their average similarity aia_{i} (with larger aia_{i} first). With this ordering we will process nodes that have many edges with high similarities to other nodes first.

  • Comb: With this method we order nodes in descending order based on combined score where we multiply their average similarity with the logarithm of their degree, i.e. ai×log(dia_{i}\times log(d_{i}). We take the logarithm of did_{i} because aia_{i} is normalised into 0ai10\leq a_{i}\leq 1 while did_{i} is a positive integer value and therefore would dominate the combined score. With this method we aim to weigh both degree and average similarities to obtain an improved ordering.

In lines 9 to 18 of the algorithm, we process one tuple in 𝐔\mathbf{U} after another. Only an unassigned node can become the centre of a new star cluster. The tuple of node vi𝐔v_{i}\in\mathbf{U} selected to become a star centre is removed from the list of unassigned nodes and a new cluster 𝐜i\mathbf{c}_{i} is created in line 11. Then we find the next best node to add to cluster 𝐜i\mathbf{c}_{i}, using the function 𝐆𝐞𝐭𝐍𝐞𝐱𝐭𝐁𝐞𝐬𝐭𝐍𝐞𝐢𝐠𝐡𝐛𝐨𝐮𝐫()\mathbf{GetNextBestNeighbour()}. This function selects the node vj𝐧iv_{j}\in\mathbf{n}_{i} which has the highest average similarity with the nodes that are currently assigned to the cluster 𝐜i\mathbf{c}_{i}. The selected node vjv_{j} is removed from 𝐧i\mathbf{n}_{i} in line 14 so it cannot be selected as the best neighbour in the next iteration. For each next best neighbour vjv_{j} we check in line 15 if vjv_{j} is plausible with every other node in 𝐜i\mathbf{c}_{i} with regard to the temporal constraints given in the list 𝐓\mathbf{T} using 𝐈𝐬𝐓𝐞𝐦𝐩𝐏𝐨𝐬𝐬𝐒𝐢𝐦𝐍𝐞𝐢𝐠𝐡𝐨𝐮𝐫()\mathbf{IsTempPossSimNeighour}() (if 𝐓\mathbf{T} is empty then this function returns true), and the minimum plausibility threshold pminp_{min}. We add the plausible nodes vjv_{j} to the cluster 𝐜i\mathbf{c}_{i} in line 16 and remove their corresponding tuples from 𝐔\mathbf{U} in line 17. This means these nodes cannot become the centre of another star cluster.

The final steps of Algorithm 2, lines 19 and 20, deal with those nodes that are members of more than one cluster (note these are not star cluster centres). Overlapping clusters are not desirable for record linkage because each cluster represents one entity. In line 19 we therefore identify the set 𝐯rep\mathbf{v}_{rep} of nodes which occur in more than one cluster in the list 𝐂\mathbf{C}, and in line 20 we use the function 𝐑𝐞𝐬𝐨𝐥𝐯𝐞𝐎𝐯𝐞𝐫𝐥𝐚𝐩()\mathbf{ResolveOverlap}() to resolve overlapping clusters, where the method mresom_{reso} determines how we assign a node vj𝐯repv_{j}\in\mathbf{v}_{rep} to its best cluster. We investigate three methods to resolve overlaps:

  • Avr-all: We average the similarities between the node vjv_{j} and all the nodes in a cluster it is connected to in the similarity graph 𝐆\mathbf{G} by dividing this similarity sum by n1n-1 where nn is the number of nodes in the cluster (including vjv_{j}), i.e. we do take nodes in a cluster which are not connected to vjv_{j} in 𝐆\mathbf{G} into account.

  • Avr-high: We calculate the average similarity between the node vjv_{j} and all the nodes in a cluster it is connected to in the similarity graph 𝐆\mathbf{G}, with similarities of at least smins_{min}.

  • Edge-ratio: In this method we count the number of edges between vjv_{j} and nodes in a cluster that have a similarity of at least smins_{min} and divide this number by n1n-1 where nn is the number of nodes in the cluster (including vjv_{j}).

For each node vj𝐯repv_{j}\in\mathbf{v}_{rep}, we assign it to the cluster with the highest value according to the selected method to resolve overlaps. For all three methods, if for a given node vjv_{j} two or more clusters have the same calculated score then we assign vjv_{j} to the cluster where vjv_{j} has the highest number of similar edges to. At the end of this process, the final list of clusters 𝐂\mathbf{C} contains no overlapping clusters.

3.4. Greedy Temporal Clustering

Refer to caption
Figure 2. Example of the greedy temporal linkage approach described in Section 3.4, showing nodes (records) and edges (similarities) from the directed similarity graph 𝐆D\mathbf{G}_{D}. Records r1r_{1} to r3r_{3} show an existing cluster, and the question now is which best future record (from r4r_{4}, r5r_{5}, and r6r_{6}) is to be added to the cluster next. We consider three selection methods: (a) the earliest next possible (according to temporal constraints) record in the graph 𝐆\mathbf{G} (in this example r4r_{4}), (b) the future record with the highest maximum similarity (r5r_{5}), or (c) the future record with the highest average similarity (r6r_{6}).

The second temporal clustering approach is based on the idea of iteratively adding nodes to clusters using a greedy selection method, as illustrated in Figure 2. We initially create one cluster per record, and insert these singleton clusters into a priority queue that is sorted according to time-stamps (i.e. the dates of birth registrations in our case) with the smallest time-stamp first. We then process the earliest cluster first, and aim to expand this cluster with a new record that is in the future (of the latest record in the cluster), as Figure 2 shows. In this greedy approach the question is how to select the best future node (record) to add to a cluster. We implement (and evaluate in Section 4) three different such selection methods:

  • Next: Select the temporal next record (with the smallest time-stamp) that is connected via an edge in the graph 𝐆\mathbf{G} to any record in the cluster. This method does neither consider the similarities between nodes (besides the edges in 𝐆\mathbf{G}) nor their connectivities and serves as a greedy baseline.

  • Max-sim: Select the record in the future that is connected via an edge in the graph 𝐆\mathbf{G} to any record in the cluster and that has the highest similarity si,js_{i,j} with any record in the cluster. This method generates clusters where nodes are connected via edges of high similarities, however, these clusters might not be dense.

  • Avr-sim: Select the record in the future that is connected via an edge in the graph 𝐆\mathbf{G} to one or more records in the cluster and that has the highest average similarity over these edges. This method generates dense clusters with high similarity edges.

Algorithm 3: Greedy temporal clustering
Input:
- 𝐆\mathbf{G}: Undirected pair-wise similarity graph
- 𝐓\mathbf{T}: List of temporal constraints (as discussed in Section 3.2)
- pminp_{min}: Minimum plausibility for record pairs to be considered
- mselm_{sel}: Method on how to select the next node to add to a cluster
Output:
- 𝐂\mathbf{C}: Final list of clusters
1: 𝐆D=𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐞𝐓𝐞𝐦𝐩𝐃𝐢𝐫𝐆𝐫𝐚𝐩𝐡(𝐆)\mathbf{G}_{D}=\mathbf{GenerateTempDirGraph(G)}    // A temporal directed graph
2: 𝐂=[]\mathbf{C}=[\,]              // Initialise an empty list of clusters
3: 𝐐=[]\mathbf{Q}=[\,]              // Initialise an empty priority queue
4: for v𝐆D.Vv\in\mathbf{G}_{D}.V do:            // Loop over all nodes in 𝐆D\mathbf{G}_{D}
5: if (|v.in()|=0)(|v.out()|=0)(|v.in()|=0)\wedge(|v.out()|=0) then: // A singleton
6:      𝐂.add({v})\mathbf{C}.add(\{v\})           // Add to the final list of clusters
7: else:
8:      𝐐.add((v.t,{v}))\mathbf{Q}.add((v.t,\{v\})) // Add node with its time-stamp to queue 𝐐\mathbf{Q}
9: 𝐒𝐨𝐫𝐭(𝐐)\mathbf{Sort(Q)}        // Sort queue according to time-stamps (earliest first)
10: while 𝐐[]\mathbf{Q}\neq[] do:      // Loop over temporal clusters until 𝐐\mathbf{Q} is empty
11: (t,𝐜tmp)=𝐐.pop()(t,\mathbf{c}_{tmp})=\mathbf{Q}.pop()     // Get first cluster tuple in 𝐐\mathbf{Q}
12: 𝐨=vi.out(),vi𝐜tmp\mathbf{o}=\cup v_{i}.out(),v_{i}\in\mathbf{c}_{tmp}   // Set of all outgoing nodes
13: if 𝐨=\mathbf{o}=\emptyset do:         // No outgoing nodes found in 𝐜tmp\mathbf{c}_{tmp}
14:      𝐂.add(𝐜tmp)\mathbf{C}.add(\mathbf{c}_{tmp})       // Add 𝐜tmp\mathbf{c}_{tmp} to the final list of clusters
15: else:
16:      if msel=m_{sel}= Next do:    // Select node with smallest time-stamp
17:       vn=vi𝐨:argmin{vi.t:vi𝐨}v_{n}=v_{i}\in\mathbf{o}:\text{argmin}\{v_{i}.t:v_{i}\in\mathbf{o}\}
18:      if msel=m_{sel}= Max-sim do:   // Select node with the highest similarity
19:       vn=vi𝐨:argmax{si,j:vi𝐜tmp,vj𝐨}v_{n}=v_{i}\in\mathbf{o}:\text{argmax}\{s_{i,j}:v_{i}\in\mathbf{c}_{tmp},v_{j}\in\mathbf{o}\}
20:      if msel=m_{sel}= Avr-sim do:   // Select node with highest average similarity
21:       vn=vi𝐨:argmax{si,j/|{(vi,vj):vi𝐜tmp,vj𝐨}|}v_{n}=v_{i}\in\mathbf{o}:\text{argmax}\{\sum s_{i,j}/|\{(v_{i},v_{j}):v_{i}\in\mathbf{c}_{tmp},v_{j}\in\mathbf{o}\}|\}
22:      pΔt=𝐂𝐡𝐞𝐜𝐤𝐓𝐞𝐦𝐩𝐂𝐨𝐧𝐬𝐭𝐫(vn.t,𝐜tmp,𝐓)p_{\Delta t}=\mathbf{CheckTempConstr}(v_{n}.t,\mathbf{c}_{tmp},\mathbf{T}) // Temporal plausibility
23:      if pΔtpminp_{\Delta t}\geq p_{min} do:
24:       𝐐.add((vn.t,𝐜tmp{vn}))\mathbf{Q}.add((v_{n}.t,\mathbf{c}_{tmp}\cup\{v_{n}\}))     // Add expanded 𝐜tmp\mathbf{c}_{tmp} to 𝐐\mathbf{Q}
25:       𝐒𝐨𝐫𝐭(𝐐)\mathbf{Sort(Q)}      // Sort queue according to time-stamps (earliest first)
26:      else:
27:       𝐂.add(𝐜tmp)\mathbf{C}.add(\mathbf{c}_{tmp})   // Add 𝐜tmp\mathbf{c}_{tmp} to the list of final clusters
28: return 𝐂\mathbf{C}

As with star clustering, we can consider temporal constraints when selecting the next record to be added into a cluster, or we can ignore any temporal constraints. Algorithm 3 outlines the steps involved in this temporal greedy clustering approach.

The main input to the algorithm are the pair-wise similarity graph, 𝐆\mathbf{G}, and a list of temporal constraints, 𝐓\mathbf{T}, as discussed in Section 3.2. We also input a minimum plausibility threshold pminp_{min} which is used to consider which record pairs are to be added into clusters based on their temporal constraints, and the selection method mselm_{sel} which determines which nodes (records) to add into a cluster.

We first (in line 1) convert the undirected similarity graph 𝐆\mathbf{G} into a directed graph where each node (birth record) has an outgoing edge to any future node, as shown in Figure 2. The function 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐞𝐓𝐞𝐦𝐩𝐃𝐢𝐫𝐆𝐫𝐚𝐩𝐡()\mathbf{GenerateTempDirGraph}() generates a directed graph 𝐆D\mathbf{G}_{D} by considering the time differences between the pairs of nodes in 𝐆\mathbf{G}, such that (vi,vj)𝐆D.E:vj.tvi.t\forall(v_{i},v_{j})\in\mathbf{G}_{D}.E:v_{j}.t\geq v_{i}.t. In line 4, the algorithm then loops over each node v𝐆Dv\in\mathbf{G}_{D} and adds vv to the final list of clusters 𝐂\mathbf{C} if vv does not have any incoming or outgoing edges to other nodes (lines 5 and 6), i.e. the node is a singleton. Otherwise, a new cluster is created containing only node vv, and this cluster is added together with its time-stamp, v.tv.t, as a tuple into the priority queue 𝐐\mathbf{Q} for further processing (line 8).

In line 9 we sort 𝐐\mathbf{Q} according to the time-stamps of each cluster such that the cluster with the smallest time-stamp is at the beginning of the queue. The main loop of the algorithm starts in line 10 where in each iteration we retrieve the cluster 𝐜tmp\mathbf{c}_{tmp} with the earliest time-stamp tt (line 11). We then find for each node vc𝐜tmpv_{c}\in\mathbf{c}_{tmp} all its outgoing nodes in 𝐆D\mathbf{G}_{D}, and in line 12 we combine these into the set 𝐨\mathbf{o} of all outgoing nodes for 𝐜tmp\mathbf{c}_{tmp}. If 𝐨\mathbf{o} is empty for the current cluster 𝐜tmp\mathbf{c}_{tmp} then 𝐜tmp\mathbf{c}_{tmp} is added to the final list of clusters 𝐂\mathbf{C} in line 14 because it cannot be expanded further.

Table 1. Attributes in birth certificates used for three variations of calculating pair-wise similarities to generate the graph 𝐆\mathbf{G}.
Attribute Similarity function Weight  All attributes  Parent names only  Parent names and addresses
Father first name Jaro-Winkler 6.578
Father last name Jaro-Winkler 7.168
Mother first name Jaro-Winkler 4.483
Mother last name Jaro-Winkler 7.168
Mother maiden last name Jaro-Winkler 5.985
Parents marriage day Exact 4.610
Parents marriage month Exact 3.855
Parents marriage year Year difference 5.240
Parents marriage place 1 Jaro-Winkler 4.435
Parents marriage place 2 Jaro-Winkler 3.607
Occupation father Jaro-Winkler 2.247
Occupation mother Jaro-Winkler 1.274
Address 1 Jaro-Winkler 4.715
Address 2 Jaro-Winkler 3.548
Source parish Jaro-Winkler 4.562
Table 2. The ten most frequent values and their corresponding frequency counts for first and last names of fathers and mothers in the Isle of Skye birth data set.
First name Last name
Father Mother Father Mother
John (3,444) Mary (2,740) Mcleod (1,571) Mcdonald (1,793)
Donald (2,628) Catherine (2,607) Mcdonald (1,556) Mcleod (1,761)
Alexander (1,665) Ann (2,084) Mckinnon (1,168) Mckinnon (1,164)
Malcolm (800) Margaret (2,031) Nicolson (1,047) Nicolson (908)
Neil (787) Christina (1,626) Mclean (908) Mclean (850)
Angus (782) Marion (1,532) Campbell (685) Campbell (823)
William (611) Flora (1,150) Mcinnes (682) Mcinnes (704)
Murdo (565) Janet (871) Mckenzie (637) Matheson (541)
Norman (513) Effie (654) Mcpherson (525) Mckenzie (509)
Ewen (502) Isabella (478) Robertson (452) Mcpherson (496)

On the other hand, if there are outgoing nodes (i.e. 𝐨\mathbf{o} is not empty), then based on the selection method mselm_{sel}, as explained above, the algorithm selects the next best node, vnv_{n}, to be added into the current cluster 𝐜tmp\mathbf{c}_{tmp} in lines 16 to 21. Using the function CheckTempConstr() in line 22 we then check the temporal plausibility pΔtp_{\Delta t} between node vnv_{n} and all nodes in 𝐜tmp\mathbf{c}_{tmp} based on the list of temporal constraints 𝐓\mathbf{T} (if this list is empty, i.e. no temporal constraints are given, then we set pΔt=1p_{\Delta t}=1). If the calculated pΔtp_{\Delta t} is at least pminp_{min} (i.e. vnv_{n} is temporary plausible with all other nodes in 𝐜tmp\mathbf{c}_{tmp}), then vnv_{n} is added to the current cluster 𝐜tmp\mathbf{c}_{tmp} and the expanded cluster is added as a new tuple into 𝐐\mathbf{Q} with vn.tv_{n}.t as the tuple’s time-stamp (line 24). 𝐐\mathbf{Q} is sorted again in line 25 to ensure the cluster with the smallest time-stamp (of its temporarily last record) is selected in the next iteration (line 25). If vnv_{n} is not temporally plausible with at least one node in 𝐜tmp\mathbf{c}_{tmp} then 𝐜tmp\mathbf{c}_{tmp} is added to the final list of clusters 𝐂\mathbf{C} in line 27 because it cannot be expanded further.

4. Experimental Evaluation

We evaluate our proposed temporal clustering approaches using a real Scottish birth data set that covers the population of the Isle of Skye over the period from 1861 to 1901. This data set contains 17,614 birth certificates, where each of these contains personal information about the baby and its parents, as shown in Table 1.

This data set has been extensively curated and linked semi-manu-ally by demographers who are experts in the domain of linking such historical data (Newton, 2011; Reid et al., 2002). Their approach followed long established rules for family reconstruction (Wrigley and Schofield, 1973), leading to a set of linked birth certificates. We thus have a set of manually generated links that allows us to compare the quality and coverage of our automatically identified links to those identified by the domain experts.

Refer to caption
Refer to caption
Refer to caption
Figure 3. Frequency distribution of (a) first names and (b) last names of parents, and (c) addresses in the Isle of Skye birth data set. Note the y-axis are in log scale. Notice the highly skewed frequency distributions where a few names occur many times.

As with other historical data sets (Antonie et al., 2014; Fu et al., 2014b), this birth data set has a very small number of unique name values (2,055 first names and only 547 last names). As Figure 3 shows, the frequency distributions of names are also very skewed. The ten most common first and last name values occur in between 30%30\% and 40%40\% of all records, as Table 2 illustrates. Many records have missing values in address or occupation attributes, and for unmarried women the details of a baby’s father are mostly missing.

As commonly performed in record linkage research (Christen, 2012; Naumann and Herschel, 2010), we evaluate our clustering approaches with regard to precision (how many of the identified links between birth records are true links according to the demographers) and recall (how many true links have our clustering approaches correctly identified and inserted into the same clusters). We do not present F-measure results given recent work has identified some problematic aspects when using the F-measure to compare record linkage approaches (Hand and Christen, 2018).

We implemented all techniques using Python 2.7.6 and used the string matching functionalities provided in Febrl (Christen, 2008) to conduct the pair-wise record comparisons. We set the LSH min-hash parameters as b=100b=100 (number of bands) and r=4r=4 (band size) in order to obtain a recall of 99.7%99.7\% of the true matches in the ground truth data set for the similarity graph 𝐆\mathbf{G}. We used three different subset of attributes, 𝐀\mathbf{A}, as described in Algorithm 1 and illustrated in Table 1. For details of the similarity functions used see (Christen, 2012). We calculated attribute similarities with either the weights shown in Table 1, or with all attribute weights set to 1.01.0. We thus ended up with six similarity graphs where we set smin=0.7s_{min}=0.7: weighted and no weights, and All attributes, Parent names and addresses, and Parent names only. This allows us to investigate how different ways to calculate pair-wise similarities influence the quality of the final clustering.

For the clustering approaches described in Sections 3.3 and 3.4, we evaluate the three sorting and resolving methods for star clustering, and the three selection methods for greedy temporal clustering. We show the final clustering results obtained as precision-recall curves in Figures 4 to  7 where we changed the value of the minimum similarity threshold to include pair-wise similarities (i.e. edges) in the graph 𝐆\mathbf{G} from 1.01.0 to 0.70.7 in 0.050.05 steps.

These rather unusual looking precision-recall curves need some explanation. When the minimum similarity threshold smins_{min} used to generate the pair-wise graph 𝐆\mathbf{G} is lowered, more false matches are included as edges into 𝐆\mathbf{G}, thus reducing the precision as expected. However, recall seems to have an inverse relationship with smins_{min} up-to a certain point (recall increases while smins_{min} is decreased) and then recall decreases with smins_{min}. We believe that this behaviour is caused by the greedy nature of the algorithms and the skewness of the attribute value distribution. When smins_{min} is too high (such as 1.0), many true-matches which are not exact matches (due to mistakes in data transposition, etc.) get dropped, leading to lower recall. When smins_{min} is slightly more lenient (such as 0.95 or 0.9), recall improves since more of the true-matches with slight spelling mistakes are included into clusters and are therefore matched. However, when smins_{min} is further lowered, the number of high similarity non-matches increases (due to skewness of the distribution) and these non-matches will be clustered incorrectly. This is caused by the greedy nature of both clustering algorithms, where after an incorrect node is selected as the next best node the actual true matches are never offered a chance to be clustered together. This behaviour is mostly accentuated when only parent names are used to calculate the similarities between certificates which is because the distribution of parent names is the most skewed.

As Figures 4 to  7 show, when temporal constraints are included in the clustering phase then precision generally increases considerably while recall only decreases little. The overall best performing approach (with and without temporal constraints) was using unweighted similarities of only parent names, with Avr-sim-first as the sorting method and Avr-all as the overlap resolving method. Furthermore, the similarity threshold value achieving the best results was 0.95. The overall highest precision and recall results without temporal constraints were 0.8770.877 and 0.8970.897, while when applying temporal constraints they were 0.9250.925 and 0.8880.888, respectively.

The result plots also show that overall star clustering achieves better results with regard to recall than the temporal greedy technique, however the similarity based selection methods for temporal greedy clustering achieve overall higher minimum precision results.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4. Precision-recall results for the temporal star clustering approach described in Section 3.3 using the Avr-sim-first sorting method, the three discussed overlap resolving methods and without (top row) and with (bottom row) temporal constraints. Each plot shows results for the six similarity graphs described in Section 4 (with / without weighted similarities and different attributes compared).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5. Precision-recall results for the temporal star clustering approach described in Section 3.3 using the Degree-first sorting method, the three discussed overlap resolving methods and without (top row) and with (bottom row) temporal constraints. Each plot shows results for the six similarity graphs described in Section 4 (with / without weighted similarities and different attributes compared).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6. Precision-recall results for the temporal star clustering approach described in Section 3.3 using the Comb sorting method, the three discussed overlap resolving methods and without (top row) and with (bottom row) temporal constraints. Each plot shows results for the six similarity graphs described in Section 4 (with / without weighted similarities and different attributes compared).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7. Precision-recall results for the greedy temporal clustering approach described in Section 3.4 using the three discussed selection methods, and without (top row) and with (bottom row) temporal constraints.

5. Conclusions and Future Work

In this work-in-progress paper we have developed and evaluated two clustering approaches for linking birth certificates in the context of historical record linkage. Both algorithms are based on a graph that represents the similarities calculated between individual birth certificates. We have evaluated six approaches how this graph is generated based on comparing different attribute combinations in a weighted or unweighted fashion, and how the characteristics of this graph affect the final clustering outcomes. Our experimental evaluation on a real Scottish data set have shown that incorporating temporal constraints (when a woman can give birth or not) can improve the quality of the final linked data set.

As future work we aim to improve our proposed greedy temporal clustering algorithm as well as temporal star clustering to obtain better linkage results. We aim to investigate why certain birth certificates are not linked (missed true matches, lowering recall) while others are falsely linked (wrong matches, lowering precision). We then aim to expand our graph-based clustering techniques to also incorporate links across birth, marriage, death, and census certificates by generating a single large similarity graph where nodes represent certificates and edges, the similarities between them, and where edges can be of different types (Christen, 2016). Such a graph will not only allow temporal constraints to be considered but also gender and role-type specific constraints (Christen, 2016; Christen et al., 2017). We plan to model temporal aspects of how the records about a certain individual will occur in historical population databases. Our ultimate aim is to develop unsupervised techniques for the accurate and efficient linkage of large and complex historical population databases in order to provide researchers in areas such as health and the social sciences with high quality longitudinal data sets.

Acknowledgements

This work was supported by ESRC grants ES/K00574X/2 Digitising Scotland and ES/L007487/1 Administrative Data Research Centre – Scotland. We like to thank Alice Reid of the University of Cambridge and her colleagues Ros Davies and Eilidh Garrett for their work on the Isle of Skye database, and their helpful advice on historical Scottish demography. This work was partially funded by the Australian Research Council under DP130101801.

References

  • (1)
  • Antonie et al. (2014) Luiza Antonie, Kris Inwood, Daniel J. Lizotte, and J. Andrew Ross. 2014. Tracking people over time in 19th century Canada for longitudinal analysis. Machine Learning 95 (2014), 129–146.
  • Bailey et al. (2017) Martha Bailey, Connor Cole, Morgan Henderson, and Catherine Massey. 2017. How Well Do Automated Methods Perform in Historical Samples? Evidence from New Ground Truth. Technical Report. National Bureau of Economic Research.
  • Bloothooft et al. (2015) Gerrit Bloothooft, Peter Christen, Kees Mandemakers, and Marijn Schraagen. 2015. Population Reconstruction. Springer.
  • Christen (2006) Peter Christen. 2006. A Comparison of Personal Name Matching: Techniques and Practical Issues. In ICDM Workshop on Mining Complex Data. Hong Kong.
  • Christen (2008) Peter Christen. 2008. Febrl: An open source data cleaning, deduplication and record linkage system with a graphical user interface. In ACM SIGKDD.
  • Christen (2012) Peter Christen. 2012. Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
  • Christen (2016) Peter Christen. 2016. Application of Advanced Record Linkage Techniques for Complex Population Reconstruction. arXiv preprint arXiv:1612.04286 (2016).
  • Christen et al. (2017) Victor Christen, Anika Groß, Jeffrey Fisher, Qing Wang, Peter Christen, and Erhard Rahm. 2017. Temporal group linkage and evolution analysis for census data. In EDBT. Venice, Italy, 620–631.
  • Dibben et al. (2012) Chris Dibben, Lee Williamson, and Zengyi Huang. 2012. Digitising Scotland. http://gtr.rcuk.ac.uk/projects?ref=ES/K00574X/2
  • Dillon (1996) Lisa Y. Dillon. 1996. Integrating nineteenth-century Canadian and American census data sets. Computers and the Humanities 30, 5 (1996), 381–392.
  • Dong and Srivastava (2015) Xin Luna Dong and Divesh Srivastava. 2015. Big data integration. Synthesis Lectures on Data Management 7, 1 (2015), 1–198.
  • Dong and Tan (2015) Xin Luna Dong and Wang-Chiew Tan. 2015. A time machine for information: Looking back to look forward. Proceedings of the VLDB Endowment 8, 12 (2015).
  • Fu et al. (2014a) Zhichun Fu, Mac Boot, Peter Christen, and Jun Zhou. 2014a. Automatic Record Linkage of Individuals and Households in Historical Census Data. International Journal of Humanities and Arts Computing (2014).
  • Fu et al. (2014b) Zhichun Fu, Peter Christen, and Jun Zhou. 2014b. A Graph Matching Method for Historical Census Household Linkage. In PAKDD. Tainan, Taiwan.
  • Fu et al. (2012) Zhichun Fu, Jun Zhou, Peter Christen, and Mac Boot. 2012. Multiple Instance Learning for Group Record Linkage. In PAKDD. Kuala Lumpur.
  • Grundy and Tomassini (2005) Emily Grundy and Cecilia Tomassini. 2005. Fertility history and health in later life: a record linkage study in England and Wales. Social Science and Medicine 61, 1 (2005), 217–228.
  • Hand and Christen (2018) David Hand and Peter Christen. 2018. A note on using the F-measure for evaluating record linkage algorithms. Statistics and Computing 28, 3 (2018), 539–547.
  • Harron et al. (2015) Katie Harron, Harvey Goldstein, and Chris Dibben. 2015. Methodological Developments in Data Linkage. John Wiley & Sons.
  • Hassanzadeh et al. (2009) Oktie Hassanzadeh, Fei Chiang, Hyun Chul Lee, and Renée J Miller. 2009. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment 2, 1 (2009), 1282–1293.
  • Kum et al. (2014) Hye-Chung Kum, Ashok Krishnamurthy, Ashwin Machanavajjhala, and Stanley C. Ahalt. 2014. Social Genome: Putting Big Data to Work for Population Informatics. IEEE Computer 47, 1 (2014).
  • Leskovec et al. (2014) Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.
  • Naumann and Herschel (2010) Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Morgan and Claypool Publishers.
  • Newton (2011) Gill Newton. 2011. Recent developments in making family reconstitutions. Local Population Studies 87, 1 (2011), 84–89.
  • On et al. (2007) Byung-Won On, Nick Koudas, Dongwon Lee, and Divesh Srivastava. 2007. Group Linkage. In IEEE ICDE. Istanbul.
  • Reid et al. (2002) Alice Reid, Ros Davies, and Eilidh Garrett. 2002. Nineteenth-Century Scottish Demography from Linked Censuses and Civil Registers. History and Computing 14, 1-2 (2002).
  • Saeedi et al. (2017) Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2017. Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In Advances in Databases and Information Systems. Springer, 278–293.
  • Wrigley and Schofield (1973) Edward Wrigley and Roger Schofield. 1973. Nominal record linkage by computer and the logic of family reconstitution. Identifying people in the past (1973), 64–101.