Disambiguate names

## Problem

(All examples from EPSA 2029 names.)

Variations in family names:

```
Ahmed Mohamed
Ahmed Mohammed
```

Shortened first names:

```
Andrew Guess
Andy Guess
```

Ambiguities:

```
Aina Gallego
Anna Gallego
Alina Vrânceanu
Alina Vranceau
```

Middle initials:

```
Andreas C. Goldberg
Andreas Goldberg
```

Accents and special characters:

```
Aysenur Dal
Ayşenur Dal
```

More complex case:

```
"Ferran Martinez i Coma"
"Ferran M i Coma"
```

… all cause some authors to have multiple first names (and thus __multiple participant UIDs__).

## Partial solution

Use `stringdist` to detect simple cases:

https://cran.r-project.org/web/packages/stringdist/vignettes/RJournal_6_111-122-2014.pdf

```r
n <- unique(d$full_name)

# Levenshtein distances
lv <- tibble::tibble(
  x = n,
  y = map(x, ~ n[ !n %in% .x ]),
  lv = map(x, ~ stringdist(.x, n[ !n %in% .x ], method = "lv"))
) %>%
  tidyr::unnest(c(y, lv))

# finds many true positives
filter(lv, lv < 3) %>%
  arrange(x, lv)

# finds "Alex Smith" and "Alex C. Smith"
filter(lv, lv == 3) %>%
  arrange(x, lv)

# finds mostly false positives
filter(lv, lv == 4) %>%
  arrange(x, lv)
```

## Question to self

Fix names here, once all names are assembled, or in the source 2019, 2020, 2021 repos?

Cleanup in this repo makes more sense because it allows to treat all names at once, which avoids treating recurring participants several times.

If cleanup happens here, UIDs need to be regenerated here, after applying the fixes (not a huge hassle).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disambiguate names #1

Problem

Partial solution

Question to self

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Disambiguate names #1

Description

Problem

Partial solution

Question to self

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions