Skip to content

Disambiguate names #1

@briatte

Description

@briatte

Problem

(All examples from EPSA 2029 names.)

Variations in family names:

Ahmed Mohamed
Ahmed Mohammed

Shortened first names:

Andrew Guess
Andy Guess

Ambiguities:

Aina Gallego
Anna Gallego
Alina Vrânceanu
Alina Vranceau

Middle initials:

Andreas C. Goldberg
Andreas Goldberg

Accents and special characters:

Aysenur Dal
Ayşenur Dal

More complex case:

"Ferran Martinez i Coma"
"Ferran M i Coma"

… all cause some authors to have multiple first names (and thus multiple participant UIDs).

Partial solution

Use stringdist to detect simple cases:

https://cran.r-project.org/web/packages/stringdist/vignettes/RJournal_6_111-122-2014.pdf

n <- unique(d$full_name)

# Levenshtein distances
lv <- tibble::tibble(
  x = n,
  y = map(x, ~ n[ !n %in% .x ]),
  lv = map(x, ~ stringdist(.x, n[ !n %in% .x ], method = "lv"))
) %>%
  tidyr::unnest(c(y, lv))

# finds many true positives
filter(lv, lv < 3) %>%
  arrange(x, lv)

# finds "Alex Smith" and "Alex C. Smith"
filter(lv, lv == 3) %>%
  arrange(x, lv)

# finds mostly false positives
filter(lv, lv == 4) %>%
  arrange(x, lv)

Question to self

Fix names here, once all names are assembled, or in the source 2019, 2020, 2021 repos?

Cleanup in this repo makes more sense because it allows to treat all names at once, which avoids treating recurring participants several times.

If cleanup happens here, UIDs need to be regenerated here, after applying the fixes (not a huge hassle).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions