Data cleaning can be a challenge, so I hope this helps the process for someone out there. This is a tiny, but valuable function for those who deal with data collected from non-ideal forms. As nearly always, this depends on the tidyverse
library. You may want to rename the function from fml
, but it does best describe dealing with mangled data.
This function retuns the first, middle, and last names for a given name or list of names. Missing data is represented as NA
.
Usage on Existing Dataframe
Setting up a dataframe with manged names and missing first, middle, and last names.
1 2 3 4 5 6 7 8 9 10 11 | df <- data.frame(names = c("John Jacbon Jingle", "Heimer Schmitt", "Cher", "John Jacbon Jingle Heimer Schmitt", "Mr. Anderson", "Sir Patrick Stewart", "Sammy Davis Jr.")) %>% add_column(First = NA) %>% add_column(Middle = NA) %>% add_column(Last = NA) |
Row | names | First | Middle | Last |
---|---|---|---|---|
1 | John Jacob Jingle | NA | NA | NA |
2 | Heimer Schmitt | NA | NA | NA |
3 | Cher | NA | NA | NA |
4 | John Jacob Jingle Heimer Schmitt | NA | NA | NA |
5 | Mr. Anderson | NA | NA | NA |
6 | Sir Patrick Stewart | NA | NA | NA |
7 | Sammy Davis Jr. | NA | NA | NA |
Replacing the first, middle, and last name values…
1 2 | df[,c("First","Middle","Last")] <- df$names %>% fml |
Row | names | First | Middle | Last |
---|---|---|---|---|
1 | John Jacob Jingle | John | Jacbon | Jingle |
2 | Heimer Schmitt | Heimer | NA | Schmitt |
3 | Cher | Cher | NA | NA |
4 | John Jacob Jingle Heimer Schmitt | John | Jacbon-Jingle-Heimer | Schmitt |
5 | Mr. Anderson | NA | NA | Anderson |
6 | Sir Patrick Stewart | Patrick | NA | Stewart |
7 | Sammy Davis Jr. | Sammy | NA | Davis |
Values Changed
- In row
1
All names were found - In row
2
the middle name was skipped - In row
3
only a first name was found - In row
4
the middle names were collapsed - In row
5
only a last name was found - In row
6
the titleSir
was omitted - In row
7
the titleJr.
was omitted
Using with a single name.
1 2 | fml("Matt Sandy") |
V1 | V2 | V3 | |
---|---|---|---|
Matt Sandy | Matt | NA | Sandy |
The Function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | fml <- function(mangled_names) { titles <- c("MASTER", "MR", "MISS", "MRS", "MS", "MX", "JR", "SR", "M", "SIR", "GENTLEMAN", "SIRE", "MISTRESS", "MADAM", "DAME", "LORD", "LADY", "ESQ", "EXCELLENCY","EXCELLENCE", "HER", "HIS", "HONOUR", "THE", "HONOURABLE", "HONORABLE", "HON", "JUDGE") mangled_names %>% sapply(function(name) { split <- str_split(name, " ") %>% unlist original_length <- length(split) split <- split[which(!split %>% toupper %>% str_replace_all('[^A-Z]','') %in% titles)] case_when( (length(split) < original_length) & (length(split) == 1) ~ c(NA, NA, split[1]), length(split) == 1 ~ c(split[1],NA,NA), length(split) == 2 ~ c(split[1],NA, split[2]), length(split) == 3 ~ c(split[1], split[2], split[3]), length(split) > 3 ~ c(split[1], paste(split[2:(length(split)-1)], collapse = "-"), split[length(split)]) ) }) %>% t %>% return } |
Improvements
I recommend improving upon this if you want to integrate this function (or attributes of this function) into your workflow. Naming the output or using lists so you can just get partial returns fml("John Smith")$Last
could come in handy.
Additional cases could also be created, such as when names are entered Last, First M.
. Tailoring the function to your project will yield best results.
Very handy Matt, thanks for sharing it!
You may want to look at the humaniformat package.
Thank you for sharing this! Would there be a way to keep in the salutations and suffixes instead of getting rid of them? I am using this function to try and parse a list of names, but I would like to be able to have additional columns for the prefixes and suffixes. Thanks!