Split Intermixed Names into First, Middle, and Last

Data cleaning can be a challenge, so I hope this helps the process for someone out there. This is a tiny, but valuable function for those who deal with data collected from non-ideal forms. As nearly always, this depends on the tidyverse library. You may want to rename the function from fml, but it does best describe dealing with mangled data.

This function retuns the first, middle, and last names for a given name or list of names. Missing data is represented as NA.

Usage on Existing Dataframe

Setting up a dataframe with manged names and missing first, middle, and last names.

df <- data.frame(names = c("John Jacbon Jingle",

"Heimer Schmitt",

"Cher",

"John Jacbon Jingle Heimer Schmitt",

"Mr. Anderson",

"Sir Patrick Stewart",

"Sammy Davis Jr.")) %>%

add_column(First = NA) %>%

add_column(Middle = NA) %>%

add_column(Last = NA)

Row	names	First	Middle	Last
1	John Jacob Jingle	NA	NA	NA
2	Heimer Schmitt	NA	NA	NA
3	Cher	NA	NA	NA
4	John Jacob Jingle Heimer Schmitt	NA	NA	NA
5	Mr. Anderson	NA	NA	NA
6	Sir Patrick Stewart	NA	NA	NA
7	Sammy Davis Jr.	NA	NA	NA

Replacing the first, middle, and last name values…

1 2	df[,c("First","Middle","Last")] <- df$names %>% fml

Row	names	First	Middle	Last
1	John Jacob Jingle	John	Jacbon	Jingle
2	Heimer Schmitt	Heimer	NA	Schmitt
3	Cher	Cher	NA	NA
4	John Jacob Jingle Heimer Schmitt	John	Jacbon-Jingle-Heimer	Schmitt
5	Mr. Anderson	NA	NA	Anderson
6	Sir Patrick Stewart	Patrick	NA	Stewart
7	Sammy Davis Jr.	Sammy	NA	Davis

Values Changed

In row 1 All names were found
In row 2 the middle name was skipped
In row 3 only a first name was found
In row 4 the middle names were collapsed
In row 5 only a last name was found
In row 6 the title Sir was omitted
In row 7 the title Jr. was omitted

Using with a single name.

1 2	fml("Matt Sandy")

	V1	V2	V3
Matt Sandy	Matt	NA	Sandy

The Function

fml <- function(mangled_names) {

titles <- c("MASTER", "MR", "MISS", "MRS", "MS",

"MX", "JR", "SR", "M", "SIR", "GENTLEMAN",

"SIRE", "MISTRESS", "MADAM", "DAME", "LORD",

"LADY", "ESQ", "EXCELLENCY","EXCELLENCE",

"HER", "HIS", "HONOUR", "THE",

"HONOURABLE", "HONORABLE", "HON", "JUDGE")

mangled_names %>% sapply(function(name) {

split <- str_split(name, " ") %>% unlist

original_length <- length(split)

split <- split[which(!split %>%

toupper %>%

str_replace_all('[^A-Z]','')

%in% titles)]

case_when(

(length(split) < original_length) &

(length(split) == 1) ~ c(NA,

NA,

split[1]),

length(split) == 1 ~ c(split[1],NA,NA),

length(split) == 2 ~ c(split[1],NA,

split[2]),

length(split) == 3 ~ c(split[1],

split[2],

split[3]),

length(split) > 3 ~ c(split[1],

paste(split[2:(length(split)-1)],

collapse = "-"),

split[length(split)])

)

}) %>% t %>% return

}

Improvements

I recommend improving upon this if you want to integrate this function (or attributes of this function) into your workflow. Naming the output or using lists so you can just get partial returns fml("John Smith")$Last could come in handy.

Additional cases could also be created, such as when names are entered Last, First M.. Tailoring the function to your project will yield best results.

3 Comments

Bob Muenchen October 22, 2019
Very handy Matt, thanks for sharing it!
Jamie Lendrum October 22, 2019
You may want to look at the humaniformat package.
Gaby January 20, 2022
Thank you for sharing this! Would there be a way to keep in the salutations and suffixes instead of getting rid of them? I am using this function to try and parse a list of names, but I would like to be able to have additional columns for the prefixes and suffixes. Thanks!

Split Intermixed Names into First, Middle, and Last

Usage on Existing Dataframe

The Function

Improvements

3 Comments

Leave a Reply Cancel reply