Using joins to clean messy data

James Smythe
Oct 30, 2019
5 min read

Data in media is rarely perfect and ready-to-chart. We can spend hours per week on repetitive data tasks, taking data from excel reports, standardising the way it's labelled, and joining data from separate tables before sending out data to an end-user.

Most of this work happens in Excel, and vlookup is a great friend. It's a tricky friend too though - cell references need to be perfect, data needs to be "to the right" of the reference column, you need to count columns and you need a formula for every single cell you want to fill... basically it's time-consuming and error-prone. It's also hard to automate, you end up writing new formulae every time.

R uses joins, as Excel uses vlookup. The most common of these is left_join(), which comes with the tidyverse set of packages. Check this page out for the full family of joins.

Let's look at an example from the work we do in building the official PAMCo subscriber app...

#Load the necessary R packages

library(tidyverse) #for data manipulation & visualisation

library(readxl) #for opening Excel files

# Read in the new data and inspect

new_data <- read_excel("new_pamco_data.xlsx")

glimpse(new_data)

Observations: 274,362

Variables: 10

$ ID <dbl> 90001, 90002, 90003, 90004, 90006, 90007, 90008, 90009, 90010, 90011, 90012, 90013, 90014, 90015, 90016, 90017, 90018,… $ Brand <chr> "Express (TBR) (N)", "Mail (TBR) (D)", "Mirror (TBR) (D)", "Daily Record (TBR) (D)", "Daily Star (TBR) (N)", "The Tele… $ Reach <chr> "Weekly", "Weekly", "Weekly", "Weekly", "Weekly", "Weekly", "Weekly", "Weekly", "Weekly", "Weekly", "Weekly", "Weekly"… $ Platform <chr> "Total Brand", "Total Brand", "Total Brand", "Total Brand", "Total Brand", "Total Brand", "Total Brand", "Total Brand"… $ Base <chr> "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL",… $ Demography1 <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "… $ Category1 <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "… $ Value <dbl> 11408.87, 14917.10, 14771.93, 2658.52, 4104.99, 9954.85, 13733.54, 2461.05, 21565.35, 4915.72, 390.93, 582.60, 409.06,… $ Percent <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100… $ `Percent GB 15+` <dbl> 21.45916052, 28.05786221, 27.78481417, 5.00046824, 7.72114902, 18.72427740, 25.83168171, 4.62904008, 40.56268608, 9.24…

The new data each quarter is a dataset of roughly 275,000 rows. Despite how it may look on your screen now, it's a good clean dataset (thanks Ipsos and PAMCo!) - but the master brand names are combined with annotations like (TBR) and (N). These convey important information about how the brands are measured, but you might not want them on your sales deck. You may also want to add other information on the brand, such as whether it's a newsbrand or a magazine, and what its various publications are called. So you have a lot of auto-replacing to do, some vlookups, or worse still, manual editing.

We clean this with a separate data file, which lists the 204 unique 'raw' brand names, along with extra information for each brand:-

lookup <- read_excel("Data-Prep/PAMCo_entities.xlsx")

glimpse(lookup)

Observations: 204

Variables: 5

$ BRAND_ID <dbl> 90001, 90002, 90003, 90004, 90006, 90007, 90008, 90009, 90010, 90011, 90012, 90013, 90014, 90015, 90016, 90017, 90018, 90019… $ BRAND_NAME <chr> "Express (TBR) (N)", "Mail (TBR) (D)", "Mirror (TBR) (D)", "Daily Record (TBR) (D)", "Daily Star (TBR) (N)", "The Telegraph … $ Print <chr> "Daily Express, Sunday Express", "Daily Mail, The Mail on Sunday", "Daily Mirror, Sunday Mirror, Sunday People", "Daily Rec… $ Digital <chr> "EXPRESS.CO.UK", "DAILYMAIL.CO.UK", "MIRROR.CO.UK", "DAILYRECORD.CO.UK", "DAILYSTAR.CO.UK", "TELEGRAPH.CO.UK", "THEGUARDIAN.… $ Type <chr> "Newsbrand", "Newsbrand", "Newsbrand", "Newsbrand", "Newsbrand", "Newsbrand", "Newsbrand", "Newsbrand", "Newsbrand", "Newsbr…

The two datasets share a common BRAND_ID field, and it seems likely that BRAND_NAME is also a match. If we want to merge the two in Excel, collecting the 3 new columns of data for each of the 275,000 rows of new data, we have a whopping 825k vlookup formulae to fire up (go make the coffee and hope your PC is a new one!). Alternatively, let's do it with one line of R.

merged_data <- new_data %>% left_join(lookup, by = c("ID" = "BRAND_ID", "Brand" = "BRAND_NAME"))

We have created a new table called 'merged_data', by taking 'new_data' and joining the corresponding columns from 'lookup'. As you can see, because the column names don't match perfectly, I've paired them up in the 'by' argument of the function. Remember, c() just indicates that you're creating a list.

For your info, I timed it with R's tictoc package and it took 0.057 seconds.

glimpse(merged_data)

Observations: 274,362

Variables: 13

If you've read this far and are dazed or confused, please get in touch and we'd be happy to refresh your R memory.

Using joins to clean messy data

Recent Posts

Comments