ABSTRACT

Naming your dataframe columns doesn’t have to be hard, does it? Here I demonstrate the benefits of implimenting an opionated dataframe cleaner to help keep your columns organised.

1 Introduction

Inconsistent and illogical naming conventions can ruin even the best analysts flow, cause sneaky errors, and potentially lead to misleading or completely incorrect results. Throughout my time as an environmental data analyst I have come across countless instances where the names used in a dataframe mess up my analysis, and I can guarantee I’m not the only one. Just Google “the importance of file naming” to find countless monologues (just like this one), or “bad naming conventions” to realize, actually it could be worse!

So if this is such a widely acknowledged issue, why is it still an issue? How has it not been fixed? Simply put, because a) its boring, and b) everyone is unique and has their own idea of what a “good” system looks like. This leads to people not bothering, or instances where you might pull together several datasets from a range of sources, each using their own (different) naming conventions. Thankfully, if each dataset is at least internally consistent, we can address these differences.

Below, I introduce my method of addressing this issue. It is a highly opinionated dataframe cleaner that focuses exclusively on ensuring every dataframe I touch receives exactly the same column naming convention. Before we dive into it, I believe it is critical to recognise that this method is customized to my needs, it may work for you as well, but I recommend instead that you use this as inspiration to develop your own method.

2 The Naming Convention

So what naming convention am I using exactly? “Upper Camel Case” is my choice, however some people may also refer to it as “Pascal Case”. If your are unfamiliar, here are some examples of naming conventions:

UpperCamelCase
snake_case
kebab-case
UPPERFLATCASE
etc.

Why UpperCamelCase? As noted above, everyone has their own idea of what is good. I find that upper camel case suite my purposes well, it is fairly easy to read, it only contains A-Z, 0-9 (no underscores or dashes), and most importantly it does not clash with my object names when coding it in R. What I mean by this is that I use snake_case to name my objects, and UpperCamelCase to name columns within my objects. Lets consider the following example.

Lets say I have a dataframe that counts fish (called “fish”):

Code

#load the dplyr package
library(dplyr)

#create an example dataframe
fish <- data.frame(species = c("A", "B", "C", "D", "E"),
                   fish_count_location_1 = c(6,9,3,5,10),
                   fish_count_location_2 = c(1,16,3,2,7))

#print the dataframe. If you want to learn about this function, check out my pretty tables post!
cond_form_tables(fish)

species	fish_count_location_1	fish_count_location_2
A	6	1
B	9	16
C	3	3
D	5	2
E	10	7

(Note that both the object and column names are in snake_case).

Then I decide to figure out the mean number of each species of fish, across all locations (called “mean_fish”):

Code

#get the rowwise mean of the fish counts per species
mean_fish <- fish |> 
  rowwise() |> 
  mutate(mean_fish = mean(c(fish_count_location_1, fish_count_location_2))) |> 
  ungroup()

#print the dataframe
cond_form_tables(mean_fish)

species	fish_count_location_1	fish_count_location_2	mean_fish
A	6	1	3.5
B	9	16	12.5
C	3	3	3
D	5	2	3.5
E	10	7	8.5

Whoops, just by using some logical naming I now accidentally have a dataframe object named “mean_fish”, and a column within that dataframe named “mean_fish”. Now obviously this is a silly example, but imaging we have 1000+ lines of code, and we need to know something about the mean number of fish. Suddenly we can’t remember whats an object and whats a column and we can run into subtle errors, or have very confusing lines of codes.

Thus; my final reason for choosing UpperCamelCase:

Code

#create a new example dataframe
fish <- data.frame(Species = c("A", "B", "C", "D", "E"),
                   FishCountLocation1 = c(6,9,3,5,10),
                   FishCountLocation2 = c(1,16,3,2,7))

#get the rowwise mean of the fish counts per species
mean_fish <- fish |> 
  rowwise() |> 
  mutate(mean_fish = mean(c(FishCountLocation1, FishCountLocation2))) |> 
  ungroup()

#print the dataframe
cond_form_tables(mean_fish)

Species	FishCountLocation1	FishCountLocation2	mean_fish
A	6	1	3.5
B	9	16	12.5
C	3	3	3
D	5	2	3.5
E	10	7	8.5

3 The Function

My custom function takes advantage of the janitor R package, which includes a wide range of functions to perform standard cleaning and organisation steps (check out the janitor documentation to see what it can do). Specifically, we are going to use the clean_names() function, along with some bells and whistles to catch our edge cases. Lets take a look:

Code

#create the custom function
name_cleaning <- function(df){

  #load and install (if required) the pacman package handler package, which we will use for all future package downloads
  if(!require("pacman")){install.packages("pacman")}

  #use the pacman function to load and install (if required) all other packages
  pacman::p_load(janitor, dplyr, sf, stringr)

  #check if the df is an sf object and if so, apply clean names to every column but the last column
  if(inherits(df, "sf")){
    
    #convert all but the geometry column to upper camel type
    df_new <- df |> 
      st_drop_geometry() |>
      clean_names(case = "upper_camel")
    
    #bind the geometry column back on with its new name. Note that it should also be named "geom"
    df_new <- df_new |>
      dplyr::mutate(geom = st_geometry(df)) |> 
      st_as_sf()
  
  } else {
    
    #convert ALL columns to upper camel type, don't have to worry about geometry
    df_new <- df |> 
      clean_names(case = "upper_camel")
    
  }
  
  #for every character type column, run a encoding check and fix, then remove weird new line characters
  df_new <- df_new  |> 
    mutate(across(where(is.character), ~ iconv(., from = 'UTF-8', to = 'ASCII//TRANSLIT'))) |> 
    mutate(across(where(is.character), ~str_replace_all(., "\r\n", " ")))
  
  return(df_new)
  
}

Ok, so even though that is a relatively short function, there is still a few things going on. Lets break it down a bit.

First we will initialize the function (if you are unfamiliar with creating your own functions check out my functions post).

Code

#initialize the function
name_cleaning <- function(df){

Then we load each of our required packages. Noting that generally we would expect these packages to already have been loaded in by the script calling this function, but we can’t be sure. Here we use the pacman package to make the install/load steps a bit more streamline, documentation for pacman can be found here.

Code

  #load and install (if required) the pacman package handler package, which we will use for all future package downloads
  if(!require("pacman")){install.packages("pacman")}

  #use the pacman function to load and install (if required) all other packages
  pacman::p_load(janitor, dplyr, sf, stringr)

We then check if the dataframe we are cleaning is actually an “sf” (simple feature) object. Sf objects are special types of dataframes used in geospatial analytics that have an extra column containing coordinate information. This special column has its own rules for column naming and therefore sf objects should be handled differently. In my work I encounter sf objects very often.

Code

  #check if the df is an sf object and if so, apply clean names to every column but the last column
  if(inherits(df, "sf")){

If we are looking at an sf object, we copy the sf object and remove the geometry column from this copy. Following this, we can then run janitor’s clean_names() function on the copy with no geometry column. The reason we do this is that the janitor package has no precedent for sf objects. In the clean_names() function, we specify that we want the column names to follow the “upper_camel” format. This will convert all our column names to the desired format.

Code

    #convert all but the geometry column to upper camel type
    df_new <- df |> 
      st_drop_geometry() |>
      clean_names(case = "upper_camel")

Once we have cleaned the names of every column in the sf object, we can then add the special geometry column back on to the dataset. At this point we also need to convert the object back to the “sf” type.

Note

You may notice that this special geometry column is called “geom” rather than “Geom”… which doesn’t adhere to our naming convention. Unfortunately, this is an annoying quirk of spatial datasets. When they are loaded, the geometry column can take on 1 of 3 different names depending on the source of the data; “geom”, “geometry”, or “shape”. In all cases the name is lowercase, even when the data is saved in uppercase, it will be reloaded in lowercase. Thus, for this issue, we simply ensure that the 3 different possibilities are all just converted to the “geom” option.

Code

    #bind the geometry column back on with its new name. Note that it should also be named "geom"
    df_new <- df_new |>
      dplyr::mutate(geom = st_geometry(df)) |> 
      st_as_sf()

If the object is a simple dataframe (not an sf object), we can just move straight to the clean_names() step that we explained above.

Code

  } else {
    
    #convert ALL columns to upper camel type, don't have to worry about geometry
    df_new <- df |> 
      clean_names(case = "upper_camel")
    
  }

Next we look to catch strange edge cases related to the encoding column of columns. You are likely familiar with the concept of a column being of type “character” or “numeric” or “boolean”, etc. Our strange edge case is similar to this. What we have found is that in some instances the character column type is encoded as “UTF-8”, while other times it is encoded as “ASCII”. Much like how you can’t combine character and numeric columns, you also can’t combine columns encoded as UTF-8 and ASCII. Below we convert all columns encoded as UTF-8 to ASCII to avoid this issue.

Note

Please note that these encodings are hidden from the user and you will never normally need to interact with them, the reason this happens doesn’t matter, and is frankly some mysterious property of excel. Broadly, you probably don’t need to ever understand why/how this step works.

Code

  #for every character type column, run a encoding check and fix, then remove weird new line characters
  df_new <- df_new  |> 
    mutate(across(where(is.character), ~ iconv(., from = 'UTF-8', to = 'ASCII//TRANSLIT'))) |> 
    mutate(across(where(is.character), ~str_replace_all(., "\r\n", " ")))

The object is then returned and the function is complete.

Code

  return(df_new)
  
}

4 In Practice

Now that we understand how the function works, lets demonstrate its use with another example dataset that has a wide range of column names. Here is before:

Code

#create an example table with example names
example_df <- data.frame("column 1" = c(1,2,3,4,5),
                         "column-2" = c("A", "B", "C", "D", "E"),
                         "column_3" = c(NA, NA, NA, NA, NA),
                         "column-four" = c("1A", "2B", "3C", "4D", "5E"),
                         "Column Five" = c(TRUE, FALSE, TRUE, FALSE, TRUE))

#print the table
print(example_df)

  column.1 column.2 column_3 column.four Column.Five
1        1        A       NA          1A        TRUE
2        2        B       NA          2B       FALSE
3        3        C       NA          3C        TRUE
4        4        D       NA          4D       FALSE
5        5        E       NA          5E        TRUE

And after:

Code

#run the clean name functions
example_df_cleaned <- name_cleaning(example_df)

#print the cleaned dataset
cond_form_tables(example_df_cleaned)

Column1	Column2	ColumnFour	ColumnFive
1	A	1A	TRUE
2	B	2B	FALSE
3	C	3C	TRUE
4	D	4D	FALSE
5	E	5E	TRUE

5 Caveats

It is also important to acknowledge the caveats of your own work. To my knowledge the only caveat of this function is that it relies on a sensible preexisting column name, even if the format is horrible. What I mean by this is that a column named “Mean-fish_in Townsville” can be cleaned, but a column with no name… well how can you rename that to something appropriate? As a side note R does generally replace empty column names with “X1”, “X2”, etc. however this still does not provide any information about the column.

Thanks For Reading!

If you like the content, please consider donating to let me know. Also please stick around and have a read of several of my other posts. You'll find work on everything from simple data management and organisation skills, all the way to writing custom functions, tackling complex environmental problems, and my journey when learning new environmental data analyst skills.

This work should be cited as:
Adam Shand, "[Insert Document Title]", "[Insert Year]".