I wanted to do a quick post based on some feedback I received on my post last week. I introduced filtering data in dplyr and missed covering one crucial part of dealing with dirty data: handling NAs.
This is straight forward to do in Power Query by filtering rows in the interface. It is easy to spot rows with errors or null values.
Filtering out NA values was one aspect of learning to clean data in R that I found difficult to get my head around at first. I first tried to filter out NAs using logical tests as I would have with strings:
The answer is using the
is.na() function. This function is in Base R and is a logical test of if a value is NA or not.
For example, the following statement would return all records where the value in the id column is NA:
records_with_NA_id <- filter(dirty_data, is.na(id))
This is great for returning values that are NA; however, we often want to be remove NAs from the data set instead of returning them. Instead of testing to see if a record is NA and then returning it if this is true, we want to do the opposite.
This can be achieved by preceding the function with
!, essentially testing if something is not NA and only returning records that have a valid value.
For example, the following statement would return all records where the value in the id column is not NA:
clean_data <- filter(dirty_data, !is.na(id))
That’s how I deal with NA values when filtering data with dplyr!
I’m excited for my posts over the next few weeks, as I explore the different ways to use R and Power BI together to create some powerful dashboards with predictive analytics. Stay tuned!