Fixing Data Inconsistencies
Using Power Query Replacement Tables
Sometimes, data gets a little messy.
One way for disorderedly data to creep in is when manually entered fields make it into our reporting.
Is it Essendon North, or is it North Essendon?
It might depend on who is entering the data, but we know we want to merge them for our reports.
What about typos?
We want RIchmond included with Richmond.
Just as there are many ways for data to become messy, there are numerous ways to tidy it and get it ready for our reports and dashboards.
Some, as we explore below, are better than others.
Spreadsheet Solutions
Manual Replacements
If we want, we can find each inconsistency and manually type or paste over it.
For all but the smallest sets of data, this is not a workable approach.
Find and Replace
Another spreadsheet-based option is to use Find and Replace.
This approach is a little better than manual replacements but comes with challenges of its own.
One problem is that we must repeat the find and replace steps every time we create a report.
Enter, Power Query
When we talk about changing our source data's values or structure in Power BI (or Excel), Power Query should spring to mind.
But, just making it into Power Query is only the first step; we have some options to consider once we're here.
A Quick Aside on Power Query
For the uninitiated, Power Query is a part of Power BI and Excel that allows you to use a point-and-click interface to create procedures that load your data in from it’s source and make a wide array of transformations to it.
From here, it is loaded into Power BI or Excel for you to build your model and reports.
Replace Values Button
The Replace Values button in Power Query provides a familiar experience to using Find and Replace in Excel.
The difference is that Replace Values in Power Query allows us to create a step in an automated data preparation pipeline.
When we use Excel's Find and Replace, we take steps that we're going to need to take again next time the same erroneous value comes up.
In Power Query, we take these steps once, and they are repeated for us automatically in future.
So, why not use this approach?
In many cases, using the Replace Values button is a perfectly acceptable solution.
However, it can sometimes lead to messy queries with a lot of steps over in the Query Settings section on the right hand side.
If we can avoid it, we don't want to replace messy data with messy queries!
All else equal, we want to keep the Query Steps list over on the right as short (and tidy) as possible.
A Conditional Column
We could otherwise use a Conditional Column, which would roll all of the Replace Values steps up into a single step.
This approach keeps our Query Steps list nice and short but doesn't solve another problem: excessive code.
When we use a conditional column, each replacement we add turns into an element of something called an if-then statement under the hood.
We could end up with a very long if-then statement, which could become a maintenance issue in future.
Below is what it looks like to use a conditional column for these replacements.
A Replacement Table
A replacement table works a little like the Conditional Column described above, except that we replace the lengthy if-then statement with a simple lookup table.
This Replacement Table approach is a tidier solution to the data inconsistencies issue, so we'll now go over implementing it.
Steps for using a Replacement Table
#1 Create the Replacement Table
First things first, we create a two-column table.
The first column is for the erroneous entries that we've observed in our data, and the second column is what should replace them.
#2 Load it into Power Query
We can load this in to Power Query as a New Source.
#3 Merge with the Original Source Data
Next, you merge this Replacement Table with your source data.
The <original source data> side of the merge is the field you are fixing (i.e. the one with the erroneous values), and the <replacement table> side of the merge is the erroneous values column - “To Replace” in the example above.
#4 Create a Conditional Column
It turns out we are still using a Conditional Column - just differently.
This time the if-then statement under the hood will be the same length regardless of the number of replacements we are making.
For rows where the original value is as intended, this conditional column will use this original value.
For rows where the original value needs to be replaced, it will use the corrected version from the replacement table.
It decides which is which based on whether or not a To Replace field has been merged in for that row.
Benefits of Replacement TabLe over a Conditional Column
There are a few benefits to this approach over the Conditional Column.
Simpler Maintenance
You don't need to go into Power Query to add new replacements.
When a new incorrect version of a value makes its first appearance in your source data (Essendon Nrth, for instance), you add the pair to this table.
Anyone with access to the list of replacements can add new ones to it as they come up.
Keeping this up to date could be the job of someone other than your Power BI Developer.
Transferability
A lookup table is language and platform agnostic.
So if at some point you need to make these replacements in another language, perhaps Python, or R, you'll be able to use this same table.
You'd be able to repurpose the if-then statement from the Conditional Column approach, but it'd be a bit more painful.
In Conclusion
For those used to manipulating their data manually in a spreadsheet, just finding yourself in Power Query can be a game-changer.
The buttons on the ribbons at the top hold vast potential to transform your reporting processes from tedious, monotonous tasks into automated insight pipelines.
Once you've been using these buttons for a while, you'll start to see even greater opportunities for enhanced efficiency a little off the beaten track.
The approach outlined in this article is an example of this, where we combine different features of Power Query to come up with an efficient, maintainable solution to a common problem.