Cleaning and joining data using Excel’s Power Query

By Kelly L. Williams

Q. There was a December 2020 article in the JofA about cleaning and joining data using a program called Alteryx, but we don't have access to this program. Could you explain how to do those tasks using Excel's Power Query?

A. The December 2020 article you are referring to is "Data Preparation for CPAs: Extract, Transform, and Load," JofA, Dec. 2020. The same actions that were performed to clean and join data using Alteryx can be performed using Excel's Power Query. It is not necessary to use complicated Excel formulas to do these tasks, and the steps required to clean and join the same dataset in Power Query are quite simple. In addition, there is no additional charge to use Power Query if you already have Excel 2010 or later for PCs.

To follow along with the walkthrough below, you can download the Excel dataset used in the December 2020 article and view the video located at the end of this article.

For users with Excel 2016 or later for PCs, the Power Query tools are already built into Excel. For users with Excel 2010 or 2013 for PCs, the free Power Query add-in will need to be downloaded; it can be found at redirect.viglink.com. Once the add-in is downloaded, the Power Query tools will have their own tab on the Ribbon. Power Query for Excel 365 for Macs is being developed. These users are able to refresh queries but do not yet have the ability to author them in the Power Query Editor. The content of this article is based on Microsoft Excel 365 for PCs.

The first step is to open Excel and pull the data into the Power Query Editor. Go to the Data tab on the Ribbon. In the Get & Transform group, you can choose to import from many data sources. Because I have saved the dataset used in the December 2020 article to my computer, I click Get Data, From File, From Workbook, choose the appropriate file, and click Import.

This opens the Navigator window. We will need to eventually join two worksheets, so check the box for Select multiple items and check the box beside each worksheet, as shown in the screenshot below. Click Transform Data. The data is now in the Power Query Editor.

tech-qa-power-query-exhibit-1

We will start by cleaning, or transforming, the data from the Customer purchase order worksheet. Make sure this is the worksheet selected on the far-left side of the Power Query Editor. Note that the headers are showing as the first row in the dataset. Fix that by going to the Transform tab on the Ribbon and clicking Use First Row as Headers from the Table group.

Next, we need to have only the number of each purchase order in our dataset. The data should not include the text PO_number. Select the PO_number column by clicking on the header of that column. Go to the Transform tab. From the Text Column group, click Split Column, then By Delimiter. The Split Column by Delimiter window will open. Choose Space from the Select or enter delimiter box. Choose Each occurrence of the delimiter. These selections are the default in the Split Column by Delimiter window. Click OK.

You will now see two PO_number columns. We don't need the column that contains the text PO_number, so select the column by clicking the header and press the Delete button. Double-click the PO_number.2 header and change it back to PO_number. You may have noticed the APPLIED STEPS box on the far-right side of the Power Query Editor is changing as you perform each step. It is essentially recording your steps so that you can automatically perform these steps again. This is very useful if you have to clean data in the same way multiple times. If you perform a step that you would like to undo, this is where you delete the step, and that action will be undone.

Next, we will clean the Quantity data. As with our PO number data, we want to include only the number representing the quantities in our dataset. As you can see, some of the data includes text and leading and trailing white space. Fortunately, Power Query can easily take care of that. First, let's get rid of the leading and trailing white space by selecting the column, going to the Transform tab, clicking Format from the Text Column group, then clicking Trim. That removes all the white space.

Next, we remove the text the same way we removed the text in the PO_number column. Select the Quantity column and go to the Transform tab. Click Split Column then By Delimiter. In the Split Column by Delimiter window, choose Space from the Select or enter delimiter box and choose Each occurrence of the delimiter. Click OK. You will now see two Quantity columns. Delete the Quantity column that contains the text by selecting that column and clicking the Delete button. Double-click the Quantity.1 header and change it back to Quantity.

Now we will clean the Inventory_ID column. Once again, we want just the number of the Inventory ID in our dataset. There is white space and text that needs to be removed. Remove the white space by selecting the column and going to the Transform tab and clicking Format, then Trim. Because this column has text that could be one word or more than one word, Split Column would not be the best option. We will use Extract, which has fewer steps anyway and works well for the data in this column.

Select the Inventory_ID column and go to the Transform tab. Click Extract, then Last Characters because we want to extract the last three characters, the size of our inventory IDs, from the column. The Extract Last Characters window will open. Enter 3 in the Count box. Click OK. This just got rid of the text without creating a new column. However, the numbers in our Inventory_ID column are formatted as text. Let's change that to whole numbers, like the other columns of data. Select the Inventory_ID column, and go to the Transform tab. Next to Data Type in the Any Column group, change the data type to Whole Number.

All of the data in the Customer purchase order worksheet is now clean. Let's make Inventory_ID the first column before we join the other worksheet with this one since we will match up our inventory data and purchase order data by Inventory ID. Select the Inventory_ID column by clicking the header. Hold the left mouse-click, and drag the column in front of the PO_number column. Now sort the data by Inventory_ID, then by PO_number, by selecting the Inventory_ID column and clicking the Sort Ascending button in the Sort group under the Home tab, then selecting the PO_number column and clicking the Sort Ascending button again.

Now let's join the data from the Inventory spreadsheet with the data in the Customer purchase order spreadsheet. With the Customer purchase order spreadsheet still selected on the far-left side of the Power Query Editor, go to the Home tab and click Merge Queries from the Combine group. The Merge window will open. The Customer purchase order table will be shown toward the top of that window. Directly below that is a drop-down box where you may choose another worksheet. Select the Inventory worksheet. We would like to merge the data based on Inventory_ID, so select those columns in both worksheets shown in the Merge window by selecting the headers of those columns. Leave the Join Kind drop-down box at its default for this dataset. See the screenshot of the Merge window below.

tech-qa-power-query-exhibit-2

Click OK. You will now see a new column in the Customer purchase order worksheet from within the Power Query Editor. We need to expand the contents of that new column. Click the Expand button located in the header of the new column, and a window will open. By default, all the columns from the Inventory table will be added. Unselect Inventory_ID since we don't need that data to appear twice in our dataset. Also, uncheck the box that says Use original column name as prefix, since we don't want our column names to be changed. Click OK.

As a reminder, the ways in which we are cleaning and joining this dataset are to mimic the ways in which the data was cleaned and joined in the original article so that a direct comparison can be made about Alteryx and Power Query.

Let's reorder a few of our columns to match the final dataset created in Alteryx. Click the Inventory_name header, hold the left mouse-click, and drag the column over to become the second column in our dataset. Drag the Type column over to become the third column. Drag the Unit_Cost over as the fourth column and the Unit_Price as the fifth column.

Notice the dataset shown above contains two additional columns, Total_Price and Total_Cost. Let's create those columns in our dataset. Go to Add Column on the Ribbon and click Custom Column from the General group. The Custom Column window will open. In the New column name box, input Total_Price. Under Custom column formula, place your cursor after the = sign that is already shown. Double-click Quantity from the Available columns box on the far-right side of the window, type *, and then double-click Unit_Price from the Available columns box. Click OK. Again, go to Add Column and click Custom Column. In the New column name box, input Total_Cost. Under Custom column formula, place your cursor after the = sign that is already shown. Double-click Quantity from the Available columns, type *, then double-click Unit_Cost from the Available columns box. Click OK.

The dataset has now been cleaned and joined. Click the X in the upper right-hand corner to close the Power Query Editor. A window will open and ask if you want to keep your changes. Choose Keep. The workbook with our data will open in Excel, with the finished dataset on a new tab. See the screenshot of the dataset we just made below using Power Query.

tech-qa-power-query-exhibit-3

Once you become familiar with the techniques in Power Query, a dataset like this will take you only a few minutes to clean and join as we have done, and you won't need to use complicated formulas. You can view the video demonstration of this dataset being cleaned and joined using Power Query below.


Kelly L. Williams, CPA, Ph.D., MBA, is an associate professor of accounting at the Jones College of Business at Middle Tennessee State University.

Submit a question

Do you have technology questions for this column? Or, after reading an answer, do you have a better solution? Send them to jofatech@aicpa.org.We regret being unable to individually answer all submitted questions.

SPONSORED REPORT

Get your clients ready for tax season

Upon its enactment in March, the American Rescue Plan Act (ARPA) introduced many new tax changes, some of which retroactively affected 2020 returns. Making the right moves now can help you mitigate any surprises heading into 2022.

100th ANNIVERSARY

Black CPA Centennial, 1921–2021

With 2021 marking the 100th anniversary of the first Black licensed CPA in the United States, a yearlong campaign kicked off to recognize the nation’s Black CPAs and encourage greater progress in diversity, inclusion, and equity in the CPA profession.