Append data from multiple worksheets of multiple workbooks where each worksheet has a different heading

{0 Comments}

In a folder there are multiple workbooks with an unknown number of worksheets in each workbook.  Each worksheet has data for one year and has 13 columns - the first is for the Product and the other 12 are for each month of the year.  So sheet1 of Book1 has Product in column1, 1 Jan 2015 in column2, 1 Feb 2015 in column3 and so on until 1 Dec 2015 in column13.  Likewise, sheet2 of Book1 has Product in column1, 1 Jan 2016 in column2, 1 Feb 2016 in column3 and so on until 1 Dec 2016 in column13.  The worksheet which has data for the current year will have fewer columns since data for all months is not available.  The number of rows on each worksheet vary depending upon the number of products sold in that year (as can be seen in the two images below).

The objective is to append data from all worksheet of all workbooks to create a simple 3 column Table - Date in column1, Product in column2 and Value in column3.

The solution should be dynamic enough to accommodate the following:

  1. More files added to/deleted from the folder
  2. More worksheets added to/deleted from the existing files
  3. Data added by rows/columns to the existing worksheets
  4. Heading changes in the existing worksheets i.e. in sheet1 of the DATA1 workbook (first image above), one may want to change the year appearing in the first row from 2015 to 2013

This question would have been an easy one to solve had the headings on all worksheets of all workbooks been the same such as Month1, Month2, Month3, Month4 etc.  In that scenario, the technique shown in this YouTube vide would have worked very well.  However, since the headers on each worksheet of each workbook are different, this technique yields an unexpected result (shown below) - Please note that I have not used the Table.PromoteHeaders([Data]) command as described in the video because that would take away the row with dates i.e. row1, row5, row11, row16.  If I do not run that step, I get the output as shown in the image below.  The dates appear as part of the Data rather than being identified as headers.

I have solved this question using the Power Query Editor.  You may download my source data from here and solution workbook from here.

  1. Download the source data workbooks and save them in a folder.
  2. Download the solution workbook and save it anywhere else.
  3. Open the solution workbook and go to Data > Queries and Connections.
  4. In the pane that opens up, right click on the Query > Edit.
  5. In the Applied steps window, double click on Source and specify the path where you have saved the source data workbooks (step1 above).
  6. Click on OK > Close and Apply.

Summarise data by most recent status

{0 Comments}

Here's a simple 3 column dataset showing Date, ID and Status - the status of each ID by Date.

So, the narrative for ID A is:

  1. It was "New" on Jan 1
  2. It remained "New" until Jan 14
  3. On Jan 15, the status changed to "Open"
  4. It remained "Open" till Jan 31 and the status changed to "Closed" on Feb 1
  5. It remained "Closed" till March 31 and the status changed to "Stop" on April 1
  6. It has remained in 'Stop" status till Today

Note that for the month of March, there is no record for ID A but the status of it has to be treated as Closed (refer point 5 above).

The objective is to count the number of ID's by status and month.  The expected result is:

Please note that the trick part here is to get the result as 2 in cell D6 (Status closed for March). I have solved this problem using Power Query and PowerPivot.  Since these two Business Intelligence (BI) tools are available in PowerBI desktop (PBI) as well, you may download a folder with both files (the MS Excel workbook and PBI file) from here.

Segment customers into dynamic buckets

{0 Comments}

Consider a 4 column table - Respondent ID, Device ID, App Name and Category.  So this dataset shows which apps are installed on which device ID by which user and which category do the apps fall into.  It is a small dataset with only 4 columns and 2,000 rows.

The question on this dataset is - "I would like to segment the total user base by Categories into the following 9 buckets:

  1. Those who only have 1 app installed; and
  2. Those who have 2 apps installed; and
  3. Those who have 3 apps installed; and
  4. Those who have 4 apps installed; and
  5. Those who have 5 apps installed; and
  6. Those who have 6 apps installed; and
  7. Those who have 7 apps installed; and
  8. Those who have 8 - 10 apps installed; and
  9. Those who have 10+ apps installed

The expected result is a Pivot Table with buckets in the column labels, Categories in the row labels and number of people in the value area section (as shown below)

Here's how one can interpret the Pivot Table shown above:

  1. Cell B50 - There are 75 people who only have 1 "Tool" app installed
  2. Cell J44 - There is just 1 person who has 10+ Photography apps installed.

I have solved this problem using Power Query and PowerPivot.  Since these two Business Intelligence (BI) tools are available in PowerBI desktop (PBI) as well, you may download a folder with both files (the MS Excel workbook and PBI file) from here.

Compute hours spent on projects given resource allocation

{2 Comments}

In the dataset below column A has the Employee Name, column B and C are the assignment start and end dates, Column D is the location and columns E to J are the Month-Year columns.  So each row represents data for an employee on a particular project.  The numbers in range E2:J8 represent how much that particular employee is aligned to the particular project i.e. a value of 1 means that the employee is dedicated solely to that project, 1.4 means that the employee will be spending extra hours on that project and 0.1 indicates that the employee will be working on multiple other projects.

The objective is to create another column (column K in the second screenshot) which will show the number of hours the employee will spend on the project.  The number of hours will be computed as number of working days in a month (treat Saturday and Sunday as weekends) * time allocation to that project (the numbers in range E2:J8) * 8.5 hours per day for an Offshore project and 8 hours per day for other projects.

The raw data sheet looks like this

The expected result is

The figure in cell K3 has been computed as:

  • Number of working days between November 11, 2018 and November 30, 2018 are 15.  So 15 * 1 = 15
  • Number of working days between December 1, 2018 and December 12, 2018 are 8.  So 8 * 0.5 = 4
  • Total effective working days are 15 + 4 = 19
  • Since it is an Offshore project, the hours per day would be 8.5.  Therefore total effective hours: 19 * 8.5 = 161.5

I have solved this problem using 3 methods:

  1. Excel formulas - Refer worksheet named "Formula output"
  2. Power Query and PowerPivot - Refer worksheet named "Power Pivot output"
  3. Power Query only - Refer worksheet named "Power Query output"

You may download my solution workbook from here.

Customer analysis by Country and time period

{2 Comments}

Here is a Sales dataset of 8 columns and 29 rows.  It basically details the revenue earned and cash collected by service type, Customer, Country and Period.  For a selected Country and time period, there could be customers availing of both services or of any 1 service.


There are 2 broad questions that one may want to get answers to:

  1. Determine the number of customers who availed of a certain number of services
  2. Determine customers with whom business was forged for the first time and those who churned out

For a chosen country and Year/Month, the first question stated above further sub-divides into:

  1. How may customers availed of both services - Consultancy and Implementation
  2. How may customers availed of only one of the two services

So if a user selects the Country as India and Year/Month as January 2015, then Customers who availed of both services would be 1,3 and 4.  Note that Customer 2 should not be considered (even though he/she availed of both services) because the revenue earned from one of the services (Implementation) was nil.  For the same selection (India and January 2015), the Customers who availed of only 1 service would be Customer 2 - this customer availed of only the Consultancy service (Revenue was earned from this Customer only for this service).  After applying a filter on the source dataset, the rows for India and January 2015 are:

The expected result is shown below in PowerBI desktop software.  If you are not concerned with who those customers are (you just want the count), then you may simply remove the Customer Name field from the visual.

The second question is to determine the number of new and lost customers.  If a customer was not in the database in any prior month, the customer is identified as new.  To clarify, a customer who availed of the Consultancy service in a prior month also availed of the Implementation service for the first time in the current month would not be counted as a new customer.  If a customer ceases to generate revenue in any month, the customer would be counted as lost (churned) in that month.  So when USA is selected in the Country slicer and Year/Month is February 2015, the expected result is:

I have solved this question with the help of the PowerPivot.  You may download my PowerBI desktop solution file from here and source Excel workbook from here.  This problem can also be solved in MS Excel using the PowerPivot.

Compute Relative Size Factor per vendor

{0 Comments}

Relative size factor (RSF) is a test to identify anomalies where the largest amount for subsets in a given key is outside the norm for those subsets. This test compares the top two amounts for each subset and calculates the RSF for each. In order to identify potential fraudulent activities in invoice payment data, one utilizes the largest and the second-largest amounts to calculate a ratio based on purchases that are grouped by vendors.  You may read more on this topic here.

Here is a 3 column dataset.  The first column is Vendor Number, the second is Invoice number and last is invoice amount.  There can be multiple invoices per vendor.  The objective is to determine the highest invoice value for a vendor and divide that by the second highest invoice value for that same vendor to get a ratio.  The same needs to be done for all vendors.  An interesting case in the dataset below is Vendor_No V4439 - there are 2 instances of highest value for this vendor i.e. 25,378.30 and another 2 instances of second highest value i.e. 24,068.25.  The RSF for this case will be 25,378.30/24,068.25.  If there is no instance of second highest value for a vendor, then the result should be 0.

The expected result is:

I have solved this question with the help of the PowerPivot.  You may download my solution workbook from here.

Analyse free flowing text data or user entered remarks from multiple perspectives

{0 Comments}

Here is a 2 column dataset - UserID in column A and Remarks in Column B.  This dataset basically tabulates the remarks/comments shared by different users.  Entries in the Remarks column are basically free flowing text entries which have the following inconsistencies/nuances:

  1. Users reported multiple errors which are separated by comma, Alt+Enter (same line within the cell) and numbered bullets
  2. Users committed spelling mistakes (see arrows in Table1)
  3. A user ID may be repeated in column A

Given this dataset, one may want to "hunt" for specific "keyword Groups" (column E above) in each user remark cell and get meaningful insights.  Some questions which one would like to have answers to are:

  1. How may users reported each type of keyword Group - "How may users used the Unresponsive keyword?".  See Pivot Table1 below.
  2. Which are the keyword Groups that each user reported - "Which are the different keyword groups reported by UserID A004?".  See Pivot Table2 below.
  3. How many users reported each of the different keyword Groups - "How many users reported all 3 problems of Slow, unresponsiveness and crash".  See Pivot Table 3 below.
  4. How may users who used this keyword group also used this keyword group - "How many users who reported Crash also reported Unresponsive?".  See Pivot Table 4 below.

This was quite a formidable challenge to solve because of spelling mistakes and multiple keywords reported in each cell.  I have solved this problem with the help of Power Query and PowerPivot.  You may download my workbook from here.

Determine the top selling location for each product

{0 Comments}

Visualise a 3 column dataset as shown below - Location, Product and Sales.  Each location can have multiple products (Product A has Banana, Apple and Carrot) and each product can be sold in multiple locations (Banana is sold in locations A, B and F).

The objective is to determine the location with highest sales for each product.  So for Banana, maximum sale value is 25 and location of maximum sales value is B.  Likewise for Orange, maximum sales value is 49 and location of maximum sales value is A.  The expected result is:

I have 4 solutions to this problem:

  1. Advanced Filters - This is a static solution.  For any changes in the source data range, one will have to re-enter the 3 inputs in the Advanced Filter window
  2. Formulas - This is a semi-dynamic solution.  To make it fully dynamic, one will have to write an array formula to first extract all unique product names in a column.  The array formula to extract product names in a column can be obtained from here.
  3. Power Query - This is a dynamic solution.  For any changes in the source data sheet, one just has to go to Data > Refresh All
  4. PowerPivot - This is a dynamic solution.  For any changes in the source data sheet, one just has to go to Data > Refresh All

You may download my solution workbook from here.

Remove duplicates from each cell of a dataset

{0 Comments}

Here's a dataset with 4 columns and 2 rows.  In column A, there is Patient ID and in the other columns are observations recorded by 3 Doctors for each patient.  As can be seen, in each cell there are multiple observations which are either separated by Alt+Enter (another line in the same cell) and/or a comma.  To make things more complicated, there are duplicate text entries appearing in each cell.  In cell B2, Creative appears twice and in cell D2, Fatigue, headaches and stress each appear twice.  Please also note that the entries appear in mixed case in each cell i.e. in cell D2, the first occurrence of stress is in proper case and the second instance is in small case.

The objective is to remove duplicates from each cell (disregarding case sensitivity) and standardize the multiple entries with an Alt+Enter in each cell.  The expected result is shown in the green table below

I have solved this problem using Power Query a.k.a. Data > Get & Transform in Excel 2016 and higher versions.  You may download my solution workbook from here.

Flex a Pivot Table to show data for x months ended a certain user defined month

{2 Comments}

In this simple 3 column dataset shown below, one can see the month wise demand and energy charge for 2 years - 2017 and 2018.

The objective is to compute the month wise demand charge for x months ended a certain user defined Year and Month.  So, if a user selects the Year as 2018, Month as June and Duration as 9, then the Pivot Table should show month wise demand charge for the 9 months ended June 2018 i.e. from October 2017 to June 2018.  Likewise, if a user selects Year as 2018, Month as May and Duration as 3, then the Pivot Table show should month wise demand charge for the 3 months ended May 2018 i.e. March 2018 to May 2018.

You may download my solution workbook from here.