Category: DATA EXTRACTION

Search for multiple phrases within a cell and extract all those phrases in another column

{0 Comments}

Assume a simple two column dataset as shown below.  In column A are some text strings.  In each cell are multiple phrases separated by commas.  In column C are some phrases - one in each cell.  The objective is to search for all phrases in each cell and extract the phrases found in another column of the same row.

untitled

The expected result is:

untitled

I have solved this problem using Power Query a.k.a. Data > Get & Transform in Excel 2016.  You may download my solution workbook from here.

Remove special characters from a string

{2 Comments}

Hi,

Assume a column of names as follows:

Name
Mohammed Zia-Ul Haque
Steven Thomas -
,-Rohit Sunil Ahir-Chowdhary.-
Anuj -----------
Sameer --
..,Mohit --
Rajeev Nair.
Monalisa . Das
Vijeta ...
--,.Anjana. M.U..,-

Please observe that there are special characters before the name, within the name and after the name.  The task is to remove special characters before and after the name.  The expected result is shown below:

Expected Result
Mohammed Zia-Ul Haque
Steven Thomas
Rohit Sunil Ahir-Chowdhary
Anuj
Sameer
Mohit
Rajeev Nair
Monalisa . Das
Vijeta
Anjana. M.U

The array formula (Ctrl+Shift+Enter) to make this work is

=MID(A2,MIN(SEARCH(CHAR(ROW($A$65:$A$90)),A2&CHAR(ROW($A$65:$A$90)))),LOOKUP(2,1/((CODE(MID(UPPER(A2),ROW(INDIRECT("1:"&LEN(A2))),1))>=65)*(CODE(MID(UPPER(A2),ROW(INDIRECT("1:"&LEN(A2))),1))<=90)),ROW(INDIRECT("1:"&LEN(A2))))-(MIN(SEARCH(CHAR(ROW($A$65:$A$90)),A2&CHAR(ROW($A$65:$A$90)))))+1)

I have solved a similar problem at this link as well but that requires the usage of an add-in.  This is so because the special characters and numbers need to be removed from within the string as well.  In other words, everything except letters need to be removed from the alphanumeric string (no matter where the numbers and special characters are - beginning, middle or at the end).

Extract City, State and Pin code from an address string

{40 Comments}

For a lot of Excel users, a common problem is to extract City, State and Pin Code from an address string.  What compounds the issue is that there is no standardisation in an address string.  For e.g., one may end the address with a Pin code, while others may end it with a State and Country.  Some other variations could be:

1. End the address with Contact Numbers
2. There are no spaces between City, State and Pin code

There are of course many other variations which are possible and the sheer number of these variations makes it difficult to list down all of them.

Here is an example of address strings from where the State, City and Pin code need to extracted in three columns:

1. F-45, Pocket 1, Sector 198, Noida - 201303, Uttar Pradesh, India
2. S-45/B, Pocket 1-C, Phase II, Sector 189, Gurgaon, Haryana - 122002
3. RR-45/B, Pocket II-S, Cross 2, Noida - 201303, India
4. T-45, Sector 198, Lucknow - 226001, Uttar Pradesh, India
5. V-45(A), Sector 193C,Allahabad-211002 Uttar Pradesh India
6. V-45(A), Sector 193C,Allahabad-211002Uttar PradeshIndia

You may refer to my solution in this workbook.

Remove abbreviations appearing before a name

{2 Comments}

Assume there is a list of names appearing in range A2:A9.  Before these names, there are abbreviations such as Dr., Mr. Mrs. etc.  To worsen the situation, there may/may not be a space after the abbreviation.  A sample dataset may look like this:

Mr.Ramakrishna
Mr Ramakrishna
Mr. Ramakrishna
MrRamakrishna
PhD.Saurav
Dr (Mrs.) Indu Sharma
Dr. (Mr) Rakesh Singh
Mr. Mrinal Joshi

The task is to extract only the names in another column.

You may refer to my solution in this workbook.

Perform a lookup with inexact text strings and/or spelling mistakes

{31 Comments}

Any lookup function i.e., VLOOKUP(), LOOKUP(), INDEX()/MATCH(), HLOOKUP() requires the text lookup variable (the first input of the VLOOKUP(), HLOOKUP() etc. ) to be spelled exactly the same as the entries in the lookup column.  If the spelling does not match, the result is #NA.  To some extent, inexact searches can be performed by concatenating wildcard characters (? and *) to the lookup variable.

The task is to increase MS Excel's tolerance towards spelling mistakes and inexact text matches across two tables. For e.g., when performing a lookup, somehow MS Excel should be able to understand the Delhi City is the same as City, Delhi.  In order to perform, inexact searches, you can download and install Microsoft's Fuzzy lookup add-in from here.

I have run this add-in on two live datasets in this workbook.

Creating Exception Reports

{23 Comments}

One worksheet shows the date wise courses (External Course, Internal Course and Elearning Course) taken by staff members.  The data also shows the Line Managers that they report to.  Another worksheet lists down all staff members and the Line Managers that they report to.

From data on the first worksheet, one can easily determine (via a Pivot Table) the Elearning courses taken by various staff members (along with their Line Managers).  However, a Pivot Table will not show the staff members (along with their Line Managers) who did not take the Elearning Course.

You may refer to the data set and my workaround in this workbook.  My solution uses the Microsoft Power Query add-in for MS Excel 2010 and higher versions.  One can download and install the tool from here.

Speeding up a lookup task on a large database

{0 Comments}

Performing calculations or writing formulas on large databases pose the following major problems:

1. Time taken to process is unduly long; and
2. File size increases disproportionately

A typical problem is fetching data from another worksheet or workbook via the VLOOKUP() function.  Quite often, when one confirms the VLOOKUP() formula in a single cell with the Enter key, then at the bottom right of your MS Excel screen (the task bar), a white progress bar appears which reads "Calculating (4 processors) 1%".  Once has to wait for the processing to complete before copying the formula down.  The process of copying is obviously just as time consuming and resource intensive.

In these situations, a workaround is the JOIN the tables in MS Access and then get the data back in MS Excel.  For accomplishing this, one must have at least some working knowledge of "SQL Query" writing.

If you are using the Power Query add-in for MS Excel 2010 and higher versions, then one can fetch data from another worksheet or workbook without writing a single formula.  Furthermore, there is absolutely no lag in generating the result.

You may refer to my Power Query solution in this workbook (This is a 20 MB file)

You may watch a short video of my solution here

Generate a list of missing invoice numbers

{2 Comments}

In a range of cells, assume there is a list of invoices raised during a certain period.  Since adequate care was not taken to number the invoices sequentially, there are missing invoice numbers in the Excel range.  So, it is possible that after raising invoice number 501, 502 and 503 the next invoice raised was 507.

Given the following:

1. Starting and ending invoice number (two numbers) from the invoice book; and
2. List of actual invoice numbers raised during a certain period (as elaborated above)

one may want to generate a list of missing invoices.

You may refer to two solutions in this workbook.

LOOKUP unique data from multiple columns where search string appears multiple times

{2 Comments}

In a list with multiple occurrences, the VLOOKUP() function in Excel will only return the first occurrence of the string being searched.  Furthermore,  the VLOOKUP() function returns data from one column only.

To work with a list where there are multiple occurrences of a particular string and data has to be extracted from multiple columns, you may refer to my solution in this workbook.

To work with a list where there are multiple occurrences of a particular string and data has to be extracted from one column only, you may refer to my solution at this link.

Filtering a database by both rows and columns

{34 Comments}

In range A1:BG2185 of this workbook, there is macro economic and demographic data for years from 1984 to 2009 (26 years) for 84 countries .  Furthermore, there are 57 parameters being tracked for each of the 26 years for all 84 countries.

There are missing observations for some years across many parameters and many countries.  If any country has more than 11 missing observations in any one parameter, then they should be depicted in the sheet named "Missing Observations".  The "Missing Observations" sheet should show three columns:

1. Country Name
2. Parameter with more than 11 missing observations
3. Observations missing