Saturday 14 September 2013

How to create 4,294,967,296 rows of data from 1 duplicate in 5 easy moves


Like this:

You load data from a source with an allegedly unique key.

Then you need to do some calculations on it, using data from other sources say, or perhaps do some calculations in the script (to make charts faster in the app) so you join the table back onto itself.

Imagine there's a duplicate, 2 lines with the same key in. When you join it onto itself, each line will replicate X the number of rows it matches in the table that you join it with. IE 2 rows will turn into 2 X 2 = 4 rows:

Data
Data_2
Data - after join 
Key# Data
Key# Data
Key# Data
1 a
1 a
1 a
2 b
2 b
2 b
3 c
3 c
3 c
4 d
4 d
4 d
4 e
4 e
4 e






4 d






4 e

If you were to then join the table onto itself again, each row with key #4 will join back onto itself 4 times, creating 4 X 4rows = 16.

So far so obvious.



What happens after - each time the number of resulting rows is a square of itself, is a reminder to check the data that comes into your application!!!

 
Join Number of rows:
Original table 2
1 4
2 16
3 256
4 65536
5 4294967296


I am posting because my colleague and I spent hours trying to find why an application hung when loading, when it had been fine before. And why was there 17 million rows of data in the script?

After checking just about every step in the script, we found the duplicates in the source data, and slapped our foreheads in annoyance! Moral of the story: do not always trust the data coming in!!!

 Ways to prevent the problem (if you can't control the source):

 1) Load distinct on source tables 
Type in "Load distinct " (or "Select distinct" in SQL) when loading the original table of data.  This will load in distinct rows of data. This wont work though when you have duplicate keys but with different data in as they still count as separate rows

2) Use the exists() function
Enter the clause
where not(exists(keyField))
at the end of the load statement. Exists() does exactly what you think it would do, checks if a value already exists in a field. We only want to load rows where the keyfield has not already been loaded before, hence the not(exists())


3) Feed back the issue to whoever has control over the source table
If it doesn't take too long it might be wise to include some script that counts the number of rows in the source table and saves to a variable.

//Load straight count from sql from the source table
TableCt: sql select count(*) as RowCount from SourceTable;

let vRowCt = Peek('RowCount ',0,'TableCt') 

//Load the data we are going to use
Data_Table: sql select * from SourceTable where not(exists(KeyField));

The perform a check in the application that alerts the user to the issues in the source data.

EG conditional colour on a text box:
if(count({1} KeyField) = vRowCt,green(),Red())

Always best to keep the data premium, every step of the way!

:)

No comments:

Post a Comment