Quantcast
Channel: SCN : All Content - Data Services and Data Quality
Viewing all articles
Browse latest Browse all 4013

Performant extraction of most recent data from a history table

$
0
0

You have a table that contains multiple time stamped records for a given primary key:

 

KeyAttTimestamp
037472012.11.11 04:17:30
01ABC2014.09.30 17:45:54
02UVW2014.04.16 17:45:23
01DEF

2014.08.17 16:16:27

02XYZ2014.08.25 18:15:45
01JKL2012.04.30 04:00:00
03777

2014.07.15 12:45:12

01GHI2013.06.08 23:11:26
037372010.12.06 06:43:52

 

Output required is the most recent record for every key value:

 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
03777

2014.07.15 12:45:12

 

 

 

Solution #1: Use the gen_row_num_by_group function

 

Build a dataflow as such:

  df1.png

In the first query transform, sort the input stream according to Key and Timestamp desc(ending). The sort will be pushed to the underlying database, which is often good for performance.

 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
01DEF

2014.08.17 16:16:27

01GHI2013.06.08 23:11:26
01JKL2012.04.30 04:00:00
02XYZ2014.08.25 18:15:45
02UVW2014.04.16 17:45:23
03777

2014.07.15 12:45:12

037472012.11.11 04:17:30
037372010.12.06 06:43:52

 

In the second query transform, add a column Seqno and map it to gen_row_num_by_group(Key).

 

KeyAttTimestampSeqno
01ABC2014.09.30 17:45:541
01DEF

2014.08.17 16:16:27

2
01GHI2013.06.08 23:11:263
01JKL2012.04.30 04:00:004
02XYZ2014.08.25 18:15:451
02UVW2014.04.16 17:45:232
03777

2014.07.15 12:45:12

1
037472012.11.11 04:17:302
037372010.12.06 06:43:523

 

In the third query transform, add a where-clause Seqno = 1 (and don’t map the Seqno column).

 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
03777

2014.07.15 12:45:12

 

 

 

Solution #2: use a join

 

Suppose we’re talking Big Data here, there are millions of records in the source table. On HANA. Obviously. Although the sort is pushed down to the database, the built-in function is not. Therefore every single record has to be pulled into DS memory; and then eventually written back to the database.

 

Now consider this approach:

df2.png

 

The first query transform selects two columns from the source table only: Key and Timestamp. Define a group by on Key and set the mapping for Timestamp to max(Timestamp).

 

KeyTimestamp
012014.09.30 17:45:54
022014.08.25 18:15:45
03

2014.07.15 12:45:12

 

In the second transform, (inner) join on Key and Timestamp and map all columns from the source table to the output.

 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
03777

2014.07.15 12:45:12

 

If you uncheck bulk loading of the target table, you’ll notice that the full sql (read and write) will be pushed to the underlying database. And your job will run so much faster!

 

Note: This second approach produces correct results only if there are no duplicate most recent timestamps within a given primary key.


Viewing all articles
Browse latest Browse all 4013

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>