You have a table that contains multiple time stamped records for a given primary key:
Key | Att | Timestamp |
---|---|---|
03 | 747 | 2012.11.11 04:17:30 |
01 | ABC | 2014.09.30 17:45:54 |
02 | UVW | 2014.04.16 17:45:23 |
01 | DEF | 2014.08.17 16:16:27 |
02 | XYZ | 2014.08.25 18:15:45 |
01 | JKL | 2012.04.30 04:00:00 |
03 | 777 | 2014.07.15 12:45:12 |
01 | GHI | 2013.06.08 23:11:26 |
03 | 737 | 2010.12.06 06:43:52 |
Output required is the most recent record for every key value:
Key | Att | Timestamp |
---|---|---|
01 | ABC | 2014.09.30 17:45:54 |
02 | XYZ | 2014.08.25 18:15:45 |
03 | 777 | 2014.07.15 12:45:12 |
Solution #1: Use the gen_row_num_by_group function
Build a dataflow as such:
In the first query transform, sort the input stream according to Key and Timestamp desc(ending). The sort will be pushed to the underlying database, which is often good for performance.
Key | Att | Timestamp |
---|---|---|
01 | ABC | 2014.09.30 17:45:54 |
01 | DEF | 2014.08.17 16:16:27 |
01 | GHI | 2013.06.08 23:11:26 |
01 | JKL | 2012.04.30 04:00:00 |
02 | XYZ | 2014.08.25 18:15:45 |
02 | UVW | 2014.04.16 17:45:23 |
03 | 777 | 2014.07.15 12:45:12 |
03 | 747 | 2012.11.11 04:17:30 |
03 | 737 | 2010.12.06 06:43:52 |
In the second query transform, add a column Seqno and map it to gen_row_num_by_group(Key).
Key | Att | Timestamp | Seqno |
---|---|---|---|
01 | ABC | 2014.09.30 17:45:54 | 1 |
01 | DEF | 2014.08.17 16:16:27 | 2 |
01 | GHI | 2013.06.08 23:11:26 | 3 |
01 | JKL | 2012.04.30 04:00:00 | 4 |
02 | XYZ | 2014.08.25 18:15:45 | 1 |
02 | UVW | 2014.04.16 17:45:23 | 2 |
03 | 777 | 2014.07.15 12:45:12 | 1 |
03 | 747 | 2012.11.11 04:17:30 | 2 |
03 | 737 | 2010.12.06 06:43:52 | 3 |
In the third query transform, add a where-clause Seqno = 1 (and don’t map the Seqno column).
Key | Att | Timestamp |
---|---|---|
01 | ABC | 2014.09.30 17:45:54 |
02 | XYZ | 2014.08.25 18:15:45 |
03 | 777 | 2014.07.15 12:45:12 |
Solution #2: use a join
Suppose we’re talking Big Data here, there are millions of records in the source table. On HANA. Obviously. Although the sort is pushed down to the database, the built-in function is not. Therefore every single record has to be pulled into DS memory; and then eventually written back to the database.
Now consider this approach:
The first query transform selects two columns from the source table only: Key and Timestamp. Define a group by on Key and set the mapping for Timestamp to max(Timestamp).
Key | Timestamp |
---|---|
01 | 2014.09.30 17:45:54 |
02 | 2014.08.25 18:15:45 |
03 | 2014.07.15 12:45:12 |
In the second transform, (inner) join on Key and Timestamp and map all columns from the source table to the output.
Key | Att | Timestamp |
---|---|---|
01 | ABC | 2014.09.30 17:45:54 |
02 | XYZ | 2014.08.25 18:15:45 |
03 | 777 | 2014.07.15 12:45:12 |
If you uncheck bulk loading of the target table, you’ll notice that the full sql (read and write) will be pushed to the underlying database. And your job will run so much faster!
Note: This second approach produces correct results only if there are no duplicate most recent timestamps within a given primary key.