Using Separate Snapshot Tables

One of the points identified in the previous section is that snapshot queries and database views are less likely to perform as well as direct access to database tables. As shown in on the SNOMED CT example database confirm that there is a substantial performance difference between these two approaches.

Testing Performance of Queries on Snapshot Tables and Snapshot Views

Snapshot Table
seconds

Snapshot Views

These views are defined using the general form described in .

seconds

Performance Ratio

Read 1 million rows from relationship snapshot

1.52

11.06

15%

Read 1 million rows from description snapshot

3.57

12.73

28%

Read all rows from concept snapshot

0.66

2.45

26%

Total time for all operations above

5.75

25.74

22%

Advanced test reading 10,000 relationships and with joins to descriptions and language reference set for the fully specified names of source, type and target concept

2.34

4.70

50%

Based on these findings the most effective way to optimize access to a snapshot view, is to replace the use of database views with snapshot tables. Representing the a snapshot with tables, rather than using a database view, adds roughly 2.6 Gb to the storage requirements for the example database.

The current snapshot view is essential and is used for most interactions with the database. Therefore, the performance enhancements justify use of the additional disk space required to store the current snapshot in separate tables. If there are specific reasons for extensive access to one or two retrospective snapshots, it might also be worthwhile representing those snapshots in separate tables. However, it would not be worthwhile to apply the same approach to the full range of less frequently used retrospective snapshots. Therefore, if the snapshot views defined in do not perform sufficiently well, it is worth considering ways to optimize snapshot access.

Unlike the views described earlier, the optimization methods described below require additional columns to be added to each of the full release tables. After importing release files, the tables are processed to generate values for the additional columns and this data is used to simplify the snapshot view queries by avoiding the need for nested queries.

Snapshot Flags Optimization Method

Overview

A column is added to each full release table. This column is used to represent flags that indicate which snapshot view each row is included in.

Practical Example

A single 64-bit integer column called flag is added to all full release tables with a default value of 0 (zero).
A distinct number which is a power of 2 between 2⁰ and 2⁶³ is assigned to each required retrospective snapshot time.
The flag column in each row is set to the sum of the values of all the snapshots in which that row appears.
- To be precise this means that a specific bit in the flag value is set if the row is part of a particular snapshot and is not set if it is not part of that snapshot.
Once this process is complete, it is possible to select the rows of a retrospective snapshot with a simple query that tests the relevant bit in the flag column value.
- This avoids the need for the nested query required to identify rows that are part of a snapshot.

The example SQL below illustrates the process of flag setting. In practice, while this query works it is not very efficient for several reasons. A more efficient approach would be to use a stored procedure that computes the full set of flags applicable to each row. This approach would allow the flag column in each row to be updated once rather than requiring a separate update for each snapshot view.

A configurable snapshot view can then be created tests the appropriate bit rather than requiring a nested query.

SET SQL_SAFE_UPDATES=0;
update full_tableName tbl
 set flag=flag | 1
 where tbl.effectiveTime=(select max(sub.effectiveTime) from copy_full_tableName sub where sub.id=tbl.id and sub.effectiveTime<='20190731');
update full_tableName tbl
 set flag=flag | 2
 where tbl.effectiveTime=(select max(sub.effectiveTime) from copy_full_tableName sub where sub.id=tbl.id and sub.effectiveTime<='20190131');
update full_tableName tbl
 set flag=flag | 4
 where tbl.effectiveTime=(select max(sub.effectiveTime) from copy_full_tableName sub where sub.id=tbl.id and sub.effectiveTime<='20180731');
update full_tableName tbl
 set flag=flag | 8
 where tbl.effectiveTime=(select max(sub.effectiveTime) from copy_full_tableName sub where sub.id=tbl.id and sub.effectiveTime<='20180131');
SET SQL_SAFE_UPDATES=1;

Once the flags are set, a query such as the one below can be used to return component versions that are part of a particular snapshot view. The example query returns row in which the flag the second bit (value 2) is set. Based on the settings in the query above this means it would include rows that are part of the 2019-01-31 snapshot view of this table

select * from full_tableName where flag & 2;

Performance

Limited testing of this optimization approach indicates that it is between 2 and 3 times faster than the unoptimized snapshot views. Direct access to a snapshot table is still twice as fast as this optimized approach.

Storage Requirements

The full release files in the 2019-07-31 release contain a total of approximately 16 million rows. If flags are added each of these rows will require a further 8 bytes of storage. No additional indexes are required to support this optimization. As a result the overall increase in storage requirements to support this optimization is less than 150 Mb.

As described here the approach is limited to 64 snapshot times. This is probably more than sufficient for most practical requirements. However, it and could be extended by adding an another flag column or by changing the data type of the flag column to binary.

Disadvantages

The process of setting the flags required for this approach adds significantly to the time taken to build the database.
Adding an additional column to every table means that queries using "SELECT * FROM ... " will return a flag column that is not part of the original SNOMED CT data.
The flag values are technically essential to the process but this may not be apparent to anyone exploring the database.
Significantly slower than direct access to snapshot tables. Optimum current snapshot performance still requires the snapshot table.

Advantages

A significant improvement in retrospective snapshot performance compared with unoptimized tables.
Minimal impact on disk capacity (adds less that 5% to the size of the full release tables).
Provides a fallback option for the current snapshot view if storage capacity is limited.

Superseded Time Optimization Method

Overview

An additional datetime column is added to each row. This column is used to represent the time at when a row was replaced by the next version of that component or reference set member.

Practical Example

A single datetime column called supersededTime is added to all full release tables with a default value of a long distant future date (e.g. 9999-12-31).
Each full table is queried to establish the sequence of versions of each component or reference set member in effectiveTime order.
The supersededTime of a component that have been updated is set to the effectiveTime of the immediately following version of that component
Once this process is complete each component is part of all snapshot views with a snapshot time greater than or equal to its effectiveTime and less than its supersededTime.
- This can be tested without the need for a nested query.

The example SQL below illustrates the process of flag setting. In practice, while this query works it is not very efficient for several reasons. A more efficient approach would be to use a stored procedure that computes the full set of flags applicable to each row. This approach would allow the flag column in each row to be updated once rather than requiring a separate update for each snapshot view.

-- Create temporary table for the supersededTime values
CREATE TEMPORARY TABLE tmp (id CHAR(36) NOT NULL,effectiveTime DATETIME,supersededTime DATETIME, PRIMARY KEY (id,effectiveTime));

-- Compute the supersededTime values for each combination of id+effectiveTime and add these to the temporary file
INSERT INTO tmp SELECT tbl.id, tbl.effectiveTime, (SELECT IFNULL(MIN(sub.effectiveTime),DATE "99991231") FROM full_tableName sub
    WHERE tbl.id=sub.id AND tbl.effectiveTime<sub.effectiveTime) supersededTime FROM full_tableName tbl;

-- Apply the appropriate supersededTime values to each row in the full table
UPDATE full_tableName tbl
    JOIN tmp
    SET tbl.supersededTime=tmp.supersededTime
        WHERE tmp.id=tbl.id AND tmp.effectiveTime=tbl.effectiveTime;

Once the superseded time values are set, a query such as the one below can be used to return component versions that are part of a particular snapshot view. The example query returns row in which the flag the second bit (value 2) is set. Based on the settings in the query above this means it would include rows that are part of the 2019-01-31 snapshot view of this table

-- This query assumes the default supersededTime is 9999-12-31
SELECT *
    FROM full_tableName
    WHERE supersededTime = DATE '99991231';

-- This query assumes that [snapshotTime] is replaced by the required snapshotTime
SELECT *
    FROM full_tableName
    WHERE [snapshotTime] >= effectiveTime AND [snapshotTime] < supersededTime;

Performance

Past experience indicates that this approach is 2 times faster than the unoptimized snapshot views

Previous tests in MySQL 5.7 were 3 times faster than unoptimized snapshot views. However, MySQL 8.0 seems to have enhanced the performance of the unoptimized queries without significantly improving the results of this approach to optimization.

. However, this figure varies depending on the complexity of the queries it is used in. Direct access to a snapshot table is between 2 and 3 times as fast as this optimized approach.

Storage Requirements

The full release files in the 2019-07-31 release contain a total of approximately 16 million rows. If a datetime column is added, each of these rows will require a further 5 bytes of storage. Database designs using this additional column also included additional indexes including supersededTime

The performance impact of removing some of these indexes was not tested, so it is unclear if benefits could still be delivered by this approach without these indexes.

. As a result, the storage to fully support this approach required an additional 750 Mb.

Disadvantages

The process of setting the supersededTime required for this approach adds significantly to the time taken to build the database.
Adding an additional column to every table means that queries using "SELECT * FROM ... " will return the supersededTime column that is not part of the original SNOMED CT data.
The supersededTime values are technically essential to the process but this may not be apparent to anyone exploring the database.
The use of supersededTime together with associated indexes increases the storage capacity required for the full release tables by approximately 20%.
Significantly slower than direct access to snapshot tables and since the introduction of MySQL 8.0 are also slower than the snapshot flagging method.

Advantages

An improvement in retrospective snapshot performance compared with unoptimized tables but is out-performed by the snapshot flagging method.
A fallback option for the current snapshot view if storage capacity is limited but requires more storage than the snapshot flagging method.
Supports an unlimited number of snapshot times.