Precautions when using RETAIN

The RETAIN statement allows the value of a data step variable to persist from one observation to the next - without RETAIN, all variables in the Program Data Vector (PDV) are reset to missing before a fresh observation is read from the input dataset. RETAIN is a very useful feature in many cases, but it can also be dangerous.

When processing clinical data, for example, you may well want to retain information from one record for a given patient in order to refer to it while processing a later observation for the same patient. However, in almost all cases you will not want to retain information from one patient when processing a later patient's data - doing so will lead to the first patient's data getting mixed up with another's, and that's not something you want coming up as an audit finding.

As a programmer, you should always be very careful to explicitly clear the values of retained variables between patients, and as a code reviewer, you should be looking for this to have been applied.

Here's a simple example... the following code is intended to carry forward the last non-missing data point for a subject into the single output record:

proc sort data=vitals1; by patient visit_date; run; data vitals2; set vitals1; by patient; retain ret_pulse; * Store any non-missing values; if pulse ne . then ret_pulse = pulse; * Output the last known non-missing value; if last.patient then do; pulse = ret_pulse; output; end; run

This looks reasonable at first glance, and will work correctly on most real-world datasets. However, if your input dataset contains a patient with no non-missing values for pulse, the retained value from the previous patient's data will still be present in ret_pulse and will end up being written out as a valid data point for a completely different patient.

To avoid this, always explicitly reset all retained variables at the beginning of each patient's data:

proc sort data=vitals1; by patient visit_date; run; data vitals2; set vitals1; by patient; retain ret_pulse; * Clear the retained variable; if first.patient then ret_pulse = .; * Store any non-missing values; if pulse ne . then ret_pulse = pulse; * Output the last known non-missing value; if last.patient then do; pulse = ret_pulse; output; end; run

Any code lacking this safety net should fail code review, even if your input data "should never" trigger the bug. This is also a good example of the type of potentially disastrous bug that might avoid detection with poorly-designed test datasets, but can easily be spotted during code review.

Comments