Using PROC SQL

PROC SQL is provided as part of base SAS and is a very comprehensive implementation of this standard data query and data manipulation language. You can find more information about SQL here and SAS's own documentation for PROC SQL is here (version 8) and here (version 9).

Having used PROC SQL more and more over the years, I'm now of the opinion that it should be your first choice when writing the bulk of your data manipulation and reporting code. Data step code still has its place, of course, but I would recommend using PROC SQL whenever possible, and data step only when necessary.

So, why is SQL so useful? The main advantage is that it is so compact. Even allowing for the verbose nature of most SQL statements, you can still achieve more in a few lines of SQL than you can in twice as many lines of plain data steps and PROC SORTs. Data steps are quite fragile, in that they often require that their input dataset is sorted in a specific order; this means that, whenever you write a BY statement in a data step, you have to either (a) add a preceding PROC SORT to get the data into the right order, or (b) analyse the preceding steps to ensure that your input data is guaranteed to be in the correct order. PROC SQL doesn't have this requirement - you never need to worry about the ordering of your input data.

Example 1
Here's an example: we want to take a demography dataset DEMOG and add to it the variable TRTCODE from a treatment dataset, TRT. We can assume that TRT contains just one record per subject, but we can't assume anything about the ordering of either dataset. For now, let's also assume that the TRT dataset only contains variables SUBJID and TRTCODE. Without using SQL, we would need two PROC SORTs and a data step:

proc sort data=indata.demog out=demog; by subjid; run; proc sort data=indata.trt out=trt; by subjid; run; data merged; merge demog (in=a) trt; by subjid; if a; run;

Using SQL, this reduces to a single statement:

proc sql; create table merged as select a.*, b.trtcode from demog.a left join trt.b on a.subjid = b.subjid ; quit;

Even using my indenting style, PROC SQL gets the job done in six lines of code instead of 11, reducing the chance of errors and making the code easier to read and easier to maintain. More significant than that, though, is the difference in feel between the two examples; the SQL is cleaner and more specific about its intentions - merge ('join' in SQL terminology) the variable TRTCODE from TRT onto DEMOG. The data step code, although it works for now, introduces an insidious bug - being inherently lazy, I didn't bother to restrict the merge to only take variable TRTCODE from TRT. Let's say someone later adds a variable called VISIT_DATE to TRT - the datastep code will take that variable from TRT and overwrite the existing VISIT_DATE in DEMOG - disaster! The SQL version, being more explicit about its intentions (despite also being much shorter) is safe from this bug.

Example 2
Here's another example, with a more complex requirement. Starting from the same two data sets, we need to find out how many subjects in each treatment group (that is, with each value of TRTCODE) are missing from the DEMOG dataset. Without SQL, this is fairly long-winded:

proc sort data=indata.demog out=demog; by subjid; run; proc sort data=indata.trt out=trt; by subjid; run; data no_dem; merge demog (in=a) trt (in=b); by subjid; if b and not a; run; proc means data=no_dem noprint; class trtcode; output out=result (where=(trtcode ne '') keep=trtcode n) n=n; run;

There are several potential problems here: there are two unnecessary PROC SORTs, and the merge relies on the statement if b and not a being coded correctly. This particular solution to the problem required 15 lines of code.

Here's the SQL version:

proc sql; create table result as select distinct trtcode, count(distinct subjid) as n from trt where subjid not in (select subjid from demog) group by trtcode ; quit;

Six lines of code, just one-third of the non-SQL version. And again, the SQL solution is not only much more concise, it's more concise precisely because it is much more explicit and straight-to-the-point about what it's setting out to achieve. All-in-all, a vast improvement in readability, robustness and ease of maintenance.

Learning PROC SQL
I hope that the examples above show that using PROC SQL can save you time and trouble in many cases. However, you may still feel that you'd rather stick with data step than start to learn SQL. That's understandable - SQL is a huge and complex language in its own right, so learning it as well as SAS may seem daunting. Remember though that SQL is very widely used in all database systems, so there is a wealth of SQL information available on the web. Google lists many PROC SQL tutorials and there are even more out there for SQL in general. As a bonus, once you've learnt PROC SQL, you'll also know how to write SQL queries for almost every database system in existence, which won't do your career any harm.

As with most things, the best way to learn SQL is to simply start playing with it and trying a few things out. You'll soon find that you can write very powerful searching, merging and summarising queries with just a few lines of code. Give it a try!

PROC SQL tip
My one SQL tip for now is this - don't forget to use the distinct keyword when summarising data! You will often see this pattern in SAS code:

* Count number of visits per subject; proc sql; create table out1 as select subjid, count(visit_date) as visits from in1 group by subjid ; quit; * Remove duplicates; proc sort data=in1 nodupkey; by subjid; run;

The novice programmer has added the PROC SORT to "remove duplicates" because, by default, SQL outputs one row for every row in the input, even when you apply a summarisation function like count(). If you don't want this, just add the distinct keyword to the select statement; you can then get rid of the extra PROC SORT:

* Count number of visits per subject; proc sql; create table out1 as select distinct subjid, count(visit_date) as visits from in1 group by subjid ; quit;

Comments