Simply Salford Blog

Sanitizing Data: Keep the Details of Your Data Mining Project Private

Posted by Dan Steinberg on Mon, Apr 22, 2013 @ 11:04 AM

The SPM software suite includes a handy utility for changing all the variable names on your data to uninformative labels such X1, X2, etc. To convert a data set this way just follow the pattern:


USE boston.csv
SAVE mystery.csv
SANITIZE

The classic output reports the name changes, and the names of the input and output files thus:
 ===============
 Data Step (RUN)
 ===============


Salford Predictive Modeler(R) software suite: SPM(R) Data Step version 7.0.0.459

s/CRIM/X01/gi
s/ZN/X02/gi
s/INDUS/X03/gi
s/CHAS/X04/gi
s/NOX/X05/gi
s/RM/X06/gi
s/AGE/X07/gi
s/DIS/X08/gi
s/RAD/X09/gi
s/TAX/X10/gi
s/PT/X11/gi
s/B/X12/gi
s/LSTAT/X13/gi
s/MV/X14/gi

 Records Read: 506
 Records Kept: 506

 \\psf\Home\Desktop\Demos\mystery.csv created with 506 records, 14 variables.

There are a number of possible motivations for this type of file conversion including sending examples of your data to tech support when you need to keep the details of your analysis private.

Note that the conversion only changes variable names and variable contents. If you had a variable named STATE$ with values like "AZ" CA" "NY" changing the name will do little to obscure the true content of the data. But name changes for a large number of continuous variables is quite effective in making your data very difficult for anyone to understand.

If you're interested in how to combine the GUI and Command Line in SPM for optimal results you should check out this videos series.

Topics: SPM, command line