********************************************** * INTERNAL VALIDATION OF LOGISTIC REGRESSION * * MODELS BY BOOTSTRAPPING CALIBRATION AND * * PERFORMANCE (C-INDEX) MEASURES * ********************************************** * (c) Marta Garcia-Granero (04/2010) * * mgarciagranero@gmail.com * * Downloaded from: http://gjyp.nl/marta/ * * Feel free to use or modify this code, but * * acknowledge the author and the web site * ********************************************** **************************************************************************** * WARNING: since the macro uses MATRIX, variable name lenght is limited to * * 8 characters, and all variables must be numeric (no strings) * ****************************************************************************. * MACRO DEFINITION *. DEFINE BOOTLOGREG (K=!DEFAULT (100)!TOKENS(1)/ DEP=!TOKENS(1)/ INDEP=!ENCLOSE('(',')')/ CONTRAST=!CMDEND). * Give name to working dataset *. DATASET NAME OriginalData. * Create output windows for discarded and useful output *. OUTPUT NEW TYPE=VIEWER NAME=Scratch. OUTPUT NEW TYPE=VIEWER NAME=LogRegOutput. ********************************** *** CREATING BOOTSTRAP SAMPLES *** **********************************. OUTPUT ACTIVATE Scratch. * Creating k (default=100) Boot Samples *. DATASET DECLARE BootData. PRESERVE. * Use Mersenne Twister as random number generator and set MXLOOPS equal to k *. SET RNG=MT MTINDEX=RANDOM. SET MXLOOPS=!k. * Matrix program *. MATRIX. GET DATA /VAR=ALL /NAMES=VNames. COMPUTE n=NROW(data). COMPUTE p=NCOL(data). * Put original data in first block of bootstrapped samples *. COMPUTE BootData={MAKE(n,1,0),Data}. * Number of bootstrap samples, can be increased, but then MXLOOPS must be increased too *. COMPUTE k=!k. COMPUTE BootSamp=MAKE(n,p,0). LOOP i= 1 TO k. . COMPUTE flipcoin=1+TRUNC(n*UNIFORM(n,1)). . COMPUTE BootSamp=Data(flipcoin,:). . COMPUTE BootData={BootData;MAKE(n,1,i),BootSamp}. END LOOP. * Export Bootsamples to new dataset *. COMPUTE VNames={'BootNr',Vnames}. SAVE BootData/OUTFILE=BootData /NAMES=VNames. END MATRIX. RESTORE. * Formatting variables in BootData *. DATASET ACTIVATE BootData. APPLY DICTIONARY /FROM OriginalData /SOURCE VARIABLES = ALL /FILEINFO /VARINFO ALIGNMENT FORMATS LEVEL MISSING VALLABELS = REPLACE ATTRIBUTES = REPLACE VARLABEL WIDTH . FORMAT BootNr (F8). VALUE LABEL BootNr 0'Original Dataset'. * Next steps are language dependent, English must be used (or the macro modified) *. PRESERVE. SET OLANG=ENGLISH. ********************************* *** LOGISTIC REGRESSION MODEL *** *********************************. * Running the original model (from BootNr = 0 cases) on the other 100 samples *. LOGISTIC REGRESSION VARIABLES !DEP /SELECT = BootNr EQ 0 /METHOD = ENTER !INDEP !CONTRAST /SAVE = PRED /PRINT = GOODFIT CI(95). * Compute 101 c-indexes (original dataset and 100 bootsamples) *. DATASET DECLARE BootAUC. OMS /SELECT TABLES /IF COMMANDS = ["ROC Curve"] SUBTYPES = ["Area Under the Curve"] /DESTINATION FORMAT = SAV OUTFILE = BootAUC. OMS /SELECT TABLES /IF COMMANDS = ["ROC Curve"] SUBTYPES = ["Case Processing Summary"] /DESTINATION VIEWER = NO. SPLIT FILE LAYERED BY BootNr . ROC PRE_1 BY !DEP (1) /PLOT = NONE /PRINT = SE /CRITERIA = CI(95). OMSEND. ************************************************* * C-Index for original and bootstrapped samples * *************************************************. OUTPUT ACTIVATE LogRegOutput. DATASET ACTIVATE BootAUC. DELETE VARIABLES Command_ Subtype_ Label_ Var2. OMS /SELECT TABLES /IF COMMANDS = ["Summarize"] SUBTYPES = ["Case Processing Summary"] /DESTINATION VIEWER = NO. SUMMARIZE /TABLES=ALL /FORMAT=LIST NOCASENUM NOTOTAL LIMIT=1 /TITLE='Performance (C-index) Statistics from Original Dataset' /CELLS=NONE. OMSEND. ECHO 'Summary Performance (C-index) Statistics from Bootsamples'. TEMPORARY. SELECT IF ($casenum GT 1). FREQUENCIES VARIABLES=Area /FORMAT=NOTABLE /PERCENTILES= 2.5 97.5 /STATISTICS=STDDEV MEAN MEDIAN. ***************************************************** * Calibration for original and bootstrapped samples * *****************************************************. DATASET ACTIVATE BootData. DATASET CLOSE BootAUC. RANK VARIABLES=PRE_1(A) BY BootNr /NTILES (10) /PRINT=NO /TIES=MEAN . SPLIT FILE OFF. SORT CASES BY BootNr NPRE_1 . DATASET DECLARE Calibration. AGGREGATE /OUTFILE='Calibration' /PRESORTED /BREAK=BootNr NPRE_1 /Predicted= MEAN(PRE_1) /case_sum = SUM(!DEP) /N=N. DATASET ACTIVATE Calibration. COMPUTE Observed = case_sum/N . SPLIT FILE LAYERED BY BootNr . DATASET DECLARE Slopes. OUTPUT ACTIVATE Scratch. OMS /SELECT TABLES /IF COMMANDS = ["Regression"] SUBTYPES = ["Coefficients"] /DESTINATION FORMAT = SAV OUTFILE = Slopes. REGRESSION /STATISTICS COEFF OUTS CI /NOORIGIN /DEPENDENT Observed /METHOD=ENTER Predicted. OMSEND. DATASET ACTIVATE Slopes. DATASET CLOSE Calibration. DELETE VARIABLES Command_ Subtype_ Label_ Var2. OUTPUT ACTIVATE LogRegOutput. TEMPORARY. SELECT IF Var1 EQ 'Original Dataset'. OMS /SELECT TABLES /IF COMMANDS = ["Summarize"] SUBTYPES = ["Case Processing Summary"] /DESTINATION VIEWER = NO. SUMMARIZE /TABLES=ALL /FORMAT=LIST NOCASENUM NOTOTAL /TITLE='Calibration Statistics from Original Dataset' /CELLS=NONE. OMSEND. SORT CASES BY Var3(A). SPLIT FILE LAYERED BY Var3 . ECHO 'Summary Calibration Statistics from Bootsamples'. TEMPORARY. SELECT IF Var1 NE 'Original Dataset'. FREQUENCIES VARIABLES=b /FORMAT=NOTABLE /PERCENTILES= 2.5 97.5 /STATISTICS=STDDEV MEAN MEDIAN. DATASET ACTIVATE OriginalData. DATASET CLOSE Slopes. OUTPUT CLOSE Scratch. RESTORE. !ENDDEFINE. * EXAMPLES OF MACRO CALLs *. * First, get Shapiro Dataset from http://gjyp.nl/marta/ShapiroDataset.sav *. * VARIABLES (DEPENDENT & INDEPENDENT) GO FIRST CONTRASTS FOR CATEGORICAL PREDICTORS GO LAST WARNING: ALL VARIABLES MUST BE NUMERIC (NO STRINGS), WITH SHORT NAMES, AND DEPENDENT VARIABLE MUST BE 0/1 CODED (NOT 1/2) *. BOOTLOGREG DEP=case INDEP=(Smoke OCU MeanAge OCU*Smoke) CONTRAST=/CONTRAST (Smoke)=Indicator(1) /CONTRAST (OCU)=Indicator(1). * K (NUMBER OF BOOT REPS.) CAN BE CHANGED TOO (WITH HIGHER VALUES THE MACRO WILL TAKE LONGER) *. BOOTLOGREG K=200 DEP=case INDEP=(Smoke OCU MeanAge OCU*Smoke) CONTRAST=/CONTRAST (Smoke)=Indicator(1) /CONTRAST (OCU)=Indicator(1).