QAQISH A Program for Fitting Multivariate Binary Regression Clustered Data Models That Allow More Than One Class In Each Cluster by David Leon May, 1994 TABLE OF CONTENTS Introduction 2 Installing and Running QAQISH 2 System Requirements 2 Program Limits 3 Program Output 4 Input Data File 4 Control File Parameters 5 Creating a Control File 7 Cautionary Notes 10 References 10 Program Messages 11 Writing a FORTRAN Format Statement 14 ACKNOWLEDGEMENT This software was developed by a project of the Population Council funded by the University of North Carolina Evaluation Project (contract number 5-35676). The Evaluation Project is funded by United States Agency for International Development contract number DPE-3060-C-00-1054-1. ************ INTRODUCTION ************ QAQISH is a program for fitting multivariate binary regression clustered data models that allow more than one class in each cluster, with different regressions for each class and for the dependence between and within classes. It is based on the work of Dr. Bahjat Qaqish presently of the University of North Carolina. This is a FORTRAN program compiled to run under the Microsoft Window's operating system for personal computers. Before using the program, users should read through this entire document at least once, except for possibly the last three sections "References","Program Messages", and "Writing a FORTRAN Format Statement". ***************************** INSTALLING AND RUNNING QAQISH ***************************** QAQISH is installed simply by copying the file QAQISH.EXE into a directory or sub-directory of the user's choice. If the user plans to make use of the program CONTROL (see the section "Creating a Control File"), CONTROL.EXE should be copied into the same location. QAQISH (and CONTROL) may be run like any other program under Windows. Choose the File Run command and enter the program name and, if needed, path. The only additional action by the user when running QAQISH is to enter the label (path and file name, limited to 20 characters) for the control file when prompted. For user inputs to CONTROL, see the section "Creating a Control File". Note that any scratch files used by QAQISH will be created in the default directory at the time of running the program. For information on disk space requirements for these, see the following section "System Requirements". ******************* SYSTEM REQUIREMENTS ******************* QAQISH has been compiled to be run under a Windows operating system; it can not be run under DOS. (A DOS version of QAQISH can be easily made available, but its potential applications would be limited due to the inability to access extended memory.) A minimum amount of RAM is required to start the program. However, the more RAM that is available, the larger the applications that can be run. RAM is allocated dynamically within the program as it is needed. If sufficient RAM can not be allocated to continue, the program will terminate with an error message. (See the section on "Program Messages".) Sufficient free hard disk space is required for at least one, and sometimes two, scratch files. The amount of disk space for the required scratch file may be a bit tedious to estimate, but it is approximately equal to the sum of the following number, as calculated for each cluster: 2 * (NDIMB+1) * N * (N+1) where NDIMB is the number of model regression parameters and N is the cluster size. The maximum size of the second scratch file, which is only required if a particular array can not be kept in RAM, is approximately equal to: 2 * NMAX * NMAX * (NMAX+1) * (NMAX+1) where NMAX is the size of the largest cluster. A math co-processor is highly recommended, however it is not required. Up to four file handles, plus the console, may be needed for a program run. The system configuration file should be set accordingly. ************** PROGRAM LIMITS ************** There are four limiting values incorporated in the program QAQISH. Limits were needed for technical reasons, but can be easily changed in other versions of this program. These limits were set at values which were deemed to be unrestrictive in most, if not all, applications. In reality, the most likely limit a user is apt to encounter is the amount of RAM her/his system has available for a QAQISH run. The four program limits in the current version are: Maximum number of classes : 25 Maximum number of model regression parameters : 20 Maximum cluster size : 50 Maximum number of clusters: 1000 ************** PROGRAM OUTPUT ************** Following is a list of the output produced in a normally- terminating QAQISH program run. a) Run control specifications, as defined in the control file. The convergence tolerance and initial estimates are shown with six and four decimal places respectively, but QAQISH uses the values on the control file if they are more precise. b) The number of data records rejected, the number of clusters, and the size of each cluster c) If requested, beta estimates for each pass through the convergence loop If the estimates converged, d) The number of iterations for convergence e) For each parameter, the final Beta estimate, robust standard error, naive standard error, and robust Z f) A row by row listing of the robust estimates matrix, with correlations in the lower half and variances in the upper half g) A row by row listing of the naive estimates matrix, with correlations in the lower half and variances in the upper half If the estimates did not converge, an appropriate message is output with the last set of Beta estimates. If the program did not terminate normally, an error message is displayed on the console. If the early termination occurred after the output file had already been opened, the message is also written to the output file. For more information on these messages, see the section "Program Messages". There is also one instance in which a warning message may be included in the output file. See the section "Program Messages". *************** INPUT DATA FILE *************** The input data file must be readable by a FORTRAN format statement. That is, it must be a fixed field ASCII data file. All data for an observation must be on a single record. QAQISH requires the data to be read from an input record to be sequenced in a particular order. This order is: Cluster id Class number Response variable Regressor(s) Cluster id must be an integer in the range -32768 to 32767. THIS IS NOT CHECKED BY THE PROGRAM! Class number is assumed to be an integer in the range 1 to the number of classes. Response variable is assumed to be 0 or 1. The number of regressors is equal to the number of main effects. If either the class number or response variable value is invalid, the data record is excluded from the analysis. The data file must be sorted by cluster id. *********************** CONTROL FILE PARAMETERS *********************** The user is required to create a control file of parameters for a desired QAQISH program run. The creation of this file is discussed in the next section, "Creating a Control File". This section discusses the parameters a user is required/allowed to specify. Heading for output: Two lines are allowed for the output heading, each up to 72 characters long including blanks. The user can leave this blank if not wanted. Data file label: This is the label of the input data file, including any path specification that may be needed. Up to 20 characters are allowed. Output file label: This is the label of the output file, including any path specification wanted. Up to 20 characters are allowed. PRN may be entered to specify the printer, or CON to specify screen output. Note, however, screen output will automatically scroll as the screen fills, and output at one point covers 77 columns, which may be larger than the default window size used in Windows. The number of classes: The range allowed is 1 to a maximum specified in the QAQISH program. (See "Program Limits".) The number of model regression parameters: The range allowed is 1 to a maximum specified in the QAQISH program. (See "Program Limits".) The number of main effects: The range allowed is 1 to a maximum equal to the number of model regression parameters. Tolerance for convergence: Iteration for convergence will stop if the sum of the absolute changes in all parameters between two iterations is less than the tolerance specified. This field must contain a decimal point and be greater than zero. Maximum number of iterations for convergence. Iteration for convergence will stop if the maximum number of iterations is reached. Display iteration estimates option: The character Y (upper case) entered here will cause the estimates to be displayed for each iteration. GEE level: 1 or 2 is entered for GEE levels 1 or 2 respectively. Formulae for 3rd/4th order moments: An entry of VMZP will cause the Zhao and Prentice formulae to be used; an entry of VMBQ will cause the exact solution to be used. No other entries are allowed. FORTRAN format for reading input data: Two lines are allowed for the format statement, each up to 72 characters long. The user can leave the second one blank if it is not needed. Regression parameter labels: A label, up to 16 characters in length including blanks, may be entered for each parameter. Regression parameter initial estimates: An initial estimate is required to be entered for each parameter. It must include any required decimal point. NON-ZERO ESTIMATES ARE RECOMMENDED. Regression specifications: If the number of classes is C, then C + C*(C+1)/2 sets of specifications must be entered, even if a particular regression is not to be done. These include C main effects regressions, C within-class odds ratios regressions, and C*(C-1)/2 between-class odds ratios regressions. Each set consists of three parts: a) Two class numbers; for the main effects regressions, set one of the class numbers to zero. For the within-class odds ratios regressions, set the two class numbers equal to each other. For the between-class odds ratios regressions, enter the two class numbers. b) The dimension of the regression; that is, the number of regressors in the regression. If a particular regression is not to be done, this should be set to zero. c) Pairs consisting of a regression parameter index and a regressor index for each regressor in the regression. Regression parameter indices range from 1 to the total number of regression parameters; regressor indices range from 1 to the number of main effects. THE ODDS-RATIOS PARAMETERS MUST BE THE LAST IN THE REGRESSION PARAMETER VECTOR. THIS IS THE USER'S RESPONSIBILITY AS THE QAQISH PROGRAM CANNOT CHECK THIS. Each regression parameter must appear in at least one regression specification (checked in QAQISH). The order that the regressions are specified is not important. *********************** CREATING A CONTROL FILE *********************** A control file for a QAQISH application may be created by running the program CONTROL. CONTROL consists of a series of easily understandable screen prompts for the user to enter the necessary information. Other than being user friendly, its major advantage is that it guarantees that a run control file is created in the proper format and that all regressions are specified. No checking is done on the validity or appropriateness of the user inputs while running CONTROL other than being sure the chosen label for the control file does not lead to the unintentional over-writing of another file of the same label. There is, however, extensive checking of the control file entries in QAQISH itself. There are two slight restrictions on using the CONTROL program. The tolerance entered is assumed to have six digits to the right of the decimal point if there is no explicit decimal point, and it is written on the CONTROL output file with six decimal places. Initial regression parameter estimates less than -9999.0 are set to -9999.0 and initial estimates greater than or equal 99999.0 are set to 99999.0, and the estimates are written on the CONTROL output file with four decimal places. If the user finds this troublesome, the CONTROL output file can be edited later if desired, as noted in the next paragraph. It is not necessary to use CONTROL to create a run control file. It may be created, or modified, with any text editor as long as it conforms to the expected format. Control files originally created by the program CONTROL can also be modified by a text editor. The expected format of the control file is listed on the next page. Numeric fields are expected to be right justified. A listing of a sample control file can be found at the end of this section. Record No. of Cols. Data Type Recs. ------ ------ ----- ----------------------------------- 1 2 1-72 Heading for the output 2 1 1-20 Label for the input data file 3 1 1-20 Label for output file 4 1 1- 4 Number of classes 5 1 1- 4 Number of regression parameters in the model 6 1 1- 4 Number of main effects in the model 7 1 1- 8 Tolerance wanted for conversion; must include a decimal point 8 1 1- 4 Maximum number of iterations allowed for convergence 9 1 1 Y if estimates are to be output for each iteration 10 1 1 1 or 2 for GEE level wanted 11 1 1- 4 VMZP (Zhao and Prentice approximation) or VMBQ (exact) for method wanted for 3rd/4th order moment calculations 12 2 1-72 FORTRAN format statement for reading input data; second record is blank if not needed 13 NDIMB (1) 1-16 Descriptive label for parameter 11-26 Initial estimate for parameter; must include decimal point if wanted 14a SUMTY (2) 1- 4 First class number for regression 5- 8 Second class number for regression (see below for more information) 9-12 Dimension of regression 14b NREGR (3) 1- 2 1st parameter index 3- 4 1st regressor index 5- 6 2nd parameter index 7- 8 2nd regressor index ..... .................... ..... .................... 77-78 20th parameter index (if needed) 79-80 20th regressor index (if needed) (1) NDIMB is the number of regression parameters in the model, as specified on record type 5. (2) SUMTY is equal to [C + C*(C+1)/2], where C is the number of classes, as specified on record type 4. This is the number of regressions expected to be defined. There are C main effects regressions, C within-class odds ratios regressions, C*(C-1)/2 between-class odds ratios regressions. The dimension should be entered as zero for any particular regression not to be done. (3) NREGR is equal to the number of type 14a records that have a regression dimension greater than zero. A record type 14b must immediately follow any 14a that requires it; that is, whose dimension is greater than zero. The number of parameter/regressor pairs must be equal to the dimension entered on record 14a. In reference to the class numbers to be specified on record type 14a, for main effects regressions, either the first or second class number should be zero. For within-class odds ratios regressions, both class numbers should be the same. For between- class odds ratios regressions, the two class numbers should of course not be the same, nor should either be zero. Following is a listing of a sample control file. The number in parentheses on the far right are not part of the control file, but refer to the record types cited above. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - *** Sample Run of Qaqish Program *** ( 1) *** Run Date: May, 1994 *** ( 1) TEST.ASC ( 2) TEST.LST ( 3) 3 ( 4) 3 ( 5) 2 ( 6) 0.00100 ( 7) 10 ( 8) Y ( 9) 2 (10) VMBQ (11) (2I2,3F2.0) (12) (12) 1 Yi:1 1.0000 (13) 2 Yi:x 1.0000 (13) 3 logOR:1 0.1000 (13) 1 0 2 (14a) 1 1 2 2 (14b) 2 0 2 (14a) 1 1 2 2 (14b) 3 0 2 (14a) 1 1 2 2 (14b) 1 1 1 (14a) 3 1 (14b) 2 2 1 (14a) 3 1 (14b) 3 3 1 (14a) 3 1 (14b) 1 2 0 (14a) 1 3 0 (14a) 2 3 0 (14a) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The meaning of the various entries should be clear. Note that all the main effects regressions are specified as: B(1)*X(1) + B(2)*X(2) all within-class odds ratios regressions are specified as: B(3)*X(1) and no between-class odds ratios regressions are to be done. **************** CAUTIONARY NOTES **************** While extensive checking of run specifications is carried out within QAQISH, it is not comprehensive. Future versions of the program are planned to include some improvements in this area, but four items of particular caution need to be considered by the user: a) Cluster id numbers must be in the range -32768 to 32767. b) For the parameter vector to be estimated, it must be arranged so that the odds-ratios parameters are the last in the vector. c) The values of the regressors used in the regression for the within and between class associations should be the same for all members of any given cluster. The current program version uses the values from the last member in each cluster. d) The user has extensive flexibility in citing model specifications. Completely ridiculous models can be specified, and the program has no way of recognizing these. ********** REFERENCES ********** Two references suggested by Dr. Qaqish are: Liang, Zeger, and Qaqish (1989), "Multivariate Regression Models Using Generalized Estimating Equations", Technical Report, Department of Biostatistics, The Johns Hopkins University, School of Hygiene and Public Health. Qaqish and Liang (1990), "Marginal Models for Correlated Binary Data with Multiple Classes and More Than One Level of Nesting",Technical Report,Department of Biostatistics, The Johns Hopkins University, School of Hygiene and Public Health. **************** PROGRAM MESSAGES **************** TERMINATION MESSAGES: This section describes the messages produced when some problem causes the program to terminate abnormally. They are listed in the order they are likely to occur. Most messages are sufficiently clear that they need no further explanation. The first group of messages are caused by control file specification problems. For some messages, possibly helpful additional information is included here on the succeeding line in brackets. PROGRAM TERMINATED: Control file cannot be found - [name possibly misspelled, or path incorrectly specified] PROGRAM TERMINATED: Control file cannot be opened - [file may be corrupted, of name may refer to unformatted file] PROGRAM TERMINATED: Control file was not created correctly. There is a data read error or the end-of-file is read when not expected. Check specifications before parameter labels. ERROR: Output file already exists - ERROR: Output file cannot be opened - [possible invalid file name or path specification] ERROR: Data file cannot be opened - [file may be corrupted or unformatted] ERROR: Data file cannot be found - [name possibly misspelled, or path incorrectly specified] ERROR: Number of classes specified () not in range 1 to [nmax is the program limit for number of classes] ERROR: Number of parameters specified () not in range 1 to [nmax is the program limit for number of model regression parameters] ERROR: Number of main effects specified () not in range 1 to [nmax is the number of model regression parameters specified] ERROR: Convergence tolerance specified () not greater than zero ERROR: GEE level specified () is not 1 or 2 ERROR: Method specified for 3rd/4th order moments () not VMZP or VMBQ ERROR: Maximum convergence iterations specified () less than 1 PROGRAM TERMINATED: See above error(s) [produced if any of the above ERROR messages were output] PROGRAM TERMINATED: Control file was not created correctly. There is a data read error or the end-of-file is read when not expected. Check the parameter labels and initial estimates. PROGRAM TERMINATED: A regression was incorrectly defined, an attempt was made to define the same regression twice, or the end-of-file was read before the required regression definitions were read. [n is the number of regressions expected to be defined; neither of the last two reasons should be valid if the program CONTROL was used to create the control file. The first reason can be caused by a Beta index not in the range 1 to the number of regression parameters, or a regressor index not in the range 1 to the number of main effects] PROGRAM TERMINATED: All required regressions () not defined [n is the number of regressions expected to be defined; this should never happen if the program CONTROL was used to create the control file] PROGRAM TERMINATED: Beta parameter () not used in any regression The following group of messages is related to problems with reading the input data file. PROGRAM TERMINATED: Data read error on record (possible format statement error) [nrec is the record number; if nrec is 1, the format statement for reading the data may be wrong; if greater than 1, the data record is more likely to, but not necessarily, have non-numeric data] PROGRAM TERMINATED: Number of clusters larger than maximum [nmax is the program limit for number of clusters; could be caused by input data file not being sorted properly] PROGRAM TERMINATED: Some cluster larger than maximum [nmax is the program limit for cluster size] PROGRAM TERMINATED: All clusters are empty [read format statement may be wrong] Members of a third group of messages will be produced when there is not enough RAM available for arrays that need to be used. These messages will be produced at the time the allocation is attempted. They are of two basic forms: PROGRAM TERMINATED: Array allocation failed PROGRAM TERMINATED: Array allocation failed for cluster of size where n and n1 are integers. For the first message, the values of n are 1,2, or 3; for the second, 4,5, or 6. Indications of the problem can be inferred from the value of n. For n=1 or 2, the most likely factor causing the allocation to fail is the number of model regression parameters. Reducing this number sufficiently would eliminate the problem. For n=3, the allocation failed because of the size of the largest cluster. This size can be determined by looking at the output up to that point. This allocation is only attempted if the VMBQ method of 3rd/4th order moment calculations is specified. Thus this problem can be eliminated by switching to the VMZP method, or sufficiently reducing the size of the largest cluster. For n=4,5, or 6, the second number, n1, is the size of the cluster at the time the allocation fails. The problem may be caused by either the number of model regression parameters or the cluster size, and can be eliminated by sufficiently reducing the size of one or both of these factors. Also, if n=5, the allocation is only attempted if the VMBQ method of 3rd/4th order moment calculations is specified, and thus the problem can be eliminated by switching to the VMZP method. There is one other group of termination messages, messages related to the failure to invert an array. These are: PROGRAM TERMINATED: W2 inversion failed on cluster/column PROGRAM TERMINATED: S1 inversion failed on column W2 is an array for cluster n1; its order is equal to the expanded cluster size. S1 is an aggregated array for all the clusters combined; its order is equal to the total number of model regression parameters. The number n is the column which was being processed for inversion when the inversion failed. This information is probably of limited value in a W2 inversion failure, but it could indicate a troublesome regression parameter in the case of S1. Check the most recent Beta estimates. In particular, if this occurs during the first iteration pass, the problem may be resolved by a better initial estimate for the Beta. WARNING MESSAGES: There is only one warning message, and its meaning should be clear. S1A is an aggregated array for all the clusters combined; its order is equal to the total number of model regression parameters. It is used only with a GEE level 1 analysis, and effects only the robust data output. The message is: WARNING: S1A INVERSION FAILED ON COLUMN OUTPUT FOR ROBUST DATA LISTED BELOW IS INVALID ********************************** WRITING A FORTRAN FORMAT STATEMENT ********************************** This section will describe how to write a simple FORTRAN format statement as required for reading the input data file of an QAQISH application. Any valid FORTRAN format statement is acceptable, as long as it adheres to the data file requirements, but only a basic form is described here. Some examples are given at the end of the section. To re-iterate the data file requirements, each read must be from a single physical record with fields of fixed width. The first two values must be read as integers, the cluster id and class number respectively, and the next NDIMB1+1 values must be read as real numbers where NDIMB1 is the number of main effects. These real numbers are for the response variable followed by values for the NDIMB1 main effects variables. The format statement is limited to 144 characters in the current version. In the following upper case letters are used for the specifications, but lower case letters are acceptable as well. The specification for an integer field is just I followed by the field width. For example, I5 would read an integer from a field of width 5. The number in the field must be an integer (i.e., no decimal points) and must be right justified. The field can have leading blanks; i.e., blanks in the left part of the field before the integer value begins. The general specification for a real number field is Fw.d, where w is the field width and d is the number of digits to the right of the decimal point. As with integers, the number must be right justified in the field and leading blanks are allowed. If the number contains a decimal point the ".d" part of the specification is not actually needed. It can be included, or it can be omitted entirely, or it can be written as ".0". Its main usage is for numbers with an implied, but not explicitly included, decimal point. As an example, F10.4 reads a real number from a field of width 10; if there is no decimal point in the field, the number will be read as having four digits to the right of the decimal point. (Note: If the number is read as F10.4, but the data item in the field is a number with an explicit decimal point and 2 digits to the right of the decimal point, the number will be read with 2 decimal places, not four. An explicit decimal point always takes precedence over what the field specification might say.) For both integer and reals, if several consecutive numbers have the same field specification they may be abbreviated by just preceding the specification with the number of times it is repeated. For example, (I4,I4,F3.1,F6.2,F6.2,F6.2,F4.1,F6.2) may be written as (2I4,F3.1,3F6.2,F4.1,F6.2) As shown in the last example, there are two other essential items for the simplest format statement. First, the format statement must be closed in parentheses, and second, individual field specifications must be separated by commas. If all fields are adjacent, and the first field starts in column one, this is all that is needed to write a format statement. If the fields are not adjacent, or if the first field does not start in the first column, it is necessary to include in the format statement specifications to indicate this. There are two options. The first is the "nX" specification, which simply means to skip "n" columns. The second is the "Tn" specification, which means the field starts in column "n". For example, the format statement (T3,I3,10X,F8.1) will read an integer from columns 3-5 and a real number from column 16-23. Note this format is equivalent to (2X,I3,T16,F8.1). Following are three complete examples. The number of main effects is assumed to be three. Note these statements can be written in more than one way, but only one is included here. a) Cluster id in cols. 1-4; class in 7-8; response variable in 9-10; main effects in 11-14, 15-18, 19-22 and all are whole numbers: (I4,2X,I2,F2.0,3F4.0) b) Cluster id in cols. 5-8; class in 9; response variable in 12; main effect in 13-16, 27-29, 65-70, all with an explicit decimal point: (4X,I4,I1,2X,F1.0,F4.0,T27,F3.0,T65,F6.0) c) Cluster id in cols. 1-2, class in 3-4, response variable in 6, main effects in 7-12, 13-18, 19-24 with no decimal points, but the first is meant to have two digits to the right of the decimal place, and the second and third to have four. (2I2,1X,F1.0,F6.2,2F6.4) For a final example, consider an application where there are 12 main effects to be read. The cluster id is in cols. 1-2, the class in col. 8, the response variable in col. 9, the main effect variables in cols. 11-15, 16-20, 31-34, 35-38, 39-40, 41-45, 101- 106, 107-111, 114-119, 120, 211-222, and 223-234 respectively. The first 10 main effect are either whole numbers or numbers with explicit decimal points; the last two have no decimal points but are to be read with eight digits to the right of the decimal point. A correct read format statement would be: (I2,T8,I1,F1.0,1X,2F5.0,T31,2F4.0,F2.0,F5.0,T101,F6.0,F5.0,2X,F6. 0,F1.0,T211,2F12.8)