Previous Article | Next Article ![]()
Clinical and Vaccine Immunology, January 2008, p. 106-114, Vol. 15, No. 1
1071-412X/08/$08.00+0 doi:10.1128/CVI.00223-07
Copyright © 2008, American Society for Microbiology. All Rights Reserved.

and
Peter M. Strebel1
National Center for Immunization and Respiratory Diseases,1 Office of Workforce and Career Development,2 Office of the Director, Centers for Disease Control and Prevention, Atlanta, Georgia3
Received 31 May 2007/ Returned for modification 28 August 2007/ Accepted 25 October 2007
|
|
|---|
|
|
|---|
For patients suspected of having pertussis, two types of clinical samples can be tested: a nasopharyngeal (NP) specimen for the isolation of Bordetella pertussis or for a PCR assay for B. pertussis DNA and a serum sample for the measurement of antibodies to B. pertussis antigens (11).
B. pertussis isolation by microbial culture is the conventional gold standard for confirming pertussis (37, 60). Most studies have derived sensitivity and specificity estimates of PCR or serologic tests using culture results as the gold standard (37). Sensitivity is the proportion of the true diseased patients classified as positive, and specificity is the proportion of the true nondiseased patients classified as negative (62). However, culture is an insensitive test, because the organism is fastidious and often not recoverable from the nasopharynx more than 3 weeks after cough onset (37). Because culture has low sensitivity, it cannot be used to determine the true specificity of other or new pertussis tests. Both PCR assays for B. pertussis DNA and serologic assays for antibodies to B. pertussis antigens have not been standardized, and their sensitivity and specificity are incompletely defined (27, 37, 41, 60).
Consider a diagnostic test under investigation, hereafter referred to as the index test. If culture for pertussis is assumed to be <100% sensitive and 100% specific and culture is used as the gold standard for assessing the index test, then the index test's sensitivity estimate will be unbiased but the specificity estimate will be biased in the direction of lower estimates (38, 46, 53, 62). This bias, referred to as the imperfect gold standard bias (62), occurs because some index test-positive results from truly infected persons will have been falsely negative by culture. Under the assumption that the index test and culture are conditionally independent, the negative bias of the specificity estimate increases as the sensitivity of culture decreases and as the prevalence of pertussis increases (46) (Fig. 1).
![]() View larger version (17K): [in a new window] |
FIG. 1. Example of the observed specificity of a diagnostic test versus the prevalence of disease, by the sensitivity of culture, assuming a culture specificity of 100% and conditional independence between the diagnostic test and culture. Adjustment formulas for the observed specificity also assume a diagnostic test specificity of 85% and a diagnostic test sensitivity of 75% (46).
|
![]() View larger version (13K): [in a new window] |
FIG. 2. (a) Example of the observed sensitivity of a diagnostic test versus the prevalence of disease, when the assumption of conditional independence between the diagnostic test and an imperfect gold standard may or may not be met. The solid line indicates the assumed level of sensitivity of the diagnostic test. The short-dashed line indicates the observed sensitivity assuming conditional independence calculated by adjustment formulas (46). The long-dashed lines indicate the pointwise nonsampling intervals for the observed sensitivity assuming diagnostic test-gold standard dependence. Formulas for the nonsampling intervals assume a diagnostic test sensitivity of 75%, a gold standard sensitivity of 95%, and a gold standard specificity of 95% (50). (b) Example of the observed specificity of a diagnostic test versus the prevalence of disease, when the assumption of conditional independence between the diagnostic test and an imperfect gold standard may or may not be met. The solid line indicates the assumed level of specificity of the diagnostic test. The short-dashed line indicates the observed specificity assuming conditional independence calculated by adjustment formulas (46). The long-dashed lines indicate the pointwise nonsampling intervals for the observed specificity assuming diagnostic test-gold standard dependence. Formulas for the nonsampling intervals assume a diagnostic test specificity of 85%, a gold standard sensitivity of 95%, and a gold standard specificity of 95% (50).
|
Standardized PCR and serologic assays for diagnosing pertussis currently are under development (4, 10, 60). The evaluation of the accuracy of these assays will require a reference standard with both high sensitivity and specificity. In this report, we outline methods for the two alternative reference standards, CRS and LCA, illustrate them for evaluating a PCR assay by using data from a previously published study of pertussis in adolescents and adults, and make recommendations for their use.
|
|
|---|
Discrepant analysis. In several evaluations of diagnostic tests for detecting B. pertussis infection, investigators have attempted to improve the sensitivity of culture by performing discrepant analysis (30, 57). A typical discrepant analysis involves the selective testing of the index test-positive, gold standard-negative specimens with a third, resolver test, which is considered a perfect gold standard (100% specific and 100% sensitive) (1, 52). If the resolver test result is positive, then it is considered a true positive (19); this reclassification leads to a modified or expanded gold standard that is based in part on the results of the index test (1). The incorporation of the results of the index test also can occur in other forms of discrepant analysis, in which an expanded gold standard is created by resolving discrepant specimens with patient histories (30). The incorporation of the results of the index test into the gold standard results, a phenomenon termed incorporation bias, typically overestimates the accuracy of the index test (39, 62). Thus, the selective testing or resolving in discrepant analysis merely substitutes incorporation bias for imperfect gold standard bias (35). To prevent incorporation bias, investigators advocate study designs in which all diagnostic tests are applied to each subject (18, 29, 31, 35, 36).
CRSs. To reduce imperfect gold standard bias but avoid incorporation bias, Hadgu (20) and Miller (35, 36) suggested combining multiple tests to improve the single best reference test. For example, a test with high specificity but poor sensitivity (e.g., culture) could be combined with another test with higher sensitivity (e.g., serology) to provide a relatively accurate CRS. The CRS can be formulated in the framework of a two-stage study design (1). In the first stage, all specimens are tested by the index test (e.g., PCR) and culture, but in the second stage, only those specimens that are culture negative are tested by the resolver test (e.g., serology) (Tables 1 and 2). The CRS is defined as positive if the specimen tested positive by either culture or the resolver test and is defined as negative if the specimen tested negative by both tests. The CRS also can be implemented in single-stage studies (33) and by combining more than two laboratory tests and clinical findings (28). Large-sample confidence intervals can be calculated for estimates of sensitivity and specificity (14).
|
View this table: [in a new window] |
TABLE 1. Notation of variables used in a composite reference standard that combines culture with a resolver test to evaluate performance of a diagnostic test for pertussisa
|
|
View this table: [in a new window] |
TABLE 2. Notation of formulas for calculating sensitivity and specificity of a composite reference standard that combines culture with a resolver test to evaluate performance of a diagnostic test for pertussisa
|
3) diagnostic tests (Y1, Y2,..., YJ) rating each subject on a binary scale (1 = positive; 0 = negative). In an item analysis of diagnostic tests, the total score on all the diagnostic tests (LCA for assessing relative accuracy of diagnostic tests. A second valid approach for reducing imperfect gold standard bias is LCA. LCA is a mathematical correction that involves fitting a latent class model using data from all available diagnostic tests (32, 59). All diagnostic tests, including the gold standard, are regarded as imperfect. The latent class model assumes that for a randomly selected subject, the unobserved true state of disease, the latent variable (X), influences the observed measurements made by J diagnostic tests (Fig. 3a). The model makes two key assumptions, which are described below.
![]() View larger version (10K): [in a new window] |
FIG. 3. (a) Conditional independence latent class model. The observed measurements made by diagnostic tests (Yj; j = 1,..., J) are independent, given the common latent variable for subject disease status (X). If an association is observed among diagnostic tests, then it is entirely attributable to the common factor X. (b) Example of a dependence latent class model that includes a bivariate association between Y1 and Y2. The observed association between Y1 and Y2 cannot be entirely explained by the common factor X.
|
![]() | (1) |
Assumption 2.
Yj is independent of Yj', j
j', given X. This is an assumption of conditional independence of the diagnostic tests, given the true disease status of the subject. This assumption means that an observed measurement made by a diagnostic test depends only on the true disease status of the subject and not on the measurements of other diagnostic tests or on any covariate (Fig. 3a). Thus, the diagnostic tests make classification errors independently of each other, irrespective of disease status. For the observed two-by-two table of Y1 and Y2, this assumption allows the joint probability in equation 1 to be factored further into an expression that includes conditional probabilities (a sensitivity parameter and a specificity parameter for each diagnostic test) and latent class probabilities (represented by a single prevalence parameter):
![]() | (2) |
The requirements for the number of diagnostic tests can be relaxed by restricting values of parameters to certain values or by restricting two or more parameters to have equal values (32, 59). The conditional independence model and models that account for conditional dependence between pairs of diagnostic tests can be fit using Latent GOLD software (55). A final model can be selected based on the value of the Bayesian information criterion, with smaller values representing better fits (32).
The results of fitting a latent class model to data from multiple diagnostic tests can be used to predict the disease status of individual study subjects (2, 3). Bayes rule can be used to calculate the posterior probability that a subject is in the latent diseased population, given the subject's observed results on the multiple diagnostic tests.
Data example: prospective study of pertussis disease burden in adolescents and adults.
A prospective study conducted among members of a managed care organization in Minneapolis/St. Paul, MN, measured the pertussis disease burden in adolescents and adults (48). Between January 1995 and December 1996, 212 persons aged 10 to 49 years who presented with an acute paroxysmal cough or a persistent cough illness of 7 to 34 days duration were enrolled in the study. At enrollment, NP swab specimens were obtained for culture and PCR, and a first serum specimen was obtained. A second serum specimen was obtained
3 weeks later. Serum samples were assayed by an enzyme-linked immunosorbent assay for immunoglobulin G (IgG) and IgA against pertussis toxin (PT) and filamentous hemagglutinin (FHA), resulting in four types of serological test results: IgG-PT, IgA-PT, IgG-FHA, and IgA-FHA. The results of several laboratory tests were available: culture, PCR, and
2-fold increases in IgG-PT, IgA-PT, IgG-FHA, or IgA-FHA. Culture and PCR were performed in separate locations within the Minnesota Department of Health Laboratory, and the serologic assays were performed at Vanderbilt University (TN) (48). The method used for the conventional PCR assay was described by van der Zee et al. (54); primer pairs were based on insertion sequence elements IS481 (specific for B. pertussis) and IS1001 (specific for B. parapertussis). In addition to laboratory test results, classification by the pertussis clinical case definition was available (9). A clinical case was defined as a cough illness of
14 days duration with one or more of the following symptoms: paroxysms of coughing (coughing spells with the inability to breathe during the spells), inspiratory whoop, or posttussive vomiting.
We analyzed these data by considering PCR to be the index test for illustrative purposes only. The sensitivity and specificity of PCR were estimated using culture results as the gold standard, a CRS, and LCA.
|
|
|---|
|
View this table: [in a new window] |
TABLE 3. Item analysis for determining which indicator of pertussis to combine with culture for evaluation of PCR in a previous pertussis study of 212 subjects (48)
|
|
View this table: [in a new window] |
TABLE 4. Sensitivity and specificity of indicators of pertussis in a previous pertussis study of 212 subjects (48) based on culture results, a CRS, and LCA
|
On the basis of these results and considering that PT is the most specific B. pertussis antigen tested (11), IgG-PT was chosen as the resolver test and was combined with culture in a CRS. Based on the CRS, the sensitivity of PCR was 47%, which was lower than that based on culture (62%), but the specificity of the two tests was the same (98%) (Table 4). The estimates of sensitivity and specificity based on the CRS for the other serology tests were higher than those based on culture.
LCA. An LCA was performed based on the cross-classification of results for all six laboratory tests and the clinical case definition (Table 5). Of 27 (12.7%) patients with at least one positive laboratory test result, all except one patient met the clinical case definition; this patient was positive only for IgG-FHA and IgA-FHA. After the conditional independence model was fitted, a large (statistically significant) bivariate residual was found for the culture-PCR association, so a second model that added a separate parameter for conditional dependence between culture and PCR was fitted. This model had a lower value of the Bayesian information criterion than the conditional independence model, and none of its bivariate residuals was large, suggesting a good fit of the model. The parameter estimates for the model were virtually identical to those for the conditional independence model. Based on the results of the conditional dependence model, PCR had a sensitivity of 34% and a specificity of 98% (Table 4). Culture and IgA-PT had relatively low sensitivity (34%), whereas the clinical case definition (94%) and IgG-FHA (97%) had relatively high sensitivity. All indicators except the case definition were relatively specific. IgG-PT and IgA-PT had the highest specificity (99.9%). The latent class model also provided an estimate of the prevalence of pertussis based on all seven indicators of pertussis (estimate, 8.4%; standard error, 1.9%).
|
View this table: [in a new window] |
TABLE 5. LCA of seven indicators of pertussis in a previous pertussis study of 212 subjects (48)a
|
Because the timing of specimen collection is a major determinant for culture, PCR, and antibody detection tests (41, 47, 48), we evaluated the performance of PCR within two subgroups of subjects defined by the interval between cough onset and enrollment (NP swab and first serum specimen): 0 to 13 days and 14 to 34 days (Table 6). With seven culture-positive samples in the 0- to 13-day interval but only one in the 14- to 34-day interval, the culture-based estimate of sensitivity was not reliable for the 14- to 34-day interval; the culture-based estimate of specificity did not vary by interval. However, results of both the CRS and LCA suggested that the sensitivity of PCR was significantly higher when the NP specimens were collected during the first 2 weeks of illness than when they were collected later in the course of illness (Table 6).
|
View this table: [in a new window] |
TABLE 6. Sensitivity and specificity of PCR in a previous pertussis study of 212 subjects (48) by the interval between cough onset and enrollment (NP swab and first serum specimen)
|
|
|
|---|
CRSs and LCA are scientifically and statistically valid alternatives to using culture results as the gold standard in diagnostic accuracy studies. In our data example, they led to much lower estimates of sensitivity of PCR (47 and 34%, respectively) than the culture-based estimate (62%). The culture-based estimate of sensitivity was biased higher due to the positive dependence in classification by PCR and culture. We believe that the CRS and LCA provided a more accurate indicator of disease than culture results alone, because they defined disease status by using additional markers of disease. For instance, information on five subjects with positive IgG-PT results who tested negative by culture and PCR (Tables 1 and 2) was used to define pertussis in both the CRS and LCA approaches. In prior studies of pertussis in adolescents and adults that were based on case definitions that combined culture, PCR, and serologic results, it was found that the PCR sensitivity estimates were less than 50% (61).
The data example was limited by few positive results by culture (n = 8) or PCR (n = 10). For this reason, the results from our analysis on the performance of PCR as a diagnostic test may not be generally applicable, particularly in the period of 14 to 34 days after cough onset. In addition, the PCR and serologic assays used in the pertussis study were not standardized. These limitations are common to published evaluations of pertussis diagnostic tests. The CDC currently is working to optimize PCR for detecting B. pertussis DNA and also is working with the Food and Drug Administration to optimize serologic assays for B. pertussis antigens (10).
Several practical aspects of implementing the CRS approach deserve mention. An appropriate CRS for defining pertussis requires a highly sensitive resolver test to improve upon the low sensitivity of culture. The resolver test also must be specific, because the CRS assumes that the resolver test, as well as culture, is highly specific. Under these assumptions, the CRS will increase the sensitivity relative to that of culture and the resolver test but will remain highly specific (38). In the data example, the choice of IgG-PT as the resolver test was aided by performing an item analysis. The item analysis suggested that IgG-FHA and IgA-FHA also were good candidates for the resolver test. However, FHA antibody tests would not be good choices for the resolver test, because FHA antibodies also are observed in response to infection with other Bordetella species (4, 37). The two-stage framework of the CRS in which the resolver test was applied to all culture-negative results was used in the data example because of sparse data (Tables 1 and 2). Additional study designs for the CRS are based on sampling from the cells of the two-by-two table of the index test by culture results (1, 22).
The LCA approach is more complicated than the CRS approach, because it involves fitting a statistical model that models unobserved data, the latent disease status. Latent class models attempt to provide an approximation of the diagnostic truth based on the results of all available diagnostic tests, recognizing that the true classification of a person's disease status is unknown and can be defined only theoretically. Thus, a sensible analysis strategy would be to include in the model indicators of disease that are based on different biologic or physiologic phenomena. An additional advantage of latent class models over CRSs in this respect is that they do not require assumptions about the specificity of the resolver or any other test.
Latent class models have other advantages over the CRS approach. They allow the estimation of sensitivity and specificity of each diagnostic test included in the model. In addition, they provide estimates of disease prevalence, but it is important to note that the prevalence estimate depends on the patient sample as well as the particular variables included in the model. In the Minnesota pertussis study, 3.8% of the subjects were culture positive, and 12.7% had positive results on at least one laboratory test; the pertussis prevalence estimate was 8.4% based on the latent class model containing seven indicators of disease (Table 4). Lastly, latent class models may provide more reliable estimates of diagnostic performance than other approaches, because they model all of the available diagnostic information. In the subgroup analysis defined by the interval between cough onset and the first specimen collection, the latent class models indicated that, like culture, PCR was a more sensitive diagnostic test earlier in the patient's course of illness. This finding is consistent with results of previous pertussis studies (47, 48).
Although in the data example the latent class models were developed for the results of culture from all study subjects, latent class models also can be developed when culture is performed only for subjects with positive results on some other indicator(s) (58). However, they may require substantial amounts of data to avoid the estimation problems associated with the identifiability of models. A latent class model is identifiable if one solution to the likelihood equations exists. Even if the model is identifiable, the parameters may not be estimated uniquely because of a small sample size, sparseness in the observed data table, or an unusual pattern of observed data. To prevent boundary value problems and the nonexistence of maximum likelihood estimates, the Latent GOLD program uses Dirichlet priors with user-defined parameters for the latent and conditional response probabilities and a hybrid expectation-maximization/Newton-Raphson model-fitting algorithm (56). Recent applications of LCA have been useful for studies with sample sizes of about 150 subjects evaluated by four or five diagnostic tests (6, 8).
If one is willing to make further analysis assumptions about the clinical sensitivity and specificity values of the reference standard, then this information may be incorporated into an LCA (16, 59, 62). For example, when we assumed that culture was 100% specific and fit another latent class model for the 0- to 34-day interval, the accuracy estimates for PCR did not change, but the number of culture positives in the pertussis study was small. In other settings, this approach may be useful for determining the possible range of values for the index test under different assumptions about the accuracy of the reference standard. A more comprehensive approach to incorporating prior information on the accuracy of the reference standard would be to assume a prior distribution for sensitivity and specificity values in a Bayesian analysis (13, 15, 25).
Recommendations for evaluations of diagnostic tests. Table 7 compares all of the analysis approaches and provides recommendations for their use. Discrepant analysis should not be used in evaluating diagnostic tests for pertussis, because it violates a fundamental principle of diagnostic test evaluation: the reference standard should not incorporate any test result that depends on the results of the diagnostic test. The acceptance of this principle is challenging when evaluating a diagnostic test that truly is more sensitive than the conventional standard (culture) or even a perfect test; such a test will identify as diseased some individuals who were classified as nondiseased by the gold standard (42-45). Culture for B. pertussis has near-perfect specificity but poor sensitivity. PCR assays potentially can detect a single copy of B. pertussis target DNA (26), and prior pertussis studies have found that PCR assays, when performed with the appropriate laboratory control reagents, have high diagnostic sensitivity (37). However, they can yield false-positive results for several reasons, including clinical or laboratory contamination (7, 27, 34, 49). Numerous false-positive PCR results occurred in three recent outbreaks of respiratory illness mistakenly attributed to pertussis (10). False-negative results of PCR assays also are possible for a variety of reasons, not all of which can be rigorously controlled (34). Standardized serologic assays might be sensitive and specific enough to be utilized in diagnostic test evaluations, given that the variability inherent in these quantitative assays does not exceed the minimal levels for acceptance criteria (40).
|
View this table: [in a new window] |
TABLE 7. Comparison of analysis approaches for evaluating the clinical accuracy of pertussis diagnostic tests
|
In conclusion, the CRS and LCA approaches were useful for evaluating the relative accuracy of PCR and other diagnostic tests for pertussis in a prior pertussis study. These approaches likely provided more accurate reference standards than culture results alone and should be used in evaluations of diagnostic tests for pertussis. The best way forward is to ensure close interdisciplinary collaboration among clinicians, laboratorians, and statisticians.
The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the funding agency.
Published ahead of print on 7 November 2007. ![]()
Present address: 1830 Mountain Valley Rd., La Veta, CO 81055. ![]()
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»