“CATCH-IT Reports” are Critically Appraised Topics in Communication, Health Informatics, and Technology, discussing recently published ehealth research. We hope these reports will draw attention to important work published in journals, provide a platform for discussion around results and methodological issues in eHealth research, and help to develop a framework for evidence-based eHealth. CATCH-IT Reports arise from “journal club” - like sessions founded in February 2003 by Gunther Eysenbach.
Monday, December 14, 2009
Final CATCH-IT Report: Syndromic Surveillance Using Ambulatory Electronic Health Records
Abstract & Blog Comments
Slideshow - not able to upload
Syndromic surveillance is a type of surveillance that uses health-related data to predict or detect disease outbreaks or bioterrorism events. Much of the work in this area of research has been conducted on structured health data (1)(2). However, these systems typically need to be tailored to for a particular IT system and institution due to a lack of available data standards. Engineering syndromic surveillance systems in this way is both time consuming and localized only. On the other hand, utilizing narrative records for syndromic surveillance brings unique challenges as this data requires natural language processing. Several studies have successfully experimented with this approach (3)(4). With the increasing adoption of electronic health records (EHRs) there is an abundance of clinically-relevant data that could potentially be used for syndromic surveillance. As a result, creating a generic syndromic surveillance system that could be broadly applied and disseminated across institutions is attractive. This CATCH-IT report is a critique of a research paper detailing an approach to creating such a system (5).
The aim of this study was to develop and assess the performance of a syndromic surveillance system using both structured and narrative ambulatory EHR data. The evaluation methodology suggests that the authors may be trying to assess the system’s performance based on its concurrent validity with other existing surveillance systems.
Not explicitly stated. It appears implicitly that the authors expect the signals from the test systems and the ED data to occur at the same time (no lag).
The Institute for Family Health (IFH) served as the data source for testing the surveillance systems. IFH is comprised of 13 community health centres in New York, all of which use the Epic EHR system.
The authors took two different approaches to developing their syndromic surveillance system, a tailored approach on structured data and a generic approach on narrative data. The two syndromes of interest, influenza-like-illnesses (ILI) and gastrointestinal-infectious diseases (GIID), were defined by two physicians. Both sets of queries were developed based on these definitions. The tailored queries were created specifically for IFH by mapping key terms to available system data and using past influenza season data. The performance of the tailored queries for ILI and GIID were not thoroughly evaluated. The MedLEE natural language processing system (NLP) (6) was utilized to create the generic queries for narrative data. Generic queries were tested on internal medicine ambulatory notes from the Columbia University Medical Centre (CUMC). These queries were evaluated using a gold standard that was produced by a manual review of a subset of clinical notes. Queries were then selected based on their ROC performance.
The resulting queries were tested on 2004-2005 data from the Institutes for Family Health (IFH). All structured notes with recorded temperature (124,568) and de-identified narrative notes (277,963) were analyzed. The results of the two test systems were compared with two existing sources, the New York City Emergency Department chief complaint syndromic surveillance system (NYC ED) (7) and the New York World Health Organization (WHO) influenza isolates, using the lagged cross-correlation technique. The NYC ED served as the only comparison source for GIID.
The ILI lagged cross-correlation for IFH structured and narrative isolates showed both a strong correlation with NYC ED isolates (0.93 and 0.88, respectively) and with one another (0.88). The correlation with WHO isolates was high (0.89 structured, 0.84 narrative), although less precise and produced an asymmetric lagged cross-correlation shape, which hindered interpretation of the true lag.
GIID results were more ambiguous. While IFH structured data correlated relatively well with the NYC ED data (0.81), the IFH narrative data correlated poorly with both IFH structured and NYC ED isolates (0.51 and 0.47, respectively). This result indicated that there was a particular problem with the generic narrative approach on GIID data. However, across all GIID comparisons there was a lack of precision in the correlations (wide confidence intervals) and clarity in interpreting the true lag.
The authors concluded that the tailored structured EHR system correlated well with validated measures with respect to both syndromes. While the narrative EHR data performed well only on ILI data, the authors believe this approach is feasible and has the potential for broad dissemination.
Methodological Issues & Questions
Both sets of queries were based on syndrome definitions created by two domain experts. While the definitions of ILI and GIID are fairly well established, it is not clear why the authors decided to employ expert opinion for their definitions rather than using a standard definition from the CDC or the WHO. Although the definitions used in this study are valid and likely do not represent a major source of error, any contentions on this issue could have easily been avoided.
The bigger methodological issue in respect to the structured query development is the lack of query evaluation. A crude measure of sensitivity was calculated for the ILI query, but no manual review was undertaken to produce measures of specificity or predictive value. Of even more concern, the performance of the GIID query was not assessed at all. There does not appear to be any logical reason for these methodological omissions and this flaw calls into question the validity of the analyses of the structured query.
While the narrative query development was described in more detail than the structured query development, it is not without its flaws. CUMC notes were used for testing in this phase. However, the lack of context surrounding the CUMC makes it difficult to determine the robustness and generalizability of the query. For instance, it is not mentioned what EHR system is used, what kind of patient population is seen at this institution, or why ambulatory internal medicine notes were used for query testing. It seems counter-intuitive to use notes from a medical specialty to create a query for primary care ambulatory notes. Additionally, only notes generated by physicians who used the EHR were used and we are not told how many physicians encompassed this population.
To their benefit, the authors do conduct a manual review of a subset of the notes to produce a gold standard, but this process is not detailed clearly. It is unknown how many notes were used for this process and why only one reviewer undertook this process. Ideally, at least two reviewers would perform the review and a measure of inter-rater reliability (kappa) would be reported.
A large part of this study’s value rests on the comparisons made between the test IFH systems and the established systems (NYC ED and WHO). However, the fundamental question here is whether or not these comparisons are valid and appropriate as the patient population and data range may differ greatly.
The NYC ED system utilizes chief complaints which are timely and produce good agreement for respiratory and GI illness (7). However, patients with GIID or mild respiratory symptoms may not go to the ED. This limitation has come to light as the system has failed in the past to detect GI outbreaks (7) and indicates the NYC ED data may not represent a gold standard. Additionally, the authors of the current study propose that the poor performance of the GIID narrative query may be due to the fact that the NYC ED covers a much broader geographical area. The implication here is that GIID are often localized and therefore may not be captured by local IFH community clinics. These distinctions between the NYC ED and IFH may account for some of the ambiguous results obtained and raises doubts about the decision to use the NYC ED as a comparison. However, it is likely that no alternative comparison source exists and therefore, the NYC ED represented the best available data source.
While the hypothesis is not explicitly stated, it appears that the authors would expect that their surveillance system would produce signals concurrent with the emergency signals. However, this assumption may not be valid as there is nothing to suggest that primary care signals would behave in this manner.
WHO Isolates are used as the second ILI comparison in this study, but are not described in any noteworthy detail. This hinders the reader in understanding the appropriateness of this source and thereby the meaningfulness of the cross-correlation results. A brief search on the internet revealed that the WHO has a National Influenza Centre (NIC) in NY which is part of a larger WHO Global Influenza Surveillance Network. The NIC samples patients with ILI and submits their biomedical isolates to the WHO for analysis. The WHO in turn will use the information for pandemic planning. The amount of testing that these centres will conduct depends on what phase the flu season is in, with more testing occurring during the start of the season in order to confirm influenza and less testing occurring at the peak of the season due to practicalities. Because the nature of the WHO NIC is very different in both motivation, operation, and scale compared to the NYC ED and IFH syndromic surveillance systems it raises questions about the appropriateness of using it as a comparison. In their study of the NYC ED, Heffernan(2004) includes the WHO isolates as a visual reference to provide context for their own signals, but they do not attempt to use it as a correlation metric. Given the aforementioned reasons this example may be the most appropriate way to use the WHO isolates.
The last area of discussion is regarding the keywords used to define the syndromes. A breakdown of the keywords used in each study found that only 3 terms (fever, viral syndrome, and bronchitis) were similar between the NYC ED and IFH queries. Interestingly, 4 terms used in the IFH queries (cold, congestion of the nose, sneezing, and sniffles) were exclusion criteria for ILI in the NYC ED. If the queries used across the studies are dissimilar, it may indicate that they are not identifying the same patients and this would raise more issues concerning the meaningfulness of the cross-correlation results.
Surprisingly, the GIID keywords for both NYC ED and IFH were similar. This complicates the interpretation of the GIID results, but may suggest that the issue here is with the geographical range and patient population.
Due to the methodological issues discussed above, it is difficult to determine the validity and salience of the conclusions reached by the authors. As this project was exploratory in nature it would be beneficial if the authors took a step back to carefully review their queries and revise them appropriately after thorough evaluation of each query’s sensitivity, specificity, and predictive value. The most salient problem should be to develop internal consistency and reliability between the two IFH systems before attempting to compare their performance with external measures that may or may not be appropriately matched.
Overall, while the study is under practical limitations in its choice of comparative data sources, the authors presented an interesting idea that will likely be of use in the future and should be developed and evaluated more carefully.
Q’s for authors
1) How were abbreviations, misspellings, acronyms, and localized language dealt with in the narrative query development?
2) Why were internal medicine notes chosen for the IFH system if one is trying to create a general approach?
3) Can you please provide more detail about the WHO isolates and why they were chosen as comparison data?
1) Cochrane DG, Allegra JR, Chen JH, Chang HG. Investigating syndromic peaks using remotely available electronic medical records. Advances Dis Surveil 2007; 4-48.
2) Thompson MW, Correlation between alerts generated from electronic medical record (EMR) data sources and traditional sources. Advances Dis Surveill 2007; 4:268.
3) South BR, Gundlapalli AV, Phansalkar, S et al. Automated detection of GI syndrome using structured and non-structured data from the VA EMR. Advances Dis Surveill 2007;4:62.
4) Chapman MW, Dowling JN, Wagner MM. Fever detection from free-text clinical records for biosurveillance. J Biomed Inform 2004 Apr;37(2):120-7.
5) Hripcsak G, Soulakis ND, Li L, Morrison FP, Lai AM, Friedman C, Calman NS, Mostashari F. Journal of the American Medical Informatics Association 2009; 16(3):354-61. Epub 2009 Mar 4.
6) Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc 2004; 11(5):392-402.
7) Heffernan R, Mostashari F, Das D, Karpati A, Kulldorff M, & Weiss D. Syndromic surveillance in public health practice, New York City. Emerging Infectious Diseases 2004; 10 (5): 858-864.