Impact of Data Science on Clinical Laboratory

By Mark Hoffman, Ph.D. Chief Research Information Officer, Children’s Mercy Hospital Kansas City, Assoc. Professor Pediatrics, Assoc. Professor Biomedical and Health Informatics – University of Missouri – Kansas City

The diagnostic laboratory has always been a key source of data that informs clinical decisions. Clinical pathology tests generate discrete results with numeric or coded values that can be classified as normal or abnormal. Anatomic pathology analysis results in a report based on visual analysis of tissues based on the application of specialised stains, probes or other resources that help evaluate the sample for malignancy, inflammation or other clinically significant findings. Recent advances in molecular methods, including diagnostic genomic sequencing, as well as advanced imaging methods such as digital pathology, generate orders of magnitude more data than traditional methods. These advances have created exciting opportunities and some challenges for the laboratory community. The emerging discipline of data science offers a valuable toolkit to maximise the value of all modalities of laboratory data and to improve the diagnostic and operational functions of a modern lab. 

Data science refers to the combination of computational, statistical and subject matter expertise necessary to recognise subtle patterns in high volume, complex data and then to develop predictive models based on those analyses. Some common categories of data science approaches include artificial intelligence (AI), machine learning and deep learning. A data science project typically begins with a large data set that is divided into a training segment and a test segment. The training segment is used to iteratively design, develop and tune an algorithm. A high volume of data to drive the statistical power of any analysis is critical, however data sets in the dozens or even hundreds are often not deep enough to support the complex analyses. For example, a data set combining haemoglobin A1c, glucose values, body mass index and dates of diabetes diagnosis with other clinical information could be used to identify early indicators of the onset of diabetes. The degree of expert involvement in this process depends on the specific data science methodology. Some AI approaches involve expert curation of the training data set and algorithm development, these are considered “supervised” methods. In contrast, deep learning is generally driven by inherent attributes of the source data. Specialised analysis approaches, including bioinformatics, can also fit into the broad category of data science. 

The incorporation of pathology information into electronic health records creates the opportunity to query this data for subtle patterns. At the local level, data analysis can help in quality control, for example in determining whether there has been drift in the results from an instrument indicating a need for calibration. Some analyses require more data than is available from a single organisation. Initiatives in which de-identified electronic health record (EHR) data is aggregated from multiple organisations can provide a valuable resource for gaining new insights into the role of laboratory data in clinical decision making. For example, we recently used this approach to demonstrate that magnesium, both high and low levels of magnesium, in patients with a myocardial infarction correlates with higher mortality. We have also used aggregate EHR data to demonstrate that A1c tests are frequently ordered for sickle cell patients, a practice that should be avoided. The unstructured data found on the text content of pathology reports can also be evaluated using natural language processing methods. EHR data analysis is increasingly recognised as an important source of phenotype information to complement genomic analysis. 

Data science is also being applied to automate the analysis of diagnostic images, including pathology slides. The application of data science methods such as deep learning to these images has the potential to improve the accuracy of interpretation and to assist in the recognition of subtle but potentially significant patterns that elude the human eye or brain. AI based methods use pathologist annotations of slides to train an algorithm. Early efforts in this area include the use of deep learning to recognise micrometastases of breast cancer in lymph node biopsies and demonstrated increased sensitivity and reduced time to review. Other work has explored the use of data science to enhance blurry regions in slide images and to support quality control. 

Molecular diagnostic testing has become standard practice in diagnostic laboratories. Increasingly, clinical full exome or genome sequencing is also becoming widely available for the management of cancer and the diagnosis of complex cases that do not yield to traditional methods. These methods generate massive volumes of high complexity data. For example, a genomic analysis for a single patient can generate more than one terabyte of data. Data science methods assist in the analysis of these raw sequences as laboratories search for variants that may be clinically significant. Recent advances in single cell sequencing will introduce another major shift in the volume of data that will ultimately be applied to reach a diagnosis. Genomic analysis has had many notable successes in single variant, Mendelian conditions, and in assisting in the management of cancer. Common chronic diseases with known hereditary, polygenic, influences remain difficult to characterise and will require the continued application of data science and bioinformatics methods to identify multifactorial contributions to diseases such as diabetes and asthma. 

Data science methods have a wide variety of applications relevant to the laboratory. First, they can enhance the diagnostic capacity of the lab by offering novel means to improve accuracy and speed. Second, with the increasing complexity of data generated by laboratory processes such as high-resolution images and genomic data, subtle patterns are increasingly likely to elude human perception. Data science can help augment the clinical expert as they navigate these new sources of diagnostic data. Third, emerging methods will support the integration of lab data with other clinical data to develop comprehensive predictive algorithms capable of early detection of disease risk or identifying optimal treatment strategies in support of precision medicine. Finally, there are numerous applications of data science that can promote the administrative process of operating a clinical lab. For example, understanding subtle patterns in test utilisation can help in inventory management. Likewise, models that predict patients at risk of being a no-show for specimen collection can help manage call centre reminders. 

Laboratory professionals seeking to apply data science to address complex questions can take a number of approaches. The best approach is to form a collaborative team with computational and statistical experts to address a clearly defined problem. The team would identify and characterise the data available to them as they design their strategy. The team approach helps mitigate concerns that laboratorians have to become programmers to participate in data science. For those who do want to develop some of the technical skills, high quality online data science training resources such as those provided by Coursera or edX provide an excellent starting point for learning more about the principles and methods of data science. Open source applications such as R and Python are widely available to perform complex data analysis, as are commercial packages. Laboratories that embrace data science will be well positioned to engage in the next generation of diagnostic technologies and methods.