Software containerization

As data science explodes worldwide, laboratory medicine is well positioned to frame discussions about data maintenance and access. After all, within the healthcare sector, clinical laboratories produce the most numeric data to manage and diagnose diseases. 

That said, healthcare is rightfully conservative based on the “do no harm” maxim, so it’s important to incorporate data science into routine workflows carefully. This involves developing and supporting a data-science infrastructure through which data can flow—or data pipelines—and lab professionals have an important role to play in making this happen. Because much of the software used for these data pipelines is open source rather than commercial, the onus for maintenance and troubleshooting falls to the local data-science team.

This can be a challenge for clinical labs because current laboratory information systems (LIS) are structured based on commercial-application-specific software, which typically has maintenance workflows that require specific servers, interfaces, and test systems. In addition, the typical LIS includes multiple interfaced software components. That means that upgrading one software application can affect many other systems, leading to implementation delays and the need for significant additional testing.

In recent years, a new tool has been created for data pipelines that offers flexibility, reproducibility, and maintainability and may improve other areas of the LIS as well: the virtual container.

WHAT IS A VIRTUAL CONTAINER?

A virtual container is a unit of software that includes application-specific information, including software versions, libraries, and operating system modifications. (See Figure 1.) Unlike a virtual machine (VM), which emulates a physical computer by including a complete operating system, a container includes only the software layers above the operating system level. Many institutions deploy virtual containers within a VM server.

Containers often are used within areas of the laboratory that require bioinformatics applications, such as whole-genome or exome sequencing, molecular cancer, and molecular microbiology workflows. An excellent example of containerization in clinical workflows outside the clinical laboratory is in areas adjacent to clinical pathology, including molecular pathology (1) and digital pathology (2).

As data science continues to expand into the laboratory workspace, there has been an increase in the deployment of containers for improved workflows. In this article, we review how virtual containers are being applied to help solve clinical laboratory problems (see Table 1) and highlight our experience using them for clinical applications.

ADVANTAGES

PORTABILITY

A major bottleneck to adopting advanced data-science applications in clinical laboratories is that doing so requires data expertise. A data-analytics ecosystem must span the application’s coding, testing, validation, and deployment. But the issues don't end there. Suppose a great data-science tool is presented at a conference or within a journal article. How do we incorporate this new application into our ecosystem to test, validate, and possibly deploy? This is a complex problem if you don’t have access to the original creator's data streams, libraries, and operating systems, even if the code is available for use on an open-source platform such as GitHub.

Containers can partially solve these issues, because they include the correct software and library versions required to deploy the container. Bioanalytical-specific containers are available through BioContainers, an open-source and community-based repository (3). Such container repositories foster standardization and shareability and enhance code review, which improves consistency within the laboratory.

SOFTWARE MAINTENANCE

Software maintenance becomes a major consideration any time one embarks on in-house data science. Developing a data-science application should mirror any other software-development process, which includes research and development, application testing, and placement on an active production server. Each application progresses from development to testing and finally to validation before being put into production.

However, even after this process is complete, application maintenance must continue. For example, someone will request feature improvements to applications that are already live on the production server, or the software will break and need to be fixed, or version updates will be needed to the underlying software to address security problems or known bugs.

By virtue of their portability, virtual containers can aid with application improvements, breaks, and version updates. If a test server is unavailable, testing can be performed with an appropriately changed new container without affecting the production server. Ideally, any software change can be tested before going into production.

Server upgrades and migrations are another common occurrence. Often, security patches to a server do not test all the software on the server, which can break software. Or, when a new server is created, even as a VM, it often has the latest software versions, which may not be compatible with the migrated applications.

Virtual containers can be migrated and placed into production without significant changes. They would need minimal validation, a major time saver, especially if hundreds of applications are being relocated.

QUALITY METRIC MONITORING

The data needed to generate a laboratory’s quality metrics are predominantly obtained manually and used to create regular reports. This information is gathered at different stages of specimen processing and is error-prone due to time pressure and cognitive burden.

To reduce these problems, we are developing readily available, usable, location- and role-based containerized web applications to populate the centralized database. These applications will help monitor workflow and improve resource allocation. We can gather quality metrics data in real time, allowing any clinical issues they reveal to be rectified as soon as possible.

This system also helps improve laboratories’ communication and transparency. The electronic data repository provides lab-, division-, assay-, and user-level metric dashboards for different user roles and locations. The containerized workflow offers flexible building blocks for adapting to ever-expanding and changing laboratory workflows. At the same time, it takes advantage of rapid advances in web technologies and data-science fields to quickly improve the lab’s efficiency, scale for a rapid increase in demands, and improve the quality of diagnostic services.

PROSPECTS FOR DIGITAL PATHOLOGY

Digital anatomic pathology systems are in the early stages of development, but the recent pandemic has accelerated their adoption at many hospitals for primary diagnosis. Scanning, compression, storage standards, and image-analysis methods are rapidly changing for digital pathology (4). Evolving systems have led to the accumulation of more image data and the integration of other relevant clinical, laboratory, and molecular data.

Providing a more comprehensive anatomic pathology report requires a robust informatics infrastructure that can quickly adapt, validate, and integrate the latest developments. Digital pathology processing has benefited from virtual containers and containerized infrastructure for applications in research (2) and artificial intelligence systems (5).

HARNESSING MACHINE LEARNING

Machine learning, another emerging area of data science impacting the clinical laboratory, will also benefit from containerization (6).

Most current LIS and electronic health record products do not have a simple way to integrate machine learning into data pipelines. Additionally, machine learning shows the most benefit when it is integrated into real-time applications. For these reasons, machine learning exists outside commercially established workflows in most LIS departments. Containers could provide a method for integrating machine learning into the clinical workflow to provide real-time feedback.

DISADVANTAGES

CHALLENGES TO IMPLEMENTATION

We would be remiss if we did not address some of the implementation challenges associated with virtual containers. First, there are always security concerns. It’s important to involve hospital IT security professionals from the beginning. They will help guide any decisions affecting data security. In our case, hospital IT, including architecture and security, was involved early in the process and interfaced with our data-analytics team.

Regulatory issues also pose a challenge. These are not specific to containers but apply to any data pipeline. Establishing data provenance and validating data accuracy are vital to the acceptance of results of all types of data pipelines (7). Artificial intelligence, another regulatory concern, requires validation of the data pipeline and the algorithm used for clinical purposes (8).

Finally, a data analytics team embedded in the laboratory is key to achieving momentum on projects of this scale. It took us about three years to implement our container pipeline, and it was only possible with the help of our data-analytics team, even with the hospital IT groups who worked with us.

Although it’s necessary to draw on hospital IT, they have too many diverse priorities to push forward projects relevant to only a small portion of the hospital. That’s why having a department-based team is critical. But adding staff requires justification within the pathology and laboratory departments. That can be tricky because this new data team is a resource to which only some will have access.

These are undoubtedly challenges, but we believe they are manageable if you have the vision and tenacity of your department behind you.

SUMMARY

Containerization addresses the growing necessity of integrating information from multiple data types into pathology reports. Containers represent a modularized model for information flow that will help laboratories to rapidly adapt to the changing disease classification systems and provide reports that present all relevant information for clinicians.

Recent technological trends—including increases in electronic data, computing power, and machine learning—provide opportunities to reduce monotonous work for healthcare providers. They also offer new tools for serving the increasing healthcare demands posed by an aging population and higher life expectancy. Containers offer a promising framework for harnessing these advances to address modern challenges.

Container encapsulation leads to easier installation, evaluation, and clinical implementation of software applications and machine-learning technology. Hence, we propose a container repository system similar to BioContainer to benefit the laboratory medicine community. As with other new system implementations, collaboration is critical to success. Those working on data science for the clinical laboratory can be partners who produce flexible and durable applications to be shared with all. 

Dustin R. Bunch, PhD, DABCC, is assistant director of clinical chemistry and laboratory Informatics at Nationwide Children’s Hospital and assistant professor-clinical at the College of Medicine at The Ohio State University in Columbus, Ohio. +Email: [email protected]

Srinivasa Chekuri, MBBS, MPH, is an anatomic pathologist in the department of pathology and laboratory medicine at Nationwide Children's Hospital, in Columbus, Ohio. +Email: [email protected]

REFERENCES

  1. Kadri S, Sboner A, Sigaras A, et al. Containers in Bioinformatics: Applications, Practical Considerations, and Best Practices in Molecular Pathology. J Mol Diagn 2022;24:442-454; doi:10.1016/j.jmoldx.2022.01.006.
  2. Saltz J, Sharma A, Iyer G, et al. A Containerized Software System for Generation, Management, and Exploration of Features from Whole Slide Tissue Images. Cancer Res 2017;77:e79-e82; doi:10.1158/0008-5472.CAN-17-0316.
  3. da Veiga Leprevost F, Grüning BA, Alves Aflitos S, et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics 2017;33:2580-2582; doi:10.1093/bioinformatics/btx192.
  4. Pantanowitz L, Sharma A, Carter AB, et al. Twenty Years of Digital Pathology: An Overview of the Road Travelled, What is on the Horizon, and the Emergence of Vendor-Neutral Archives. J Pathol Inform 2018;9:40; doi:10.4103/jpi.jpi_69_18.
  5. Greenwald NF, Miller G, Moen E, et al. Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning. Nat Biotechnol 2022;40:555-565; doi:10.1038/s41587-021-01094-0.
  6. Herman DS, Rhoads DD, Schulz WL, et al. Artificial Intelligence and Mapping a New Direction in Laboratory Medicine: A Review. Clin Chem 2021;67:1466-1482; doi:10.1093/clinchem/hvab165.
  7. Schulz WL, Kvedar JC, Krumholz HM. Agile analytics to support rapid knowledge pipelines. npj Digital Medicine 2020;3:108; doi:10.1038/s41746-020-00309-z.
  8. Schulz WL, Durant TJS, Krumholz HM. Validation and Regulation of Clinical Artificial Intelligence. Clin Chem 2019;65:1336-1337; doi:10.1373/clinchem.2019.308304.