Clinical Usability-Oriented Automatic Contour Quality Evaluation for Deep Learning Auto-Segmentation
International Journal of Radiation Oncology*Biology*Physics, ISSN: 0360-3016, Vol: 117, Issue: 2, Page: S144-S145
2023
Metric Options: CountsSelecting the 1-year or 3-year option will change the metrics count to percentiles, illustrating how an article or review compares to other articles or reviews within the selected time period in the same journal. Selecting the 1-year option compares the metrics against other articles/reviews that were also published in the same calendar year. Selecting the 3-year option compares the metrics against other articles/reviews that were also published in the same calendar year plus the two years prior.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Abstract Description
Various auto-segmentations, including deep learning auto-segmentation (DLAS), are being increasingly adopted in radiotherapy. A common method to evaluate quality of auto-segmented contours uses thresholds of various quantitative metrics (e.g., dice similarity coefficient (DSC), mean distance to agreement (MDA), etc.) that are often averaged over all contour slices. This method fails to detect contour errors on individual slices, thus, does not reflect the current clinical practice (slice-by-slice evaluation) and the clinical usability (e.g., expected contour editing time). In addition, the use of multi-metrics is generally not easy to interpret. This work aims to develop a novel contour quality classification (CQC) model to evaluate auto-segmented contours based on their clinical applicability. The CQC method was designed to classify a contour on a slice into acceptable, minor edit or major edit category, based on the expected editing effort/time. Organ-specific supervised ensemble tree classification models were trained to relate the slice-based quality category with the combination of seven commonly used calculatable quantitative metrics (i.e., DSC, MDA, Hausdorff 95% distance, surface DSC, added path length (APL), slice area and relative APL). The proposed method was demonstrated by training CQC models using DLAS contours of five abdominal organs (i.e., pancreas, duodenum, stomach, and small and large bowels) from 50 MRI sets and evaluating on 20 MRI and 9 CT testing sets. These test datasets were labelled by six individual observers and the consensus labels were generated through majority vote method. The model performance was evaluated using accuracy (acc), and risk rate (RR, the percentage of unacceptable slices mislabeled as acceptable) and compared with inter-observer variation and baseline threshold-based method. Compared to the majority vote labels, the obtained CQC models achieved a mean accuracy of 95.8% ([94.5%-99.1%]) and 94.3% ([90.6%-96.9%]), and the mean RR of 0.8% ([0.3%-1.3%]) and 0.7% ([0%-1.1%]) for the MRI and CT testing sets, respectively. The CQC performance was comparable to the inter-observer variation and significantly higher than those from the threshold-based method with single or multiple metrics. The execution time on a typical abdominal dataset (e.g., 70 slices) took less than 3 seconds. Table 1 CQC models performance for different organs The proposed CQC model can classify the quality of a contour slice with high accuracy. This slice-based single-output evaluation method better reflects the current clinical practice and may be used to evaluate/compare performance of DLAS on any image modality, facilitating its clinical implementation and quality assurance.
Bibliographic Details
Elsevier BV
Provide Feedback
Have ideas for a new metric? Would you like to see something else here?Let us know