AI-Powered Follicle Measurement and Counting: Achieving Greater Efficiency and Consistency?

In IVF stimulation cycles, follicle monitoring is a key component of clinical decision-making. Clinicians rely on transvaginal ultrasound to observe follicle count and diameter changes, assess ovarian response, and guide subsequent treatment. Traditionally, this has been a manual process relying on visual inspection and measurement. For experienced operators, it’s routine; however, in real-world practice, factors such as 2D image quality, acoustic shadowing, overlapping adjacent follicles, poorly defined borders, and operator variability can all affect follicle identification and measurement accuracy.

During peak stimulation monitoring periods, clinicians and sonographers must complete a high volume of exams in a short time. The more follicles present, the more time-consuming manual measurement becomes. The more complex the image—particularly with small or ill‑defined follicles—the higher the risk of missing follicles. The practical question is: how can follicle counting and diameter measurement be made more efficient, more consistent, and easier to review without changing existing ultrasound workflows? A recent multicenter study evaluated the feasibility of using AI to automatically identify, count, and measure follicles during ovarian stimulation and IVF monitoring.

Real‑World Validation

The study employed a multicenter, multi‑device design, including 5,508 transvaginal ultrasound scans from 1,689 patients undergoing controlled ovarian stimulation. Data were used for model training, independent testing, and expert consensus evaluation. For the consensus test set, three IVF ultrasound experts independently annotated the scans and then reached a consensus, enabling comparison between AI and expert performance. To further assess real‑world usability, the study also included 904 prospective ultrasound cine loops. AI results were integrated into the clinical PACS workflow, and the number of additions, modifications, or deletions made by experts to the AI‑generated measurements was recorded. This design allows evaluation of model stability across different devices and centers, and shows how much manual correction is needed when AI is introduced into the clinical workflow.

For larger follicles—those ≥10 mm, which are of particular interest in mid‑to‑late stimulation—the model performed consistently, close to the average expert level. In late stimulation, the automatically counted number of follicles ≥10 mm showed a high correlation with expert consensus, with an average error of approximately one follicle. For diameter measurement, the mean absolute error between AI and experts was 0.74 mm, compared to 0.83 mm between experts, indicating that automated measurements fall within the range of inter‑expert variability. In terms of efficiency, manual annotation of follicles per ovary took an average of 220 seconds, while automated processing followed by expert review took approximately 100 seconds. In the prospective workflow, the 904 scans required an average of only 0.54 manual edits per scan; of 13,710 follicles, only 3.57% required addition, modification, or deletion.

Follicle‑by‑Follicle Identification

The clinical need for follicle monitoring demands more than just identifying a “region” of interest. Clinicians need information at the individual follicle level: the location of each follicle, the total count, the diameter of each, and which follicles have reached a clinically relevant size. Some previous image analysis approaches rely on semantic segmentation—labeling which pixels belong to follicles. However, semantic segmentation does not inherently separate adjacent follicles, making it less suitable for counting and individual follicle measurement.

When follicles are close together, have weak borders, or are irregularly shaped, post‑processing after region labeling often leads to merging or incorrect separation. This study used an end‑to‑end instance segmentation approach, directly outputting independent results for each follicle—more appropriate for counting and diameter measurement. The model directly processes conventional 2D transvaginal ultrasound cine loops without requiring the operator to pre‑define the ovarian region or perform any additional preprocessing. In other words, the automated output closely matches the information that clinicians need to review and record in practice.

Because follicle monitoring is not simply a matter of detecting presence or absence, the study did not rely on a single classification metric. Instead, it evaluated follicle detection, count agreement, diameter measurement error, cross‑device stability, and the number of manual edits in the real‑world workflow. Together, these metrics address a more practical question: can automatically generated follicle measurements serve as a reliable starting point for clinical review?

Limitations with Small Follicles

Nevertheless, model performance varied across different scenarios. The study showed that larger, well‑formed follicles with good image quality were most accurately identified. Performance decreased for follicles <10 mm in diameter, those with irregular borders, or in out‑of‑focus images. Small follicles are inherently more indistinct and more susceptible to interference from adjacent structures, acoustic shadowing, and image quality. Therefore, in settings such as antral follicle count (AFC) assessment or when many small follicles are present, automated measurements still require expert review and cannot simply replace human judgment.

From a clinical workflow perspective, the value of automation lies primarily in reducing repetitive measurements and improving consistency. It can reduce inter‑observer variability, shorten routine follicle annotation time, and generate more consistent, traceable scan and measurement records for easier review and quality management. It is important to note that this study evaluated follicle detection, counting, diameter measurement, and workflow efficiency; it did not demonstrate that AI involvement improves ultimate clinical outcomes such as oocyte yield, embryo quality, pregnancy rate, or live birth rate. Also, expert consensus annotation, while more reliable than a single expert’s judgment, is still based on human visual assessment and does not represent an absolute ground truth.

Summary

Overall, this study demonstrates the feasibility of AI for follicle identification, counting, and diameter measurement. Performance for follicles ≥10 mm was close to expert level, results were relatively stable across multi‑center, multi‑device data, and the low number of manual edits per scan along with reduced measurement time in the prospective workflow suggest that automated measurement can be integrated into routine clinical practice. However, such tools are best suited as adjuncts to measurement, not as standalone decision‑making systems. While automated measurement of larger follicles shows good stability, issues with small follicles, complex images, and final clinical outcomes require further validation. The clinical value is not to replace clinician judgment, but to transform a portion of the repetitive, time‑consuming, and subjective work of measurement into more consistent, reviewable, and traceable initial results.