Human Blastocyst Classification after In Vitro Fertilization Using Deep Learning
Abstract
Embryo quality assessment after in vitro fertilization (IVF) is primarily done visually by embryologists. Variability among assessors, however, remains one of the main causes of the low success rate of IVF. This study aims to develop an automated embryo assessment based on a deep learning model. This study includes a total of 1084 images from 1226 embryos. The images were captured by an inverted microscope at day 3 after fertilization. The images were labelled based on Veeck criteria that differentiate embryos to grade 1 to 5 based on the size of the blastomere and the grade of fragmentation. Our deep learning grading results were compared to the grading results from trained embryologists to evaluate the model performance. Our best model from fine-tuning a pre-trained ResNet50 on the dataset results in 91.79% accuracy. The model presented could be developed into an automated embryo assessment method in point-of-care settings.
Index Terms:
in vitro fertilization, embryo grading, deep learningI Introduction
Embryo quality plays a pivotal role in a successful IVF cycle. Embryologists assess embryo quality from the morphological appearance using direct visualization [1]. There are three protocols in different time points to evaluate the quality of an embryo: (1) quality assessment of zygote (16-18 hours after oocyte insemination), (2) morphological quality assessment of cleavage stage embryos (day 3 after insemination), and (3) Morphological quality assessment of blastocyst stage embryos (4-5 days after fertilization) [2].
This kind of visual assessment is susceptible to the subjectivity of the embryologists. There are two kinds of variability in embryo assessment as seen in [3]: interobserver and intraobserver. Grading systems like Veeck criteria [4] aim to standardize grading and minimize both variabilities. However, as also found in [3], “the embryologists often gave an embryo a score different than Dr. Veeck, but that score was typically within one grade.” The study also shows that the intraobserver variation is limited. Khosravi et al. [5] also shows that only 89 out of 394 embryos were classified as the same quality by all five embryologists in their study.
In recent years, we have applications of deep learning for computer vision in medical imaging to address this variability issue. From MRI for brain imaging [6], various anatomical areas [7], to point-of-care diagnostics from microscopy images [8], deep learning has aided medical practitioners to diagnose better. In the field of reproductive medicine, we have also seen some application of artificial intelligence as shown in [9]. Recent studies also explored the possibilities to automate embryo assessments for IVF [5, 10, 11] using a robust classifier trained on thousands of images.
Our main contribution is that previous studies [5, 10, 11] are using day 5 embryo images, while we are using day 3 embryo images. As seen in [12], “Early embryos can grow in a simple salt solution, whereas they require more complex media after they reach the 8-cell stage.” Day 3 and day 5 embryos are similar in implantation, clinical pregnancy, twinning, and live birth rates [13, 12], but since day 5 embryos are extremely sensitive to suboptimal culture environment, many clinics are still doing day 3 embryo transfers [14]. Moreover, unlike previous studies which only used ResNet50, we also compared several deep learning architectures to do the task. The dataset used in this study is described in the following section.
II Dataset
Our dataset comprises of 1084 images from 1226 embryos of 246 IVF cycles at Yasmin IVF Clinic, Jakarta, Indonesia. The images were captured by an inverted microscope at day 3 after fertilization. The images consist of 2-3 embryos each. We manually cropped the images and a team of 4 embryologists graded them 1-5 by using Veeck criteria [4]. However, we only found grade 1 to grade 3 embryos from the samples. This yields 1226 identified embryos consisting of 459 grade 1, 620 grade 2, and 147 grade 3 embryos.
To train the model and for further generalization error, we divided the dataset into train and test sets with 75:25 ratio. The training set is further divided into training and validation sets with 70:30 ratio. Some examples of the images in the training set can be seen in Figure 1. These images were automatically cropped and resized by the library that we used for the deep learning application [15]. This preprocessing step was done to prevent overfitting of the models.

III Methodology
We used the fast.ai library [15] for the deep learning application. The library helped us to do transfer learning from several pre-trained convolutional neural networks [16], such as the residual networks [17] with different depths (ResNet18, ResNet34, ResNet50, ResNet101), densely connected convolutional networks [18], Xception [19], and the MobileNetV2 [20] to the given task. We trained these models using backpropagation with 1cycle policy [21]. We used the cyclical learning rates [21] implemented in the fast.ai library to find the best learning rates. To find an unbiased estimate of the accuracy from a model, we trained the model and predicted the test set five times after ensuring that we got the best model during the training step.
III-A Deep Residual Networks
Prior to the study on residual learning, deeper neural networks are harder to train [17]. The accuracy of deeper neural networks gets saturated and becomes worse eventually. Their solution is to recast the original mapping into that can be seen as a feedforward neural network with shortcut connections. This reformulation makes it easier to optimize the model and can even reach over 1000 layers with no optimization difficulty though achieved worse result compared to the ones with fewer layers. The ensemble of this architecture has 3.57% top-5 error on the ImageNet test set and also won the 1st places in several tracks in ILSVRC and COCO 2015 competitions.
III-B Densely Connected Convolutional Networks
Borrowing ideas from deep residual networks, the Dense Convolutional Network (DenseNet) tries to harness the power of shortcut connections. “For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers.” [18] This enables the model to have “substantially fewer parameters and less computation” while still achieving state-of-the-art performances. Nevertheless, DenseNet can still be scaled into hundreds of layers easily.
III-C Xception
We also used the Xception architecture [19] to benchmark against the study in [10]. This architecture is an improvement from the Inception V3 [22] “where Inception modules have been replaced with depthwise separable convolutions”. Xception also uses residual connections as in the deep residual networks [17]. It outperforms Inception V3 while having a similar number of parameters as Inception V3.
III-D MobileNetV2
Since the result would possibly be implemented in a mobile phone for better outreach, we consider MobileNetV2 as a viable architecture. This architecture allows us to “reduce the memory footprint needed during inference by never fully materializing large intermediate tensors.” [20]
IV Results and Discussion
We found that the best learning rates are around from 8 epochs. After 8 epochs, the learning curve starts plateauing which suggest the model overfits the training data. The results from all models can be seen in Table I. We can see that the ResNet50 got the highest accuracy and the lowest cross-entropy loss of all models. While increasing the depth of the ResNet model from 18, 34, to 50 increased the accuracy, it stopped increasing afterwards. We argue that this is because the dataset is relatively simpler compared to ImageNet where we have different objects in different colours and sizes. This might also be the case why DenseNet models failed to achieve better performance while being more complex.
model | accuracy | loss |
---|---|---|
ResNet18 | ||
ResNet34 | ||
ResNet50 | ||
ResNet101 | ||
DenseNet121 | ||
DenseNet169 | ||
Xception | ||
MobileNetV2 |
On the other hand, we are more interested in MobileNetV2 which achieved a similar accuracy to the best model. As we elaborated in the previous section, this architecture would enable us to design an embedded system for point-of-care diagnostics. Thus, we provide the learning curve from MobileNetV2 in Figure 2. An example of a confusion matrix from MobileNetV2’s prediction can be seen in Figure 3.
Note that we are only using three embryo grades in this study due to the unavailability of the samples. We would need to reassess these models when we have more grades. However, since the current model can predict the minority class (grade 3) well, we argue that it might generalise with the complete grades as well.


Examples from the misclassified embryos as seen in Figure 4 suggest that the image capturing process can impact the model performance. For example, different shades of color (bottom right image) of the embryo images might cause the misclassification. Obstruction in the images, such as the red circles (top right image) or timestamps from the application that we used to digitally process the microscopy images also affect the performance of the models.

V Related Work
Robust assessment and selection of human blastocysts after IVF using deep learning has been studied in [5, 10, 11]. Khosravi et al. [5] fine-tuned their Inception V1 model [23] on 10,148 images of day 5 embryos from the Center for Reproductive Medicine at Weill Cornell Medicine. The accuracy of the resulting model is 96.94%. However, the training was done for good and poor quality images only from the three classes defined first.
Kragh et al. [10] address the issue in [5] by predicting blastocyst grades of any embryo based on raw time-lapse image sequences. Aside from using Xception [19] as the convolutional neural network (CNN) architecture to extract image features, they also use a recurrent neural network (RNN) “that connects subsequent frames from the sequence in order to leverage temporal information.” They predict the inner cell mass (ICM) and trophectoderm (TE) grades for the entire sequence from the RNN using two independent fully-connected layers. They train the models on 6957 embryos. On a test set of 55 embryos annotated by multiple embryologists, their models reached 71.9% and 76.4% of ICM and TE accuracy respectively compared to human embryologists who only achieved 65.1% and 73.8%. While the result is promising, using a RNN makes the training slower and prone to vanishing or exploding gradients [24].
In [11], the authors fine-tune a ResNet50 model on 171,239 images from 16,201 day 5 embryos to predict blastocyst development ranking from 3–6, ICM quality, and TE quality. The images were annotated by embryologists based on Gardner’s grading system. They achieved “an average predictive accuracy of 75.36% for the all three grading categories: 96.24% for blastocyst development, 91.07% for ICM quality, and 84.42% for TE quality.”
VI Conclusions
We have shown in this study that we can grade day 3 embryo images automatically with the best accuracy of 91.79% by fine-tuning a ResNet50 model. We found that more complex models failed to achieve better accuracy compared to the ResNet50. MobileNetV2 as our model of interest to build an embedded system achieved a relatively similar accuracy of 91.14% compared to the best model. The models still face some problems from different shades of colour or obstructions from the software that embryologists use to capture and process the images.
We saw good results when combining CNN and RNN in previous studies. However, in resource constrained settings, e.g. when we want to make inferences on small devices, the unparallelizable nature of RNNs also makes it challenging to implement. Moreover, time-lapse microscopes are not prevalent in developing countries. Thus, our solution would be more feasible to put into production.
Since we are still manually cropping the embryos from the original images, we can extend this work to automate this task, e.g. using an image segmentation model like YOLOv3 [25] or U-Net [26]. In the future, we hope that this model can be developed as an embedded system for point-of-care diagnostics such as found in [8].
References
- [1] J. Cummins, T. Breen, K. Harrison, J. Shaw, L. Wilson, and J. Hennessey, “A formula for scoring human embryo growth rates in in vitro fertilization: its value in predicting pregnancy and in comparison with visual estimates of embryo quality,” Journal of In Vitro Fertilization and Embryo Transfer, vol. 3, no. 5, pp. 284–295, 1986.
- [2] N. Nasiri and P. Eftekhari-Yazdi, “An overview of the available methods for morphological scoring of pre-implantation embryos in in vitro fertilization,” Cell Journal (Yakhteh), vol. 16, no. 4, p. 392, 2015.
- [3] A. E. B. Bendus, J. F. Mayer, S. K. Shipley, and W. H. Catherino, “Interobserver and intraobserver variation in day 3 embryo grading,” Fertility and sterility, vol. 86, no. 6, pp. 1608–1615, 2006.
- [4] L. L. Veeck, An atlas of human gametes and conceptuses: an illustrated reference for assisted reproductive technology. CRC Press, 1999.
- [5] P. Khosravi, E. Kazemi, Q. Zhan, J. E. Malmsten, M. Toschi, P. Zisimopoulos, A. Sigaras, S. Lavery, L. A. Cooper, C. Hickman et al., “Deep learning enables robust assessment and selection of human blastocysts after in vitro fertilization,” npj Digital Medicine, vol. 2, no. 1, p. 21, 2019.
- [6] Z. Akkus, A. Galimzianova, A. Hoogi, D. L. Rubin, and B. J. Erickson, “Deep learning for brain mri segmentation: state of the art and future directions,” Journal of digital imaging, vol. 30, no. 4, pp. 449–459, 2017.
- [7] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
- [8] J. A. Quinn, R. Nakasi, P. K. Mugagga, P. Byanyima, W. Lubega, and A. Andama, “Deep convolutional neural networks for microscopy-based point of care diagnostics,” in Machine Learning for Healthcare Conference, 2016, pp. 271–281.
- [9] N. Zaninovic, O. Elemento, and Z. Rosenwaks, “Artificial intelligence: its applications in reproductive medicine and the assisted reproductive technologies,” Fertility and sterility, vol. 112, no. 1, pp. 28–30, 2019.
- [10] M. F. Kragh, J. Rimestad, J. Berntsen, and H. Karstoft, “Automatic grading of human blastocysts from time-lapse imaging,” Computers in Biology and Medicine, p. 103494, 2019.
- [11] T.-J. Chen, W.-L. Zheng, C.-H. Liu, I. Huang, H.-H. Lai, and M. Liu, “Using deep learning with large dataset of microscope images to develop an automated embryo grading system,” Fertility & Reproduction, vol. 1, no. 01, pp. 51–56, 2019.
- [12] S. Coskun, J. Hollanders, S. Al-Hassan, H. Al-Sufyan, H. Al-Mayman, and K. Jaroudi, “Day 5 versus day 3 embryo transfer: a controlled randomized trial,” Human Reproduction, vol. 15, no. 9, pp. 1947–1952, 2000.
- [13] Ş. Hatırnaz and M. K. Pektaş, “Day 3 embryo transfer versus day 5 blastocyst transfers: A prospective randomized controlled trial,” Turkish Journal of Obstetrics and Gynecology, vol. 14, no. 2, p. 82, 2017.
- [14] K.-C. Lan, F.-J. Huang, Y.-C. Lin, F.-T. Kung, C.-H. Hsieh, H.-W. Huang, P.-H. Tan, and S. Y. Chang, “The predictive value of using a combined z-score and day 3 embryo morphology score in the assessment of embryo survival on day 5,” Human reproduction, vol. 18, no. 6, pp. 1299–1306, 2003.
- [15] J. Howard and S. Gugger, “Fastai: A layered api for deep learning,” Information, vol. 11, no. 2, p. 108, 2020.
- [16] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1717–1724.
- [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
- [19] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
- [20] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
- [21] L. N. Smith, “Cyclical learning rates for training neural networks,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017, pp. 464–472.
- [22] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
- [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- [24] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, 2013, pp. 1310–1318.
- [25] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
- [26] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.