11email: prithviraj.naik@mobisy.com
22institutetext: Bizom (Mobisy Technologies Private Limited)
22email: rohit@mobisy.com
ENCLIP: Ensembling and Clustering-Based Contrastive Language-Image Pretraining for Fashion Multimodal Search with Limited Data and Low-Quality Images
Abstract
Multimodal search has revolutionized the fashion industry, providing a seamless and intuitive way for users to discover and explore fashion items. Based on their preferences, style, or specific attributes, users can search for products by combining text and image information. Text-to-image searches enable users to find visually similar items or describe products using natural language. This paper presents an innovative approach called ENCLIP, for enhancing the performance of the Contrastive Language-Image Pretraining (CLIP) model, specifically in Multimodal Search targeted towards the domain of fashion intelligence. This method focuses on addressing the challenges posed by limited data availability and low-quality images. This paper proposes an algorithm that involves training and ensembling multiple instances of the CLIP model, and leveraging clustering techniques to group similar images together. The experimental findings presented in this study provide evidence of the effectiveness of the methodology. This approach unlocks the potential of CLIP in the domain of fashion intelligence, where data scarcity and image quality issues are prevalent. Overall, the ENCLIP method represents a valuable contribution to the field of fashion intelligence and provides a practical solution for optimizing the CLIP model in scenarios with limited data and low-quality images.
Keywords:
Multimodal Search Information retrieval Deep Learning Fashion e-commerce Ensembling Clustering1 Introduction
The fashion industry is highly visual and dynamic, driven by constantly evolving trends and consumer preferences. Over 300 million people have jobs in the fashion industry, which generates global GDP of $1.3 trillion [12]. The rising prominence of e-commerce and online shopping platforms highlighted the need to develop fast and effective fashion search strategies. Traditional text-based search systems often fall short in capturing the rich visual aspects of fashion items, leading to suboptimal search results and a less satisfactory user experience [14]. To address these limitations, multimodal search [8] has emerged as a powerful solution in the fashion domain. By leveraging both textual and visual information, multimodal search enables users to explore and discover fashion items in a more intuitive and comprehensive manner. This approach integrates the capabilities of natural language processing with computer vision techniques in order to establish a connection between textual descriptions and visual representations within the domain of fashion. In multimodal fashion search, users can express their preferences or query fashion items using natural language descriptions. These descriptions can encompass various aspects, such as color, gender, pattern, style, age, or even specific attributes like sleeve length or neckline.
This paper explores the utilization of multimodal search techniques in the context of fashion intelligence. The primary objective is to enhance the efficacy of the CLIP (Contrastive Language-Image Pretraining) [15] model specifically for low-resolution and low-quality photos in the fashion domain. CLIP (Contrastive Language-Image Pretraining) model, a state-of-the-art deep learning model developed by OpenAI that excels in understanding the relationship between images and text.This paper proposes an approach called ENCLIP, which involves ensembling outputs of multiple CLIP fine-tuned models and leveraging clustering techniques to enhance the fine-tuning process. By harnessing the power of ensemble learning and exploiting the benefits of clustering, the challenges posed by limited data availability and low-quality images are addressed.
In the study, the dataset titled "Fashion Product Images (Small)" by Param Aggarwal [1] available on Kaggle is utilized. This dataset comprises fashion images and their corresponding textual descriptions or keywords with some Indian fashion products. The dataset offers a diverse range of fashion items, including traditional attire and jewellery. By leveraging this curated dataset that also includes Indian fashion products, this paper also aims to enhance the relevance and accuracy of the model with Indian fashion trends, styles, and cultural nuances. Following data collection, the fashion images are preprocessed to ensure uniform size and format, facilitating compatibility with CLIP. Simultaneously, the textual descriptions or keywords are prepared for input to CLIP’s language model component. The vector database was employed to facilitate the storing and retrieval of images, with input text queries expressed in the form of vector embeddings.
2 Related work
This section reviews previous related works on Multimodal Search in fashion domain.
In 2023 Min Wang et al.[16] proposed a study that focuses on training a visual question-answering (VQA) [2] system for apparel in fashion photoshoot images using a large-scale multimodal dataset. By employing diverse templates and emphasizing challenging concepts, they achieved a VQA model surpassing human expert-level accuracy, demonstrating the effectiveness of visual language models for their dataset. The use of diverse templates and challenging concepts might require extensive preprocessing and template creation, increasing the complexity of implementation and potentially making the system less flexible.
In 2022 Karin Sevegnani et al.[17] proposed a model, WhisperLite. This model utilizes contrastive learning techniques to effectively extract user intent from natural language text. By doing so, it enhances the quality of recommendations for fashion products.
In 2022 Patrick John Chia et al.[6] proposed GradREC, a novel recommendation system that introduces explicit directionality through natural language, enabling zero-shot, language-based comparative recommendations without the need for explicitly defined labels or behavioral data. Zero-shot learning capabilities, while impressive, may still struggle with rare or unseen attributes, leading to less accurate recommendations in certain scenarios.
In 2022 Patrick John Chia et al.[5] proposed FashionCLIP, a CLIP-like model for the fashion industry that showcases its capabilities for retrieval, classification, and grounding. FashionCLIP, being a CLIP-like model,require significant data and resources for training and inference. This model is used in this study to compare results with ENCLIP approach along with the Pre-trained CLIP Model.
In 2019 Ivona Tautkute et al. [18] proposed a multimodal search engine DeepStyle. The author demonstrates the effectiveness of this methodology on two distinct and demanding datasets consisting of fashion items and furniture. The DeepStyle engine surpasses baseline approaches by a significant margin of 18-21% on the tested datasets. The effectiveness of the DeepStyle engine depends on the availability of high-quality visual and linguistic signals, which may not always be accessible.
In 2019 Gil Sadeh et al. [19] proposed a multimodal visual-textual search refinement method for fashion garments, allowing intuitive, interactive retrieval of similar items based on image and textual refinement properties. Their joint embedding training scheme effectively manipulates catalog images semantically using a new training objective function, Mini-Batch Match Retrieval, outperforming triplet loss, and they demonstrate the integration of an attribute extraction module to enhance search performance. The joint embedding training scheme and semantic manipulation of catalog images could be complex to implement and fine-tune, requiring extensive experimentation.
3 Background
This section discusses Multimodal Search and CLIP as used in the study.
3.1 Multimodal Search
Multimodal search refers to a search approach that employs several methodologies in order to obtain pertinent outcomes [8]. It is intended to mimic the flexibility and agility with which the human mind generates, processes, and rejects irrelevant ideas. It is a type of search that allows users to search for information using multiple modalities, such as text, images, audio, and video. This can be done by providing a single query that includes text, images, or other media, or by providing multiple queries, each in a different modality.
By combining different types of input, multimodal search engines can provide more accurate and relevant results than traditional search engines [11]. The different ways that multimodal search can be implemented include:
-
•
Feature-based: This approach uses features extracted from each modality to represent the query and the documents. The features are then combined to calculate the similarity between the query and the documents.
-
•
Model-based: This approach uses a model to learn the relationship between the different modalities. The model can then be employed to predict the relevance of documents to a query.
-
•
Hybrid: This approach combines the feature-based and model-based approaches.
3.2 CLIP
The neural network model known as CLIP (Contrastive Language-Image Pre-training)[15] was created by OpenAI. The model has been trained using a vast dataset consisting of pairings of images and corresponding texts. Fig. 1 illustrates the concept of contrastive pretraining with CLIP.
CLIP is a powerful model with the potential to revolutionize the way images and text are interacted with. It is still under development, but it has already been used for a variety of interesting applications.
The basic architecture of CLIP is as follows:
-
•
The model is first pre-trained on a massive dataset of image-text pairs.
- •
-
•
The vision transformer is responsible for encoding the images into a vector representation.
-
•
The language model is responsible for encoding the text into a vector representation.
-
•
The two vector representations are then compared to each other to determine the similarity between the image and the text.
The comparison of the two vector representations is done using a contrastive loss function [13]. The contrastive loss function is designed to maximize the similarity between the image and text representations when they are actually related, and to minimize the similarity between the image and text representations when they are not related. The basic architecture of CLIP is relatively simple, but it is very effective at learning the semantic similarities between images and text.

4 Methodology
4.1 Dataset
This dataset, fashion-product-images-small [1], provided by HuggingFace, is used in this study. It consists of around 44072 60x80 pixel resolution images with details and descriptions about the image in respective columns. It has 7 master categories(Apparel, Accessories, Footwear, Personal Care, Free items, Home, Sporting Goods), 45 subcategories.
4.2 Preprocessing
The dataset consists of around 44072 (image,text) pairs. It is divided into three sets of 80% for training, 10% for validation, and 10% for testing. There are 35258 (image,text) pair in the training set, 4407 (image,text) pair in the validation set and 4407 (image,text) pair in the test set. The training model requires the images to be scaled from their original dimensions of 60x80 pixels to 224x224 pixels. All column values except for id and year are considered as captions for the corresponding image. Balanced Batch Sampling is performed as a sampling technique to ensure that each batch of data contains an equal representation of different classes. It is designed to address the issue of class imbalance, where some classes have significantly fewer samples than others, which can lead to biased model training.
4.3 Model Architecture and Training
Adamax is used as an optimization technique. Categorical cross-entropy was employed as a loss function for both image and text. Tokenization of the caption for an image was performed before giving it to the model for training. The Cosine Annealing learning rate scheduler, which gradually reduces the learning rate in a cosine annealing pattern, was used.
The cosine annealing pattern is designed to help the model converge to a better solution by allowing it to explore different areas of the loss landscape during training. It helps to avoid getting stuck in local minima and potentially find a more optimal solution.
According to Equation 1, the total loss is calculated as the average of the image loss and text loss.
(1) |
The model was trained across 10, 30, 50, 80, and 100 epochs, using a batch size of 128. It was executed on a Google Colab Pro platform equipped with an Nvidia Tesla T4 graphics card. During training, the model took an average of 280 seconds per epoch.
4.4 Ensemble and Clustering Strategy
4.4.1 Rationale for Model Selection:
Five CLIP models are trained, each initialized with the same architecture but trained for different epochs: 10, 30, 50, 80, and 100 (To demonstrate the working of the Algorithm 1, five fine-tuned models with incremental training are considered in this study). This approach captures various learning stages of the model, from initial learning to fine-tuning. It was observed that results in initial epoch training and final epoch training complemented each other to give the best results collectively. Combining these models in an ensemble allows us to leverage the strengths of each.
4.4.2 Latent space of the Algorithm:
Latent space is a lower-dimensional space that captures the essential features of the input data. In ENCLIP approach, each model encodes an image into a fixed-size vector (e.g., 512 dimensions), capturing the essential features and patterns in the data while reducing its dimensionality using t-SNE (t-distributed Stochastic Neighbor Embedding) to 2D for visualization purposes. This helps in understanding the distribution and structure of the data in a more interpretable form.
4.4.3 Weighted Sum Approach:
To combine the predictions of each model, a weighted sum approach is employed. The weights are determined based on the epoch number, with later epochs receiving higher weights:
(2) |
This weighting strategy ensures that more emphasis is placed on models that have undergone extensive training while still incorporating the insights from earlier training stages.
4.4.4 Encoding Images:
Each model is used to encode the images into a latent space, resulting in z sets of encoded images(z - Number of fine-tuned models; Five fine-tuned models (z=5) were considered in this study for demonstration of the Algorithm 1). These encoded representations capture the high-dimensional features learned by each model.
4.4.5 Dimensionality Reduction:
To visualize and analyze the encoded features, t-SNE (t-distributed Stochastic Neighbor Embedding) is applied, reducing the dimensionality to 2D while preserving the local structure of the data.
4.4.6 Cluster Analysis and Ranking:
K-Means clustering is performed on the t-SNE-transformed features to group similar images. The number of clusters (k) is set to values between 4 and 6, reflecting the diversity in the dataset categories.
4.4.7 Cluster Label Assignment:
Cluster labels are assigned to each image based on the K-Means results, and the frequency of each image appearing in different clusters is analyzed.
4.4.8 Frequency and Weighted Sum Calculation:
For each image, the frequency across different epochs is calculated, and a weighted sum based on predefined weights is computed. This helps in ranking the images according to their importance and relevance within each cluster.
4.4.9 Final Selection:
Images are ranked and selected based on their weighted sum and frequency, ensuring that the most representative and informative images are prioritized.
(3) |
4.5 Evaluation Metrics
In this section, the evaluation metrics used to measure the performance of the ENCLIP approach is discussed. [4, 7].
-
•
Precision(PREC): It can be defined as the percentage of relevant items out of those items selected by the query.
(4) where is the number of the recommended items that user prefer and is the number of the recommended items.
-
•
Precision@k(PREC@k): It can be defined as the proportion of relevant recommended items in a recommendation list of size k for a query.
(5) where is the number of the recommended items that user prefer in a recommendation list of size k.
-
•
Mean Average Precision(mAP): The mean average precision (MAP) is the average of multiple recommendation precision. The more precision of the recommended items are, the higher the mAP is, and the formula of mAP is as follows [4]:
(6) (7) where Q is the number of recommendation, k is the rank, rel(k) represents the relativity function given rank k, PREC@k represents the precision given rank k.
5 Evaluation Results and Discussion
Table. 1 and Table. 2 compare the retrieved results for the query "Give me polo neck t-shirt for men", with Table. 1 focusing on different fine-tuned CLIP models and Table. 2 comparing Pre-trained CLIP, FashionCLIP, and ENCLIP approaches. In Table. 2, the following points are observed:
-
•
In Pre-trained CLIP Model, the search output predominantly features full sleeved round neck t-shirts.
-
•
In FashionCLIP Model, the search output appears satisfactory but includes some full sleeved round neck t-shirts.
-
•
In ENCLIP Approach, the search output appears to be focused specifically on polo neck t-shirt for men.
Model | Retrieved Products |
---|---|
Epoch 10 | ![]() |
Epoch 30 | ![]() |
Epoch 50 | ![]() |
Epoch 80 | ![]() |
Epoch 100 | ![]() |
Model | Retrieved Products |
---|---|
Pre-trained CLIP | ![]() |
FashionCLIP | ![]() |
ENCLIP Approach | ![]() |
Category | Sub Category | mAP for AVG_PREC@10 | ||
Pre-trained CLIP | FashionCLIP | ENCLIP Approach | ||
Topwear and Bottomwear | Topwear | 0.552 | 0.797 | 0.651 |
Bottomwear | 0.677 | 0.657 | 0.735 | |
Ethnic wear | Men | 0.062 | 0.105 | 0.185 |
Women | 0.773 | 0.786 | 0.821 | |
Footwear | Men | 0.563 | 0.811 | 0.837 |
Women | 0.646 | 0.683 | 0.803 | |
Accessories | Men | 0.769 | 0.934 | 0.947 |
Women | 0.827 | 0.965 | 0.835 | |
Bags | Men | 0.506 | 0.598 | 0.614 |
Women | 0.587 | 0.611 | 0.717 |
Table. 3 shows the comparison of Mean Average Precision for Average Precision@10 for the Pre-trained CLIP, FashionCLIP and ENCLIP Approach. In Table. 3 results, 10 queries for each category to calculate Mean Average Precision is considered. So, a total of 100 queries are considered for evaluation.
Fig. 5 represents the t-SNE plot for different fine-tuned model’s image outputs in the same latent space obtained by the query ”Give me polo neck t-shirt for men”. Epoch 10, 30, 50, 80, and 100 fine-tuned CLIP models have been selected for this study.
Based on Algorithm 1, Fig. 6 represents a step of performing K-means Clustering on the t-SNE plot for different fine-tuned model’s image outputs in the same latent space obtained by the query ”Give me polo neck t-shirt for men”. It is done to obtain the head cluster which has the head image(the most frequent image in different fine-tuned model’s image output results). The K-means clustering algorithm is employed to effectively group data points that exhibit similarity, based on a predefined number of clusters. Applying K-means on the t-SNE plot helps visualize model’s representations and how these representations cluster together in the reduced dimensionality space. The Elbow Method is used with the Silhouette Method to find the best K value in K-Means Clustering [20]. The K value between the range of 4 and 6 gave the best results.
The evaluation and ranking of images have been conducted using the following methods:
-
•
Sorting in descending order based on weighted_score.
-
•
Sorting in descending order based on the product of frequency and weighted_score.
-
•
Sorting in descending order based on frequency and weighted_score.
Ranking based on frequency and weighted_score yielded the best results in the findings.
The ensemble of multiple fine-tuned models is particularly effective in scenarios with limited data and low-quality images. Different models capture different aspects of the data, and combining them helps mitigate the limitations of any single model. By integrating outputs from models trained for different epochs, ENCLIP approach leverages the strengths of models at various stages of training. Early epochs may generalize better due to less overfitting, while later epochs capture more refined features. This combination ensures a robust performance even with limited data The latent space representations extracted from the models encapsulate rich, high-dimensional features of the images. These features are crucial for handling low-quality images, as they can emphasize important aspects of the data that are less affected by noise and low resolution.
By performing clustering in the latent space, even images of lower quality are grouped based on their most salient features rather than raw pixel values. This method is more resilient to variations in image quality. Assigning weights and frequencies to images based on their occurrences in multiple model outputs helps in identifying consistently important images. This is particularly useful when dealing with limited and low-quality data, as it highlights images that are recognized as important across different models. The weighted score formula gives higher importance to images appearing in models trained for more epochs, reflecting improved learning. This prioritization helps in selecting images that are likely to be of higher relevance and quality despite the overall data limitations.
In FashionCLIP [5], for final training around 700k high-resolution product images were used which was provided by Farfetch. In the ENCLIP approach, approximately 35k low-resolution product images were used for training, which is significantly less—20 times fewer—than the Farfetch dataset used in FashionCLIP.
6 Conclusion
The introduction of multimodal search has brought about a revolutionary change in the fashion industry by providing users with a seamless and intuitive way to explore and discover fashion items. By integrating text and image information, users can now search for products based on their preferences, style, or specific attributes. This advancement has greatly enhanced the overall user experience and opened up new possibilities for fashion discovery. This paper presents an innovative approach called ENCLIP, specifically tailored to enhance the fine-tuning performance of the Contrastive Language-Image Pretraining (CLIP) model in the domain of fashion intelligence. The ENCLIP approach addresses the challenges posed by limited data availability and low-quality images, which are prevalent in the fashion industry. In ENCLIP approach, the process involves training and ensembling multiple instances of the CLIP model. Additionally, a clustering technique is employed to group similar images together to get the top N images based on the user’s query for searching the top relevant N images.
The efficacy of the ENCLIP approach is demonstrated through experimental results, comparing it with recent studies. By leveraging the clustering techniques and training ensembles of CLIP models, the method successfully overcomes the limitations of data scarcity and low-quality images. However, it is important to note some limitations of the current approach. The ENCLIP approach may not be well-suited for fine-grained querying in the case of text-to-image search, where users require highly specific search results. The model’s generalization capabilities may be limited in such cases, and further research is needed to improve its performance in fine-grained fashion queries. Performance in the case of fine-grained queries could be improved by adding more textual descriptions to enhance textual understanding in terms of data quality or by exploring more advanced multimodal fusion techniques[3]. The advancements made in this paper pave the way for further advancements in fashion intelligence and contribute to the ongoing evolution of multimodal search technology.
6.0.1 Acknowledgements
Thanks are extended to the anonymous reviewers and participants for their time and feedback. Sincere gratitude is also expressed to Lalit Bhise, founder and CEO of Bizom (Mobisy Technologies Private Limited), for his enthusiasm and invaluable support for this project.
References
- [1] Aggarwal, P. (2023). Fashion Product Images (Small). Retrieved May 30, 2023, from https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small
- [2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual question answering. In IEEE International Conference on Computer Vision (pp. 2425-2433)
- [3] Atrey, P. K., Hossain, M. A., El Saddik, A., & Kankanhalli, M. S. (2010). Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 16, 345-379.
- [4] Chen, M., & Liu, P. (2017). Performance evaluation of recommender systems. International Journal of Performability Engineering, 13(8), 1246.
- [5] Chia, P. J., Attanasio, G., Bianchi, F., Terragni, S., Magalhães, A. R., Goncalves, D., Greco, C., & Tagliabue, J. (2022). Contrastive language and vision learning of general fashion concepts. Scientific Reports, 12(1), 18958.
- [6] Chia, P. J., Tagliabue, J., Bianchi, F., Greco, C., & Goncalves, D. (2022). "Does it come in black?" CLIP-like models are zero-shot recommenders. arXiv preprint arXiv:2204.02473.
- [7] Wikipedia contributors. (2023). Evaluation measures (information retrieval). Retrieved August 19, 2023, from https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval).
- [8] Wikipedia contributors. (2023). Multimodal search. Retrieved August 28, 2023, from https://en.wikipedia.org/wiki/Multimodal_search.
- [9] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- [10] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Dehghani, M. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- [11] Fu, H. C., Xu, Y. Y., & Pao, H. T. (2008). Multimodal search for effective image retrieval. In 15th International Conference on Systems, Signals and Image Processing (pp. 233-236).
- [12] Gazzola, P., Pavione, E., Pezzetti, R., & Grechi, D. (2020). Trends in the Fashion Industry. The Perception of Sustainability and Circular Economy: A Gender/Generation Quantitative Approach. Sustainability, 12(7), 2809. https://doi.org/10.3390/su12072809.
- [13] Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, A., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. In Advances in Neural Information Processing Systems 33 (pp. 18661-18673).
- [14] Kofler, C., Larson, M., & Hanjalic, A. (2017). User Intent in Multimedia Search: A Survey of the State of the Art and Future Challenges. ACM Computing Surveys, 49(2), 36. https://doi.org/10.1145/2954930.
- [15] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., & Sastry, G. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748-8763).
- [16] Wang, M., Mahjoubfar, A., & Joshi, A. (2023). FashionVQA: A Domain-Specific Visual Question Answering System. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3513-3518).
- [17] Sevegnani, K., Seshadri, A., Wang, T., Beniwal, A., McAuley, J., Lu, A., & Medioni, G. (2022). Contrastive learning for interactive recommendation in fashion. arXiv preprint arXiv:2207.12033.
- [18] Tautkute, I., Trzciński, T., Skorupa, A. P., Brocki, Ł., & Marasek, K. (2019). Deepstyle: Multimodal search engine for fashion and interior design. IEEE Access, 7, 84613-84628.
- [19] Sadeh, G., Fritz, L., Shalev, G., & Oks, E. (2019). Joint visual-textual embedding for multimodal style search. arXiv preprint arXiv:1906.06620.
- [20] Saputra, D. M., Saputra, D., & Oswari, L. D. (2020). Effect of distance metrics in determining k-value in k-means clustering using elbow and silhouette method. In Sriwijaya International Conference on Information Technology and Its Applications (SICONIAN 2019) (pp. 341-346).