Sunyu Pang
thumbnail
Multimodal AI
2024

Fashion Captioning with BLIP Finetuning

The main goal of this project was to significantly enhance the performance of fashion image captioning by fine-tuning the BLIP model. Taking into account the unique characteristics of the fashion domain, we built a high-quality dataset that includes a wide range of fashion styles, accessories, and clothing items, and used it to train the model. Throughout this process, the model was guided to develop a deep understanding of fashion-related vocabulary and context, enabling it to generate more accurate and contextually relevant captions.

Model Overview

blip4fashion model overview

  • Decoder-only
    • For all layers of Decoder
      • query/key/value of the attention layer
      • dense layers (Feed-Forward Neural Networks)
  • Encoder + Decoder
    • All layers of Decoder
    • For the last 6 layers of Encoder
      • query/key/value of the attention layer
      • dense layers (Feed-Forward Neural Networks)

Performance was evaluated using various metrics such as BLEU, METEOR, and CIDEr, allowing for objective assessment and continuous improvement of caption quality.

Beyond technical achievements, the project aimed to deliver real-world value in areas such as automated product description generation for e-commerce platforms, content creation for digital fashion magazines, and enhanced accessibility to fashion content, making it more inclusive for diverse user groups.