| | --- |
| | title: Multimodal Product Classification |
| | emoji: π |
| | colorFrom: purple |
| | colorTo: yellow |
| | sdk: gradio |
| | sdk_version: 5.44.0 |
| | app_file: app.py |
| | pinned: true |
| | license: mit |
| | short_description: Product classification using image and text |
| | --- |
| | |
| | # ποΈMultimodal Product Classification with Gradio |
| |
|
| | ## Table of Contents |
| |
|
| | 1. [Project Description](#1-project-description) |
| | 2. [Methodology & Key Features](#2-methodology--key-features) |
| | 3. [Technology Stack](#3-technology-stack) |
| | 4. [Model Details](#4-model-details) |
| |
|
| | ## 1. Project Description |
| |
|
| | This project implements a **multimodal product classification system** for Best Buy products. The core objective is to categorize products using both their text descriptions and images. The system was trained on a dataset of **almost 50,000** items. |
| |
|
| | The entire system is deployed as a lightweight, web application using **Gradio**. The app allows users to: |
| |
|
| | - Use both text and an image for the most accurate prediction. |
| | - Run predictions using only text or only an image to understand the contribution of each data modality. |
| |
|
| | This project showcases the power of combining different data types to build a more robust and intelligent classification system. |
| |
|
| | > [!IMPORTANT] |
| | > |
| | > - Check out the deployed app here: ποΈ [Multimodal Product Classification App](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification) ποΈ |
| | > - Check out the Jupyter Notebook for a detailed walkthrough of the project here: ποΈ [Jupyter Notebook](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification/blob/main/notebook_guide.ipynb) ποΈ |
| |
|
| |  |
| |
|
| | ## 2. Methodology & Key Features |
| |
|
| | - **Core Task:** Multimodal Product Classification on a Best Buy dataset. |
| |
|
| | - **Pipeline:** |
| |
|
| | - **Data:** A dataset of \~50,000 products, each with a text description and an image. |
| | - **Feature Extraction:** Pre-trained models are used to convert raw text and image data into high-dimensional embedding vectors. |
| | - **Classification:** A custom-trained **Multilayer Perceptron (MLP)** model performs the final classification based on the embeddings. |
| |
|
| | - **Key Features:** |
| |
|
| | - **Multimodal:** Combines text and image data for a more accurate prediction. |
| | - **Single-Service Deployment:** The entire application runs as a single, deployable Gradio app. |
| | - **Flexible Inputs:** The app supports multimodal, text-only, and image-only prediction modes. |
| |
|
| | ## 3. Technology Stack |
| |
|
| | This project was built using the following technologies: |
| |
|
| | **Deployment & Hosting:** |
| |
|
| | - [Gradio](https://gradio.app/) β interactive web app frontend. |
| | - [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) β for cost-effective deployment. |
| |
|
| | **Modeling & Training:** |
| |
|
| | - [TensorFlow / Keras](https://www.tensorflow.org/) β used to train the final MLP classification model. |
| | - [Sentence-Transformers](https://www.sbert.net/) β for generating text embeddings. |
| | - [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) β for the image feature extractor (`TFConvNextV2Model`). |
| |
|
| | **Development Tools:** |
| |
|
| | - [Ruff](https://github.com/charliermarsh/ruff) β Python linter and formatter. |
| | - [uv](https://github.com/astral-sh/uv) β fast Python package installer and resolver. |
| |
|
| | ## 4. Model Details |
| |
|
| | The final classification is performed by a custom-trained **Multilayer Perceptron (MLP)** model that takes the extracted embeddings as input. |
| |
|
| | - **Text Embedding Model:** `SentenceTransformer` (`all-MiniLM-L6-v2`) |
| | - **Image Embedding Model:** `TFConvNextV2Model` (`convnextv2-tiny-22k-224`) |
| | - **Classifier:** A custom MLP model trained on top of the embeddings. |
| | - **Classes:** The model classifies products into a set of specific Best Buy product categories. |
| |
|
| | | Model | Modality | Accuracy | Macro Avg F1-Score | Weighted Avg F1-Score | |
| | | :------------------ | :----------- | :------- | :----------------- | :-------------------- | |
| | | Random Forest | Text | 0.90 | 0.83 | 0.90 | |
| | | Logistic Regression | Text | 0.90 | 0.84 | 0.90 | |
| | | Random Forest | Image | 0.80 | 0.70 | 0.79 | |
| | | Random Forest | Combined | 0.89 | 0.79 | 0.89 | |
| | | Logistic Regression | Combined | 0.89 | 0.83 | 0.89 | |
| | | **MLP** | **Image** | **0.84** | **0.77** | **0.84** | |
| | | **MLP** | **Text** | **0.92** | **0.87** | **0.92** | |
| | | **MLP** | **Combined** | **0.92** | **0.85** | **0.92** | |
| |
|
| | > [!TIP] |
| | > |
| | > Based on the evaluation on the test set, the Multimodal MLP model achieved an excellent **92% accuracy** and a **92% weighted F1-score**, confirming its superior performance by leveraging both text and image data. |
| |
|