Package Damage Detection Model

By: Evan Ancheta

1. Model Description

Context

This project develops a YOLOv11 object detection model that detects damaged and undamaged shipping packages in images. The model predicts bounding boxes around packages and classifies each detected package as either damaged or normal. Detecting package damage automatically could help logistics systems identify problems earlier in the shipping process. For example, warehouse monitoring systems could use a model like this to flag packages that may require manual inspection.

Training Approach

The model was trained using YOLOv11 object detection within the Ultralytics framework. The model was fine-tuned from pretrained COCO weights and then trained on a custom dataset of package images. Using pretrained weights allows the model to reuse general visual features learned from large image datasets before adapting them to the specific task of package damage detection.

Intended Use Cases

Potential Applications of this model include:

warehouse package inspection
automated logistics monitoring
research experiments involving object detection
prototype quality-control systems

This model is intended for research and experimentation and should not be used as a fully automated system without human oversight.

2. Training Data

Data Sources

The dataset used for training was created by combining two pre-annotated package detection datasets from Roboflow Universe:

https://universe.roboflow.com/nani-tmzf6/package-detection-5ozpr
https://universe.roboflow.com/roboflow-ngkro/package-detection-e1ssd

These datasets contain images of shipping packages with bounding box annotations identifying package locations and damage conditions.

Dataset Size

The combined dataset contains approximately 2,787 images.

Original Class Labels Across Datasets

Because the datasets came from different sources, they used slightly different label names. The original labels included:

Box
Package
Box_broken
Open_package
damaged

These labels were inconsistent across datasets and required standardization before training.

Annotation Process

The dataset preparation process involved several steps:

combined two pre-annotated datasets
standardized inconsistent class labels across datasets
verified bounding box annotations
reduced the number of classes

After cleaning and relabeling, the dataset was reduced to two classes representing the condition of each package.

Final Classes Used for Training

Class	Count
normal	1575
damaged	1212

The dataset is relatively balanced between damaged and undamaged packages.

Train / Validation / Test Split

Split	Ratio	Approximate Images
Train	70%	~1951
Validation	20%	~557
Test	10%	~279

Data Augmentation

The following augmentations were applied during training:

Augmentation	Purpose
Horizontal flip	simulate packages placed in different orientations
Image rotation	simulate packages rotated on the floor or conveyor
Brightness adjustment	simulate different lighting conditions

These augmentations help the model generalize better to real-world environments.

Known Biases and Dataset Limitations

The dataset contains mostly cardboard shipping boxes photographed under relatively consistent lighting conditions. As a result, the model may perform less reliably when detecting different packaging materials or under significantly different lighting environments. Some types of package damage may also appear less frequently in the dataset.

3. Training Procedure

Framework: Ultralytics YOLOv11n
Hardware: A100 GPU in Google Colab
Epochs: 50
Batch Size: 64
Image Size: 640x640
Patience: 50
Training Time: ~7.44 minutes (446.58 seconds)
Preprocessing: Augmentations applied at training time (see Data Augmentation section)

4. Evaluation Results

Comprehensive Metrics

The YOLOv11 model was evaluated on a held-out validation dataset using standard object detection metrics. These metrics measure how accurately the model detects packages and classifies them as damaged or normal.

Metric	Value
Precision	0.927
Recall	0.919
mAP50	0.959
mAP50-95	0.868

Overall, the model achieves strong detection performance, with a high mAP50 score indicating that the model is able to accurately detect package locations and classify damage in most cases.

Per-Class Breakdown

Class	Images	Instances	Precision	Recall	mAP50	mAP50-95
all	499	538	0.927	0.919	0.959	0.868
damaged	329	329	0.966	0.963	0.988	0.904
normal	170	209	0.887	0.876	0.930	0.832

The model performs slightly better at detecting damaged packages than normal ones. This may be because damage features such as dents or torn cardboard create stronger visual cues for the model.

Visual Examples of Classes

The examples above illustrate typical images used during training. Damaged packages often contain dents, crushed corners, or torn cardboard, while normal packages appear structurally intact.

Key Visualizations

Confusion Matrix

F1 Confidence Curve

Training Results

Performance Analysis

Overall, the YOLOv11 model performs well for detecting packages and classifying them as damaged or normal. The high mAP50 score of 0.959 suggests the model is usually able to both locate packages and correctly classify their condition. The damaged class performs slightly better than the normal class, likely because visible damage such as dents or crushed edges creates stronger visual features that are easier for the model to learn. The confusion matrix shows that most predictions fall along the diagonal, meaning the model correctly classifies the majority of packages. However, the matrix also highlights a background issue that is common in object detection models. In some cases the model predicts background where an object exists, meaning it misses a package entirely. This can happen when packages blend into the environment, appear small in the image, or have subtle visual features. The F1-confidence curve shows that the model maintains strong performance across a range of confidence thresholds, with the best F1 score around 0.92. The training results plots also show that loss decreases and evaluation metrics improve steadily over the 50 training epochs, suggesting the model learned useful features without significant overfitting. Overall, the model works well for clear package images, but performance may decrease when packages are partially occluded, poorly lit, or when damage is subtle.

5. Limitations and Biases

Known Failure Cases

The image above shows an example where the model predicts Damaged with a confidence of 0.85.
The box shows minor deformation, but the damage is relatively subtle and could be interpreted differently by a human observer.
The model appears to rely on visual cues such as bent edges, creases in the cardboard, or irregular box shapes when identifying damage.
Because damaged packages in the training data often contain stronger visual damage signals, the model may sometimes overpredict the damaged class when it encounters boxes with minor wear or shipping creases.

Poor Performing Classes

The normal class performs slightly worse than the damaged class.
Damaged packages often contain clearer visual features such as dents, crushed corners, or tears, which makes them easier for the model to recognize.
Normal boxes can sometimes look visually similar to slightly damaged boxes or to background regions.

Data Biases

The dataset mainly contains cardboard shipping boxes, which limits how well the model can generalize to other package types.
Other packaging formats such as padded envelopes, plastic mailers, or irregular parcels are not represented in the training data.
Many images were collected in relatively controlled conditions, which may differ from real-world delivery environments.

Environmental and Contextual Limitations

Model performance may decrease when lighting conditions are poor or when strong shadows are present.
Packages that blend into the surrounding background may be harder for the model to detect.
Detection may also become more difficult when packages are partially occluded or appear very small in the image.

Inappropriate Use Cases

This model should not be used as a fully automated system for determining package damage in real-world logistics environments.
Predictions should be reviewed by a human operator, especially when model confidence is low or when the damage is subtle.

Sample Size Limitations

The dataset used for training contains approximately 2,787 images across two classes.
A larger and more diverse dataset would likely improve the model’s ability to generalize and reduce detection errors.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support