Add --validate-images CLI option to filter corrupt images using PIL #388

Copilot · 2025-07-08T20:55:57Z

Adds a new --validate-images CLI option that enables PIL-based image validation to filter out corrupt or invalid image files during processing.

Problem

When processing image datasets, users may encounter corrupt or invalid image files that cause processing to fail. Currently, zamba only checks if files exist and have non-zero size, but doesn't validate that they are actually valid images that can be opened and processed.

Solution

This PR adds a new CLI option --validate-images that:

Attempts to open each image file with PIL (Python Imaging Library)
Filters out images that cannot be opened or decoded
Logs appropriate warning messages about filtered files
Continues processing with only valid images

Usage

Command Line Interface

For image prediction:

zamba image predict --data-dir /path/to/images --validate-images

For image training:

zamba image train --data-dir /path/to/images --labels /path/to/labels.csv --validate-images

Python API

from zamba.images.config import ImageClassificationPredictConfig

config = ImageClassificationPredictConfig(
    data_dir="/path/to/images",
    validate_images=True
)

Implementation Details

Backward Compatible: Feature is disabled by default (validate_images=False)
Comprehensive Logging: Distinguishes between file existence failures and PIL validation failures
Efficient Processing: Uses parallel processing for training validation
Robust Error Handling: Gracefully handles all PIL-related exceptions

Changes Made

CLI Enhancement: Added --validate-images option to both predict and train commands
Configuration: Added validate_images: bool = False parameter to both config classes
Validation Logic: Implemented _validate_filepath_with_pil() function using PIL
Integration: Enhanced existing validation methods to use PIL when enabled
Logging: Added specific messages for PIL validation failures
Tests: Comprehensive test suite covering all functionality
Documentation: Added detailed usage examples and documentation

Example Output

With validation enabled, users will see:

INFO     | Validating image files exist and can be opened with PIL
WARNING  | 2 files in provided labels file do not exist on disk or cannot be opened with PIL; ignoring those files. Example: ['corrupt_image.jpg', 'invalid_file.jpg']...

This feature is particularly useful when working with datasets from external sources or when data integrity is uncertain.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: pjbull <[email protected]>

Initial plan

3067ce2

Copilot AI assigned Copilot and pjbull Jul 8, 2025

Copilot started work on behalf of pjbull July 8, 2025 20:56 View session

Copilot AI and others added 2 commits July 8, 2025 21:12

Add --validate-images CLI option for PIL image validation

5f08e78

Co-authored-by: pjbull <[email protected]>

Add comprehensive tests and documentation for image validation feature

aa95969

Co-authored-by: pjbull <[email protected]>

Copilot AI requested a review from pjbull July 8, 2025 21:18

Copilot finished work on behalf of pjbull July 8, 2025 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add --validate-images CLI option to filter corrupt images using PIL #388

Add --validate-images CLI option to filter corrupt images using PIL #388

Uh oh!

Copilot AI commented Jul 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add --validate-images CLI option to filter corrupt images using PIL #388

Are you sure you want to change the base?

Add --validate-images CLI option to filter corrupt images using PIL #388

Uh oh!

Conversation

Copilot AI commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Usage

Command Line Interface

Python API

Implementation Details

Changes Made

Example Output

Uh oh!

Uh oh!

Copilot AI commented Jul 8, 2025 •

edited

Loading