Skip to content

Add --validate-images CLI option to filter corrupt images using PIL #388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Jul 8, 2025

Adds a new --validate-images CLI option that enables PIL-based image validation to filter out corrupt or invalid image files during processing.

Problem

When processing image datasets, users may encounter corrupt or invalid image files that cause processing to fail. Currently, zamba only checks if files exist and have non-zero size, but doesn't validate that they are actually valid images that can be opened and processed.

Solution

This PR adds a new CLI option --validate-images that:

  • Attempts to open each image file with PIL (Python Imaging Library)
  • Filters out images that cannot be opened or decoded
  • Logs appropriate warning messages about filtered files
  • Continues processing with only valid images

Usage

Command Line Interface

For image prediction:

zamba image predict --data-dir /path/to/images --validate-images

For image training:

zamba image train --data-dir /path/to/images --labels /path/to/labels.csv --validate-images

Python API

from zamba.images.config import ImageClassificationPredictConfig

config = ImageClassificationPredictConfig(
    data_dir="/path/to/images",
    validate_images=True
)

Implementation Details

  • Backward Compatible: Feature is disabled by default (validate_images=False)
  • Comprehensive Logging: Distinguishes between file existence failures and PIL validation failures
  • Efficient Processing: Uses parallel processing for training validation
  • Robust Error Handling: Gracefully handles all PIL-related exceptions

Changes Made

  1. CLI Enhancement: Added --validate-images option to both predict and train commands
  2. Configuration: Added validate_images: bool = False parameter to both config classes
  3. Validation Logic: Implemented _validate_filepath_with_pil() function using PIL
  4. Integration: Enhanced existing validation methods to use PIL when enabled
  5. Logging: Added specific messages for PIL validation failures
  6. Tests: Comprehensive test suite covering all functionality
  7. Documentation: Added detailed usage examples and documentation

Example Output

With validation enabled, users will see:

INFO     | Validating image files exist and can be opened with PIL
WARNING  | 2 files in provided labels file do not exist on disk or cannot be opened with PIL; ignoring those files. Example: ['corrupt_image.jpg', 'invalid_file.jpg']...

This feature is particularly useful when working with datasets from external sources or when data integrity is uncertain.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] Can you add a CLI option for images that will try to open images with PIL and filter them out of the list we are processing (including adding a message in the logs) if they fail to load? Add --validate-images CLI option to filter corrupt images using PIL Jul 8, 2025
@Copilot Copilot AI requested a review from pjbull July 8, 2025 21:18
Copilot finished work on behalf of pjbull July 8, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants