Skip to content

Conversation

@amyeroberts
Copy link
Contributor

@amyeroberts amyeroberts commented Aug 8, 2022

What does this PR do?

This is the first of a series of PRs to replace feature extractors with image processors for vision models.

Create a new module image_transforms.py that will contain functions for transforming images e.g. resize.

The functions are designed to:

  • Accept numpy arrays.
  • Return numpy arrays (except for e.g. to_pil_image)
  • Provide logic such that the new image processors produce the same outputs as feature extractors when called directly.

Subsequent PRs:

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

def __call__(self, images, **kwargs) -> BatchFeature:
return self.preprocess(images, **kwargs)

def preprocess(self, images, **kwargs) -> BatchFeature:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nice :) could possibly also include a postprocess method, although I'm wondering whether that could be a method on the model (as it's oftentimes framework-specific)

Definitely stuff for later :)

cc @alaradirik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't include a postprocess method, as one model can have many different downstream tasks. I was thinking of using a structure similar to the current feature extractors where we have postprocess_task methods. What do you think?

I don't think the postprocessing methods should fall under the model if it's outside of what's needed for training. I wouldn't want to have to load a model in order to process outputs. Re framework specific, we'll have to think about how to handle this as we need to be able to support and have consistent outputs for the different framework implementations. If it's necessary to use specific libraries, we could have the image processor call the specific implementation e.g. postprocess_instance_segmentation calls _postprocess_instance_segmentation_pytorch.

Copy link
Contributor

@alaradirik alaradirik Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts @NielsRogge I'm a bit late to the conversation but I agree that postprocessing methods shouldn't fall under the modeling files.

Adding postprocess_task methods sounds good to me! We just need to ensure consistent inputs and outputs for these where applicable.

return_numpy: bool = True,
) -> np.ndarray:
"""
Resizes `image` to (h, w) specified by `size` using the PIL library.
Copy link
Contributor

@NielsRogge NielsRogge Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we use Pillow for resizing, maybe it makes more sense to let size be a tuple of (width, height) rather than the other way around?

Copy link
Contributor

@NielsRogge NielsRogge Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, won't this method support the same behaviour as torchvision's resize? i.e. when size is an int, it only resizes the smaller edge of the image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we use Pillow for resizing, maybe it makes more sense to let size be a tuple of (width, height) rather than the other way around?

As the input and output to the function are numpy arrays, I think that would be confusing. Ideally the user wouldn't have any knowledge about the library used to resize.

Also, won't this method support the same behaviour as torchvision's resize? i.e. when size is an int, it only resizes the smaller edge of the image

I moved the logic to find the dimensions of the output image into get_resize_output_image_size so that resize does just one thing. This is in part because there's a difference in behaviour between TF and PyTorch. Then each model's image processor can have its own logic within its resize method for finding the output shape, or use get_resize_output_image_size which has the same behaviour as before.

return PIL.Image.fromarray(image)


def get_resize_output_image_size(
Copy link
Contributor

@NielsRogge NielsRogge Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this function used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not yet because GLPN does its own logic to find the output image size based on size_divisor.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of nits but looking great!

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@amyeroberts
Copy link
Contributor Author

@alaradirik @NielsRogge Could you (re-)review?

Copy link
Contributor

@alaradirik alaradirik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me!

I just have a question regarding multi-modal models such as CLIP and OWL-ViT. These models have both feature extractors and processors, which call their respective tokenizer and feature extractor. Wouldn't creating XXModelProcessor aliases for their feature extractors create issues?

@amyeroberts
Copy link
Contributor Author

I just have a question regarding multi-modal models such as CLIP and OWL-ViT. These models have both feature extractors and processors, which call their respective tokenizer and feature extractor. Wouldn't creating XXModelProcessor aliases for their feature extractors create issues?

@alaradirik I believe this should be OK, as the feature extractors are being mapped to XxxImageProcessor rather than XxxProcessor, so there's no clash of names. Not sure if this answers your question or I've missed the consequence you're asking about.

Use PIL.Image.XXX resampling values instead of PIL.Image.Resampling.XXX enum as it's only in the recent version >= 9.10 and version is not yet pinned and older version support deprecated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.