-
Notifications
You must be signed in to change notification settings - Fork 31.2k
[Model] Add PaddleOCR-VL Model Support #42178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @zhang-prog , thanks for the PR! Great model to have in transformers!
The main thing to fix first is the naming, it should clearly include "PaddlePaddleOCR" and follow the usual pattern depending on the modality. The config format also isn’t right; it needs to be fully nested, with text and vision configs inside. Additionally there are no tests or docs, several files are missing. You can run transformers add-new-model-like which would generate a placeholder with the necessary files. I also left some smaller comments here and there. Let me know if you hit any issues
| if height < factor: | ||
| width = round((width * factor) / height) | ||
| height = factor | ||
|
|
||
| if width < factor: | ||
| height = round((height * factor) / width) | ||
| width = factor | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as Qwen but with support for H/W smaller than a factor. I think we made qwen-VL support small images as well, so prob directly importing will give expected result?
| return h_bar, w_bar | ||
|
|
||
|
|
||
| class PaddleOCRVLImageProcessor(Qwen2VLImageProcessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently recommending to add a Fast Image Processor first for new models, and add the slow version only as a complementary fallback
Can you add a FastProcessor as well? There is some info about fast processors in #36978
| self.min_pixels = min_pixels | ||
| self.max_pixels = max_pixels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use size instead of min/max pixels. We've been trying to standardize attribute naming lately and size is a common arg for it
| attributes = ["image_processor", "tokenizer"] | ||
| valid_kwargs = [ | ||
| "chat_template", | ||
| "image_std", | ||
| "min_pixels", | ||
| "image_mean", | ||
| "merge_size", | ||
| "image_processor_type", | ||
| "temporal_patch_size", | ||
| "patch_size", | ||
| "max_pixels", | ||
| ] | ||
|
|
||
| image_processor_class = "AutoImageProcessor" | ||
| tokenizer_class = "AutoTokenizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need these anymore with the recent change in v5
| tokenizer_class = "AutoTokenizer" | ||
|
|
||
| def __init__(self, image_processor=None, tokenizer=None, chat_template=None, **kwargs): | ||
| self.image_token = "<|IMAGE_PLACEHOLDER|>" if not hasattr(tokenizer, "image_token") else tokenizer.image_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the token in tokenizer so we can assume t's always available?
https://huggingface.co/docs/transformers/en/main_classes/tokenizer#multimodal-tokenizer
| loss = None | ||
| if labels is not None: | ||
| # Upcast to float if we need to compute the loss to avoid potential precision issues | ||
| logits = logits.float() | ||
| # Shift so that tokens < n predict n | ||
| shift_logits = logits[..., :-1, :].contiguous() | ||
| shift_labels = labels[..., 1:].contiguous() | ||
| # Flatten the tokens | ||
| loss_fct = CrossEntropyLoss() | ||
| shift_logits = shift_logits.view(-1, self.config.vocab_size) | ||
| shift_labels = shift_labels.view(-1) | ||
| # Enable model parallelism | ||
| shift_labels = shift_labels.to(shift_logits.device) | ||
| loss = loss_fct(shift_logits, shift_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use self.loss_fn here
| if not return_dict: | ||
| output = (logits,) + outputs[1:] | ||
| return (loss,) + output if loss is not None else output | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed as long as the forward is decorated with can_return_tuple
| rope_deltas=self.rope_deltas, | ||
| ) | ||
|
|
||
| def prepare_inputs_for_generation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as qwen2-5-vl, can be deleted after inheriting from it
|
|
||
| return model_inputs | ||
|
|
||
| def _get_image_nums_and_video_nums( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as qwen2-5-vl, can be deleted after inheriting from it
|
|
||
| return image_nums, video_nums | ||
|
|
||
| def _expand_inputs_for_generation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as qwen2-5-vl, can be deleted after inheriting from it
What does this PR do?
This PR adds PaddleOCR-VL model to Hugging Face Transformers from PaddleOCR.
Relevant Links:
PaddleOCR
https://huggingface.co/PaddlePaddle/PaddleOCR-VL
Usage
Use a pipeline
Load model directly