-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Add support for Apple's Depth-Pro #34583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output. Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements: I have a couple of questions:
|
|
cc @pcuenca as well! |
|
Hi @geetu040! Thanks for working on this model! Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in |
|
@qubvel @pcuenca Thanks, I have updated the code for hidden_states. I still need an opinion with The existing class DepthEstimatorOutput(ModelOutput):
loss: Optional[torch.FloatTensor] = None
predicted_depth: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
attentions: Optional[Tuple[torch.FloatTensor, ...]] = NoneQ1: Do I create a new class |
|
Thanks @geetu040 Q1: class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
fov: Optional[torch.FloatTensor] = NoneThis output can be returned in both cases: Q2: Yeah, this can be a parameter of the config, but also should be an argument in Please, let me know if you have more questions! |
This needs to be done during |
|
OK, got it! Then it should be done with config! And anyone can just load a model as following: model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False)With such initialization |
I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales
@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users? |
|
Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code. |
|
Hi @qubvel I have a question about the image processor. the source code from this causes the two outputs to be slightly different from each other. do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original Here are the outputs Different in Outputs there is a slight difference, this happens because of how the image is pre-processed before being given to the model Source code results HF code results Difference in Output Image visually no difference in the 2 images |
|
Also how does the weight conversion work? I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring. |
|
run-slow: depth_pro |
|
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/depth_pro'] |
|
@qubvel Shouldn't the checkpoint be |
|
@geetu040 I followed the same pattern for the model family. Let me check if there's anything we can do. |
Otherwise I can change the checkpoint in the code, and it would also require to be changed in the model card. |
|
https://huggingface.co/apple/depth-pro-hf now redirects to |
|
But good point about the model card, I'll change so it's less confusing. Edit: updated. |
sure, updated! |
|
run-slow: depth_pro |
|
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/depth_pro'] |
|
@qubvel, the tests did not run in the last 2 attempts. Error LogLooks like something is wrong with the workflow itself. |
|
@geetu040 yes, smth wrong with CI, waiting for the team to fix it |
|
run-slow: depth_pro |
|
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/depth_pro'] |
|
@geetu040, congratulations on getting the model merged! 🎉🚀 |
|
Thanks @qubvel |
* implement config and model building blocks * refactor model architechture * update model outputs * update init param to include use_fov_model * update param name in config * fix hidden_states and attentions outputs for fov * sort config * complete minor todos * update patching * update config for encoder * fix config * use correct defaults in config * update merge for compatibility with different image size * restructure encoder for custom configuration * make fov model compatible with custom config * replace word "decoder" with "fusion" * weight conversion script * fix fov squeeze * update conversion script (without test) * upload ruff image processing * create fast image processing * use torch interpolation for image processing * complete post_process_depth_estimation * config: fix imports and sort args * apply inference in weight conversion * use mllama script instead for weight conversion * clean weight conversion script * add depth-pro status in other files * fill docstring in config * formatting * more formatting * formatting with ruff * formatting with style * fix copied classes * add examples; update weight convert script * fix using check_table.py and isort * fix config docstring * add depth pro to sdpa docs * undo unintentional changes in configuration_gemma.py * minor fixes * test image processing * fixes and tests * more fixes * use output states from image_encoder instead * Revert "use output states from image_encoder instead" This reverts commit 2408ec5. * make embeddings dynamic * reshape output hidden states and attentions as part of computation graph * fix ruff formating * fix docstring failure * use num_fov_head_layers in tests * update doc * check consistency with config * ruff formatting * update test case * fix ruff formatting * add tests for fov * use interpolation in postprocess * run and fix slow tests locally * use scaled_images_features for image and fov encoder * return fused_hidden_states in fusion stage * fix example * fix ruff * fix copyright license for all files * add __all__ for each file * minor fixes - fix download spell - add push_to_hub option - fix Optional type hinting - apply single loop for DepthProImageProcessor.preprocess * return list in post_process_depth_estimation * minor fixes - capitalize start of docstring - use ignore copy - fix examples - move docstring templates and custom output classes to top - remove "-> None" typehinting from __init__ - type hinting for forward passes - fix docstrings for custom output classes * fix "ruff check" * update upsample and projection * major changes: (image size and merge optimization) - add support for images of any size - optimize merge operation - remove image_size from config - use full names instead of B, C, H, W - remove interpolation from fusion stage - add interpolation after merge - move validations to config - update integration test - add type hints for functions * fix push_to_hub option in weights conversion * remove image_size in weights conversion * major changes in the architecture - remove all DepthProViT modules and support different backbones using the AutoModel API - set default use_fov_model to False - validate parameters in configuration - update interpolate function: use "nearest" for faster computation - update reshape_feature function: remove all special tokens, possible from different backbones - update merge function: use padding from config instead of merge_out_size - remove patch_to_batch and batch_to_patch conversions for now - calculate out_size dynamically in the encoder - leave head_mask calculation to the backbone - fix bugs with merge - add more comments - update tests * placeholder for unused config attributes * improve docs amid review * minor change in docs * further optimize merge * fix formatting * remove unused patch/batch convertion functions * use original F.interpolate * improve function naming * minor chages - use torch_int instead of int - use proper for newly initialized tensors - use user provided return_dict for patch_encoder - use if-else block instead in self.use_fov_model * rearchitect upsample block for improved modularity * update upsample keys in weight conversion * improve padding in merge_patches * use double-loop for merge * update comments * create feature_extractor, reduce some forward code * introduce config.use_mask_token in dinov2 * minor fixes * minor fixes for onnx * update __init__ to latest format * remove DepthProConfig.to_dict() * major changes in backbone * update config in weight conversion * formatting * converted model is fp32 * improve naming and docs for feature_extractor->reconstruct_feature_maps * minor fixes; amid review * create intermediate vars in func call * use torch.testing.assert_close * use ModuleList instead of Sequential and ModuleDict * update docs * include fov in integraiton tests * update docs * improve initialization of convolution layers * fix unused fov keys * update tests * ruff format * fix test, amid kaimming initialization * add depthpro to toctree * add residual layer to _no_split_modules * architecture rework * Update src/transformers/models/depth_pro/image_processing_depth_pro.py Co-authored-by: Pavel Iakubovskii <[email protected]> * Update src/transformers/models/depth_pro/image_processing_depth_pro_fast.py Co-authored-by: Pavel Iakubovskii <[email protected]> * update docs * improve merge_patches * use flatten with fov_output * ruff formatting * update resources section in docs Co-authored-by: Pavel Iakubovskii <[email protected]> * fix typo "final_kernal_size" Co-authored-by: Pavel Iakubovskii <[email protected]> * fix output typehint for DepthProDepthEstimator Co-authored-by: Pavel Iakubovskii <[email protected]> * residual operation in 2 steps Co-authored-by: Pavel Iakubovskii <[email protected]> * use image_size instead of global patch_size in interpolation * replace all Sequential with ModuleList * update fov * update heads * fix and update conversion script for heads * ruff formatting * remove float32 conversion * use "Fov" instead of "FOV" in class names * use "Fov" instead of "FOV" in config docs * remove prune_heads * update fusion stage * use device in examples * update processor * ruff fixes * add do_rescale in image_processor_dict * skip test: test_fast_is_faster_than_slow * ruff formatting * DepthProImageProcessorFast in other files * revert antialias removal * add antialias in BaseImageProcessorFast * Revert "revert antialias removal" This reverts commit 5caa0bd. * Revert "add antialias in BaseImageProcessorFast" This reverts commit 3ae1134. * update processor for grouping and antialias * try test_fast_is_faster_than_slow without "skip" or "flanky" * update checkpoint * update checkpoint * use @is_flanky for processor test * update checkpoint to "apple/DepthPro-hf" --------- Co-authored-by: Pavel Iakubovskii <[email protected]>
|
Hi! I know this is merged already but I wanted to point out a small detail missing compared to Apple's implementation. In the Reference: Apple's implementation Then later, although by default they predict the FOV if Reference: Apple's implementation I think this would be good to add to Hugging Face's implementation since it removes one layer of estimation, lowering the margin of error. |
|
Hi @Boulaouaney, thanks a lot for pointing this out! Would you like to propose a PR to update it? |
|
Hi @qubvel and @Boulaouaney ...
model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", use_fov_model=False)
outputs = model(**inputs)
# use the infered fov from given image
outputs.field_of_view = torch.tensor([0.4])
post_processed_output = image_processor.post_process_depth_estimation(
outputs, target_sizes=[(image.height, image.width)],
)
... |
|
Hi @geetu040, Yes, you're right. One could manually adjust the image processing if they know the focal length used to take the image. However, I still believe it would be more efficient to handle this automatically within the image processor. Automating this process ensures consistency and reduces the potential for human error. Additionally, it would streamline the workflow for users who may not be familiar with the technical details of image processing. @qubvel yes, if necessary I can open a PR when I'm free to update this |
|
@Boulaouaney I understand, so you are suggesting that we extract the FOV value as well along with the RGB values from the raw input image right? |
|
I'm not sure if we need to read it in the image processor, because I'm not sure how consistent it is among image formats. Currently, image/metadata reading is the user's responsibility. However, we should clearly describe this in the documentation and the model card. Also, it's better to avoid modifying |
|
@geetu040 yes I am suggesting to extract the focal length in 35mm from the image's metadata if it exists and convert it to focal length in pixels The logic to be added would be similar to apple's implementation: # Extract focal length at 35mm from exif data
f_35mm = img_exif.get(
"FocalLengthIn35mmFilm",
img_exif.get(
"FocalLenIn35mmFilm", img_exif.get("FocalLengthIn35mmFormat", None)
),
)
# if extracted, convert it to pixels
if f_35mm is not None:
f_px = f_35mm * np.sqrt(width**2.0 + height**2.0) / np.sqrt(36**2 + 24**2)But I guess @qubvel 's solution to leave the metadata handling to the user during image loading would also be good. |
|
Ok, it would be nice to mention it in docs ( |



What does this PR do?
Fixes #34020
This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.
Relevant Links
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@amyeroberts, @qubvel