Skip to content

Conversation

geetu040
Copy link
Contributor

@geetu040 geetu040 commented Nov 3, 2024

What does this PR do?

Fixes #34020

This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.

Relevant Links

Before submitting

Who can review?

@amyeroberts, @qubvel

@geetu040
Copy link
Contributor Author

geetu040 commented Nov 3, 2024

I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output.

Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements:

DepthProForDepthEstimation
├── depth_pro: DepthProModel
│   ├── encoder: DepthProEncoder
│   │   ├── patch_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   │   ├── image_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   ├── decoder: DepthProDecoder
│   └── fov_model: DepthProFOVModel
│       ├── encoder: DepthProViT
│       │   ├── embeddings: DepthProViTEmbeddings
│       │   └── encoder: DepthProViTEncoder
└── head: DepthProDepthEstimationHead

I have a couple of questions:

  1. The encoder: DepthProEncoder outputs features processed at various scales, including hidden states from the intermediate layers of ViTEncoder. Currently, I use BaseModelOutput, returning all features in the last_hidden_state argument. Should I create a dedicated ModelOutput class for DepthProEncoder? If so, it should reside in the same file as the DepthPro classes since it is specific to them.

  2. For handling the FOV (Field of View) output, would it be appropriate to create a class named DepthEstimatorOutputWithFOV in transformers.modeling_outputs, or should it also remain within the DepthPro context?

@Rocketknight1
Copy link
Member

cc @pcuenca as well!

@qubvel
Copy link
Contributor

qubvel commented Nov 5, 2024

Hi @geetu040! Thanks for working on this model!

Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in BaseModelOutput.hidden_states, for example mllama set default output_hidden_states=True and then select required hidden states from vision transformer.

@geetu040
Copy link
Contributor Author

@qubvel @pcuenca Thanks, I have updated the code for hidden_states.

I still need an opinion with fov (field of view)
DepthPro returns the predicted_depth as well as the fov which is a scaler value.

The existing DepthEstimatorOutput class in transformers/src/transformers/modeling_outputs.py looks like this:

class DepthEstimatorOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    predicted_depth: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

Q1: Do I create a new class DepthEstimatorOutputWithFOV or update the existing class?
Q2: User should be given the option to turn the FOV on or off because calculating FOV requires extra processing. In this case should this parameter be a part of model initialization DepthProForDepthEstimation(config, return_fov=True) or should it be kept inside config.

@qubvel
Copy link
Contributor

qubvel commented Nov 11, 2024

Thanks @geetu040

Q1:

class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
    fov: Optional[torch.FloatTensor] = None

This output can be returned in both cases: fov=None and not None.

Q2:

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

Please, let me know if you have more questions!

@geetu040
Copy link
Contributor Author

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

This needs to be done during __init__, because it requires fov_model (another vision transformer) to be initialized.

@qubvel
Copy link
Contributor

qubvel commented Nov 15, 2024

OK, got it! Then it should be done with config! And anyone can just load a model as following:

model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False)

With such initialization fov_model param will be overridden in config

@geetu040
Copy link
Contributor Author

  • currently an image is down-scaled to medium resolution (high / 2) and low resolution (high / 4)
  • then patches are created from high, medium and low and concatenated.

I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales image_scales=[0.6, 0.4, 0.3]

  • now an image will downscale to these 3 scales
  • then patches are created from high and scaled images and concatenated.

@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users?

@qubvel
Copy link
Contributor

qubvel commented Nov 18, 2024

Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code.

@geetu040
Copy link
Contributor Author

geetu040 commented Nov 25, 2024

Hi @qubvel I have a question about the image processor.

the source code from apple/depth-pro preprocesses the image in this sequence normalize -> resize, however in conventional image processor for vit and dpt, the sequence is resize -> normalize

this causes the two outputs to be slightly different from each other.

do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original resize function gives an error if it is simply moved above normalization code and if I use torch.nn.funtional.interpolate that is also not very optimal, it requires data conversions.

Here are the outputs

Different in Outputs

there is a slight difference, this happens because of how the image is pre-processed before being given to the model

Source code results

ic| depth: tensor([[0.9604, 0.9329, 0.8837,  ..., 3.0123, 2.9720, 2.9517],
                   [0.9210, 0.8995, 0.8605,  ..., 3.0148, 3.0120, 3.0106],
                   [0.8811, 0.8655, 0.8366,  ..., 3.0245, 3.0473, 3.0592],
                   ...,
                   [1.2283, 1.2263, 1.2225,  ..., 1.2698, 1.2818, 1.2881],
                   [1.2228, 1.2241, 1.2266,  ..., 1.2679, 1.2806, 1.2872],
                   [1.2167, 1.2223, 1.2333,  ..., 1.2655, 1.2757, 1.2810]])
ic| depth.shape: torch.Size([2268, 3024])
ic| focallength_px: tensor(3362.0200)

HF code results

ic| predicted_depth: [tensor([[0.9727, 0.9443, 0.8937,  ..., 3.0023, 2.9608, 2.9399],
                             [0.9320, 0.9097, 0.8693,  ..., 3.0045, 3.0006, 2.9987],
                             [0.8899, 0.8737, 0.8439,  ..., 3.0129, 3.0352, 3.0469],
                             ...,
                             [1.2393, 1.2373, 1.2334,  ..., 1.2805, 1.2934, 1.3001],
                             [1.2344, 1.2356, 1.2379,  ..., 1.2802, 1.2935, 1.3004],
                             [1.2286, 1.2341, 1.2447,  ..., 1.2788, 1.2892, 1.2947]])]
ic| fov: [tensor(3383.9839)]

Difference in Output Image

visually no difference in the 2 images

Input Image
example

Source code results
Figure_1

HF code results
Figure_2

@geetu040
Copy link
Contributor Author

Also how does the weight conversion work?

I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring.

@pcuenca
Copy link
Member

pcuenca commented Feb 7, 2025

Checkpoint has been transferred.

@qubvel
Copy link
Contributor

qubvel commented Feb 7, 2025

run-slow: depth_pro

Copy link
Contributor

github-actions bot commented Feb 7, 2025

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/depth_pro']
quantizations: [] ...

@geetu040
Copy link
Contributor Author

geetu040 commented Feb 7, 2025

Checkpoint has been transferred.

@qubvel Shouldn't the checkpoint be apple/depth-pro-hf instead of apple/DepthPro-hf as you suggested? The code uses apple/depth-pro-hf everywhere.

@pcuenca
Copy link
Member

pcuenca commented Feb 7, 2025

@geetu040 I followed the same pattern for the model family. Let me check if there's anything we can do.

@geetu040
Copy link
Contributor Author

geetu040 commented Feb 7, 2025

@geetu040 I followed the same pattern for the model family. Let me check if there's anything we can do.

Otherwise I can change the checkpoint in the code, and it would also require to be changed in the model card.

@pcuenca
Copy link
Member

pcuenca commented Feb 7, 2025

https://huggingface.co/apple/depth-pro-hf now redirects to DepthPro-hf. So code should work.

@pcuenca
Copy link
Member

pcuenca commented Feb 7, 2025

But good point about the model card, I'll change so it's less confusing. Edit: updated.

@qubvel
Copy link
Contributor

qubvel commented Feb 7, 2025

Thanks @pcuenca!

@geetu040 can you please update the code references? sorry for the back-and-forth changes.

Waiting for the slow CI to merge it then.

@geetu040
Copy link
Contributor Author

geetu040 commented Feb 7, 2025

@geetu040 can you please update the code references? sorry for the back-and-forth changes.

sure, updated!

@qubvel
Copy link
Contributor

qubvel commented Feb 7, 2025

run-slow: depth_pro

Copy link
Contributor

github-actions bot commented Feb 7, 2025

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/depth_pro']
quantizations: [] ...

@geetu040
Copy link
Contributor Author

geetu040 commented Feb 8, 2025

@qubvel, the tests did not run in the last 2 attempts.

Error Log
Run PR_MERGE_SHA=$(git log -1 --format=%H)
  PR_MERGE_SHA=$(git log -1 --format=%H)
  if [ $PR_MERGE_SHA != $VERIFIED_PR_MERGE_SHA ]; then
    echo "The merged commit SHA is not the same as the verified one! Security issue detected, abort the workflow!";
    exit -1;
  fi
  shell: sh -e {0}
  env:
    HF_HOME: /mnt/cache
    TRANSFORMERS_IS_CI: yes
    OMP_NUM_THREADS: 8
    MKL_NUM_THREADS: 8
    RUN_SLOW: yes
    HF_HUB_READ_TOKEN: ***
    SIGOPT_API_TOKEN: ***
    TF_FORCE_GPU_ALLOW_GROWTH: true
    RUN_PT_TF_CROSS_TESTS: 1
    CUDA_VISIBLE_DEVICES: 0,1
    matrix_folders: models_depth_pro
    VERIFIED_PR_MERGE_SHA: 850bdaaad25826da72b87e9455296742bb83e331
The merged commit SHA is not the same as the verified one! Security issue detected, abort the workflow!
/__w/_temp/881ad002-6e8a-4896-968a-9a2b93d19337.sh: 4: exit: Illegal number: -1
Error: Process completed with exit code 2.

Looks like something is wrong with the workflow itself.

@qubvel
Copy link
Contributor

qubvel commented Feb 8, 2025

@geetu040 yes, smth wrong with CI, waiting for the team to fix it

@qubvel
Copy link
Contributor

qubvel commented Feb 10, 2025

run-slow: depth_pro

Copy link
Contributor

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/depth_pro']
quantizations: [] ...

@qubvel qubvel merged commit 9a6be63 into huggingface:main Feb 10, 2025
24 checks passed
@qubvel
Copy link
Contributor

qubvel commented Feb 10, 2025

@geetu040, congratulations on getting the model merged! 🎉🚀
Fantastic work! Huge thanks for all your iterations to make the modeling code simpler, faster, and more export-friendly. It was a pleasure to support you in contributing this model, thank you!

@geetu040
Copy link
Contributor Author

Thanks @qubvel

sbucaille pushed a commit to sbucaille/transformers that referenced this pull request Feb 16, 2025
* implement config and model building blocks

* refactor model architechture

* update model outputs

* update init param to include use_fov_model

* update param name in config

* fix hidden_states and attentions outputs for fov

* sort config

* complete minor todos

* update patching

* update config for encoder

* fix config

* use correct defaults in config

* update merge for compatibility with different image size

* restructure encoder for custom configuration

* make fov model compatible with custom config

* replace word "decoder" with "fusion"

* weight conversion script

* fix fov squeeze

* update conversion script (without test)

* upload ruff image processing

* create fast image processing

* use torch interpolation for image processing

* complete post_process_depth_estimation

* config: fix imports and sort args

* apply inference in weight conversion

* use mllama script instead for weight conversion

* clean weight conversion script

* add depth-pro status in other files

* fill docstring in config

* formatting

* more formatting

* formatting with ruff

* formatting with style

* fix copied classes

* add examples; update weight convert script

* fix using check_table.py and isort

* fix config docstring

* add depth pro to sdpa docs

* undo unintentional changes in configuration_gemma.py

* minor fixes

* test image processing

* fixes and tests

* more fixes

* use output states from image_encoder instead

* Revert "use output states from image_encoder instead"

This reverts commit 2408ec5.

* make embeddings dynamic

* reshape output hidden states and attentions as part of computation graph

* fix ruff formating

* fix docstring failure

* use num_fov_head_layers in tests

* update doc

* check consistency with config

* ruff formatting

* update test case

* fix ruff formatting

* add tests for fov

* use interpolation in postprocess

* run and fix slow tests locally

* use scaled_images_features for image and fov encoder

* return fused_hidden_states in fusion stage

* fix example

* fix ruff

* fix copyright license for all files

* add __all__ for each file

* minor fixes
- fix download spell
- add push_to_hub option
- fix Optional type hinting
- apply single loop for DepthProImageProcessor.preprocess

* return list in post_process_depth_estimation

* minor fixes
- capitalize start of docstring
- use ignore copy
- fix examples
- move docstring templates and custom output classes to top
- remove "-> None" typehinting from __init__
- type hinting for forward passes
- fix docstrings for custom output classes

* fix "ruff check"

* update upsample and projection

* major changes: (image size and merge optimization)
- add support for images of any size
- optimize merge operation
- remove image_size from config
- use full names instead of B, C, H, W
- remove interpolation from fusion stage
- add interpolation after merge
- move validations to config
- update integration test
- add type hints for functions

* fix push_to_hub option in weights conversion

* remove image_size in weights conversion

* major changes in the architecture
- remove all DepthProViT modules and support different backbones using the AutoModel API
- set default use_fov_model to False
- validate parameters in configuration
- update interpolate function: use "nearest" for faster computation
- update reshape_feature function: remove all special tokens, possible from different backbones
- update merge function: use padding from config instead of merge_out_size
- remove patch_to_batch and batch_to_patch conversions for now
- calculate out_size dynamically in the encoder
- leave head_mask calculation to the backbone
- fix bugs with merge
- add more comments
- update tests

* placeholder for unused config attributes

* improve docs amid review

* minor change in docs

* further optimize merge

* fix formatting

* remove unused patch/batch convertion functions

* use original F.interpolate

* improve function naming

* minor chages
- use torch_int instead of int
- use proper for newly initialized tensors
- use user provided return_dict for patch_encoder
- use if-else block instead in self.use_fov_model

* rearchitect upsample block for improved modularity

* update upsample keys in weight conversion

* improve padding in merge_patches

* use double-loop for merge

* update comments

* create feature_extractor, reduce some forward code

* introduce config.use_mask_token in dinov2

* minor fixes

* minor fixes for onnx

* update __init__ to latest format

* remove DepthProConfig.to_dict()

* major changes in backbone

* update config in weight conversion

* formatting

* converted model is fp32

* improve naming and docs for feature_extractor->reconstruct_feature_maps

* minor fixes; amid review

* create intermediate vars in func call

* use torch.testing.assert_close

* use ModuleList instead of Sequential and ModuleDict

* update docs

* include fov in integraiton tests

* update docs

* improve initialization of convolution layers

* fix unused fov keys

* update tests

* ruff format

* fix test, amid kaimming initialization

* add depthpro to toctree

* add residual layer to _no_split_modules

* architecture rework

* Update src/transformers/models/depth_pro/image_processing_depth_pro.py

Co-authored-by: Pavel Iakubovskii <[email protected]>

* Update src/transformers/models/depth_pro/image_processing_depth_pro_fast.py

Co-authored-by: Pavel Iakubovskii <[email protected]>

* update docs

* improve merge_patches

* use flatten with fov_output

* ruff formatting

* update resources section in docs

Co-authored-by: Pavel Iakubovskii <[email protected]>

* fix typo "final_kernal_size"

Co-authored-by: Pavel Iakubovskii <[email protected]>

* fix output typehint for DepthProDepthEstimator

Co-authored-by: Pavel Iakubovskii <[email protected]>

* residual operation in 2 steps

Co-authored-by: Pavel Iakubovskii <[email protected]>

* use image_size instead of global patch_size in interpolation

* replace all Sequential with ModuleList

* update fov

* update heads

* fix and update conversion script for heads

* ruff formatting

* remove float32 conversion

* use "Fov" instead of "FOV" in class names

* use "Fov" instead of "FOV" in config docs

* remove prune_heads

* update fusion stage

* use device in examples

* update processor

* ruff fixes

* add do_rescale in image_processor_dict

* skip test: test_fast_is_faster_than_slow

* ruff formatting

* DepthProImageProcessorFast in other files

* revert antialias removal

* add antialias in BaseImageProcessorFast

* Revert "revert antialias removal"

This reverts commit 5caa0bd.

* Revert "add antialias in BaseImageProcessorFast"

This reverts commit 3ae1134.

* update processor for grouping and antialias

* try test_fast_is_faster_than_slow without "skip" or "flanky"

* update checkpoint

* update checkpoint

* use @is_flanky for processor test

* update checkpoint to "apple/DepthPro-hf"

---------

Co-authored-by: Pavel Iakubovskii <[email protected]>
@Boulaouaney
Copy link

Hi!

I know this is merged already but I wanted to point out a small detail missing compared to Apple's implementation.

In the load_rgb() method in Apple's implementation, they check for the focal length in the EXIF data and extract it if it exists.

Reference: Apple's implementation load_rgb() method

Then later, although by default they predict the FOV if use_fov_head == True, they only use the predicted FOV to calculate the focal length if there was none in the EXIF data.

Reference: Apple's implementation depth_pro.py

I think this would be good to add to Hugging Face's implementation since it removes one layer of estimation, lowering the margin of error.

@qubvel
Copy link
Contributor

qubvel commented Feb 18, 2025

Hi @Boulaouaney, thanks a lot for pointing this out! Would you like to propose a PR to update it?

@geetu040
Copy link
Contributor Author

Hi @qubvel and @Boulaouaney
FOV value is used only by the DepthProImageProcessor.post_process_depth_estimation, can one not update the outputs like this?

...
model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", use_fov_model=False)
outputs = model(**inputs)

# use the infered fov from given image
outputs.field_of_view = torch.tensor([0.4])

post_processed_output = image_processor.post_process_depth_estimation(
    outputs, target_sizes=[(image.height, image.width)],
)
...

@Boulaouaney
Copy link

Hi @geetu040,

Yes, you're right. One could manually adjust the image processing if they know the focal length used to take the image. However, I still believe it would be more efficient to handle this automatically within the image processor. Automating this process ensures consistency and reduces the potential for human error. Additionally, it would streamline the workflow for users who may not be familiar with the technical details of image processing.

@qubvel yes, if necessary I can open a PR when I'm free to update this

@geetu040
Copy link
Contributor Author

@Boulaouaney I understand, so you are suggesting that we extract the FOV value as well along with the RGB values from the raw input image right?

@qubvel
Copy link
Contributor

qubvel commented Feb 19, 2025

I'm not sure if we need to read it in the image processor, because I'm not sure how consistent it is among image formats. Currently, image/metadata reading is the user's responsibility. However, we should clearly describe this in the documentation and the model card. Also, it's better to avoid modifying output and to make FOV an optional argument for the postprocessing method.

@Boulaouaney
Copy link

@geetu040 yes I am suggesting to extract the focal length in 35mm from the image's metadata if it exists and convert it to focal length in pixels f_px to be used for calculating the metric depth instead of the estimated value from the fov_model.

The logic to be added would be similar to apple's implementation:

# Extract focal length at 35mm from exif data
f_35mm = img_exif.get(
    "FocalLengthIn35mmFilm",
    img_exif.get(
        "FocalLenIn35mmFilm", img_exif.get("FocalLengthIn35mmFormat", None)
    ),
)

# if extracted, convert it to pixels
if f_35mm is not None:
    f_px = f_35mm * np.sqrt(width**2.0 + height**2.0) / np.sqrt(36**2 + 24**2)

But I guess @qubvel 's solution to leave the metadata handling to the user during image loading would also be good.

@qubvel
Copy link
Contributor

qubvel commented Feb 19, 2025

Ok, it would be nice to mention it in docs (depth_pro.md), feel free to open PR to clarify it, then we can update examples and model cards on our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Apple's Depth-Pro

8 participants