Skip to content

docs: update detection core with tips for using Gemini integration #1925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions supervision/detection/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -946,6 +946,26 @@ def from_lmm(cls, lmm: LMM | str, result: str | dict, **kwargs: Any) -> Detectio
```

!!! example "Gemini 2.0"

??? tip "Prompt engineering"

From Gemini 2.0 onwards, models are further trained to detect objects in
an image and get their bounding box coordinates. The coordinates,
relative to image dimensions, scale to [0, 1000]. You need to descale
these coordinates based on your original image size.

According to the Gemini API documentation on image prompts, when using
a single image with text, the recommended approach is to place the text
prompt after the image part in the contents array. This ordering has
been shown to produce significantly better results in practice.

To get the best results from Google Gemini 2.0, use the following prompt.

```
Detect all the cats and dogs in the image. The box_2d should be
[ymin, xmin, ymax, xmax] normalized to 0-1000.
```

```python
import supervision as sv

Expand Down Expand Up @@ -983,6 +1003,11 @@ def from_lmm(cls, lmm: LMM | str, result: str | dict, **kwargs: Any) -> Detectio
including small, distant, or partially visible ones, and to return
tight bounding boxes.

According to the Gemini API documentation on image prompts, when using
a single image with text, the recommended approach is to place the text
prompt after the image part in the contents array. This ordering has
been shown to produce significantly better results in practice.

```
Carefully examine this image and detect ALL visible objects, including
small, distant, or partially visible ones.
Expand Down Expand Up @@ -1323,6 +1348,26 @@ def from_vlm(cls, vlm: VLM | str, result: str | dict, **kwargs: Any) -> Detectio
```

!!! example "Gemini 2.0"

??? tip "Prompt engineering"

From Gemini 2.0 onwards, models are further trained to detect objects in
an image and get their bounding box coordinates. The coordinates,
relative to image dimensions, scale to [0, 1000]. You need to descale
these coordinates based on your original image size.

According to the Gemini API documentation on image prompts, when using
a single image with text, the recommended approach is to place the text
prompt after the image part in the contents array. This ordering has
been shown to produce significantly better results in practice.

To get the best results from Google Gemini 2.0, use the following prompt.

```
Detect all the cats and dogs in the image. The box_2d should be
[ymin, xmin, ymax, xmax] normalized to 0-1000.
```

```python
import supervision as sv

Expand Down Expand Up @@ -1360,6 +1405,11 @@ def from_vlm(cls, vlm: VLM | str, result: str | dict, **kwargs: Any) -> Detectio
including small, distant, or partially visible ones, and to return
tight bounding boxes.

According to the Gemini API documentation on image prompts, when using
a single image with text, the recommended approach is to place the text
prompt after the image part in the contents array. This ordering has
been shown to produce significantly better results in practice.

```
Carefully examine this image and detect ALL visible objects, including
small, distant, or partially visible ones.
Expand Down