Skip to content

docs: add MTEB evaluation guide and update usage.rst #3477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sahibpreetsingh12
Copy link
Contributor

This PR resolves #3332.

Summary

Adds a new documentation page for evaluating SentenceTransformer models using the Massive Text Embedding Benchmark (MTEB), along with relevant task examples and best practices.

Changes

  • mteb_evaluation.md in docs/sentence_transformer/usage/:

    • Installation steps
    • Minimal working example
    • Task-type breakdown (STS, Classification, Retrieval, etc.)
    • Notes on output handling
    • Warnings about not using MTEB during training
    • Leaderboard + export instructions
  • Linked from usage.rst to include in sidebar navigation

Notes

Following the guidance in the discussion, MTEB is documented as a post-training evaluation tool, not integrated as an evaluator to avoid benchmark overfitting.

Let me know if you'd like any section adjusted. Thank you!

@sahibpreetsingh12
Copy link
Contributor Author

@tomaarsen Please share your feedback and if anything else i should chnage

Copy link
Contributor

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen can you look too?

@sahibpreetsingh12
Copy link
Contributor Author

sahibpreetsingh12 commented Jul 31, 2025

@Samoed and @tomaarsen
If anything else is required form my side please do share.
Since I am new to this thing
What can i do in future to make Unit test run successfully since I just commited changes from UI
and If this merges willl take a pull for later

@sahibpreetsingh12
Copy link
Contributor Author

@Samoed what is required for this PR to merge
I am happy to contribute


* Using it during training risks **overfitting** to public benchmarks.
* It writes to disk and caches aggressively.
* Official guidance recommends using SentenceTransformer's built-in evaluators like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to evaluate during training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen ok so here the thought was to just show we do evaluation at testing
Although i also thought about it but then idea was to keep things simple and if this works
I would open another issue and then get it done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I am not quite sure what you mean, but from my reading, it sounds like you do not recommend MTEB for evaluation (even after training), which I don't think is the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, @KennethEnevoldsen !

You're absolutely right — my intention was not to discourage the use of MTEB for evaluation after training. I’ve now updated the wording to clarify that MTEB is recommended for post-training evaluation, but not ideal during training loops due to the risk of overfitting and aggressive caching.

Let me know if the revised phrasing works better — happy to tweak further! 🙌

**Important**: MTEB is for *post-training* benchmarking only.

* Using it during training risks **overfitting** to public benchmarks.
* It writes to disk and caches aggressively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not ideal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @KennethEnevoldsen from my understanding and based on discussion we had at the start
This looks ideal for me.

@sahibpreetsingh12
Copy link
Contributor Author

@KennethEnevoldsen and @Samoed

All review suggestions have been incorporated:

  • Replaced STSBenchmark with STS22.v2
  • Clarified task examples with disclaimer and link to full list
  • Added filtering by task_type, domain, language
  • Used .to_dataframe() for result printing
  • Explicitly looped over main_score values

Let me know if anything else is required — happy to revise 🙌

@tomaarsen
Copy link
Collaborator

tomaarsen commented Aug 4, 2025

Hello @sahibpreetsingh12,

Thanks for the PR. I've had a detailed look at it now, and I'm afraid I see a lot of issues/quirks. For example:

  • Very little of code the runs
  • The code snippets don't match the mteb suggested format of import mteb, mteb.get_tasks, but instead use e.g. from mteb import get_tasks
  • There's text in a code block in the Quick Start
  • The section titles are unusually long compared to related documentation in Sentence Transformers, they also use emojis unlike other documentation
  • There's links to Sentence Transformers documentation, even though this is Sentence Transformers documentation
  • The list of tasks/examples is a bit arbitrary
  • There's a second file under .ipynb_checkpoints that likely should not have been included

Overall, the documentation page reads very "AI generated", which I'd like to avoid, if possible. I've overhauled it now, with working code, etc. The overall message from your version still remains. I hope that's okay.

Thank you @Samoed and @KennethEnevoldsen for taking the time to review the PR here thus far.

Feel free to let me know what you think of the new version, I think we're pretty close to ready now.

  • Tom Aarsen

@sahibpreetsingh12
Copy link
Contributor Author

Hello @sahibpreetsingh12,

Thanks for the PR. I've had a detailed look at it now, and I'm afraid I see a lot of issues/quirks. For example:

  • Very little of code the runs
  • The code snippets don't match the mteb suggested format of import mteb, mteb.get_tasks, but instead use e.g. from mteb import get_tasks
  • There's text in a code block in the Quick Start
  • The section titles are unusually long compared to related documentation in Sentence Transformers, they also use emojis unlike other documentation
  • There's links to Sentence Transformers documentation, even though this is Sentence Transformers documentation
  • The list of tasks/examples is a bit arbitrary
  • There's a second file under .ipynb_checkpoints that likely should not have been included

Overall, the documentation page reads very "AI generated", which I'd like to avoid, if possible. I've overhauled it now, with working code, etc. The overall message from your version still remains. I hope that's okay.

Thank you @Samoed and @KennethEnevoldsen for taking the time to review the PR here thus far.

Feel free to let me know what you think of the new version, I think we're pretty close to ready now.

  • Tom Aarsen

Since these are my early days doing a documentation and even any open source.
Yes I took some help in checking how we can make a documentation but the help was setting a base template but yes I would take it and would definitely improve from here.
And these suggestions help me improve

Thanks @tomaarsen @KennethEnevoldsen and @Samoed

And keep "Speeding up Inference" as the last item in Usage
@tomaarsen tomaarsen merged commit 6e7d64e into UKPLab:master Aug 6, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add documentation for evaluating using MTEB
4 participants