Skip to content

ci/cpu: split | common & distributed #3188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open

ci/cpu: split | common & distributed #3188

wants to merge 33 commits into from

Conversation

Borda
Copy link
Member

@Borda Borda commented Jul 25, 2025

What does this PR do?

This shall boost each iteration but preserve all test configurations

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃


📚 Documentation preview 📚: https://torchmetrics--3188.org.readthedocs.build/en/3188/

Copy link

codecov bot commented Jul 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69%. Comparing base (0f78932) to head (4f5aedc).

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #3188   +/-   ##
======================================
  Coverage      69%     69%           
======================================
  Files         364     364           
  Lines       20096   20095    -1     
======================================
+ Hits        13790   13816   +26     
+ Misses       6306    6279   -27     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Borda
Copy link
Member Author

Borda commented Jul 25, 2025

@SkafteNicki, any idea why most of the DDP test, when we run them without the common test before, suddenly hangs...

@SkafteNicki
Copy link
Collaborator

@SkafteNicki, any idea why most of the DDP test, when we run them without the common test before, suddenly hangs...

@Borda this looks super weird to me. When I look into the failed tests, it is not even the same ones that are failing in same settings so it does not even seem to be systematic failure.

@Borda
Copy link
Member Author

Borda commented Aug 3, 2025

any idea why most of the DDP test, when we run them without the common test before, suddenly hangs...

this looks super weird to me. When I look into the failed tests, it is not even the same ones that are failing in same settings so it does not even seem to be systematic failure.

yeah, so do you agree that the change shall be completely fine

@SkafteNicki SkafteNicki self-requested a review as a code owner August 4, 2025 06:19
@SkafteNicki
Copy link
Collaborator

any idea why most of the DDP test, when we run them without the common test before, suddenly hangs...

this looks super weird to me. When I look into the failed tests, it is not even the same ones that are failing in same settings so it does not even seem to be systematic failure.

yeah, so do you agree that the change shall be completely fine

Yes, change should be fine. Right now I am thinking that it may be due to a port coalition, testing...

@mergify mergify bot added the has conflicts label Aug 6, 2025
@mergify mergify bot removed the has conflicts label Aug 6, 2025
@Borda
Copy link
Member Author

Borda commented Aug 6, 2025

so many cases are resolved now, but still seeing:

unittests/audio/test_nisqa.py::TestNISQA::test_nisqa[True-preds0-16000-reference0] +++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
~~~~~~~~~~~~ Stack of Thread-4 (_handle_results) (140416607311424) ~~~~~~~~~~~~~
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/multiprocessing/pool.py", line 579, in _handle_results
    task = get()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
~~~~~~~~~~~~~ Stack of Thread-3 (_handle_tasks) (140416624092736) ~~~~~~~~~~~~~~
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/multiprocessing/pool.py", line 531, in _handle_tasks
    for taskseq, set_length in iter(taskqueue.get, None):
~~~~~~~~~~~~ Stack of Thread-2 (_handle_workers) (140416640874048) ~~~~~~~~~~~~~
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/multiprocessing/pool.py", line 522, in _handle_workers
    cls._wait_for_updates(current_sentinels, change_notifier)
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/multiprocessing/pool.py", line 502, in _wait_for_updates
    wait(sentinels, timeout=timeout)
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
~~~~~~~~~~~~~~~ Stack of Thread-1 (run_server) (140416729474624) ~~~~~~~~~~~~~~~
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/pytest_rerunfailures.py", line 438, in run_server
    conn, _ = self.sock.accept()
  File "/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/socket.py", line 293, in accept
    fd, addr = self._accept()

@Borda Borda changed the title ci: split common & distributed ci/cpu: split common & distributed Aug 7, 2025
@mergify mergify bot removed the has conflicts label Aug 7, 2025
@Borda
Copy link
Member Author

Borda commented Aug 11, 2025

by looking at the following, it seems it uses the same port for all?

FAILED unittests/text/test_rouge.py::TestROUGEScore::test_rouge_score_class[True-avg-preds0-targets0-rouge2_fmeasure-True] - RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.1.0.203]:21380
FAILED unittests/text/test_rouge.py::TestROUGEScore::test_rouge_score_class[True-avg-preds0-targets0-rougeL_precision-False] - RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.1.0.203]:21380
FAILED unittests/text/test_rouge.py::TestROUGEScore::test_rouge_score_class[True-avg-preds0-targets0-rougeL_recall-False] - RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.1.0.203]:21380

@Borda Borda changed the title ci/cpu: split common & distributed ci/cpu: split | common & distributed Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants