Skip to content

Conversation

@lafiona
Copy link
Contributor

@lafiona lafiona commented Sep 2, 2022

Overview

The goal of this pull request is to update archery to work with a repository default branch named master or main, as part of the effort to rename the Apache Arrow repository's default branch to main. The parent Jira ticket can be found here.

Implementation

  • Update the language of the top level archery, crossbow, and docker command line interface code to reference the mainline development branch (default git branch) generically.
  • Update comments that reference the master branch.
  • Update the crossbow benchmarking examples to generically specify the <default-branch> rather than a hard-coded value.
  • In .github/workflows/integration.yml, add an environment variable DEFAULT_BRANCH to the archery command in the "Execute Docker Build" step, so that archery can reliably access the default branch value.
  • In .github/workflows/archery.yml, add an environment variable DEFAULT_BRANCH for all steps. This environment variable was already used by the Git Fixup step. It will also be used by the Archery Unittests step.
  • Add a property, default_branch_name, to the Repo class in dev/archery/archery/crossbow/core.py for computing the default branch name.
    • If specified, the DEFAULT_BRANCH environment variable, takes precedent in determining the default branch name (this is for overriding the git-based heuristic and qualifying in CI).
    • Otherwise, pygit2 is used to get the default branch name via the Apache Arrow repository's origin remote HEAD reference. This is a heuristic, but in most cases, the HEAD reference of the remote points to the default branch.
  • Add a cached property, default_branch to the Release class in dev/archery/archery/release/core.py for computing the default branch name. Similar to the default_branch_name property for Repo in archery/archery/crossbow/core.py:
    • If specified, the DEFAULT_BRANCH environment variable, takes precedent in determining the default branch name (this is for qualifying in CI).
    • Otherwise, similar to the previous step,GitPython is used to get the default branch name via the Apache Arrow repository's origin remote HEAD reference.
    • Modify the PANDAS and DASK Docker Build Parameter value for indicating the upstream development branch to upstream_devel.
    • Updated Development Running Docker Builds documentation to reflect the above change and fixed a broken link.

Out of scope:

  • There are remaining instances of master in the test fixtures files in dev/archery/archery/test/fixtures. It appears that the data only refers to external repositories, such as ursa-labs/ursabot, which currently uses master, so these instances were not modified.

Testing

  • Ran the archery and crossbow commands in local clones of both the mathworks/arrow and apache/arrow repositories.
  • Confirmed that the GitHub CI jobs pass.
  • We are unsure how to locally qualify the changes to the release component, but the release tests pass in CI.
  • Ran a sample archery docker command after setting the PANDAS environment variable to confirm that the correct version of Pandas is used.

Future Directions

  1. Added Jira task to update the pull request merge script to work with both master and main (ARROW-17777)
  2. Added Jira task to update the default value for the default branch name, if it cannot be determined via an environment variable, ARCHERY_DEFAULT_BRANCH, or from the repository's remote head reference. (ARROW-18011)

Notes

Thank you @kevingurney for your help with this pull request!

@github-actions
Copy link

github-actions bot commented Sep 2, 2022

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@lafiona lafiona marked this pull request as ready for review September 19, 2022 21:08
@lafiona lafiona marked this pull request as draft September 19, 2022 21:09
@lafiona lafiona changed the title [WIP] ARROW-15691: [Dev] Update archery to work with either master or main as default branch ARROW-15691: [Dev] Update archery to work with either master or main as default branch Sep 19, 2022
@github-actions
Copy link

@lafiona lafiona changed the title ARROW-15691: [Dev] Update archery to work with either master or main as default branch [WIP] ARROW-15691: [Dev] Update archery to work with either master or main as default branch Sep 20, 2022
@lafiona lafiona changed the title [WIP] ARROW-15691: [Dev] Update archery to work with either master or main as default branch ARROW-15691: [Dev] Update archery to work with either master or main as default branch Sep 20, 2022
@lafiona lafiona marked this pull request as ready for review October 10, 2022 21:34
@lafiona
Copy link
Contributor Author

lafiona commented Oct 10, 2022

This change is ready for review and ready to be considered for running the CI workflows that are awaiting approval. Thank you for your help on this!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lafiona . Here are a number of comments and suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid doing this at module import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for this suggestion, I've moved the code to set the default within the function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... can we instead default to "master" for the time being?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion, instead of erroring, I will set the default to "master" while printing a warning containing the same details as the error message above. It could be helpful for the user to know that the two heuristics that we're using are not working in their environment and it is falling back to a hard-coded default. We've also added a Jira ticket (ARROW-18011) to modify the default value from "master" to "main" after the migration.

@pitrou
Copy link
Member

pitrou commented Oct 12, 2022

Also @raulcd FYI

@lafiona
Copy link
Contributor Author

lafiona commented Oct 12, 2022

Thanks @pitrou for your code review, I've addressed your feedback in the latest commits.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lafiona for the PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding this is eventually used on the install_dask.sh and install_pandas.sh scripts:
https://github.com/apache/arrow/blob/master/ci/scripts/install_dask.sh#L29
and
https://github.com/apache/arrow/blob/master/ci/scripts/install_pandas.sh#L38
I think we will also have to update these if's otherwise. I've tested to run the archery command for pandas from current master:

 $ PANDAS=master archery docker run --no-leaf-cache conda-python-pandas

...
 => => # Collecting git+https://github.com/pandas-dev/pandas.git                                                                                              
 => => #   Cloning https://github.com/pandas-dev/pandas.git to /tmp/pip-req-build-op7u80ho                                                                    
 => => #   Running command git clone --filter=blob:none --quiet https://github.com/pandas-dev/pandas.git /tmp/pip-req-build-op7u80ho                          
 => => #   Resolved https://github.com/pandas-dev/pandas.git to commit 67d75f3715ed8bfb19edc6d99d16f39daba6e461                                               
 => => #   Preparing metadata (pyproject.toml): started  

and from your branch:

 $ PANDAS=main archery docker run --no-leaf-cache conda-python-pandas
...
#0 33.87 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#0 34.39 ERROR: Could not find a version that satisfies the requirement pandas==main (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5, 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.4.0rc0, 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.5.0rc0, 1.5.0)
#0 34.39 ERROR: No matching distribution found for pandas==main

with these changes we would stop testing the latest git version and try to find main from: pip install pandas==${pandas} instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing this and bringing it up!

I attempted to fix this by modifying the install scripts to use Git to compute the current default branch, however, at the time of running the scripts, the working directory is not within a git repository.

Since the flag value that is used is not dependent on the actual default branch name of the Arrow or Pandas repositories, it could be helpful to change the flag value to default, or one of the following:

  • devel
  • upstream_devel

If this sounds good to you, I can make this change and also update the documentation page at: https://arrow.apache.org/docs/developers/continuous_integration/docker.html#usage to reflect the new flag value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your approach sounds good to me, as this name refers to the upstream development version of the related project (pandas, dask) it makes more sense to use a flag like upstream_devel instead of master or main to differentiate this vs the latest release, the nightly build or a specific version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @raulcd! I've made this change and qualified by running the sample command from your original comment and running the affected tests in dev/archery/archery/docker/tests/test_docker.py.

I've also made the changes to the documentation and qualified by performing a sphinx build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late reply. We also have to update the tasks.yml definition to make crossbow run the correct task versions, updating master for upstream_devel on the following:
https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1495
and
https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1515

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries and thank you! I've made this update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also update this to upstream_devel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have updated this sample command too.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this changes on my fork and have triggered a bunch of nightly jobs and all seems to be working as expected (raulcd#13 (comment)) the failures are current failures on our nightlies. I am happy with it. I have investigated around archery and I can't seem to find other places where we are using hardcoded "master", so I am +1 on it.
@kou @kszucs this is a subtask for the bigger task: https://issues.apache.org/jira/browse/ARROW-15689 it seems to advance towards migrating from master to main, is there anything else you would expect on this?

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

lafiona and others added 20 commits October 25, 2022 16:32
'master' for now in dev/archery/archery/crossbow/core.py
…termined, default to 'master' in dev/archery/archery/release/core.py
… and Dask to 'upstream_devel', and update the documentation.
@pitrou
Copy link
Member

pitrou commented Oct 26, 2022

@github-actions crossbow submit example-cpp-minimal-build-static

@github-actions
Copy link

Revision: 374f540

Submitted crossbow builds: ursacomputing/crossbow @ actions-05d3f08a75

Task Status
example-cpp-minimal-build-static Github Actions

@pitrou pitrou merged commit 8861c0c into apache:master Oct 26, 2022
@pitrou
Copy link
Member

pitrou commented Oct 26, 2022

Thank you @lafiona !

@ursabot
Copy link

ursabot commented Oct 28, 2022

Benchmark runs are scheduled for baseline = 21564cf and contender = 8861c0c. 8861c0c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.36% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8861c0c8 ec2-t3-xlarge-us-east-2
[Failed] 8861c0c8 test-mac-arm
[Finished] 8861c0c8 ursa-i9-9960x
[Finished] 8861c0c8 ursa-thinkcentre-m75q
[Finished] 21564cf3 ec2-t3-xlarge-us-east-2
[Failed] 21564cf3 test-mac-arm
[Finished] 21564cf3 ursa-i9-9960x
[Finished] 21564cf3 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants