Skip to content

Conversation

@pjbull
Copy link
Member

@pjbull pjbull commented Sep 1, 2024

Initial pass at http.

  • implemented with standard library
  • tested against Python http.server
  • will try writing/deleting files with PUT/DELETE in case your server allows it
  • will try parsing "directories" from the returned HTML; seems like this grabs the right things from python's http.server and also looks like it should work for apache and nginx file servers from my googling. (allows user to override this method for their own server)
  • Bonus fix, use .anchor instead of .cloud_prefix where appropriate. For cloud providers, .anchor is the same as .cloud_prefix (e.g., s3://). For http, the anchor should be the server http://example.com and the .cloud_prefix should be http://. Constructing new paths should use .anchor

Limitations/caveats:

  • Assumes that url must have suffix (e.g., http://example.com/file.txt) to be a file; if no suffix, assumes dir. This is definitely not true of real-world URLs, but maybe is an ok assumption for anything serving files?
  • Assumes urls must end in / to be directories (user can pass custom test for urls if they have a different scenario)

Left to do:

  • Add https
  • Expose all the parsed url components publicly (not behind HttpPath._url).
  • Update default file + dir function to be based on a trailing slash
  • Allow also overriding the file vs. dir function
  • Add tests for http specific functions
  • Turn off noisy logs for test http server by default
  • Add tests that url search strings and fragments are persisted correctly
  • Documentation (table, caveats, docstrings, headers, auth (add cookies), files v dirs, etc.)
  • HTTP Test suite often passes locally, but not always; need to debug and make less flaky

Closes #455

@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2024

@github-actions github-actions bot temporarily deployed to pull request September 1, 2024 11:22 Inactive
@github-actions github-actions bot temporarily deployed to pull request September 1, 2024 11:29 Inactive
@jayqi
Copy link
Member

jayqi commented Sep 4, 2024

Bonus fix, use .anchor instead of .cloud_prefix where appropriate. For cloud providers, .anchor is the same as .cloud_prefix (e.g., s3://). For http, the anchor should be the server http://example.com and the .cloud_prefix should be http://. Constructing new paths should use .anchor

We shouldn't call this "anchor". Anchor has a different meaning in a close enough context that I think this will be confusing. In particular, "anchor" colloquially refers to "anchor tags" in HTML documents, which you can reference within a URL with fragment identifiers #someref. The part preceding the :// is called "scheme" in URL/URI vocabulary and we should probably stick with that if we're going to call it anything. (Reference)

Assumes that url must have suffix (e.g., http://example.com/file.txt) to be a file; if no suffix, assumes dir. This is definitely not true of real-world URLs, but maybe is an ok assumption for anything serving files?

I'm not sure I like this. There are often genuine files that don't have file extensions that are usually plain text files, e.g., README, LICENSE, Makefile.

My impression is that, (while it's not absolute) the most common convention and default for many web servers and frameworks is to use a trailing slash explicitly to serve directories.

@pjbull
Copy link
Member Author

pjbull commented Sep 4, 2024

We shouldn't call this "anchor". Anchor has a different meaning in a close enough context that I think this will be confusing.

Hm, yeah, I can see this is potentially confusing, but we're not calling it "anchor" arbitrarily. We're looking for the right analogy to populate the existing .anchor pathlib property. In the pathlib context .anchor is a Path object that is drive + root. We're using it to refer to scheme + netloc (not just the scheme) which reasoning by analogy feels about right. The difference from the cloud providers is that the scheme is the root is the drive (e.g., listing s3:// lists buckets you have access to). IMO docs can cover this source of potential confusion without too much worry, especially since anything referring to an anchor in a url is most often called the "fragment".

use a trailing slash explicitly to serve directories.

Yeah, this was the other default rule I considered—I think you're right it's probably a more reliable default. That is how the python server works (except for the root, which does not redirect to the slash). Also planning for the method to be overridable so people can configure for their particular servers if it is different.

@github-actions github-actions bot temporarily deployed to pull request September 13, 2024 10:39 Inactive
@github-actions github-actions bot temporarily deployed to pull request September 16, 2024 18:58 Inactive
@github-actions github-actions bot temporarily deployed to pull request September 17, 2024 21:55 Inactive
@pjbull pjbull mentioned this pull request Sep 19, 2024
@MattOates
Copy link

MattOates commented Oct 9, 2024

So one thing to be aware of is your code assumes no one would ever do HttpPath("http://username:password@host/path/file.txt") for basic auth.

url.netloc looks like the string "username:password@host" in this instance. Im currently struggling to see how in the Cloudpathlib design I can actually inject the auth from the URL into the client too. Looks more like you would always have to explicitly provide a client which feels not incredible from the user side. Perhaps we could introduce something that for http/sftp/ftp etc user/pass auth has some standard parameters that get passed to the client from the URL if the netloc has an @ in?

@pjbull
Copy link
Member Author

pjbull commented Oct 14, 2024

So one thing to be aware of is your code assumes no one would ever do HttpPath("http://username:password@host/path/file.txt") for basic auth.

@MattOates I guess I was thinking that this just sticks around on that path and any paths derived from it (since we don't mess with netloc) and should "just work." Does it not? Maybe there are just some bugs to clean up (sorry, haven't had a chance to poke at it and confirm/deny).

Looks more like you would always have to explicitly provide a client which feels not incredible from the user side.

I think that we could (1) support env vars for basic auth like most of the cloud providers have for auth, (2) point people towards how to set the default client, or (3) recommend an explicit client.

@jayqi
Copy link
Member

jayqi commented Oct 18, 2024

I guess I was thinking that this just sticks around on that path and any paths derived from it

Does that mean we're going to print out the username and password in plaintext in string representations?

@pjbull
Copy link
Member Author

pjbull commented Feb 14, 2025

Does that mean we're going to print out the username and password in plaintext in string representations?

Yeah, FWIW urllib also is happy to just print that out too. We could be more conservative, but I'm not inclined to add a special case if the standard lib doesn't.

In [1]: from urllib.parse import urlparse

In [2]: urlparse("http://user:[email protected]")
Out[3]: ParseResult(scheme='http', netloc='user:[email protected]', path='', params='', query='', fragment='')

@github-actions github-actions bot temporarily deployed to pull request February 14, 2025 20:43 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 14, 2025 20:48 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 14, 2025 20:52 Inactive
@codecov
Copy link

codecov bot commented Feb 14, 2025

Codecov Report

❌ Patch coverage is 95.16908% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.8%. Comparing base (863e884) to head (4982159).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
cloudpathlib/http/httpclient.py 91.6% 9 Missing ⚠️
cloudpathlib/http/httppath.py 98.8% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #468     +/-   ##
========================================
+ Coverage    93.4%   93.8%   +0.3%     
========================================
  Files          23      26      +3     
  Lines        1800    1948    +148     
========================================
+ Hits         1682    1828    +146     
- Misses        118     120      +2     
Files with missing lines Coverage Δ
cloudpathlib/__init__.py 93.7% <100.0%> (+0.8%) ⬆️
cloudpathlib/cloudpath.py 94.4% <100.0%> (+0.2%) ⬆️
cloudpathlib/http/__init__.py 100.0% <100.0%> (ø)
cloudpathlib/http/httppath.py 98.8% <98.8%> (ø)
cloudpathlib/http/httpclient.py 91.6% <91.6%> (ø)

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot temporarily deployed to pull request February 14, 2025 22:33 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 14, 2025 22:35 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 14, 2025 22:40 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 16, 2025 00:27 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 16, 2025 01:09 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 16, 2025 01:11 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 17, 2025 19:49 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 17, 2025 21:58 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 17, 2025 22:45 Inactive
@github-actions github-actions bot temporarily deployed to pull request February 18, 2025 01:04 Inactive
@github-actions github-actions bot temporarily deployed to pull request August 2, 2025 19:55 Inactive
@pjbull
Copy link
Member Author

pjbull commented Aug 2, 2025

Code review comments addressed and everything passing. Merging in HTTP support! 💖

@pjbull pjbull merged commit b21cac1 into master Aug 2, 2025
25 checks passed
@pjbull pjbull deleted the 455-http branch August 2, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read from http - httppathlib?

4 participants