Skip to content

Conversation

grahamalama
Copy link
Contributor

@grahamalama grahamalama commented May 31, 2025

Checklist:

Before submitting your PR, please confirm that you have done the following:

  • I have opened my PR against the staging branch, NOT against main
  • I've run the relevant formatting and linting tools listed in the setup docs
  • I have commented hard-to-understand areas in my code
  • I've reviewed any merge conflicts to make sure they are resolved
  • My changes generate no new warnings

Description

Since #1212 was merged, we now have a new version of the pipeline that we plan to use to generate our data that powers the website. With that in mind, this PR "promotes" what we were calling the new_etl to the parent directory, replacing the old implementation.

To get the unit tests to run, I also had to make a few small tweaks in 31fa06d and 8459c41.

In a follow up PR, I plan to continue removing dead code and outdated documentation (mostly anything that relates to Postgres).

Related Issue(s)

This PR addresses issue #...

How Has This Been Tested?

Our (minimal) unit tests pass, linting and formatting passes, and the new ETL at least runs. It doesn't run to completion, however, as I encountered an issue at this step:

Details

Downloading and processing park priority data...
Downloading: 100%|████████████████████████████████████████████████████| 2.52G/2.52G [04:07<00:00, 10.2MiB/s]
Extracting files from the downloaded zip...
Extracting:   0%|                                                                     | 0/7 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/src/app/src/data_utils/park_priority.py", line 156, in park_priority
    phl_parks = file_manager.load_gdf(
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/classes/file_manager.py", line 177, in load_gdf
    raise FileNotFoundError(
FileNotFoundError: File phl_parks not found in corresponding directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/src/app/src/main.py", line 99, in <module>
    dataset = service(dataset)
              ^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/metadata/metadata_utils.py", line 233, in wrapper
    primary_featurelayer = func(primary_featurelayer)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/data_utils/park_priority.py", line 161, in park_priority
    phl_parks = download_and_process_shapefile(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/data_utils/park_priority.py", line 75, in download_and_process_shapefile
    file_manager.extract_files(buffer, target_files)
  File "/usr/src/app/src/classes/file_manager.py", line 244, in extract_files
    zip_ref.extract(filename, destination)
  File "/usr/local/lib/python3.11/zipfile.py", line 1664, in extract
    return self._extract_member(member, path, pwd)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1703, in _extract_member
    member = self.getinfo(member)
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1476, in getinfo
    raise KeyError(
KeyError: "There is no item named 'Parkserve_ParkPriorityAreas.shp' in the archive"

but this indicates a part of the pipeline we need to fix, not an error with the file moving itself.

We no longer run diff backup jobs since we no longer use postgres, and our slack error reporting has changed significantly, so we're going to delete our current test module and have an issue filed to write new tests.
Copy link

vercel bot commented May 31, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vacant-lots-proj ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 31, 2025 3:44am

@grahamalama
Copy link
Contributor Author

Note that the diffs for this PR are pretty wacky. I'd suggest reviewing by commit to get a sense of what changed and how. What I basically did was:

  • delete most of the code outside new_etl (except config) a0c4dc2
  • move the contents of new_etl to the parent directory ae4f7da
  • make small tweaks to get tests to pass 31fa06d and 8459c41

@cfreedman
Copy link
Contributor

This looks good to me, thanks! Did the change you added to parse for the correct Shapefile in parks_priority happen to address the earlier issue you pointed out in slack about that failing or was it purely for the tests sake and that's still outstanding?

@cfreedman cfreedman self-requested a review June 5, 2025 13:15
@grahamalama grahamalama merged commit 35fda93 into staging Jun 7, 2025
13 checks passed
@grahamalama grahamalama deleted the make-new-etl-the-etl branch June 7, 2025 16:18
@grahamalama
Copy link
Contributor Author

Did the change you added to parse for the correct Shapefile in parks_priority happen to address the earlier issue you pointed out in slack about that failing or was it purely for the tests sake and that's still outstanding?

It addressed the issue with the unit tests (which is also what I posted to Slack), but it didn't totally solve the issue

park_priority issue

Running service: park_priority
Downloading park priority data from: https://parkserve.tpl.org/downloads/Parkserve_Shapefiles_05212025.zip
Error loading GeoJSON: File phl_parks not found in corresponding directory.. Re-downloading and processing shapefile.
Downloading and processing park priority data...
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 2.52G/2.52G [06:52<00:00, 6.11MiB/s]
Extracting files from the downloaded zip...
Extracting:   0%|                                                                                       | 0/7 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/src/app/src/data_utils/park_priority.py", line 156, in park_priority
    phl_parks = file_manager.load_gdf(
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/classes/file_manager.py", line 177, in load_gdf
    raise FileNotFoundError(
FileNotFoundError: File phl_parks not found in corresponding directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/usr/src/app/src/main.py", line 99, in <module>
    dataset = service(dataset)
              ^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/metadata/metadata_utils.py", line 233, in wrapper
    primary_featurelayer = func(primary_featurelayer)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/data_utils/park_priority.py", line 161, in park_priority
    phl_parks = download_and_process_shapefile(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/data_utils/park_priority.py", line 75, in download_and_process_shapefile
    file_manager.extract_files(buffer, target_files)
  File "/usr/src/app/src/classes/file_manager.py", line 244, in extract_files
    zip_ref.extract(filename, destination)
  File "/usr/local/lib/python3.11/zipfile.py", line 1664, in extract
    return self._extract_member(member, path, pwd)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1703, in _extract_member
    member = self.getinfo(member)
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1476, in getinfo
    raise KeyError(
KeyError: "There is no item named 'Parkserve_ParkPriorityAreas.shp' in the archive"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants