Move contents of `new_etl` to parent directory, replacing the old ETL implementation #1219

grahamalama · 2025-05-31T03:43:55Z

Checklist:

Before submitting your PR, please confirm that you have done the following:

I have opened my PR against the staging branch, NOT against main
I've run the relevant formatting and linting tools listed in the setup docs
I have commented hard-to-understand areas in my code
I've reviewed any merge conflicts to make sure they are resolved
My changes generate no new warnings

Description

Since #1212 was merged, we now have a new version of the pipeline that we plan to use to generate our data that powers the website. With that in mind, this PR "promotes" what we were calling the new_etl to the parent directory, replacing the old implementation.

To get the unit tests to run, I also had to make a few small tweaks in 31fa06d and 8459c41.

In a follow up PR, I plan to continue removing dead code and outdated documentation (mostly anything that relates to Postgres).

Related Issue(s)

This PR addresses issue #...

How Has This Been Tested?

Our (minimal) unit tests pass, linting and formatting passes, and the new ETL at least runs. It doesn't run to completion, however, as I encountered an issue at this step:

Details

Downloading and processing park priority data...
Downloading: 100%|████████████████████████████████████████████████████| 2.52G/2.52G [04:07<00:00, 10.2MiB/s]
Extracting files from the downloaded zip...
Extracting:   0%|                                                                     | 0/7 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/src/app/src/data_utils/park_priority.py", line 156, in park_priority
    phl_parks = file_manager.load_gdf(
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/classes/file_manager.py", line 177, in load_gdf
    raise FileNotFoundError(
FileNotFoundError: File phl_parks not found in corresponding directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/src/app/src/main.py", line 99, in <module>
    dataset = service(dataset)
              ^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/metadata/metadata_utils.py", line 233, in wrapper
    primary_featurelayer = func(primary_featurelayer)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/data_utils/park_priority.py", line 161, in park_priority
    phl_parks = download_and_process_shapefile(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/data_utils/park_priority.py", line 75, in download_and_process_shapefile
    file_manager.extract_files(buffer, target_files)
  File "/usr/src/app/src/classes/file_manager.py", line 244, in extract_files
    zip_ref.extract(filename, destination)
  File "/usr/local/lib/python3.11/zipfile.py", line 1664, in extract
    return self._extract_member(member, path, pwd)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1703, in _extract_member
    member = self.getinfo(member)
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1476, in getinfo
    raise KeyError(
KeyError: "There is no item named 'Parkserve_ParkPriorityAreas.shp' in the archive"

but this indicates a part of the pipeline we need to fix, not an error with the file moving itself.

We no longer run diff backup jobs since we no longer use postgres, and our slack error reporting has changed significantly, so we're going to delete our current test module and have an issue filed to write new tests.

vercel · 2025-05-31T03:44:00Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
vacant-lots-proj	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 31, 2025 3:44am

grahamalama · 2025-05-31T03:46:47Z

Note that the diffs for this PR are pretty wacky. I'd suggest reviewing by commit to get a sense of what changed and how. What I basically did was:

delete most of the code outside new_etl (except config) a0c4dc2
move the contents of new_etl to the parent directory ae4f7da
make small tweaks to get tests to pass 31fa06d and 8459c41

cfreedman · 2025-06-05T13:15:35Z

This looks good to me, thanks! Did the change you added to parse for the correct Shapefile in parks_priority happen to address the earlier issue you pointed out in slack about that failing or was it purely for the tests sake and that's still outstanding?

grahamalama · 2025-06-07T21:09:13Z

Did the change you added to parse for the correct Shapefile in parks_priority happen to address the earlier issue you pointed out in slack about that failing or was it purely for the tests sake and that's still outstanding?

It addressed the issue with the unit tests (which is also what I posted to Slack), but it didn't totally solve the issue

park_priority issue

Running service: park_priority
Downloading park priority data from: https://parkserve.tpl.org/downloads/Parkserve_Shapefiles_05212025.zip
Error loading GeoJSON: File phl_parks not found in corresponding directory.. Re-downloading and processing shapefile.
Downloading and processing park priority data...
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 2.52G/2.52G [06:52<00:00, 6.11MiB/s]
Extracting files from the downloaded zip...
Extracting:   0%|                                                                                       | 0/7 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/src/app/src/data_utils/park_priority.py", line 156, in park_priority
    phl_parks = file_manager.load_gdf(
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/classes/file_manager.py", line 177, in load_gdf
    raise FileNotFoundError(
FileNotFoundError: File phl_parks not found in corresponding directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/usr/src/app/src/main.py", line 99, in <module>
    dataset = service(dataset)
              ^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/metadata/metadata_utils.py", line 233, in wrapper
    primary_featurelayer = func(primary_featurelayer)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/data_utils/park_priority.py", line 161, in park_priority
    phl_parks = download_and_process_shapefile(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/src/data_utils/park_priority.py", line 75, in download_and_process_shapefile
    file_manager.extract_files(buffer, target_files)
  File "/usr/src/app/src/classes/file_manager.py", line 244, in extract_files
    zip_ref.extract(filename, destination)
  File "/usr/local/lib/python3.11/zipfile.py", line 1664, in extract
    return self._extract_member(member, path, pwd)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1703, in _extract_member
    member = self.getinfo(member)
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1476, in getinfo
    raise KeyError(
KeyError: "There is no item named 'Parkserve_ParkPriorityAreas.shp' in the archive"

grahamalama added 4 commits May 29, 2025 20:38

Remove most code outside of new_etl

a0c4dc2

Move all code from new_etl to parent directory

ae4f7da

Fix beautiful soup path to find tpl shapefile

31fa06d

Remove outdated tests

8459c41

We no longer run diff backup jobs since we no longer use postgres, and our slack error reporting has changed significantly, so we're going to delete our current test module and have an issue filed to write new tests.

github-actions bot added backend frontend labels May 31, 2025

cfreedman self-requested a review June 5, 2025 13:15

cfreedman approved these changes Jun 5, 2025

View reviewed changes

grahamalama merged commit 35fda93 into staging Jun 7, 2025
13 checks passed

grahamalama deleted the make-new-etl-the-etl branch June 7, 2025 16:18

grahamalama mentioned this pull request Jun 8, 2025

Remove remaining traces of Postgres #1228

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move contents of `new_etl` to parent directory, replacing the old ETL implementation #1219

Move contents of `new_etl` to parent directory, replacing the old ETL implementation #1219

Uh oh!

grahamalama commented May 31, 2025 •

edited

Loading

Uh oh!

vercel bot commented May 31, 2025

Uh oh!

grahamalama commented May 31, 2025

Uh oh!

cfreedman commented Jun 5, 2025

Uh oh!

Uh oh!

grahamalama commented Jun 7, 2025

Uh oh!

Uh oh!

Move contents of new_etl to parent directory, replacing the old ETL implementation #1219

Move contents of new_etl to parent directory, replacing the old ETL implementation #1219

Uh oh!

Conversation

grahamalama commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist:

Description

Related Issue(s)

How Has This Been Tested?

Uh oh!

vercel bot commented May 31, 2025

Uh oh!

grahamalama commented May 31, 2025

Uh oh!

cfreedman commented Jun 5, 2025

Uh oh!

Uh oh!

grahamalama commented Jun 7, 2025

Uh oh!

Uh oh!

Move contents of `new_etl` to parent directory, replacing the old ETL implementation #1219

Move contents of `new_etl` to parent directory, replacing the old ETL implementation #1219

grahamalama commented May 31, 2025 •

edited

Loading