Skip to content

Conversation

zigouras
Copy link
Contributor

@zigouras zigouras commented May 15, 2024

Implementation of issue #520. Backup of database, archiving and diff reporting.

Copy link

vercel bot commented May 15, 2024

Someone is attempting to deploy a commit to the Clean and Green Philly Team on Vercel.

A member of the Team first needs to authorize it.

@nlebovits nlebovits requested a review from brandonfcohen1 May 15, 2024 10:33
@brandonfcohen1
Copy link
Contributor

Moving this conversation about primary keys here.

@nlebovits- Is it correct to say that OPA_ID is unique per parcel, but for multifamily buildings there can be one or many vacant units inside one parcel?

@zigouras- can you document how the tests work and how to run them?

@zigouras
Copy link
Contributor Author

@brandonfcohen1 the tests are commented, indicating what each one does. Some of them are POCs, just running code to see the output. Some have assertions. I would not run them as part of CI/CD but they are a good place for people to look to understand the underlying code.

@zigouras
Copy link
Contributor Author

zigouras commented May 15, 2024

What does the vacant_properties_end table hold, is it essentially a union of the data in all of the other tables? If so seems like it could be the most important table to include in the diff so it would be good to work out the pks on that one.

@brandonfcohen1
Copy link
Contributor

What does the vacant_properties_end table hold, is it essentially a union of the data in all of the other tables? If so seems like it could be the most important table to include in the diff so it would be good to work out the pks on that one.

vacant_properties_end is the end result of the script, it's really the only table we need to report a diff on

Copy link
Contributor

@brandonfcohen1 brandonfcohen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zigouras I'm having a hard time following what is happening in your code. Best I can tell script.py has all of the actual data loading commented out and it's replaced elsewhere.

Can you add to the readme and/or PR what is happening here?

@zigouras
Copy link
Contributor Author

zigouras commented May 16, 2024

@brandonfcohen1 you are probably looking at an earlier commit in this PR. Look at the latest commit. The backend setup doc describes the new backup and diff process.

@zigouras
Copy link
Contributor Author

What does the vacant_properties_end table hold, is it essentially a union of the data in all of the other tables? If so seems like it could be the most important table to include in the diff so it would be good to work out the pks on that one.

vacant_properties_end is the end result of the script, it's really the only table we need to report a diff on

OK then we have to figure out how to enforce a primary key on it.

Copy link
Contributor

@brandonfcohen1 brandonfcohen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got this running the new script for the first time:

RuntimeError: data-diff command did not exit with success. Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/app-lp47FrbD/bin/data-diff", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/__main__.py", line 342, in main
    _data_diff(dbt_project_dir=project_dir_override, dbt_profiles_dir=profiles_dir_override, state=state, **kw)
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/__main__.py", line 611, in _data_diff
    _print_result(stats, json_output, diff_iter)
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/__main__.py", line 423, in _print_result
    rich.print(diff_iter.get_stats_string())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/diff_tables.py", line 139, in get_stats_string
    diff_stats = self._get_stats(is_dbt)
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/diff_tables.py", line 100, in _get_stats
    list(self)  # Consume the iterator into result_list, if we haven't already
    ^^^^^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/diff_tables.py", line 95, in __iter__
    for i in self.diff:
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/diff_tables.py", line 266, in _diff_tables_wrapper
    raise error
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/diff_tables.py", line 239, in _diff_tables_wrapper
    yield from self._diff_tables_root(table1, table2, info_tree)
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/joindiff_tables.py", line 164, in _diff_tables_root
    yield from self._bisect_and_diff_tables(table1, table2, info_tree)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.11/site-packages/data_diff/diff_tables.py", line 299, in _bisect_and_diff_tables
    raise NotImplementedError(f"Cannot use a column of type {kt} as a key")
NotImplementedError: Cannot use a column of type Text(_notes=[], collation=None) as a key

@zigouras
Copy link
Contributor Author

zigouras commented May 18, 2024

got this running the new script for the first time:

This happens the first time you run it because there are null opa_ids in the vacant_properties table. To rerun, drop the backup_ schema, delete from vacant_properties where opa_id is null, and run again.
You should also run delete from property_tax_delinquencies where opa_number = 0 and delete from li_violations where opa_account_num is null for the same reason.

@brandonfcohen1
Copy link
Contributor

can you include instructions on how to run

got this running the new script for the first time:

This happens the first time you run it because there are null opa_ids in the vacant_properties table. To rerun, drop the backup_ schema, delete from vacant_properties where opa_id is null, and run again. You should also run delete from property_tax_delinquencies where opa_number = 0 for the same reason.

got it, can you include in documentation?

Copy link
Contributor

@brandonfcohen1 brandonfcohen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you include more details on how to set up and test email/slack locally

@zigouras
Copy link
Contributor Author

zigouras commented May 18, 2024

can you include more details on how to set up and test email/slack locally

Per the backend doc:

The CAGP_SLACK_API_TOKEN environmental variable must be set with the API key for the Slack app that can write messages to the channel as configured in the config.py report_to_slack_channel variable.

The report will also be emailed to any emails configured in the config.py report_to_email variable.

The unit tests test_send_report_to_slack and test_email_report are there. If you need help running python tests with pytest in VSCode there should be a lot of docs on the web.

I am still waiting on @nlebovits to get the API key to write to our clean-and-green-philly-back-end Slack channel. In the meantime you could create your own private Slack channel and create an app API key to test writing to it per these instructions:
https://www.datacamp.com/tutorial/how-to-send-slack-messages-with-python
That is what I did.
You will need a working smtp server for email as configured in config.py. I put localhost and installed postfix on my machine to send mail.

@zigouras
Copy link
Contributor Author

can you include instructions on how to run

got this running the new script for the first time:

This happens the first time you run it because there are null opa_ids in the vacant_properties table. To rerun, drop the backup_ schema, delete from vacant_properties where opa_id is null, and run again. You should also run delete from property_tax_delinquencies where opa_number = 0 for the same reason.

got it, can you include in documentation?

I added a section in the backend doc for this.

Nico Zigouras added 2 commits May 18, 2024 17:14
@zigouras
Copy link
Contributor Author

@brandonfcohen1 I changed the way the data-diff is done so it does not create the primary key constraints in the pg db or need any preliminary clean up of data with sql. This keeps your existing script logic and feature layer classes unchanged and is a more extensible, loosely coupled solution. I leveraged the -w flag in the data-diff program to apply a where clause to limit the records being compared.

Please look at my latest commit to this PR. You should be able to run the script.py now without any preparatory execution of sql to clean up the data. Thanks.

@Moylena
Copy link

Moylena commented May 28, 2024

@nlebovits is this something you can look at if Brandon can't?

@nlebovits nlebovits merged commit 2a8c71a into CodeForPhilly:main May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants