Skip to content

Conversation

@mgramigna
Copy link
Contributor

Summary

Introduces the concept of validating CSV schemas for the specific extractor (e.g. column names, number of columns, required columns, etc.).

CSV Validation is done at initialization time of the Extractor (before any data is extracted for anything).

NOTE: Schemas in this PR only exist for the ICARE data elements per the task definition of done.

New behavior

If any validation errors occur, they will be logged to the console and the extraction process will stop. Otherwise, logs will indicate that the validation has succeeded, and extraction will continue. Actual extraction behavior is unchanged

Code changes

  • Schemas and utilities for doing validation of specific CSV content based on a provided schema (see src/helpers/schemas/csv.js) for schemas. They are simple for now (just column names and optional vs. not). We can add more complex validation in the future but IMO that should be separate tasking.
  • Introduce BaseCSVExtractor class that all CSV extractors extend from
    • Implement the validate function and also construct the CSV module based on the file path
  • Validate all extractors in the initializeExtractors in BaseClient
    • Validation is asynchronous, so this is now an async function
  • Update all tests with necessary async keywords based on calls to extractor initialization.
  • Update the CSV data element spreadsheet to correctly indicate present fields and if they are required based on the behavior in the current version of the MEF.

Testing guidance

  • Running extraction with existing CSV should have unchanged behavior apart from logs
  • Change some of the data in any of the ICARE CSVs (Patient, Condition, TreatmentPlanChange, CancerDiseaseStatus, ClinicalTrialInformation)
    • Try omitting required data (e.g. mrn)
    • Try adding erroneous columns
    • Try invalid CSV (e.g. 9 column headers but only 8 column values for a given row)

@Dtphelan1 Dtphelan1 self-assigned this Mar 12, 2021
@jafeltra jafeltra self-assigned this Mar 12, 2021
Copy link
Contributor

@jafeltra jafeltra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good to me! I just had a couple questions:

  • [JA1] When I remove a column, I get a logged message that says ...found extra column(s): "". Is that expected? This happens both if the column is marked as required or not. I think we had said if we had missing non-required columns we wanted to flag the CSV as invalid, but I'm not sure and wanted to check. Either way, if we are missing columns, it might be helpful to update the error message we get in this case since it reference extra columns, but we're missing columns.
  • [JA2] When I add an extra column but don't add a value in the row (or vice versa), I just get an error that the extractor cannot be initialized, but no message from the validation. Is that expected? It looks like the error that gets caught when we log that message says that the column length and row length is off, so it might be helpful to log that information.
  • [JA3] This might not be possible with the approach you've taken, so I'm just asking to see your/others opinion on this. But if one CSV file fails validation, do you think we should continue with extraction for the other extractors that did have valid CSV files? Or does it make sense to stop all extraction? Maybe there's a case to make that we might be missing resources we try to reference, but we also don't enforce any sort of context right now with the CSV extractors.
  • [JA4] I'm not sure exactly what you updated in the Excel file, but it looks like some of the required columns don't match up with what is in the schema for some of the resources. Is that something you were trying to align as part of this?

@Dtphelan1
Copy link
Contributor

DP1 - Looks like this PR hasn't yet added tests for the CSV Validator. I think as you're going through and addressing the JA's above, creating tests that explicitly define expected behavior would be helpful.

@Dtphelan1
Copy link
Contributor

  • [JA3] This might not be possible with the approach you've taken, so I'm just asking to see your/others opinion on this. But if one CSV file fails validation, do you think we should continue with extraction for the other extractors that did have valid CSV files? Or does it make sense to stop all extraction? Maybe there's a case to make that we might be missing resources we try to reference, but we also don't enforce any sort of context right now with the CSV extractors.

My 2 Cents: I would want to continue the validation and aggregate all the errors. Since this is pre-extraction, it will be most helpful to let the user know about all errors found in validation, so they can go back to their POC and inform them of all the issues in validation that need addressing..

@jafeltra
Copy link
Contributor

My 2 Cents: I would want to continue the validation and aggregate all the errors. Since this is pre-extraction, it will be most helpful to let the user know about all errors found in validation, so they can go back to their POC and inform them of all the issues in validation that need addressing..

Ah that's a good idea! Do you think that in addition to seeing all the errors that are there, would we want to continue with extraction for the ones that did validate? Or just stop and be able to see the errors that were there?

@Dtphelan1
Copy link
Contributor

Ah that's a good idea! Do you think that in addition to seeing all the errors that are there, would we want to continue with extraction for the ones that did validate? Or just stop and be able to see the errors that were there?

I would think the client should shut down if the CSV's don't validate without errors. That's my thought anyway.

@mgramigna
Copy link
Contributor Author

Methinks all comments resolved

Copy link
Contributor

@jafeltra jafeltra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these changes look good to me! I tested out the various cases for missing and required/ missing and optional/extra and they all work as expected. And the schemas, extractors, and Excel file all line up!

@Dtphelan1
Copy link
Contributor

Looks like this could use a rebase.

Matthew Gramigna added 4 commits March 17, 2021 09:36
* Include schemas and tools for validating CSVs
* Add BaseCSVExtractor class that all CSV extractors extend
* Update excel sheet to indicate all fields used
* Update tests to use proper calls
@Dtphelan1
Copy link
Contributor

Looks great.

@mgramigna mgramigna merged commit 1bbbbe4 into master Mar 17, 2021
@mgramigna mgramigna deleted the csv-validation branch March 17, 2021 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants