CSV Validation #91

mgramigna · 2021-03-11T21:01:16Z

Summary

Introduces the concept of validating CSV schemas for the specific extractor (e.g. column names, number of columns, required columns, etc.).

CSV Validation is done at initialization time of the Extractor (before any data is extracted for anything).

NOTE: Schemas in this PR only exist for the ICARE data elements per the task definition of done.

New behavior

If any validation errors occur, they will be logged to the console and the extraction process will stop. Otherwise, logs will indicate that the validation has succeeded, and extraction will continue. Actual extraction behavior is unchanged

Code changes

Schemas and utilities for doing validation of specific CSV content based on a provided schema (see src/helpers/schemas/csv.js) for schemas. They are simple for now (just column names and optional vs. not). We can add more complex validation in the future but IMO that should be separate tasking.
Introduce BaseCSVExtractor class that all CSV extractors extend from
- Implement the validate function and also construct the CSV module based on the file path
Validate all extractors in the initializeExtractors in BaseClient
- Validation is asynchronous, so this is now an async function
Update all tests with necessary async keywords based on calls to extractor initialization.
Update the CSV data element spreadsheet to correctly indicate present fields and if they are required based on the behavior in the current version of the MEF.

Testing guidance

Running extraction with existing CSV should have unchanged behavior apart from logs
Change some of the data in any of the ICARE CSVs (Patient, Condition, TreatmentPlanChange, CancerDiseaseStatus, ClinicalTrialInformation)
- Try omitting required data (e.g. mrn)
- Try adding erroneous columns
- Try invalid CSV (e.g. 9 column headers but only 8 column values for a given row)

jafeltra

This looks really good to me! I just had a couple questions:

[JA1] When I remove a column, I get a logged message that says ...found extra column(s): "". Is that expected? This happens both if the column is marked as required or not. I think we had said if we had missing non-required columns we wanted to flag the CSV as invalid, but I'm not sure and wanted to check. Either way, if we are missing columns, it might be helpful to update the error message we get in this case since it reference extra columns, but we're missing columns.
[JA2] When I add an extra column but don't add a value in the row (or vice versa), I just get an error that the extractor cannot be initialized, but no message from the validation. Is that expected? It looks like the error that gets caught when we log that message says that the column length and row length is off, so it might be helpful to log that information.
[JA3] This might not be possible with the approach you've taken, so I'm just asking to see your/others opinion on this. But if one CSV file fails validation, do you think we should continue with extraction for the other extractors that did have valid CSV files? Or does it make sense to stop all extraction? Maybe there's a case to make that we might be missing resources we try to reference, but we also don't enforce any sort of context right now with the CSV extractors.
[JA4] I'm not sure exactly what you updated in the Excel file, but it looks like some of the required columns don't match up with what is in the schema for some of the resources. Is that something you were trying to align as part of this?

Dtphelan1 · 2021-03-15T13:06:59Z

DP1 - Looks like this PR hasn't yet added tests for the CSV Validator. I think as you're going through and addressing the JA's above, creating tests that explicitly define expected behavior would be helpful.

Dtphelan1 · 2021-03-15T13:08:50Z

[JA3] This might not be possible with the approach you've taken, so I'm just asking to see your/others opinion on this. But if one CSV file fails validation, do you think we should continue with extraction for the other extractors that did have valid CSV files? Or does it make sense to stop all extraction? Maybe there's a case to make that we might be missing resources we try to reference, but we also don't enforce any sort of context right now with the CSV extractors.

My 2 Cents: I would want to continue the validation and aggregate all the errors. Since this is pre-extraction, it will be most helpful to let the user know about all errors found in validation, so they can go back to their POC and inform them of all the issues in validation that need addressing..

jafeltra · 2021-03-15T14:02:26Z

My 2 Cents: I would want to continue the validation and aggregate all the errors. Since this is pre-extraction, it will be most helpful to let the user know about all errors found in validation, so they can go back to their POC and inform them of all the issues in validation that need addressing..

Ah that's a good idea! Do you think that in addition to seeing all the errors that are there, would we want to continue with extraction for the ones that did validate? Or just stop and be able to see the errors that were there?

Dtphelan1 · 2021-03-15T14:48:32Z

Ah that's a good idea! Do you think that in addition to seeing all the errors that are there, would we want to continue with extraction for the ones that did validate? Or just stop and be able to see the errors that were there?

I would think the client should shut down if the CSV's don't validate without errors. That's my thought anyway.

mgramigna · 2021-03-16T18:15:51Z

Methinks all comments resolved

jafeltra

All these changes look good to me! I tested out the various cases for missing and required/ missing and optional/extra and they all work as expected. And the schemas, extractors, and Excel file all line up!

Dtphelan1 · 2021-03-17T13:35:00Z

Looks like this could use a rebase.

* Include schemas and tools for validating CSVs * Add BaseCSVExtractor class that all CSV extractors extend * Update excel sheet to indicate all fields used * Update tests to use proper calls

Dtphelan1 · 2021-03-17T14:05:33Z

Looks great.

mgramigna requested review from Dtphelan1, dmendelowitz, jafeltra and julianxcarter March 11, 2021 21:01

Dtphelan1 self-assigned this Mar 12, 2021

jafeltra self-assigned this Mar 12, 2021

jafeltra reviewed Mar 12, 2021

View reviewed changes

jafeltra approved these changes Mar 16, 2021

View reviewed changes

Matthew Gramigna added 4 commits March 17, 2021 09:36

CSV Validation

a671a00

* Include schemas and tools for validating CSVs * Add BaseCSVExtractor class that all CSV extractors extend * Update excel sheet to indicate all fields used * Update tests to use proper calls

Update to switch to custom implementation of validation with tests

edb9bb2

Update CSV to include proper requirements for CDS

560eb5c

Revert baseClient test to use synchronous initialize call

bb59921

mgramigna force-pushed the csv-validation branch from 75eaf06 to bb59921 Compare March 17, 2021 13:36

Dtphelan1 approved these changes Mar 17, 2021

View reviewed changes

mgramigna merged commit 1bbbbe4 into master Mar 17, 2021

mgramigna deleted the csv-validation branch March 17, 2021 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSV Validation #91

CSV Validation #91

Uh oh!

mgramigna commented Mar 11, 2021

Uh oh!

jafeltra left a comment

Uh oh!

Dtphelan1 commented Mar 15, 2021

Uh oh!

Dtphelan1 commented Mar 15, 2021

Uh oh!

jafeltra commented Mar 15, 2021

Uh oh!

Dtphelan1 commented Mar 15, 2021

Uh oh!

mgramigna commented Mar 16, 2021

Uh oh!

jafeltra left a comment

Uh oh!

Dtphelan1 commented Mar 17, 2021

Uh oh!

Dtphelan1 commented Mar 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CSV Validation #91

CSV Validation #91

Uh oh!

Conversation

mgramigna commented Mar 11, 2021

Summary

New behavior

Code changes

Testing guidance

Uh oh!

jafeltra left a comment

Choose a reason for hiding this comment

Uh oh!

Dtphelan1 commented Mar 15, 2021

Uh oh!

Dtphelan1 commented Mar 15, 2021

Uh oh!

jafeltra commented Mar 15, 2021

Uh oh!

Dtphelan1 commented Mar 15, 2021

Uh oh!

mgramigna commented Mar 16, 2021

Uh oh!

jafeltra left a comment

Choose a reason for hiding this comment

Uh oh!

Dtphelan1 commented Mar 17, 2021

Uh oh!

Dtphelan1 commented Mar 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants