-
Notifications
You must be signed in to change notification settings - Fork 449
feat: Expose document store Python API in instructlab/instructlab rag submodule #2832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
anastasds
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested some simplification and had a few questions.
| ) -> Pipeline: | ||
| pipeline = Pipeline() | ||
| pipeline.add_component(instance=create_converter(), name="converter") | ||
| pipeline.add_component(instance=DocumentCleaner(), name="document_cleaner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to apply Haystack's document cleaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but if we assume to feed the pipelines with un-processed user documentation (pdf), are we also assuming that they're all ready-to-use in the RAG ingestion pipeline?
BTW: I might be wrong, but I don't think Docling applies any cleanup to the original documents.
|
@anastasds Please note that the I've added one comment here to track the request of upgrading the |
66fa135 to
e7bb65f
Compare
anastasds
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending the discussion regarding the new dependencies, looks good to me!
|
442aaaa to
3146c9a
Compare
b78ac38 to
a3ae7e3
Compare
b4d94a9 to
6c06cdf
Compare
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
courtneypacheco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Had a few comments (mostly about test coverage)
|
e2e workflow succeeded on this PR: View run, congrats! |
95d57fc to
f2cd496
Compare
|
@courtneypacheco added more unit tests to increase coverage |
|
@dmartinol this is looking great! can you squash the commits before merging? |
5b30ce7 to
1461333
Compare
Signed-off-by: Daniele Martinoli <[email protected]>
1461333 to
0121cae
Compare
|
cdoern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the hard work on this!
Resolves #2876 Depends on #2832 Dev doc: instructlab/dev-docs#161 **Checklist:** - [X] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [x] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [x] Documentation has been updated, if necessary. - [x] Unit tests have been added, if necessary. - [x] Functional tests have been added, if necessary. - [x] E2E Workflow tests have been added, if necessary. Approved-by: cdoern Approved-by: nathan-weinberg
…2903) **Issue resolved by this Pull Request:** Resolves #2875 Addresses #2957 Depends on #2832 **Dev Docs related to this Pull Request:** Link to Dev Doc or PR: instructlab/dev-docs#161 **Checklist:** - [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [ ] Documentation has been updated, if necessary. - [x] Unit tests have been added, if necessary. - [ ] Functional tests have been added, if necessary. - [ ] E2E Workflow tests have been added, if necessary. **Unit test code in the next commit** Approved-by: nathan-weinberg Approved-by: cdoern Approved-by: alinaryan
References
This PR relates to the task "[Dev][RAG] Expose document store Python API in instructlab/instructlab rag submodule" in the RAG project.
See #2872
Proposed interfaces
We propose two interfaces to model the ingestion and retrieval interactions with a document store, in the context of RAG pipelines:
Design notes
The actual instances are generated by factory functions which delegate the actual implementation to a factory function identified with some hardcoded conventions.
Factory functions:
Initial implementation uses a Milvus db, and there is no option to configure it differently.
The implementations are based on Haystack pipelines whose details are abstracted by the interface defined in the
document_storemodule.Test functions
Test functions in
tests/test_document_store_factory.pycan help to clarify how to consume these APIs.Updated dependencies
Requirements file has been updated to reflect the required dependencies.
Minimize the PR
The changes include the implementation of the Haystack document stores for Milvus mainly to prove the effectiveness of the proposal. If you prefer, we can start with a minimal commit to only include the interfaces and the minimal dependencies (configuration and unit test).
@anastasds @jwm4
Checklist:
conventional commits.