Skip to content

Conversation

@dmartinol
Copy link
Contributor

@dmartinol dmartinol commented Dec 20, 2024

References

This PR relates to the task "[Dev][RAG] Expose document store Python API in instructlab/instructlab rag submodule" in the RAG project.
See #2872

Proposed interfaces

We propose two interfaces to model the ingestion and retrieval interactions with a document store, in the context of RAG pipelines:

class DocumentStoreRetriever(ABC):
    @abstractmethod
    def augmented_context(self, user_query: str) -> str:...

class DocumentStoreIngestor(ABC):
    @abstractmethod
    def ingest_documents(self, input_dir) -> tuple[bool, int]:...

Design notes

The actual instances are generated by factory functions which delegate the actual implementation to a factory function identified with some hardcoded conventions.

Factory functions:

def create_document_retriever(
    document_store_config: DocumentStoreConfig,
    retriever_config: RetrieverConfig,
) -> DocumentStoreRetriever:
...

def create_document_store_ingestor(
    document_store_config: DocumentStoreConfig,
    embedding_config: EmbeddingModelConfig,
) -> DocumentStoreIngestor:
...

Initial implementation uses a Milvus db, and there is no option to configure it differently.
The implementations are based on Haystack pipelines whose details are abstracted by the interface defined in the document_store module.

Test functions

Test functions in tests/test_document_store_factory.py can help to clarify how to consume these APIs.

Updated dependencies

Requirements file has been updated to reflect the required dependencies.

Minimize the PR

The changes include the implementation of the Haystack document stores for Milvus mainly to prove the effectiveness of the proposal. If you prefer, we can start with a minimal commit to only include the interfaces and the minimal dependencies (configuration and unit test).

@anastasds @jwm4

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Functional tests have been added, if necessary.
  • E2E Workflow tests have been added, if necessary.

@mergify mergify bot added testing Relates to testing dependencies Relates to dependencies ci-failure PR has at least one CI failure labels Dec 20, 2024
Copy link
Contributor

@anastasds anastasds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested some simplification and had a few questions.

) -> Pipeline:
pipeline = Pipeline()
pipeline.add_component(instance=create_converter(), name="converter")
pipeline.add_component(instance=DocumentCleaner(), name="document_cleaner")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to apply Haystack's document cleaning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but if we assume to feed the pipelines with un-processed user documentation (pdf), are we also assuming that they're all ready-to-use in the RAG ingestion pipeline?
BTW: I might be wrong, but I don't think Docling applies any cleanup to the original documents.

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jan 7, 2025
@dmartinol
Copy link
Contributor Author

@anastasds Please note that the document_splitter might be replaced by the one available in the docling-haystack, whenever it will be available.
ATM we cannot integrate it because of a dependency conflict with the instructlab-sdg package:

    instructlab-sdg 0.6.2 depends on docling<=2.8.3 and >=2.4.2
    docling-haystack 0.1.1 depends on docling<3.0.0 and >=2.9.0

I've added one comment here to track the request of upgrading the instructlab-sdg dependencies.

Copy link
Contributor

@anastasds anastasds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending the discussion regarding the new dependencies, looks good to me!

@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jan 7, 2025
@dmartinol
Copy link
Contributor Author

@anastasds

  • can we make the PR ready for review?
  • what about the issue with the test functions that require the download of the default embedding model?
    • make it a real unit test with mocked embedder? (not easy but probably doable)
    • make it an integration test instead?

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jan 7, 2025
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jan 8, 2025
@dmartinol dmartinol marked this pull request as ready for review January 8, 2025 22:10
@mergify mergify bot added the ci-failure PR has at least one CI failure label Jan 8, 2025
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jan 9, 2025
@dmartinol dmartinol force-pushed the doc-store-api branch 2 times, most recently from b4d94a9 to 6c06cdf Compare January 10, 2025 10:13
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jan 14, 2025
@github-actions
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@cdoern cdoern added the hold In-progress PR. Tag should be removed before merge. label Jan 14, 2025
Copy link
Contributor

@courtneypacheco courtneypacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Had a few comments (mostly about test coverage)

@github-actions
Copy link

e2e workflow succeeded on this PR: View run, congrats!

@dmartinol
Copy link
Contributor Author

@courtneypacheco added more unit tests to increase coverage

src/instructlab/rag/__init__.py                                    0      0   100%
src/instructlab/rag/document_store.py                              7      0   100%
src/instructlab/rag/document_store_factory.py                      9      0   100%
src/instructlab/rag/haystack/component_factory.py                 22      0   100%
src/instructlab/rag/haystack/components/document_splitter.py      52      7    87%
src/instructlab/rag/haystack/document_store_factory.py             9      0   100%
src/instructlab/rag/haystack/document_store_ingestor.py           42      3    93%
src/instructlab/rag/haystack/document_store_retriever.py          27      0   100%

@cdoern
Copy link
Contributor

cdoern commented Jan 15, 2025

@dmartinol this is looking great! can you squash the commits before merging?

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 15, 2025
@dmartinol
Copy link
Contributor Author

@dmartinol this is looking great! can you squash the commits before merging?

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the hard work on this!

@mergify mergify bot added ci-failure PR has at least one CI failure one-approval PR has one approval from a maintainer labels Jan 15, 2025
@mergify mergify bot removed one-approval PR has one approval from a maintainer ci-failure PR has at least one CI failure labels Jan 15, 2025
@nathan-weinberg nathan-weinberg removed the hold In-progress PR. Tag should be removed before merge. label Jan 15, 2025
@nathan-weinberg nathan-weinberg removed the request for review from alinaryan January 15, 2025 16:45
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jan 15, 2025
@mergify mergify bot merged commit 45e1749 into instructlab:main Jan 15, 2025
27 checks passed
mergify bot added a commit that referenced this pull request Jan 22, 2025
Resolves #2876 
Depends on #2832 
Dev doc: instructlab/dev-docs#161

**Checklist:**

- [X] **Commit Message Formatting**: Commit titles and messages follow guidelines in the
  [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary).
- [x] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release.
- [x] Documentation has been updated, if necessary.
- [x] Unit tests have been added, if necessary.
- [x] Functional tests have been added, if necessary.
- [x] E2E Workflow tests have been added, if necessary.



Approved-by: cdoern

Approved-by: nathan-weinberg
mergify bot added a commit that referenced this pull request Jan 22, 2025
…2903)

**Issue resolved by this Pull Request:**
Resolves #2875 
Addresses #2957
Depends on #2832

**Dev Docs related to this Pull Request:**
Link to Dev Doc or PR: instructlab/dev-docs#161

**Checklist:**

- [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the
  [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary).
- [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release.
- [ ] Documentation has been updated, if necessary.
- [x] Unit tests have been added, if necessary.
- [ ] Functional tests have been added, if necessary.
- [ ] E2E Workflow tests have been added, if necessary.

**Unit test code in the next commit**


Approved-by: nathan-weinberg

Approved-by: cdoern

Approved-by: alinaryan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Relates to dependencies documentation Improvements or additions to documentation testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants