Conversation
davidsbatista
left a comment
There was a problem hiding this comment.
@hande-k thank you for this contribution!
I left some initial comments/suggestions for improvements.
integrations/cognee/src/haystack_integrations/components/connectors/cognee/memory_store.py
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/memory_store.py
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/memory_store.py
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/memory_store.py
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/memory_store.py
Outdated
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/writer.py
Outdated
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/memory_store.py
Outdated
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/writer.py
Outdated
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/memory_store.py
Outdated
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/py.typed
Outdated
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/retrievers/cognee/memory_retriever.py
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/writers/cognee/memory_writer.py
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/memory_stores/cognee/memory_store.py
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/cognifier.py
Outdated
Show resolved
Hide resolved
integrations/cognee/src/haystack_integrations/components/connectors/cognee/cognifier.py
Outdated
Show resolved
Hide resolved
|
Thanks for the review @davidsbatista & @sjrl! I've addressed all the comments. A couple of notes:
Let me know if anything needs further adjustment! |
| return future.result() | ||
|
|
||
|
|
||
| def extract_text(item: Any) -> str: |
There was a problem hiding this comment.
Would it be possible to make the type for item more strict? Looking at the code below it looks like it could be one of three things str, dict and some Cognee specific object.
There was a problem hiding this comment.
Cognee's search API returns Any internally. Results can be str (LLM completions), dict (structured outputs), or cognee model objects (DataPoint subclasses) depending on the search type. Since these internal types aren't part of cognee's API, I've kept Any but expanded the docstring to document the three expected categories. Would that work?
There was a problem hiding this comment.
Yeah expanding the docstring with the three supported types and the note that they are internal types would be great.
integrations/cognee/src/haystack_integrations/components/retrievers/cognee/memory_retriever.py
Outdated
Show resolved
Hide resolved
| """ | ||
| :param search_type: Cognee search type for memory retrieval. | ||
| :param top_k: Default number of results for memory search. | ||
| :param dataset_name: Cognee dataset name for storing memories. |
There was a problem hiding this comment.
Can you help me understand Cognee a bit better. It looks like from your code that dataset_name is required which makes sense.
My question is what level of scope is dataset_name meant to have? For example, is it normal to have a new dataset_name per user or is the intention for dataset to be scoped to multiple users (e.g. like at an org level)?
There was a problem hiding this comment.
To provide more context on why I'm asking this question is that our existing Mem0 memory store, most memories are scoped to users which is why our experimental protocol typically requires a user_id in all of its methods since the idea is that we are always scoping the request to a specific user. For example our protocol for add_memories looks like
def add_memories(self, *, messages: list[ChatMessage], user_id: str | None = None, **kwargs: Any) -> None:with the idea being that user_id is set at run time to scope the request to a specific user.
This way we don't need to create a new MemoryStore for every user which is useful when we usually want to easily reuse a Haystack Agent or Pipeline for many users.
So my hope is that Cognee can also fit into this pattern.
There was a problem hiding this comment.
Cognee supports user-level scoping. Its add(), search(), and cognify() APIs all accept a user parameter for ACL-based access control. I've added user_id support to CogneeMemoryStore to match your existing pattern: a single store instance can serve multiple users via user_id at runtime. When user_id is provided, it's resolved to a cognee User and forwarded to all API calls.
dataset_name groups data logically (like a collection), while user_id controls who can access it. Search is scoped to the store's dataset_name for the given user, and shared datasets (where another user granted read permission) are automatically resolved. I've updated the examples/demo_memory_agent.py for a demo of isolation + sharing.
Does the current logic make sense for the integration?
There was a problem hiding this comment.
Yup that logic makes a lot of sense! Thanks for the changes.
I'd only say it would be good to expose the user resolution to the Retriever and Writer components as well since those are how pipeline builders will typically interact with the Cognee Memory Store.
There was a problem hiding this comment.
It would be great to also add some integration tests that require access to a Cognee account. We then mark those test with a header like
@pytest.mark.skipif(
not os.environ.get("COGNEE_API_KEY", None),
reason="Export an env var called COGNEE_API_KEY containing the Cognee API key to run this test.",
)
@pytest.mark.integrationThere was a problem hiding this comment.
Added test_integration.py gated on LLM_API_KEY. This integration uses Cognee SDK locally as a library, so the only external dependency is the LLM provider (in default config it is OpenAI API key).
We can extend the integration in the future to use our Cloud as well, which would require using a cognee account
There was a problem hiding this comment.
Oh okay good to know. When using the Cognee SDK locally like this where do the memories get stored? Is the cognee library running a local db in the background or storing them in memory?
There was a problem hiding this comment.
Could we also add tests for CogneeCognifier and for the connectors/cognee/_utils.py file?
@hande-k thanks for your patience with us! This is going to be a great addition to Haystack. Since it’s the first version of a new abstraction, we really appreciate you working through all the comments with us as we refine how it fits in. |
|
Hi @hande-k, just checking in, do you have time to continue working on this? Happy to help or take over some of the remaining changes if that’s useful. |
|
hi @sjrl thanks for the new comments and offering your help to handle the requested changes, appreciate it! |
|
@hande-k you're welcome and that sounds good to me! |
| user = run_sync(_get_cognee_user(user_id)) if user_id else None | ||
|
|
||
| added = 0 | ||
| for msg in messages: | ||
| text = msg.text | ||
| if not text: | ||
| continue | ||
| run_sync(cognee.add(text, dataset_name=self.dataset_name, user=user)) | ||
| added += 1 |
There was a problem hiding this comment.
What happens if the user is None?
| Search is restricted to the store's ``dataset_name``. If the user owns the | ||
| dataset it is resolved by name; otherwise the store checks whether the user | ||
| has been granted read access (e.g. via shared permissions) and searches by | ||
| dataset UUID. | ||
| When ``None``, cognee's default user is used. |
There was a problem hiding this comment.
Let's make sure to only only use single backticks to wrap code sections otherwise our doc build complains.
| Search is restricted to the store's ``dataset_name``. If the user owns the | |
| dataset it is resolved by name; otherwise the store checks whether the user | |
| has been granted read access (e.g. via shared permissions) and searches by | |
| dataset UUID. | |
| When ``None``, cognee's default user is used. | |
| Search is restricted to the store's `dataset_name`. If the user owns the | |
| dataset it is resolved by name; otherwise the store checks whether the user | |
| has been granted read access (e.g. via shared permissions) and searches by | |
| dataset UUID. | |
| When `None`, cognee's default user is used. |
| def __init__( | ||
| self, search_type: CogneeSearchType = "GRAPH_COMPLETION", top_k: int = 10, dataset_name: str | None = None | ||
| ): |
There was a problem hiding this comment.
Following the pattern for our DocumentStores and their respective retrievers I'd expect the init method to take in a CogneeMemoryStore as init param. Check out our OpenSearchBM25Retriever as an example
The idea is for the retriever to call the method directly from the Store. E.g. this is how the bm25 retrieval is run in the linked component docs = doc_store._bm25_retrieval(**bm25_args).
So it would be great if we could follow that pattern here as well.
| self.dataset_name = dataset_name | ||
|
|
||
| @component.output_types(documents=list[Document]) | ||
| def run(self, query: str, top_k: int | None = None) -> dict[str, list[Document]]: |
There was a problem hiding this comment.
Related to this comment https://github.com/deepset-ai/haystack-core-integrations/pull/2979/changes#r3027766776 the run method should more or less accept all of the same arguments as the CogneeMemoryStore.search_memories function
Closes https://github.com/deepset-ai/haystack-private/issues/240
Summary
CogneeWriter,CogneeCognifier,CogneeRetriever, andCogneeMemoryStoreCogneeWriteringests Haystack Documents into Cognee memory viacognee.add()+ optionalcognee.cognify()CogneeRetrieversearches Cognee's memory and returns Haystack DocumentsCogneeCognifierwrapscognee.cognify()as a standalone pipeline stepCogneeMemoryStoreimplements theMemoryStoreprotocol from haystack-experimental for use with Haystack's experimental AgentTest plan
hatch run test:unithatch run fmt-checkhatch run test:typesdemo_pipeline.py,demo_memory_agent.py)