Skip to content

Queries with OR/Contains fall back to BSON scan and return cross-collection phantom objects in single-file mode #65

@LeoYang06

Description

@LeoYang06

Package version

4.3.1

Affected package

BLite (client SDK)

.NET version

10.0

Description

Hello! @mrdevrobot While using BLite with multiple collections, I encountered a critical edge case where queries return duplicate/wrong-typed entities from other collections.

When performing a query using || (OR) or .Contains() on an indexed field in the default embedded mode (single-file), the query engine bypasses the B-Tree index, falls back to a physical page scan, and mistakenly deserializes documents from other collections into "phantom objects".

Image

Minimal reproduction

Here is a minimal reproducible example using BLite v4.3.1.
Note: Both PhotoPo and PhotoMetadataPo share the Id and SourceId properties.

[MRE](http://github.com/LeoYang06/BLiteTestCases/tree/master/QueriesWithORContainsReturnPhantomObjectsIssue)

Expected behavior

Where(x => idList.Contains(x.SourceId)).Count() should return 710.

Actual behavior

It returns 1420. Half of the results are PhotoMetadataPo documents silently deserialized as PhotoPo (with RelativePath defaulting to null/empty string).

Additional context

Root Cause Analysis (Based on source code):

Based on tracing the execution path in the source code, this issue appears to be caused by a combination of AST fallback and shared physical pages:

  1. AST Fallback (Missing OrElse/Contains):
    In IndexOptimizer.OptimizeExpression, there is no handling for ExpressionType.OrElse or MethodCallExpression (for Contains). This forces the optimizer to return null, bypassing the isolated B-Tree index scan (Strategy 1 in FetchAsync).

  2. Shared Pages across Collections:
    In embedded mode (_collectionFiles == null), _storage.GetCollectionPageIds(_collectionName) yields all pages from the single DB file, meaning physical pages contain mixed slots from both Photos and PhotoMetadata.

  3. BSON Predicate matching foreign collections:
    The engine falls back to ScanAsync(BsonReaderPredicate predicate, ...). The compiled BSON predicate matches fields purely by name. Since PhotoMetadataPo also has a SourceId field, the predicate returns true for foreign documents.

  4. Silent Deserialization Bypass:
    The catch block in ScanAsync assumes that cross-collection deserialization will throw an exception (// foreign-collection document — skip silently). However, because PhotoMetadataPo shares the Id and SourceId fields with PhotoPo, the deserializer succeeds without throwing. The missing RelativePath property is simply assigned its default value. The bare catch is bypassed, and the phantom object is yielded.

Proposed Suggestions:

  • Immediate fix via AST: I noticed that the execution path for IndexQueryPlan.PlanKind.IndexIn is already implemented in DocumentCollection.ScanAsync. Adding support for OrElse / Contains in IndexOptimizer to map to IndexIn would immediately fix this specific query, as the B-Tree index is inherently collection-isolated.
  • Safer Fallback Scans: To prevent cross-collection data leaks during ScanAsync or FindAllAsync fallbacks, perhaps the SlottedPageHeader/SlotEntry could embed a collection identifier, or the BSON payload could strictly tag the CollectionName for validation before deserialization. Relying on the deserializer to throw exceptions might be too permissive when types share a subset of properties.

Metadata

Metadata

Labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions