Package version
4.3.1
Affected package
BLite (client SDK)
.NET version
10.0
Description
Hello! @mrdevrobot While using BLite with multiple collections, I encountered a critical edge case where queries return duplicate/wrong-typed entities from other collections.
When performing a query using || (OR) or .Contains() on an indexed field in the default embedded mode (single-file), the query engine bypasses the B-Tree index, falls back to a physical page scan, and mistakenly deserializes documents from other collections into "phantom objects".
Minimal reproduction
Here is a minimal reproducible example using BLite v4.3.1.
Note: Both PhotoPo and PhotoMetadataPo share the Id and SourceId properties.
[MRE](http://github.com/LeoYang06/BLiteTestCases/tree/master/QueriesWithORContainsReturnPhantomObjectsIssue)
Expected behavior
Where(x => idList.Contains(x.SourceId)).Count() should return 710.
Actual behavior
It returns 1420. Half of the results are PhotoMetadataPo documents silently deserialized as PhotoPo (with RelativePath defaulting to null/empty string).
Additional context
Root Cause Analysis (Based on source code):
Based on tracing the execution path in the source code, this issue appears to be caused by a combination of AST fallback and shared physical pages:
-
AST Fallback (Missing OrElse/Contains):
In IndexOptimizer.OptimizeExpression, there is no handling for ExpressionType.OrElse or MethodCallExpression (for Contains). This forces the optimizer to return null, bypassing the isolated B-Tree index scan (Strategy 1 in FetchAsync).
-
Shared Pages across Collections:
In embedded mode (_collectionFiles == null), _storage.GetCollectionPageIds(_collectionName) yields all pages from the single DB file, meaning physical pages contain mixed slots from both Photos and PhotoMetadata.
-
BSON Predicate matching foreign collections:
The engine falls back to ScanAsync(BsonReaderPredicate predicate, ...). The compiled BSON predicate matches fields purely by name. Since PhotoMetadataPo also has a SourceId field, the predicate returns true for foreign documents.
-
Silent Deserialization Bypass:
The catch block in ScanAsync assumes that cross-collection deserialization will throw an exception (// foreign-collection document — skip silently). However, because PhotoMetadataPo shares the Id and SourceId fields with PhotoPo, the deserializer succeeds without throwing. The missing RelativePath property is simply assigned its default value. The bare catch is bypassed, and the phantom object is yielded.
Proposed Suggestions:
- Immediate fix via AST: I noticed that the execution path for
IndexQueryPlan.PlanKind.IndexIn is already implemented in DocumentCollection.ScanAsync. Adding support for OrElse / Contains in IndexOptimizer to map to IndexIn would immediately fix this specific query, as the B-Tree index is inherently collection-isolated.
- Safer Fallback Scans: To prevent cross-collection data leaks during
ScanAsync or FindAllAsync fallbacks, perhaps the SlottedPageHeader/SlotEntry could embed a collection identifier, or the BSON payload could strictly tag the CollectionName for validation before deserialization. Relying on the deserializer to throw exceptions might be too permissive when types share a subset of properties.
Package version
4.3.1
Affected package
BLite (client SDK)
.NET version
10.0
Description
Hello! @mrdevrobot While using BLite with multiple collections, I encountered a critical edge case where queries return duplicate/wrong-typed entities from other collections.
When performing a query using
||(OR) or.Contains()on an indexed field in the default embedded mode (single-file), the query engine bypasses the B-Tree index, falls back to a physical page scan, and mistakenly deserializes documents from other collections into "phantom objects".Minimal reproduction
Here is a minimal reproducible example using BLite v4.3.1.
Note: Both
PhotoPoandPhotoMetadataPoshare theIdandSourceIdproperties.[MRE](http://github.com/LeoYang06/BLiteTestCases/tree/master/QueriesWithORContainsReturnPhantomObjectsIssue)
Expected behavior
Where(x => idList.Contains(x.SourceId)).Count()should return710.Actual behavior
It returns
1420. Half of the results arePhotoMetadataPodocuments silently deserialized asPhotoPo(withRelativePathdefaulting tonull/empty string).Additional context
Root Cause Analysis (Based on source code):
Based on tracing the execution path in the source code, this issue appears to be caused by a combination of AST fallback and shared physical pages:
AST Fallback (Missing
OrElse/Contains):In
IndexOptimizer.OptimizeExpression, there is no handling forExpressionType.OrElseorMethodCallExpression(forContains). This forces the optimizer to returnnull, bypassing the isolated B-Tree index scan (Strategy 1 inFetchAsync).Shared Pages across Collections:
In embedded mode (
_collectionFiles == null),_storage.GetCollectionPageIds(_collectionName)yields all pages from the single DB file, meaning physical pages contain mixed slots from bothPhotosandPhotoMetadata.BSON Predicate matching foreign collections:
The engine falls back to
ScanAsync(BsonReaderPredicate predicate, ...). The compiled BSON predicate matches fields purely by name. SincePhotoMetadataPoalso has aSourceIdfield, the predicate returnstruefor foreign documents.Silent Deserialization Bypass:
The
catchblock inScanAsyncassumes that cross-collection deserialization will throw an exception (// foreign-collection document — skip silently). However, becausePhotoMetadataPoshares theIdandSourceIdfields withPhotoPo, the deserializer succeeds without throwing. The missingRelativePathproperty is simply assigned its default value. The barecatchis bypassed, and the phantom object is yielded.Proposed Suggestions:
IndexQueryPlan.PlanKind.IndexInis already implemented inDocumentCollection.ScanAsync. Adding support forOrElse/ContainsinIndexOptimizerto map toIndexInwould immediately fix this specific query, as the B-Tree index is inherently collection-isolated.ScanAsyncorFindAllAsyncfallbacks, perhaps theSlottedPageHeader/SlotEntrycould embed a collection identifier, or the BSON payload could strictly tag theCollectionNamefor validation before deserialization. Relying on the deserializer to throw exceptions might be too permissive when types share a subset of properties.