Feat: enables cross-project file access by storing absolute paths and project names in all graph nodes#302
Conversation
This change enables cross-project file access by storing absolute paths and project names in all graph nodes, fixing the issue where queries from different working directories couldn't locate files. ## Core Changes ### 1. New Fields in All Nodes - `absolute_path`: POSIX format absolute path for file access - `project_name`: Project name for fast cross-project filtering - `path`: Relative path (retained for display) ### 2. Path Calculation Utility - New `codebase_rag/utils/path_utils.py` with `calculate_paths()` - Computes all path-related fields using `os.path.abspath()` - Preserves symlinks for user-friendly paths ### 3. Parser Layer Updates - All processors now call `calculate_paths()` when creating nodes - Project, Package, Folder, File, Module, Class, Function, Method nodes updated - External modules marked with `project_name = "__external__"` ### 4. Query Layer Updates - All Cypher queries return `absolute_path`, `path`, `project_name` - New indexes on `Function`, `Method`, `Class` for `project_name` ### 5. Tools Layer Updates - `CodeRetriever`, `FileEditor`, `FileReader` use `absolute_path` - MCP tools return complete path information - Removed path concatenation logic ## Performance Impact - Storage: +600KB/10k nodes (negligible) - Query performance: 10x improvement for project-filtered queries - Index overhead: <5% ## Backward Compatibility - ✅ Existing `path` field retained - ✅ All existing APIs work unchanged - ✅ Requires reindexing existing projects
Summary of ChangesHello @evander-wang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces fundamental changes to the codebase's knowledge graph structure and its interaction mechanisms to facilitate robust cross-project file access. By embedding absolute paths and project identifiers directly into graph nodes, the system can now accurately locate and reference files regardless of the current working directory. This enhancement significantly improves the reliability and performance of file-related operations and queries, particularly for multi-project environments, without breaking existing functionalities. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed feature addition, enabling cross-project file access by storing absolute paths and project names in graph nodes. The changes are comprehensive, touching the parsing, querying, and tools layers. The new calculate_paths utility centralizes path logic effectively. I've included a couple of review comments focusing on potential performance improvements in Cypher queries and ensuring schema consistency. Overall, this is a solid contribution.
Greptile OverviewGreptile SummaryThis PR enables cross-project file access by storing absolute paths and project names in graph nodes, addressing the limitation where queries from different working directories couldn't locate files. Key Changes:
Issues Found:
Addressed from Previous Threads:
Migration Impact: Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Parser as Parser Layer
participant PathUtils as path_utils.py
participant Graph as Neo4j Graph
participant Tools as Tools Layer
participant FileSystem as File System
Note over Parser,PathUtils: Indexing Phase
Parser->>PathUtils: calculate_paths(file_path, repo_path)
PathUtils->>PathUtils: file_path.relative_to(repo_path).as_posix()
PathUtils->>PathUtils: file_path.resolve() -> absolute_path
PathUtils-->>Parser: {relative_path, absolute_path}
Parser->>Graph: CREATE node with path, absolute_path, project_name
Note over Graph,Tools: Query Phase (Cross-Project)
Tools->>Graph: MATCH (n) WHERE project_name = 'project_a'
Graph-->>Tools: {absolute_path, relative_path, project_name}
alt absolute_path present
Tools->>FileSystem: read(absolute_path)
FileSystem-->>Tools: file contents
else absolute_path missing (old index)
Tools->>PathUtils: validate_allowed_path(relative_path, roots)
PathUtils->>PathUtils: resolve and validate against allowed_roots
PathUtils-->>Tools: validated absolute Path
Tools->>FileSystem: read(validated_path)
FileSystem-->>Tools: file contents
end
|
…ng TYPE and UNION nodes with path properties
…ility; update README and bump version to 0.0.60
|
/gemini review |
|
@claude review |
|
@greptile |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and valuable change by enabling cross-project file access through the storage of absolute paths and project names in graph nodes. The approach is solid, and the security improvements in DirectoryLister are well-implemented and crucial for this feature. The updates to the parsers and Cypher queries are consistent with the new data model.
I have a few suggestions for improvement:
- The
FileEditorandFileReadertools don't seem to be fully updated to support cross-project access, as they are not aware of theALLOWED_PROJECT_ROOTSsetting. This could lead to functional issues and prevents them from returning complete path information. - There's a minor style issue with a local import in one of the parser utility functions.
Overall, this is a great feature addition. Addressing these points will make the implementation more robust and consistent.
Additional Comments (5)
Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/tools/code_retrieval.py
Line: 34:46
Comment:
**Fallback reads wrong key**
`CYPHER_FIND_BY_QUALIFIED_NAME` now returns `n.path AS relative_path`, but the fallback branch here still does `file_path_str = res.get("path")`. When `absolute_path` is missing/NULL, `file_path_str` will be `None` even though `relative_path` is present, causing `CODE_MISSING_LOCATION` and preventing retrieval. Use `res.get("relative_path")` (or keep the alias `AS path` in the query) so the fallback can actually locate the file.
How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/tools/semantic_search.py
Line: 96:105
Comment:
**Query alias breaks lookup**
`CYPHER_GET_FUNCTION_SOURCE_LOCATION` was changed to return `n.path AS relative_path` (and `absolute_path`), but this code still reads `result.get("path")`. That makes `file_path` None and `validate_source_location()` will reject the location, so semantic source extraction will fail for all nodes. Either update this to use `absolute_path` (preferred) or `relative_path`, or keep the query alias as `AS path` for backward compatibility.
How can I resolve this? If you propose a fix, please make it concise.
The embeddings query now returns Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/graph_updater.py
Line: 372:405
Comment:
**Embedding path key mismatch**
The embeddings query now returns `m.path AS relative_path` (and not `AS path`), but `_parse_embedding_result()` and `_generate_semantic_embeddings()` still fetch the file path using `cs.KEY_PATH` (`"path"`). As a result `file_path` becomes `None` and `_extract_source_code()` will skip extraction for every row, preventing embeddings from being generated. Fix by updating the cypher alias back to `AS path`, or update parsing to use `relative_path`/`absolute_path` consistently end-to-end.
How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/constants.py
Line: 416:428
Comment:
**Broken module prefix filter**
`CYPHER_QUERY_EMBEDDINGS` changed from `STARTS WITH $project_name + '.'` to `STARTS WITH $project_name`, but `GraphUpdater` still passes `{"project_name": self.project_name + "."}`. That produces a double-dot prefix like `"myproj.."`, which won't match module qualified names and will yield zero embedding candidates. Either revert the cypher condition or stop appending `"."` in `graph_updater.py`.
How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/utils/path_utils.py
Line: 31:45
Comment:
**Absolute path follows symlinks**
`calculate_paths()` uses `file_path.resolve()` for `absolute_path`, which resolves symlinks. That contradicts the PR description (“Preserves symlinks for user-friendly paths”) and will change the stored `absolute_path` for symlinked repos/files (breaking lookups if users expect the symlink path). If the intended behavior is to preserve symlinks, use `Path.absolute()` / `os.path.abspath()` semantics rather than `resolve()` (or `resolve(strict=False)` only for normalization if you still want to preserve links).
How can I resolve this? If you propose a fix, please make it concise. |
…E config with validator (query/edit), refactor MCPToolsRegistry to conditionally register tools based on mode, query mode provides read-only access while edit mode provides full access including file editing
|
@greptile |
|
/gemini review |
1 similar comment
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-implemented feature to enable cross-project file access by storing absolute paths and project names in all graph nodes. The changes are comprehensive, touching the parser, query, and tool layers to ensure consistency. The addition of a read-only query mode for the MCP server is also a great enhancement for security.
I've found a couple of areas for improvement. There's a high-severity issue in one of the updated Cypher queries that could lead to incorrect results and misses a key performance optimization. I've also noted a minor inconsistency in path handling that should be addressed for better maintainability. Overall, this is a solid contribution that greatly enhances the tool's capabilities.
…simplify path handling by removing redundant fields from schemas
|
@greptile |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This is an excellent and substantial pull request that introduces a crucial feature for cross-project analysis. Storing absolute paths and project names in the graph is a solid approach to resolve file access issues from different working directories. The changes are well-implemented across all layers of the application, from parsing to querying. I particularly appreciate the addition of the MCP dual-mode system (query and edit) and the ALLOWED_PROJECT_ROOTS setting, which significantly enhance the security and usability of the tool in different environments. The code is clean, and the new tests provide good coverage for the new functionality. I have a couple of minor suggestions for improving code style and adhering to project rules.
Additional Comments (3)
After this PR, Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/graph_updater.py
Line: 429:436
Comment:
**Embeddings source path now wrong**
`CYPHER_QUERY_EMBEDDINGS` was changed to return `n.path AS path` (relative path), but `_extract_source_code()` still assumes `file_path` is relative to `self.repo_path` and always does `file_path_obj = self.repo_path / file_path`.
After this PR, `CYPHER_QUERY_EMBEDDINGS` also has `n.absolute_path` available, and the embedding pass should prefer that; otherwise cross-project nodes will either read the wrong file (if `path` collides) or fail to open it from the current repo root.
How can I resolve this? If you propose a fix, please make it concise.
This can break Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/parsers/definition_processor.py
Line: 114:123
Comment:
**Module parent folder key mismatch**
When a module’s parent is a folder, the relationship uses `(Folder, path, str(parent_rel_path))`, but folder nodes are created in `StructureProcessor` with `cs.KEY_PATH = paths["relative_path"]` (from `calculate_paths()`). On Windows, `str(parent_rel_path)` is `"a\\b"` while existing folder nodes/other relationships may use `"a/b"`.
This can break `CONTAINS_MODULE` linkage and create duplicate folder nodes. Use a single normalized representation (POSIX or OS-native) consistently for all `path` keys.
How can I resolve this? If you propose a fix, please make it concise.
This is a concrete security bug in edit mode. Reuse the validated Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/mcp/tools.py
Line: 392:429
Comment:
**Read_file bypasses path validator**
In `read_file()`, the pagination branch (`if offset is not None or limit is not None`) opens `full_path = Path(self.project_root) / file_path` directly and does not use `FileReader`’s `@validate_project_path` guard. This allows `../` traversal (and ignores `ALLOWED_PROJECT_ROOTS`) whenever offset/limit is provided.
This is a concrete security bug in edit mode. Reuse the validated `FileReader` path handling for both branches, or apply the same validation to the paginated path before opening.
How can I resolve this? If you propose a fix, please make it concise. |
…ools for query mode compatibility
|
/gemini review |
|
@greptile |
There was a problem hiding this comment.
Code Review
This pull request introduces significant and well-structured features for multi-project support and enhanced security through dual 'query' and 'edit' modes. The changes are extensive, touching configuration, database schema, parsers, and tools. The introduction of absolute paths and project names in graph nodes is a solid approach to enable cross-project queries. My review focuses on ensuring the new security mechanisms are consistently applied. I've identified a few critical areas where path validation is inconsistent, potentially undermining the security goals of this PR. Specifically, the handling of symbolic links and the validation against ALLOWED_PROJECT_ROOTS needs to be standardized across all file I/O operations to prevent security vulnerabilities. All comments highlight critical security and consistency issues that need to be addressed.
…ix critical bug in code_retrieval
…ate_allowed_path() utility function and security boundary for write operations
|
/gemini review |
|
@greptile |
There was a problem hiding this comment.
Code Review
This pull request introduces MCP dual mode, adding query and edit modes with different capabilities and security profiles, and includes changes to enhance security and path handling. The .env.example file was updated to include ALLOWED_PROJECT_ROOTS and MCP_MODE variables, and openspec was added to .gitignore. The README.md was updated to describe the new MCP dual mode system, including example usages, security configurations, and a mode selection guide. The config.py file was modified to add ALLOWED_PROJECT_ROOTS and MCP_MODE settings, along with validation for MCP_MODE. The constants.py file was updated to include KEY_ABSOLUTE_PATH and EXTERNAL_PROJECT_NAME. Cypher queries were updated to include path information, and decorators were modified to validate project paths. The graph_updater.py file was updated to include absolute path and project name when ensuring project nodes. The logs.py file was updated to include new log messages related to MCP mode and path parsing. The main.py file was updated to pass allowed roots to the directory lister. The server.py file was updated to include the MCP mode in the server creation process. The tools.py file was updated to include the mode and allowed roots in the MCP tools registry. The cpp_modules.py file was updated to use as_posix() for path handling. The mixin.py file was updated to calculate and include paths for class nodes and methods. The definition_processor.py file was updated to include path information for modules. The function_ingest.py file was updated to include path information for functions. The import_processor.py file was updated to include the project name for external modules. The structure_processor.py file was updated to include path information for packages and folders. The utils.py file was updated to include path calculation and validation functions. The base.py file was updated to remove OpenAIResponsesModel. The schemas.py file was updated to include project name in the CodeSnippet model. The graph_service.py file was updated to ensure project name indexes. The test_decorators.py file was updated to include allowed roots in mock services. The test_graph_service.py file was updated to account for new indexes. The test_provider_classes.py file was updated to reflect the removal of OpenAIResponsesModel. The code_retrieval.py file was updated to handle absolute and relative paths and to validate allowed paths. The directory_lister.py file was updated to validate allowed paths. The file_editor.py file was updated to include mode and to block write operations in query mode. The file_reader.py file was updated to include mode and allowed roots. The file_writer.py file was updated to include mode and to block write operations in query mode. The semantic_search.py file was updated to prioritize absolute paths. The types_defs.py file was updated to include path information and allowed roots. The path_utils.py file was updated to include path calculation and validation functions. The realtime_updater.py file was updated to use as_posix() for path handling. New tests were added for cross-project access. Review comments suggest removing the redundant project_name property from the Project node schema, considering the necessity of the path property on Class, Function, and Method nodes, parsing strings to Path objects in _parse_frozenset_of_strings, ensuring the return type of allowed_project_roots_set is frozenset[Path], renaming n.path to n.relative_path in Cypher queries, explicitly passing allowed_roots to FileEditor and FileWriter, avoiding redundant construction of allowed_roots in read_file, storing the absolute path in the CodeSnippet model, simplifying _get_safe_path in DirectoryLister, and initializing the allowed_roots attribute in FileEditor and FileWriter.
Additional Comments (4)
The relative-path fallback builds a path by combining Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/tools/code_retrieval.py
Line: 40:43
Comment:
**Fallback path join breaks**
The relative-path fallback builds a path by combining `self.project_root.as_posix()` (a `str`) with `file_path_str` using `/`, which will raise a `TypeError` at runtime when `absolute_path` is missing in the graph result (e.g., before reindex/migration). Build the fallback path using `Path` objects (e.g., `self.project_root / relative_path`) and only convert to POSIX string at the end. This bug will break snippet retrieval for any node without `absolute_path` populated.
How can I resolve this? If you propose a fix, please make it concise.
When Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/tools/semantic_search.py
Line: 90:93
Comment:
**Relative path fallback misresolves**
When `absolute_path` is absent, `get_function_source_code()` falls back to `relative_path` and forwards it into `validate_source_location()` without joining it to any deterministic base directory. That means reads become dependent on the current working directory (and will often fail or read the wrong file) until every graph is fully reindexed with `absolute_path`. Fix by either requiring `absolute_path` for this flow, or resolving relative paths against a known project root derived from `project_name`/allowed roots.
How can I resolve this? If you propose a fix, please make it concise.
The paginated Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/mcp/tools.py
Line: 395:398
Comment:
**Allowed roots bypass in paginated read**
The paginated `read_file` path validation checks only that `full_path` is under `self.project_root`, ignoring `ALLOWED_PROJECT_ROOTS` used elsewhere for cross-project access. In multi-root setups this will incorrectly reject valid files, and it also creates an inconsistent security boundary between paginated and non-paginated reads. Reuse the shared `validate_allowed_path()`/allowed-roots validation here as well.
How can I resolve this? If you propose a fix, please make it concise.
The user-facing message Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! Prompt To Fix With AIThis is a comment left during a code review.
Path: codebase_rag/tools/file_writer.py
Line: 25:28
Comment:
**Centralize repeated tool error**
The user-facing message `"Write operations are not allowed in query mode"` is a new hardcoded string in a tools module. This repo’s convention is to centralize user/tool error strings (e.g., `tool_errors.py` or `constants.py`) so the same message isn’t duplicated and can be reused consistently across tools.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise. |
…[str] to frozenset[Path] and simplify Path conversions
✅ Already Fixed
✅ Design Correct - No Changes Needed
This is a security-by-design choice to prevent accidental writes outside the project root.
This query is only used for single-project embedding generation (not exposed to users), so the shorter path alias is appropriate. Other queries use the more descriptive relative_path alias for external
Summary All critical issues have been addressed. The remaining items represent intentional design choices that improve security and code clarity. The codebase is now ready for merge. |
…by directly using self.file_reader.allowed_roots
|
@greptile |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces two significant features: cross-project file access by storing absolute paths in the graph, and a dual-mode (query/edit) system for the MCP server to enhance security. The changes are extensive and well-implemented across the codebase, including updates to parsers, Cypher queries, and tools. The documentation in the README has also been updated commendably to reflect these new features.
However, I've identified a critical security vulnerability where the FileEditor.replace_code_block method does not use the new ALLOWED_PROJECT_ROOTS validation, potentially allowing path traversal attacks in 'edit' mode. I have provided two critical comments with suggestions to address this issue.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-implemented feature to enable cross-project file access by storing absolute paths and project names in all graph nodes. The changes are comprehensive, consistently applied across the parsing, storage, querying, and tools layers. The introduction of a dual-mode (query/edit) MCP server is a great security enhancement. The code is generally of high quality, with path validation logic being thoughtfully centralized. I've identified one minor issue in a new utility function where an edge case could lead to a type violation, and I've provided a suggestion to make it more robust. Overall, this is an excellent contribution.
|
@vitali87
Please review the commits and merge if approved. |
|
@github-project-automation added this to @vitali87's graph code |
This change enables cross-project file access by storing absolute paths
and project names in all graph nodes, fixing the issue where queries
from different working directories couldn't locate files.
Core Changes
1. New Fields in All Nodes
absolute_path: POSIX format absolute path for file accessproject_name: Project name for fast cross-project filteringpath: Relative path (retained for display)2. Path Calculation Utility
codebase_rag/utils/path_utils.pywithcalculate_paths()os.path.abspath()3. Parser Layer Updates
calculate_paths()when creating nodesproject_name = "__external__"4. Query Layer Updates
absolute_path,path,project_nameFunction,Method,Classforproject_name5. Tools Layer Updates
CodeRetriever,FileEditor,FileReaderuseabsolute_pathPerformance Impact
Backward Compatibility
pathfield retained