Skip to content

fix: improve Cypher query generation accuracy#294

Open
hupeky wants to merge 1 commit intovitali87:mainfrom
hupeky:fix/improve-cypher-generation
Open

fix: improve Cypher query generation accuracy#294
hupeky wants to merge 1 commit intovitali87:mainfrom
hupeky:fix/improve-cypher-generation

Conversation

@hupeky
Copy link

@hupeky hupeky commented Feb 2, 2026

Summary

  • Improve LLM prompt to teach correct schema patterns
  • Add example for DEFINES_METHOD relationship
  • Fix response cleaner to handle markdown formatting

Problem

The LLM sometimes generates incorrect Cypher queries because it doesn't understand:

  • When to use name vs qualified_name properties
  • The correct relationship types (DEFINES_METHOD, DEFINES)
  • Proper Cypher patterns for this specific graph schema

Changes

  1. cypher_queries.py: Added CYPHER_EXAMPLE_CLASS_METHODS example
  2. prompts.py: Added VALUE PATTERN RULES explaining name matching
  3. llm.py: Improved _clean_cypher_response() to handle markdown code blocks

Testing

Tested with codellama model - queries now generate correctly for class method lookups and other common patterns.

🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hupeky, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the accuracy of Cypher query generation by the Language Model (LLM). It achieves this by providing more explicit guidance within the LLM prompts regarding schema interpretation, particularly for property usage and relationship types. Additionally, it bolsters the system's ability to parse and extract valid Cypher queries from diverse LLM output formats, making the overall process more reliable.

Highlights

  • Improved LLM Prompting for Cypher Queries: New VALUE PATTERN RULES have been added to the LLM prompt to guide the model on correctly using name versus qualified_name properties and the DEFINES_METHOD and DEFINES relationships, addressing common misunderstandings in query generation.
  • Enhanced Cypher Query Examples: A new example, CYPHER_EXAMPLE_CLASS_METHODS, has been introduced to demonstrate how to query for methods of a class, further refining the LLM's understanding of the graph schema.
  • Robust Cypher Response Cleaning: The _clean_cypher_response function has been significantly improved to handle various markdown formats (e.g., triple backticks, bold headers) that LLMs might use when returning Cypher queries, ensuring accurate extraction of the query string.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively improves the accuracy of Cypher query generation by enhancing the prompts with clearer rules and examples, and by making the response cleaning logic more robust. The changes in prompts.py and cypher_queries.py are well-implemented and align with the PR's objectives. I have a few suggestions for the _clean_cypher_response function in codebase_rag/services/llm.py to align it with project standards and further improve its robustness.

Comment on lines +44 to +51
else:
# Remove markdown bold/headers (e.g., **Cypher Query:**)
query = re.sub(r"\*\*[^*]+\*\*:?\s*", "", query)
# Remove single backticks
query = query.replace(cs.CYPHER_BACKTICK, "")
# Remove "cypher" prefix if present
if query.lower().startswith(cs.CYPHER_PREFIX):
query = query[len(cs.CYPHER_PREFIX):].strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic in the else block may not correctly handle all cases with leading whitespace, and the cleaning steps could be ordered for better robustness. For example, a response like **Cypher Query:** MATCH (n) would result in a query with a leading space: MATCH (n);, which could cause execution to fail. This refactoring handles whitespace more consistently and correctly identifies and removes the cypher prefix even if it has leading spaces.

Suggested change
else:
# Remove markdown bold/headers (e.g., **Cypher Query:**)
query = re.sub(r"\*\*[^*]+\*\*:?\s*", "", query)
# Remove single backticks
query = query.replace(cs.CYPHER_BACKTICK, "")
# Remove "cypher" prefix if present
if query.lower().startswith(cs.CYPHER_PREFIX):
query = query[len(cs.CYPHER_PREFIX):].strip()
else:
# Remove markdown bold/headers (e.g., **Cypher Query:**)
query = re.sub(r"\*\*[^*]+\*\*:?\s*", "", query)
# Remove "cypher" prefix if present
if query.lower().strip().startswith(cs.CYPHER_PREFIX):
query = query[query.lower().find(cs.CYPHER_PREFIX) + len(cs.CYPHER_PREFIX):]
# Remove single backticks and strip any remaining whitespace
query = query.replace(cs.CYPHER_BACKTICK, "").strip()

Comment on lines +29 to +35
"""Clean LLM response to extract pure Cypher query.

Handles markdown formatting that models sometimes output:
- Triple backticks (```cypher ... ```)
- Bold text (**Cypher Query:**)
- Headers and other markdown
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the project's general rules, docstrings are not allowed. Please remove this docstring to adhere to the project's coding standards.

References
  1. Docstrings are not allowed in this project, as enforced by a pre-commit hook.

- Bold text (**Cypher Query:**)
- Headers and other markdown
"""
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Per PEP 8, imports should be at the top of the file. Please remove this import from here and add import re to the top-level imports section of the file.

References
  1. PEP 8: Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. (link)

This PR addresses issues where the LLM generates incorrect Cypher queries
due to misunderstanding the graph schema.

Changes:
- Add CYPHER_EXAMPLE_CLASS_METHODS to demonstrate DEFINES_METHOD pattern
- Add VALUE PATTERN RULES to prompts explaining name vs qualified_name usage
- Improve _clean_cypher_response() to handle markdown formatting in LLM output

The prompt improvements teach the LLM to:
- Use `name` property for short class/function names (not qualified_name)
- Use correct relationships (DEFINES_METHOD, DEFINES)
- Follow proper Cypher patterns for this schema

The response cleaner now handles:
- Triple backtick code blocks (```cypher ... ```)
- Bold markdown headers (**Cypher Query:**)
- Mixed formatting in LLM responses

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 2, 2026

Greptile Overview

Greptile Summary

This PR improves Cypher query generation accuracy by teaching the LLM to correctly use the graph schema properties and relationships.

Key Changes:

  • Added CYPHER_EXAMPLE_CLASS_METHODS example demonstrating the DEFINES_METHOD relationship pattern for querying class methods
  • Added VALUE PATTERN RULES section to prompts explaining when to use name vs qualified_name properties (critical for short name matching)
  • Enhanced _clean_cypher_response() function to handle markdown formatting in LLM outputs (triple backtick code blocks, bold headers)

Impact:
The prompt improvements address a core issue where LLMs would incorrectly use qualified_name for short class/function names (e.g., WHERE c.qualified_name = 'UserService' instead of the correct WHERE c.name = 'UserService'). The qualified_name property contains full paths like 'Project.folder.subfolder.ClassName', so matching against short names would always fail.

The enhanced response cleaner now correctly extracts Cypher queries from markdown-formatted LLM responses, improving robustness across different LLM providers and output formats.

Confidence Score: 4/5

  • This PR is safe to merge with minor style improvements recommended
  • The changes are well-targeted improvements to prompt engineering and response parsing. The logic is sound and addresses a real issue with name matching in Cypher queries. Existing tests cover the core functionality of _clean_cypher_response(). The only concerns are minor style issues (import placement and case handling consistency) that don't affect correctness.
  • codebase_rag/services/llm.py could benefit from moving the re import to module-level, but this is a minor style issue

Important Files Changed

Filename Overview
codebase_rag/services/llm.py Enhanced Cypher response cleaning with markdown handling - import statement placement could be improved
codebase_rag/prompts.py Added VALUE PATTERN RULES and class methods example to improve LLM query generation accuracy
codebase_rag/cypher_queries.py Added CYPHER_EXAMPLE_CLASS_METHODS to demonstrate DEFINES_METHOD pattern usage

Sequence Diagram

sequenceDiagram
    participant User
    participant CypherGenerator
    participant Agent
    participant LLM
    participant CleanFunction as _clean_cypher_response

    User->>CypherGenerator: "generate('What methods does UserService have?')"
    CypherGenerator->>Agent: "run(natural_language_query)"
    Agent->>LLM: "Send prompt with system prompt and examples"
    Note over LLM: Uses VALUE PATTERN RULES<br/>Match by name property<br/>Use DEFINES_METHOD relationship
    LLM-->>Agent: "Response with markdown formatting"
    Agent-->>CypherGenerator: "result.output"
    CypherGenerator->>CleanFunction: "Clean markdown formatting"
    Note over CleanFunction: Extract from code blocks<br/>Remove bold headers<br/>Add semicolon
    CleanFunction-->>CypherGenerator: "Clean Cypher query"
    CypherGenerator-->>User: "Valid Cypher query"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

- Bold text (**Cypher Query:**)
- Headers and other markdown
"""
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move import re to top-level imports (after line 1). Module-level imports belong with stdlib imports at the file top, not inside functions.

Suggested change
import re
"""Clean LLM response to extract pure Cypher query.
Handles markdown formatting that models sometimes output:
- Triple backticks (```cypher ... ```)
- Bold text (**Cypher Query:**)
- Headers and other markdown
"""

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/services/llm.py
Line: 36:36

Comment:
Move `import re` to top-level imports (after line 1). Module-level imports belong with stdlib imports at the file top, not inside functions.

```suggestion
    """Clean LLM response to extract pure Cypher query.

    Handles markdown formatting that models sometimes output:
    - Triple backticks (```cypher ... ```)
    - Bold text (**Cypher Query:**)
    - Headers and other markdown
    """
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +50 to +51
if query.lower().startswith(cs.CYPHER_PREFIX):
query = query[len(cs.CYPHER_PREFIX):].strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case mismatch: query.lower().startswith() but using original cs.CYPHER_PREFIX length. If cs.CYPHER_PREFIX = "cypher" and query is "CYPHER MATCH...", slicing by len("cypher") (6 chars) from "CYPHER MATCH..." works correctly. However, for safety and clarity, use consistent casing.

Suggested change
if query.lower().startswith(cs.CYPHER_PREFIX):
query = query[len(cs.CYPHER_PREFIX):].strip()
if query.lower().startswith(cs.CYPHER_PREFIX.lower()):
query = query[len(cs.CYPHER_PREFIX):].strip()
Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/services/llm.py
Line: 50:51

Comment:
Case mismatch: `query.lower().startswith()` but using original `cs.CYPHER_PREFIX` length. If `cs.CYPHER_PREFIX = "cypher"` and query is `"CYPHER MATCH..."`, slicing by `len("cypher")` (6 chars) from `"CYPHER MATCH..."` works correctly. However, for safety and clarity, use consistent casing.

```suggestion
        if query.lower().startswith(cs.CYPHER_PREFIX.lower()):
            query = query[len(cs.CYPHER_PREFIX):].strip()
```

How can I resolve this? If you propose a fix, please make it concise.

@hupeky hupeky force-pushed the fix/improve-cypher-generation branch from c24af15 to 232fee8 Compare February 2, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant