A research project in 2024 evaluates and compares different AI-generated therapeutic responses, finally generate a better response than default GPT. The default gpt-3.5-turbo by Openai was used a lot for emotional support in our daily lives. However, it was not following some examined pyschologic principals. Moreover, it tends to practice CBT (Cognitive behavioral therapy) when respond to user. Those features might be harmful for people who are unfamiliar to psychological treatment.
Please refer to Thesis for detailed context and experiment result.
To have a valid responser llm, we need to first build a valid scorer llm. This idea was inspired by GANs. You can download the labeled data I used in this research to examine it.
PAIR: Prompt-Aware margIn Ranking for Counselor Reflection Scoring in Motivational Interviewing
Reflective Listening (SYS_REF)
✅ Good responses include:
- Acknowledging client's perspective and feelings
- Encouraging client sharing
- Validation and support of client's experience
- Active listening demonstration
❌ Poor responses include:
- Lack of emotional reflection
- Solution-focused rather than understanding-focused
- Not engaging with client's perspective
- Missed exploration opportunities
Non-Advice Assessment (SYS_ADV)
✅ Good responses avoid:
- Giving unsolicited advice or specific directions
- Telling clients what to do
- Being authoritative or judgmental
- Using imperative sentences
- Response Length Monitoring: Try to limit the response within 100 words to make it more simialr to human conversation. The default chatgpt response is way to long in one single turn.
- Assumption: The "lq5" (most low quality) response has no reflective listening, including unsolicited advice, use closed question. The "hq1" (most high quality) response is the opposite. (sample size=100)
- Prompt LLM: Send the user message and the paired "lq5" and "hq1" labeled response in the dataset, together with the prompt to gpt api, asking it to return true/false response following the criterias mentioned above.
- Chi2 Test: Build a contingency table based on the response returned. Keep tuning the prompt until it passed chi2 test.
- Validated Scorer: If the score give significant different scoring results for low quality response and high quality response, we consider it as a valid scorer.
- PenguinChat (response_penguin): Specialized therapeutic chatbot using Rogerian/humanistic principles
- Limited Response (response_limit): Standard responses with length constraints
- Default Response (response_default): Baseline helpful assistant responses
| Metric | PenguinChat | Limited Response | Default Response |
|---|---|---|---|
| Average Length | 82.91 chars | 93.19 chars | 241.38 chars |
| Reflective Listening | 100% ✅ | 65% | 84% |
| No Advice | 96% ✅ | 30% | 20% |
All comparisons showed significant differences (p < 0.05) using chi-squared tests, confirming that specialized therapeutic prompting produces measurably better responses according to counseling best practices.
The project uses a dataset (pair_data.csv) containing:
- prompt: Client messages/concerns
- hq1, hq2: High-quality reference responses
- mq1: Medium-quality reference responses
- lq1-lq5: Low-quality reference responses
get_pair(df, row_index, col): Extracts client-response pairs from datasetscorer(df, size, col, SYS_content): Automated response evaluation using GPT-3.5responser(df, size, system, temperature, col): Generates responses using specified system prompts
This framework can be used to:
- Evaluate therapeutic chatbot quality
- Develop training datasets for therapeutic AI systems
- Multi-turn conversation support
- Make responser actively response with Open Questions.
Humanistic Framework, Carl Rogers, Reflection, Non-directive therapy, Automated evaluation, Therapeutic AI