Skip to content

ZurichParis/penguinchat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PenguinChat: Mental Health Counseling Response Evaluation

A research project in 2024 evaluates and compares different AI-generated therapeutic responses, finally generate a better response than default GPT. The default gpt-3.5-turbo by Openai was used a lot for emotional support in our daily lives. However, it was not following some examined pyschologic principals. Moreover, it tends to practice CBT (Cognitive behavioral therapy) when respond to user. Those features might be harmful for people who are unfamiliar to psychological treatment.

Please refer to Thesis for detailed context and experiment result.

Overview

To have a valid responser llm, we need to first build a valid scorer llm. This idea was inspired by GANs. You can download the labeled data I used in this research to examine it.

PAIR: Prompt-Aware margIn Ranking for Counselor Reflection Scoring in Motivational Interviewing

Key Steps

1. Define criterias to evaluate

Reflective Listening (SYS_REF)

Good responses include:

  • Acknowledging client's perspective and feelings
  • Encouraging client sharing
  • Validation and support of client's experience
  • Active listening demonstration

Poor responses include:

  • Lack of emotional reflection
  • Solution-focused rather than understanding-focused
  • Not engaging with client's perspective
  • Missed exploration opportunities

Non-Advice Assessment (SYS_ADV)

Good responses avoid:

  • Giving unsolicited advice or specific directions
  • Telling clients what to do
  • Being authoritative or judgmental
  • Using imperative sentences
  • Response Length Monitoring: Try to limit the response within 100 words to make it more simialr to human conversation. The default chatgpt response is way to long in one single turn.

2. Examine the effectiveness of scorer llm

  • Assumption: The "lq5" (most low quality) response has no reflective listening, including unsolicited advice, use closed question. The "hq1" (most high quality) response is the opposite. (sample size=100)
  • Prompt LLM: Send the user message and the paired "lq5" and "hq1" labeled response in the dataset, together with the prompt to gpt api, asking it to return true/false response following the criterias mentioned above.
  • Chi2 Test: Build a contingency table based on the response returned. Keep tuning the prompt until it passed chi2 test.
  • Validated Scorer: If the score give significant different scoring results for low quality response and high quality response, we consider it as a valid scorer.

3. Get the response from among three llms

  • PenguinChat (response_penguin): Specialized therapeutic chatbot using Rogerian/humanistic principles
  • Limited Response (response_limit): Standard responses with length constraints
  • Default Response (response_default): Baseline helpful assistant responses

4. Scorer responses and apply chi2 test

Metric PenguinChat Limited Response Default Response
Average Length 82.91 chars 93.19 chars 241.38 chars
Reflective Listening 100% ✅ 65% 84%
No Advice 96% ✅ 30% 20%

All comparisons showed significant differences (p < 0.05) using chi-squared tests, confirming that specialized therapeutic prompting produces measurably better responses according to counseling best practices.

Data Structure

The project uses a dataset (pair_data.csv) containing:

  • prompt: Client messages/concerns
  • hq1, hq2: High-quality reference responses
  • mq1: Medium-quality reference responses
  • lq1-lq5: Low-quality reference responses

Core Functions

  • get_pair(df, row_index, col): Extracts client-response pairs from dataset
  • scorer(df, size, col, SYS_content): Automated response evaluation using GPT-3.5
  • responser(df, size, system, temperature, col): Generates responses using specified system prompts

Research Applications

This framework can be used to:

  • Evaluate therapeutic chatbot quality
  • Develop training datasets for therapeutic AI systems

Future Enhancements

  • Multi-turn conversation support
  • Make responser actively response with Open Questions.

Keywords

Humanistic Framework, Carl Rogers, Reflection, Non-directive therapy, Automated evaluation, Therapeutic AI

About

To improve ChatGPT's ability to provide responses with reflection while refraining itself from unsolicited advising

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors