PenguinChat: Mental Health Counseling Response Evaluation

A research project in 2024 evaluates and compares different AI-generated therapeutic responses, finally generate a better response than default GPT. The default gpt-3.5-turbo by Openai was used a lot for emotional support in our daily lives. However, it was not following some examined pyschologic principals. Moreover, it tends to practice CBT (Cognitive behavioral therapy) when respond to user. Those features might be harmful for people who are unfamiliar to psychological treatment.

Please refer to Thesis for detailed context and experiment result.

Overview

To have a valid responser llm, we need to first build a valid scorer llm. This idea was inspired by GANs. You can download the labeled data I used in this research to examine it.

PAIR: Prompt-Aware margIn Ranking for Counselor Reflection Scoring in Motivational Interviewing

Key Steps

1. Define criterias to evaluate

Reflective Listening (SYS_REF)

✅ Good responses include:

Acknowledging client's perspective and feelings
Encouraging client sharing
Validation and support of client's experience
Active listening demonstration

❌ Poor responses include:

Lack of emotional reflection
Solution-focused rather than understanding-focused
Not engaging with client's perspective
Missed exploration opportunities

Non-Advice Assessment (SYS_ADV)

✅ Good responses avoid:

Giving unsolicited advice or specific directions
Telling clients what to do
Being authoritative or judgmental
Using imperative sentences
Response Length Monitoring: Try to limit the response within 100 words to make it more simialr to human conversation. The default chatgpt response is way to long in one single turn.

2. Examine the effectiveness of scorer llm

Assumption: The "lq5" (most low quality) response has no reflective listening, including unsolicited advice, use closed question. The "hq1" (most high quality) response is the opposite. (sample size=100)
Prompt LLM: Send the user message and the paired "lq5" and "hq1" labeled response in the dataset, together with the prompt to gpt api, asking it to return true/false response following the criterias mentioned above.
Chi2 Test: Build a contingency table based on the response returned. Keep tuning the prompt until it passed chi2 test.
Validated Scorer: If the score give significant different scoring results for low quality response and high quality response, we consider it as a valid scorer.

3. Get the response from among three llms

PenguinChat (response_penguin): Specialized therapeutic chatbot using Rogerian/humanistic principles
Limited Response (response_limit): Standard responses with length constraints
Default Response (response_default): Baseline helpful assistant responses

4. Scorer responses and apply chi2 test

Metric	PenguinChat	Limited Response	Default Response
Average Length	82.91 chars	93.19 chars	241.38 chars
Reflective Listening	100% ✅	65%	84%
No Advice	96% ✅	30%	20%

All comparisons showed significant differences (p < 0.05) using chi-squared tests, confirming that specialized therapeutic prompting produces measurably better responses according to counseling best practices.

Data Structure

The project uses a dataset (pair_data.csv) containing:

prompt: Client messages/concerns
hq1, hq2: High-quality reference responses
mq1: Medium-quality reference responses
lq1-lq5: Low-quality reference responses

Core Functions

get_pair(df, row_index, col): Extracts client-response pairs from dataset
scorer(df, size, col, SYS_content): Automated response evaluation using GPT-3.5
responser(df, size, system, temperature, col): Generates responses using specified system prompts

Research Applications

This framework can be used to:

Evaluate therapeutic chatbot quality
Develop training datasets for therapeutic AI systems

Future Enhancements

Multi-turn conversation support
Make responser actively response with Open Questions.

Keywords

Humanistic Framework, Carl Rogers, Reflection, Non-directive therapy, Automated evaluation, Therapeutic AI

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
README.md		README.md
emperor.ipynb		emperor.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PenguinChat: Mental Health Counseling Response Evaluation

Overview

Key Steps

1. Define criterias to evaluate

2. Examine the effectiveness of scorer llm

3. Get the response from among three llms

4. Scorer responses and apply chi2 test

Data Structure

Core Functions

Research Applications

Future Enhancements

Keywords

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PenguinChat: Mental Health Counseling Response Evaluation

Overview

Key Steps

1. Define criterias to evaluate

2. Examine the effectiveness of scorer llm

3. Get the response from among three llms

4. Scorer responses and apply chi2 test

Data Structure

Core Functions

Research Applications

Future Enhancements

Keywords

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages