Skip to content

AI Summarization and Classification for Chats (#2641)#2646

Open
mbichara wants to merge 39 commits into
sepinf-inc:masterfrom
mbichara:add-aisummarizationtask
Open

AI Summarization and Classification for Chats (#2641)#2646
mbichara wants to merge 39 commits into
sepinf-inc:masterfrom
mbichara:add-aisummarizationtask

Conversation

@mbichara

@mbichara mbichara commented Oct 7, 2025

Copy link
Copy Markdown
Contributor

This is an ongoing work.

The AISummarizationTask sends WhatsApp chat contents to a remote service (AI middleware service) and stores returned textual summaries on each item’s extra attributes.

In the analysis interface, when the user clicks on an item that has "summaries" attributes, these are rendered on a "Summary" tab near Preview tab on bottom right. The Summary tab should be hidden otherwise. But I noticed this is still buggy when scrolling through chats, and needs some work.

The idea is also to later support the summarization of other content types. I started looking into adding support to UFED chats, and will check recent changes made by @aberenguel on the UFED chats parser.

@mbichara mbichara force-pushed the add-aisummarizationtask branch from 1bb7f45 to 8c6e63d Compare October 7, 2025 17:01
@mbichara mbichara force-pushed the add-aisummarizationtask branch from 995ce0e to 70c7dba Compare December 10, 2025 19:48
@mbichara mbichara marked this pull request as ready for review December 12, 2025 19:01
@wladimirleite wladimirleite changed the title #2641: AISummarization for WhatsApp chats - first commit AISummarization for WhatsApp chats (#2641) Jan 17, 2026
@wladimirleite wladimirleite marked this pull request as draft January 20, 2026 15:30
@wladimirleite

Copy link
Copy Markdown
Member

As I mentioned in #2641, this will be a great addition to the project.
I have a few comments/suggestions:

  1. Change the property name from "ai:summaries" to "ai:summary". Although there may be several summary values (for longer chats, I guess), most multivalued property names are singular.
  2. Localize the "AI-generated summaries. Check all information" message.
  3. Localize the "Summary" tab title.
  4. It seems that "ai:chunk_ids" is used for internal control only. If that is the case, make it a temporary attribute so it won't be added to the case and visible in the advanced properties metadata tab.
  5. Allow hit navigation (left/right arrow icons) in the Summary viewer.
  6. Allow "search in viewer" in the Summary viewer.
  7. Currently, the "Summary" tab is only visible if the selected item has a summary. While this makes sense, it can be inconvenient in practice. Even if the tab is pinned, selecting a chat without a summary hides the tab. When subsequently selecting a chat with a summary, the tab reappears but loses its pinned state (in the default layout). I propose adding a check when the case is opened: if there is any item with a summary, the tab should be visible. That would be more consistent with other viewers (e.g., the preview tab is always present, even for items with no preview available).
  8. When processing, skip chats if "Communication:isEmpty: true".
  9. Show "Summarized Chats" in the AI panel.
  10. Show chats in the AI panel grouped by question score (very high, high, medium, low, very low?). This is not trivial and will require some changes in the AI panel code, as the questions can be customized.
  11. Support "application/x-telegram-chat" and "application/x-threema-chat" content types. Maybe the parameter "enableWhatsAppSummarization" could be changed to "enableInternalChatSummarization".
  12. Normalize HTML summaries "header". Not sure if this is feasible, but I observed that each part of the summary has a "header" with its period and participants involved. However, it sometimes uses "**" (I guess to highlight it), sometimes it shows "Participantes" (in Portuguese) and in others "Interlocutores", sometimes a "|" is used instead of placing "period" and "participants" in separated lines, and sometimes there are no labels, just a textual description. The following image shows some samples (all summaries belong to the same chat).

@wladimirleite

Copy link
Copy Markdown
Member

@mbichara, not sure if we can do anything about 12.
Please, take a look in this item as it would require some change in the "AISummarizationTask.py" and/or the server-side code.

I can try to deal with the others, if you agree with them.

@mbichara

mbichara commented Jan 20, 2026

Copy link
Copy Markdown
Contributor Author

Hi @wladimirleite!

Thank you for the valuable comments and suggestions and for helping on this.

I agree on all of them.

About 12, I believe it is fairly doable to generate the "header" of the answer in a standard way.
I can adapt the prompt on server side to insctuct the LLM to answer in a structured form, and check if the header is correct, regenerating the answer if it is not.

Regarding 4, I also agree that keeping "ai:chunk_ids" visible in metadata tab is not ideal.
The reason for saving "ai:chunk_ids" is related to the next AI feature we are planning after this, that directly relies on summaries and their "ids", generated by this task.

Just to describe it briefly, it's a hierarchical RAG architecture to allow the user to ask general questions about one or multiple chats during analysis-time in a chatGPT-like panel in the interface, and get a textual response.
For questions about large or multiple chats, summaries are used, and I instruct the LLM to use chunk_ids to quote the relevant evidence parts(chunks) in its answer. The user can then click on the quotation link (chunk_id) in the answer to see the specific chat chunk directly.

I would like to later show you a POC of this next feature, as I think you could also provide some good insights.

Thank you again!

@wladimirleite

Copy link
Copy Markdown
Member

Thanks @mbichara!
About 4, let's keep chunk_id's.
I think it is possible just to omit them on the metadata panel.

@wladimirleite wladimirleite changed the title AISummarization for WhatsApp chats (#2641) AISummarization for Chats (#2641) Jan 27, 2026
@wladimirleite

wladimirleite commented Jan 27, 2026

Copy link
Copy Markdown
Member
  • Change the property name from "ai:summaries" to "ai:summary".
  • Localize the "AI-generated summaries. Check all information" message.
  • Localize the "Summary" tab title.
  • Rename "ai:chunk_ids" to "ai:chunkIds".
  • Hide "ai:chunkIds" in advanced properties metadata tab.
  • Allow hit navigation (left/right arrow icons) in the Summary viewer.
  • Allow "search in viewer" in the Summary viewer. (left as a future improvement, as requires some refatoring)
  • Review "Summary" tab visibility.
  • Skip chats if "Communication:isEmpty" is "true".
  • Show "Summarized Chats" in the AI panel.
  • Show chats in the AI panel grouped by question score.
  • Support "application/x-telegram-chat" and "application/x-threema-chat" content types.
  • Rename "enableWhatsAppSummarization" to "enableInternalChatSummarization".
  • Normalize HTML summaries "header".

@wladimirleite

Copy link
Copy Markdown
Member

I also suggest renaming "chunk_ids" to "chunkIds".
I am adding it to the task list.

@wladimirleite

Copy link
Copy Markdown
Member

@mbichara, there are likely some details to fine-tune, but the current version should be functional.
Please try processing a case and let me know if you encounter any issues or have suggestions.

@mbichara

Copy link
Copy Markdown
Contributor Author

@wladimirleite, very nice, I will process a case here and let you know

@mbichara

mbichara commented Feb 4, 2026

Copy link
Copy Markdown
Contributor Author

Hi @wladimirleite, I processed a couple of cases, it seems fine, good work!
Now I am fixing some problems I found for pasing internal Telegram chats and will also test Threema next.
Also improving the backend logic.
I will let you know once I get it done. Thank you very much

@lfcnassif

Copy link
Copy Markdown
Member

@mbichara, is this still a draft? If not could you convert it to ready for review?

Could anyone finish reviewing and testing and, if OK, approve it? I think this will be very helpful in many cases, such as in CSAM analysis, bank fraud, drugs and weapon dealing, extortion, money counterfeit, etc...

@lfcnassif lfcnassif changed the title AISummarization for Chats (#2641) AI Summarization and Classification for Chats (#2641) Mar 25, 2026
@mbichara

Copy link
Copy Markdown
Contributor Author

@lfcnassif,

I made some changes and improvements, including on the server side, and I now believe this is ready for review.

@wladimirleite, thanks for collaborating on this. I’ll send you a config that points to the server so you can run it when you have some time. If other colleague wants to test it let me know.

Just a heads-up: I’m also adding/testing document summarization to the task, but due to infrastructure constraints, maybe we should start with chats first. What do you think @lfcnassif?

@mbichara mbichara marked this pull request as ready for review March 30, 2026 19:27
@lfcnassif

Copy link
Copy Markdown
Member

Just a heads-up: I’m also adding/testing document summarization to the task, but due to infrastructure constraints, maybe we should start with chats first. What do you think @lfcnassif?

I agree.

@wladimirleite

Copy link
Copy Markdown
Member

@wladimirleite, thanks for collaborating on this. I’ll send you a config that points to the server so you can run it when you have some time. If other colleague wants to test it let me know.

Sorry for the slow response!
I processed a large UFDR yesterday and analysed the results today.
Everything seems fine!

@mbichara, a last suggestion (sorry for not mentioning this before):
As the task took quite some time (about 50% of processing time in this case with a lot of large chats), I think it would be important to have in the processing log some basic performance data:

  • Number of chats processed,
  • Total/average number of characters in processed chats,
  • Average time spent per chat.
    Maybe separated for chat analysis and summarization, if these tasks ran as separated steps.

By the way, I added two chat analysis questions, which probably increased the processing time. I am running it again with summarization only.

A minor detail, there were a lot of warning in the log:

xxxx\iped-4.4.0-SNAPSHOT\python\lib\site-packages\urllib3\connectionpool.py:1095: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.61.xx.xx'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
  warnings.warn(

Just a heads-up: I’m also adding/testing document summarization to the task, but due to infrastructure constraints, maybe we should start with chats first. What do you think @lfcnassif?

I also agree that starting with chats only seems a better idea.
After the feedback of users in real cases, we can make adjustments and expand to other type of items.

@wladimirleite

Copy link
Copy Markdown
Member

By the way, I added two chat analysis questions, which probably increased the processing time. I am running it again with summarization only.

With summarization only, time spent on this task decreased from 14608 s to 4811 s.

@wladimirleite

Copy link
Copy Markdown
Member

After taking a closer look in the processed case and discussing with @felipecampanini, who also processed a large case he is working on with this PR, some final comments and suggestions for future improvements:

  1. Although the summarization content is very accurate, for many chats it is not very concise, looking more like a detailed description. I suggest trying to make it a bit more concise (by default) and, maybe, having a configuration parameter to set the level of conciseness (it could be a number or low/medium/high).
  2. For the questions of the analysis feature, show the score of each chunk in the visualization. That would help a lot to find which parts are important in large chats.
  3. Allow "open" questions, i.e. questions which instead of producing a score, would bring a textual answer.

@lfcnassif

Copy link
Copy Markdown
Member

Hi @mbichara!

I know you are working on a very sensitive case, but could you push the last implemented fixes and commits here? I think this is a very important feature to include into version 4.4.0, hopefully before the middle of this year...

@lfcnassif

Copy link
Copy Markdown
Member

@mbichara, do you plan to push more enhancements here?

@mbichara

Copy link
Copy Markdown
Contributor Author

Hi @lfcnassif . Yes, I was working on other urgent demand on past weeks.
Just returned to this. I am finishing some tests mostly on the backend. But I made some enhancements here also.
I will push it now

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds AI-driven chat summarization (and optional per-chunk “analysis score” attributes) to IPED, plus a dedicated “Summary” viewer tab that renders stored summaries and can navigate to the relevant message in the chat preview. It also extends the AI Filters panel to group chats by analysis score fields discovered in the index.

Changes:

  • Introduces AISummarizationTask.py + configuration to call a remote AI middleware service, store ai:summary, ai:chunkIds, and ai:analysis:* extra attributes, and register the task in TaskInstaller.xml.
  • Adds a new SummaryViewer (tab) that displays ai:summary chunks, renders basic markup, shows per-chunk analysis labels, and provides “go to first message” navigation via a new MessageNavigator API.
  • Extends AI filters to support wildcard expansion for ai:analysis:* fields and adds “Analyzed Chats” / “Summarized Chats” filter entries and localization keys.

Reviewed changes

Copilot reviewed 25 out of 32 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
iped-viewers/iped-viewers-impl/src/main/java/iped/viewers/SummaryViewer.java New viewer to render stored AI summaries, show analysis labels, and link into chat preview.
iped-viewers/iped-viewers-api/src/main/java/iped/viewers/api/MessageNavigator.java New API hook for viewers to request navigation to a message id.
iped-utils/src/main/java/iped/utils/UiUtil.java Enhances empty HTML helper to support optional message text and theme colors.
iped-engine/src/main/java/iped/engine/data/SimpleFilterNode.java Adds suffix to support display labeling for wildcard-expanded AI filter nodes.
iped-app/src/main/java/iped/app/ui/ViewerController.java Registers SummaryViewer, wires navigation callback, loads analysis thresholds, and hides Summary when index lacks summary field.
iped-app/src/main/java/iped/app/ui/ai/AIFiltersTreeCellRenderer.java Displays node suffix in the AI filters tree.
iped-app/src/main/java/iped/app/ui/ai/AIFiltersLoader.java Expands wildcard property definitions (e.g., ai:analysis:*) into concrete filter nodes based on indexed fields.
iped-app/resources/scripts/tasks/AISummarizationTask.py New processing task that calls remote service, parses chat HTML, and stores summaries + analysis attributes.
iped-app/resources/localization/iped-viewer-messages.properties Adds SummaryViewer strings (EN).
iped-app/resources/localization/iped-viewer-messages_pt_BR.properties Adds SummaryViewer strings (pt_BR).
iped-app/resources/localization/iped-viewer-messages_it_IT.properties Adds SummaryViewer strings (it_IT) placeholders.
iped-app/resources/localization/iped-viewer-messages_fr_FR.properties Adds SummaryViewer strings (fr_FR) placeholders.
iped-app/resources/localization/iped-viewer-messages_es_AR.properties Adds SummaryViewer strings (es_AR) placeholders.
iped-app/resources/localization/iped-viewer-messages_de_DE.properties Adds SummaryViewer strings (de_DE) placeholders.
iped-app/resources/localization/iped-ai-filters.properties Adds “Analyzed Chats” / “Summarized Chats” filter labels (EN).
iped-app/resources/localization/iped-ai-filters_pt_BR.properties Adds “Analyzed Chats” / “Summarized Chats” filter labels (pt_BR).
iped-app/resources/localization/iped-ai-filters_it_IT.properties Adds filter labels (it_IT) placeholders.
iped-app/resources/localization/iped-ai-filters_fr_FR.properties Adds filter labels (fr_FR) placeholders.
iped-app/resources/localization/iped-ai-filters_es_AR.properties Adds filter labels (es_AR) placeholders.
iped-app/resources/localization/iped-ai-filters_de_DE.properties Adds filter labels (de_DE) placeholders.
iped-app/resources/config/IPEDConfig.txt Adds enableAISummarization config flag documentation.
iped-app/resources/config/conf/TaskInstaller.xml Registers AISummarizationTask.py in the processing pipeline.
iped-app/resources/config/conf/AISummarizationConfig.txt New task configuration file (remote address, timeouts, parser selection, analysis questions).
iped-app/resources/config/conf/AIFiltersConfig.json Adds “Analyzed Chats” wildcard filter and “Summarized Chats” filter entries.
iped-api/src/main/java/iped/properties/ExtraProperties.java Adds SUMMARY and CHUNK_IDS constants for AI summarization attributes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +140 to +145
String html = SimpleHTMLEncoder.htmlEncode(text).replace("\n", "<br>");
Matcher matcher = BOLD_PATTERN.matcher(html);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "<strong>" + matcher.group(1) + "</strong>");
}
sb.append("color:");
sb.append(getHexRGB(c));
sb.append(";");
sb.append("\">").append(msg).append("</p>");
Comment on lines +639 to +648
# --- summaries ---
summary = entry.get("summary")
if isinstance(summary, str) and summary.strip():
chunk_summaries.append(summary)

# --- chunk ids ---
chunk_id = entry.get("chunk_id")
if isinstance(chunk_id, str) and chunk_id.strip():
chunk_ids.append(chunk_id)

Comment on lines +48 to +51
String suffix = node.getSuffix();
if (suffix != null) {
text += " - " + suffix;
}
Comment on lines +525 to +531
DefaultSingleCDockable dock = dockPerViewer.get(viewer);
dockPerViewer.remove(viewer);
viewers.remove(i);
CControl cControl = dock.getControl();
if (cControl != null) {
cControl.removeDockable(dock);
}
Comment on lines +64 to +66
String s = field.substring(prop.length()).trim();
if (s.toLowerCase().endsWith("score")) {
s = Character.toUpperCase(s.charAt(0)) + s.substring(1, s.length() - 5);
Comment on lines +132 to +134
SummaryViewer.NoSummary=No summary available[TBT]
SummaryViewer.Title=AI-generated summary. Check all information. [TBT]
SummaryViewer.TabName=Summary[TBT]
Comment on lines +132 to +134
SummaryViewer.NoSummary=No summary available[TBT]
SummaryViewer.Title=AI-generated summary. Check all information. [TBT]
SummaryViewer.TabName=Summary[TBT]
Comment on lines +132 to +134
SummaryViewer.NoSummary=No summary available[TBT]
SummaryViewer.Title=AI-generated summary. Check all information. [TBT]
SummaryViewer.TabName=Summary[TBT]
Comment on lines +132 to +134
SummaryViewer.NoSummary=No summary available[TBT]
SummaryViewer.Title=AI-generated summary. Check all information. [TBT]
SummaryViewer.TabName=Summary[TBT]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AI summarization and AI questions based classification for chats

4 participants