implemented DBservices according to new db schema by parthdude07 · Pull Request #15 · genesis-kb/genesis-kb.github.io

parthdude07 · 2026-05-22T07:21:30Z

Fixes #14

Summary by cubic

Migrates the backend to a new platform‑agnostic DB schema and updates DB services to use normalized tables. Improves full‑text search and transcript versioning, and adds duration formatting.

New Features
- Added NEW_DB_SCRIPTS/models.py, migrate_schema.py, add_indexes.py, and Description.md (v2 schema, transactional migration, partial GIN/B‑Tree indexes, deployment steps).
- Reworked backend/src/services/supabaseService.js to the new schema (joins content_items, transcripts with is_current = true, summaries, speakers, taxonomies, content_sources; returns speakers/tags/status/duration; FTS ranks titles/descriptions, transcript text, and TL;DR; safer input sanitization and distinct counts).
- Replaced the legacy cache table with summaries; updated types.ts for JSON tags/categories and optional topics; dataProcessor now formats duration_seconds as mm:ss.
- Updated backend/supabase/migrations/001_full_text_search.sql to provide RPCs that search across normalized tables and return better snippets.
Migration
- Dry run: python NEW_DB_SCRIPTS/migrate_schema.py --dry-run
- Execute: python NEW_DB_SCRIPTS/migrate_schema.py
- Indexes: python NEW_DB_SCRIPTS/add_indexes.py

^{Written for commit 92034e7. Summary will update on new commits. Review in cubic}

Summary by CodeRabbit

New Features
- Support for multiple content sources (beyond YouTube)
- Improved full-text transcript search with better ranking, highlights, and richer results (speakers, conference/channel, tags/categories)
- Denormalized transcript metadata (aggregated speakers, summaries) and duration shown per talk
- Schema migration and indexing to improve search and current-transcript handling
Documentation
- Added a migration guide for upgrading to the new multi-source schema

coderabbitai · 2026-05-22T07:21:42Z

📝 Walkthrough

Walkthrough

Migrates the database from a YouTube-centric schema to a normalized, platform-agnostic v2 schema: adds ORM models, a transactional migration CLI with dry-run support, an index-creation script, updated PostgreSQL FTS RPCs, refactors backend queries to join/aggregate the new schema, and updates types and frontend duration formatting.

Changes

Database Schema Migration & Adaptation

Layer / File(s)	Summary
Data Model Definition `NEW_DB_SCRIPTS/models.py`	New SQLAlchemy ORM schema defines `Taxonomy`, `ContentSource`, `ContentItem`, `Speaker`, `ContentItemSpeaker`, `Transcript`, `Summary`, `ExternalPublication`, and `PipelineRun` with constraints, partial indexes, relationships, and `to_dict()` serializers.
Migration Script & Documentation `NEW_DB_SCRIPTS/Description.md`, `NEW_DB_SCRIPTS/migrate_schema.py`	CLI migration renames legacy tables, creates new schema, migrates channels/videos/transcripts/ingestion_runs/comments into v2 tables via SQL and Python streaming upserts, supports `--dry-run`, and drops legacy tables; documentation lists deployment steps and index targets.
Index Creation & Database Optimization `NEW_DB_SCRIPTS/add_indexes.py`	Script ensures `pg_trgm` extension and applies B-Tree, partial unique, DESC, and GIN FTS indexes using `to_tsvector('english', ...)` for content_items, transcripts (partial on current), summaries, and supporting tables.
RPC Functions & Full-Text Search `backend/supabase/migrations/001_full_text_search.sql`	`search_transcripts_fts` and companion count updated to compose tsvectors across content_items/transcripts/summaries, use `plainto_tsquery`, join/aggregate v2 tables, rank and headline results, and return v2-style columns (jsonb tags/categories, speakers[], conference, status, duration_seconds).
Backend Service Query Adapter `backend/src/services/supabaseService.js`	Service queries rewritten to JOIN and aggregate v2 tables for fetch/search endpoints, use `plainto_tsquery`/`ts_headline` with ranking and error handling, and migrate AI cache read/write to the `summaries` table via upsert on `(transcript_id, summary_type)`.
Type Definitions & Frontend Formatting `types.ts`, `backend/src/utils/dataProcessor.js`	TypeScript types loosen `tags`/`categories`/`topics` to `string[]

Sequence Diagram(s)

sequenceDiagram
  participant ComponentA
  participant ComponentB
  ComponentA->>ComponentB: observable interaction

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A migration hopped in with careful delight,
Models and scripts in the soft morning light.
Indexes hum and queries now sing,
Transcripts and speakers all fit in the spring.
A multi-source garden, tidy and bright.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: implementing database services according to a new database schema, which is the core objective of the PR.
Linked Issues check	✅ Passed	The PR fully addresses the coding requirement from issue `#14` by updating backend handlers and services (supabaseService.js, migrations SQL, models, types) to operate against the new normalized v2 schema.
Out of Scope Changes check	✅ Passed	All changes align with the schema migration objective: models, migration scripts, indexing, backend service rewrites, type updates, and data processor improvements are all necessary for the new schema implementation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/src/services/supabaseService.js (1)
447-480: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't drop non-array tag payloads.

Line 448 treats anything except a JSON array as “no tags”. In this PR, the shared contracts now allow object-shaped tags, so those rows will contribute nothing to topics or tags here. Normalize the JSONB value first instead of silently defaulting to [].
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/services/supabaseService.js` around lines 447 - 480, The code
currently sets rowTags = Array.isArray(row.tags) ? row.tags : [] which drops
object-shaped JSONB tag payloads; instead parse/normalize row.tags into a
consistent array before using it (e.g., convert single-string, object-shaped, or
array payloads into an array of tag values) so Speaker loop, topic loop and tag
loop operate on valid entries; update the logic around rowTags creation (the
variable used in the speakerMap/topicMap/tagSet flows) to normalize JSONB inputs
first (using existing helpers like cleanLabel/normalizeLabel where appropriate)
rather than defaulting to [].

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/services/supabaseService.js`:
- Around line 290-303: The rank calculation omits summaryVector so matches found
only in the cached TL;DR (su.content) get zero score; update the ts_rank call in
the SQL fragment (the expression producing ts_rank(... ) AS rank) to include
summaryVector alongside titleDescVector and textVector (e.g., combine
titleDescVector || textVector || summaryVector, ${ftsQuery}) so that hits in
su.content affect ranking; ensure the same ${ftsQuery} and summaryVector symbol
names are used to match the WHERE clause that checks ${summaryVector} @@
${ftsQuery}.
- Around line 366-382: The current read-then-write using query around the
summaries table (select by transcript_id and summary_type then update or insert)
is not atomic; replace that logic in supabaseService.js (the block that
references query, transcriptId, type, content and existing.rows) with a single
upsert SQL using INSERT ... ON CONFLICT (transcript_id, summary_type) DO UPDATE
so the write is atomic; ensure the DO UPDATE sets content = EXCLUDED.content and
created_at = NOW() (or equivalent) and call query once with parameters
[transcriptId, type, content].

In `@backend/supabase/migrations/001_full_text_search.sql`:
- Around line 57-61: The ts_rank calculation currently only includes
to_tsvector('english', COALESCE(c.title, '') || ' ' || COALESCE(c.description,
'')) and the transcript vector (t.corrected_text / t.raw_text) so matches found
only in the summary field su.content will produce rank = 0; update the ts_rank
expression to also include to_tsvector('english', COALESCE(su.content, ''))
alongside the existing vectors so that ranking reflects the same fields used in
the WHERE tsquery predicate (ensure the same vector concatenation used in Lines
83-85 is included in the ts_rank call that uses tsquery_val).

In `@NEW_DB_SCRIPTS/add_indexes.py`:
- Around line 82-87: The script catches exceptions from add_indexes() but only
logs them and prints a traceback, leaving the process to exit successfully;
update the except block in add_indexes.py (the try around add_indexes()) to
terminate with a non-zero status—either re-raise the exception or call
sys.exit(1) after logging the error (and add an import for sys if needed) so
failures cause a failing process exit.

In `@NEW_DB_SCRIPTS/Description.md`:
- Around line 6-8: The deployment docs reference old paths `app/models.py`,
`scripts/migrate_schema.py`, and `scripts/add_indexes.py` but this PR moved
those files under NEW_DB_SCRIPTS; update the docs (including the block around
lines 15-28) to point to `NEW_DB_SCRIPTS/app/models.py`,
`NEW_DB_SCRIPTS/migrate_schema.py`, and `NEW_DB_SCRIPTS/add_indexes.py` (or the
exact filenames as committed) and adjust any example commands to use
NEW_DB_SCRIPTS/ so operators run the correct scripts.

In `@NEW_DB_SCRIPTS/migrate_schema.py`:
- Around line 18-19: The script imports Base from app.models but should use the
v2 metadata defined in NEW_DB_SCRIPTS/models.py to ensure DDL
(create_all/migration SQL) targets the new schema; update the import to pull the
metadata or Base-equivalent from NEW_DB_SCRIPTS.models (the v2 metadata object
used for create_all) and pass that metadata into engine.create_all / SQL
generation instead of app.models.Base, keeping get_session and _get_engine usage
unchanged.
- Around line 268-273: The except block that catches failures from
run_migration(dry_run=args.dry_run) currently only logs the error and prints a
traceback, leaving the process exit code as zero; update the handler to
terminate the process with a non-zero exit code (e.g., call sys.exit(1) or
re-raise the exception) after logging so CI/CD detects failures—modify the
except block that references logger.error and traceback.print_exc to import sys
if needed and exit with a non-zero status.
- Around line 141-143: The current backfill loads all rows via
conn.execute(text("SELECT * FROM old_transcripts")).fetchall(), which can OOM;
change this to stream or batch the query results instead—use a server-side
cursor/streaming execution (e.g., execution_options(stream_results=True) or the
DB driver's cursor) or iterate with fetchmany/LIMIT+OFFSET to process chunks,
replacing the fetchall() usage and adjusting the logger.info that reports
progress to count per-batch (refer to the transcripts variable, the conn.execute
call and the "old_transcripts" query) so rows are processed incrementally rather
than all at once.

---

Outside diff comments:
In `@backend/src/services/supabaseService.js`:
- Around line 447-480: The code currently sets rowTags = Array.isArray(row.tags)
? row.tags : [] which drops object-shaped JSONB tag payloads; instead
parse/normalize row.tags into a consistent array before using it (e.g., convert
single-string, object-shaped, or array payloads into an array of tag values) so
Speaker loop, topic loop and tag loop operate on valid entries; update the logic
around rowTags creation (the variable used in the speakerMap/topicMap/tagSet
flows) to normalize JSONB inputs first (using existing helpers like
cleanLabel/normalizeLabel where appropriate) rather than defaulting to [].

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f902fba-dcb8-4119-88c8-73a61d93b113

📥 Commits

Reviewing files that changed from the base of the PR and between 72fd4e2 and 83ac902.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (7)

NEW_DB_SCRIPTS/Description.md
NEW_DB_SCRIPTS/add_indexes.py
NEW_DB_SCRIPTS/migrate_schema.py
NEW_DB_SCRIPTS/models.py
backend/src/services/supabaseService.js
backend/supabase/migrations/001_full_text_search.sql
types.ts

cubic-dev-ai

8 issues found across 8 files

_{Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

NEW_DB_SCRIPTS/add_indexes.py (1)

20-23: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return a failing process status when engine creation fails.

On Line 20-Line 23, returning after logging makes index setup appear successful to automation even though nothing executed.

Proposed fix

 def add_indexes():
     engine = _get_engine()
     if not engine:
-        logger.error("Could not get database engine. Check DATABASE_URL.")
-        return
+        raise RuntimeError("Could not get database engine. Check DATABASE_URL.")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@NEW_DB_SCRIPTS/add_indexes.py` around lines 20 - 23, The current early-return
after calling _get_engine() masks failures; change the behavior so the process
exits with a non-zero status instead of returning: import sys (or use raise
SystemExit) in NEW_DB_SCRIPTS/add_indexes.py and replace the bare "return" after
logger.error("Could not get database engine. Check DATABASE_URL.") with
sys.exit(1) (or raise SystemExit("Could not get database engine")), ensuring the
failure is propagated to automation; keep the existing logger.error call and
reference the _get_engine() call and engine variable in your change.

NEW_DB_SCRIPTS/models.py (1)

409-411: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a unique constraint for summary upserts.

The summaries table is missing a unique/exclusion constraint on (transcript_id, summary_type). Upserts using ON CONFLICT (transcript_id, summary_type) will fail at runtime without it.

Proposed fix

     __table_args__ = (
+        UniqueConstraint(
+            "transcript_id",
+            "summary_type",
+            name="uq_summaries_transcript_type",
+        ),
         Index("idx_summaries_type", "summary_type"),
     )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@NEW_DB_SCRIPTS/models.py` around lines 409 - 411, The summaries model's
__table_args__ currently defines a non-unique Index("idx_summaries_type",
"summary_type") but lacks a unique constraint required for ON CONFLICT upserts
on (transcript_id, summary_type); update __table_args__ in the summaries model
to include a UniqueConstraint("transcript_id", "summary_type") (or an Index with
unique=True) alongside the existing Index so the database enforces uniqueness
for upserts.

NEW_DB_SCRIPTS/migrate_schema.py (1)

32-35: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when DB engine initialization fails.

On Line 32-Line 35, the script logs and returns when _get_engine() is unavailable, which can still exit with code 0. That can hide failed migrations in CI/CD.

Proposed fix

 def run_migration(dry_run=False):
     engine = _get_engine()
     if not engine:
-        logger.error("Could not get database engine. Check DATABASE_URL.")
-        return
+        raise RuntimeError("Could not get database engine. Check DATABASE_URL.")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@NEW_DB_SCRIPTS/migrate_schema.py` around lines 32 - 35, The script currently
logs and returns when _get_engine() fails, which allows a zero exit status;
update the failure path in migrate_schema.py where engine = _get_engine() to
terminate with a non-zero exit (e.g., call sys.exit(1) or raise
SystemExit/RuntimeError) right after logger.error("Could not get database
engine. Check DATABASE_URL.") so CI sees the migration failure; ensure you
import sys or raise an exception and keep the log message intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/utils/dataProcessor.js`:
- Line 189: The current ternary uses a falsy check on row.duration_seconds which
treats 0 as missing; update the condition in the duration assignment (the line
setting duration using row.duration_seconds) to check for null/undefined (e.g.
row.duration_seconds != null or Number.isFinite(row.duration_seconds)) instead
of a truthy check so that 0 is formatted as "0:00" using the existing
Math.floor/ padStart formatting logic.

---

Outside diff comments:
In `@NEW_DB_SCRIPTS/add_indexes.py`:
- Around line 20-23: The current early-return after calling _get_engine() masks
failures; change the behavior so the process exits with a non-zero status
instead of returning: import sys (or use raise SystemExit) in
NEW_DB_SCRIPTS/add_indexes.py and replace the bare "return" after
logger.error("Could not get database engine. Check DATABASE_URL.") with
sys.exit(1) (or raise SystemExit("Could not get database engine")), ensuring the
failure is propagated to automation; keep the existing logger.error call and
reference the _get_engine() call and engine variable in your change.

In `@NEW_DB_SCRIPTS/migrate_schema.py`:
- Around line 32-35: The script currently logs and returns when _get_engine()
fails, which allows a zero exit status; update the failure path in
migrate_schema.py where engine = _get_engine() to terminate with a non-zero exit
(e.g., call sys.exit(1) or raise SystemExit/RuntimeError) right after
logger.error("Could not get database engine. Check DATABASE_URL.") so CI sees
the migration failure; ensure you import sys or raise an exception and keep the
log message intact.

In `@NEW_DB_SCRIPTS/models.py`:
- Around line 409-411: The summaries model's __table_args__ currently defines a
non-unique Index("idx_summaries_type", "summary_type") but lacks a unique
constraint required for ON CONFLICT upserts on (transcript_id, summary_type);
update __table_args__ in the summaries model to include a
UniqueConstraint("transcript_id", "summary_type") (or an Index with unique=True)
alongside the existing Index so the database enforces uniqueness for upserts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 5f63ca40-1646-4342-8aed-f6aa9f357b1e

📥 Commits

Reviewing files that changed from the base of the PR and between 83ac902 and 92034e7.

📒 Files selected for processing (7)

NEW_DB_SCRIPTS/Description.md
NEW_DB_SCRIPTS/add_indexes.py
NEW_DB_SCRIPTS/migrate_schema.py
NEW_DB_SCRIPTS/models.py
backend/src/services/supabaseService.js
backend/src/utils/dataProcessor.js
backend/supabase/migrations/001_full_text_search.sql

implemented DBservices according to new db schema

83ac902

parthdude07 mentioned this pull request May 22, 2026

Migration: Update the handlers in the backend according to the new schema . #14

Open

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 22, 2026

View reviewed changes

implemented AI reviewers suggestions

92034e7

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread backend/src/utils/dataProcessor.js

Conversation

parthdude07 commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

parthdude07 commented May 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading