Skip to content

longkeyy/knowledgesdk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KnowledgeSDK

KnowledgeSDK is a Go library for building and managing vector-based knowledge bases with semantic search capabilities. It provides a comprehensive set of tools for document management, chunking, embedding generation, and semantic search.

Features

  • Knowledge base management (create, read, update, delete)
  • Document handling with automatic content extraction
  • Text chunking for efficient storage and retrieval
  • Vector embedding generation
  • Semantic search with similarity scoring
  • PostgreSQL-based vector storage with efficient indexing
  • Support for various file formats via Apache Tika integration

Configuration

SDK Configuration

type Config struct {
    // Database configuration
    DBHost     string
    DBPort     int
    DBName     string
    DBUser     string
    DBPassword string

    // Vector embedding service configuration
    APIKey         string
    BaseURL        string // Compatible with different model services
    EmbeddingModel string // e.g. "text-embedding-ada-002"
}

Chunk Configuration

type ChunkConfig struct {
    ChunkSize int // Maximum number of characters per chunk
    Overlap   int // Number of overlapping characters between adjacent chunks
}

Search Parameters

type SearchParams struct {
    Query               string  // Query text to search for
    TopK                int     // Number of results to return
    SimilarityThreshold float64 // Minimum similarity score (0-1)
    CreatorID           string  // Creator ID for filtering results (optional)
    KBID                string  // Knowledge base ID to limit search scope (optional)
}

Tika Configuration

type TikaConfig struct {
    URL string // Tika server URL, e.g., "http://localhost:9998"
}

// DefaultTikaConfig returns default Tika configuration with URL set to "http://localhost:9998"

API Reference

Initialization

NewKnowledgeSDK

Creates a new SDK instance with the provided configuration.

  • Parameters:
    • config Config: Configuration for database and embedding service
  • Returns:
    • *KnowledgeSDK: SDK instance
    • error: Error if initialization fails

Knowledge Base Management

CreateKnowledgeBase

Creates a new knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kb *KnowledgeBase: Knowledge base object with fields:
      • Name: Knowledge base name
      • Description: Knowledge base description
      • ModelID: Large model identifier (optional)
      • Temperature: Model temperature parameter, controls randomness (optional, default 0.7)
      • RigorousPrompt: Rigorous answer prompt template (optional)
      • EnableRigorousAnswer: Whether to enable rigorous answer mode (optional, default false)
      • ChunkSize: Document chunk size in characters (optional, default 1000)
      • Overlap: Overlap between adjacent chunks (optional, default 50)
      • TopK: Maximum number of related chunks to retrieve (optional, default 5)
      • SimilarityThreshold: Similarity threshold (optional, default 0.6)
      • SystemPromptTemplate: System prompt template (optional)
      • MaxReferenceLength: Maximum reference knowledge length (optional, default 3000)
      • CreatorID: ID of the knowledge base creator (optional)
  • Returns:
    • *KnowledgeBase: Created knowledge base
    • error: Error if creation fails

GetKnowledgeBase

Retrieves a knowledge base by ID.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
  • Returns:
    • *KnowledgeBase: Retrieved knowledge base
    • error: Error if retrieval fails

ListKnowledgeBases

Lists all knowledge bases.

  • Parameters:
    • ctx context.Context: Context for the operation
  • Returns:
    • []KnowledgeBase: List of knowledge bases
    • error: Error if listing fails

ListKnowledgeBasesByCreatorID

Lists all knowledge bases by creator ID.

  • Parameters:
    • ctx context.Context: Context for the operation
    • creatorID string: ID of the creator
  • Returns:
    • []KnowledgeBase: List of knowledge bases
    • error: Error if listing fails

ListKnowledgeBasesByIDs

Retrieves multiple knowledge bases by their IDs.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbIDs []string: List of knowledge base IDs
  • Returns:
    • []KnowledgeBase: List of knowledge bases
    • error: Error if retrieval fails

UpdateKnowledgeBase

Updates all properties of a knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kb *KnowledgeBase: Knowledge base object with updated fields
  • Returns:
    • *KnowledgeBase: Updated knowledge base
    • error: Error if update fails

DeleteKnowledgeBase

Deletes a knowledge base and all its documents.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
  • Returns:
    • error: Error if deletion fails

ListKnowledgeBaseDocuments

Lists all documents in a knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
  • Returns:
    • []Document: List of documents
    • error: Error if listing fails

ListKnowledgeBaseDocumentsPaginated

Lists documents in a knowledge base with pagination, sorting, and keyword filtering for document names.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • keyword string: Keyword for filtering document names (use empty string for no filtering)
    • page int: Page number (starting from 1)
    • pageSize int: Number of documents per page
    • orderBy string: Sorting criteria (e.g., "uploaded_at DESC")
    • creatorID string: ID of the creator (optional, for filtering)
  • Returns:
    • []Document: List of documents
    • int64: Total number of documents in the knowledge base matching the filter criteria
    • error: Error if listing fails

SearchKnowledgeBasesByName

Search knowledge bases by name.

  • Parameters:
    • ctx context.Context: Context for the operation
    • name string: Name keyword to search for
  • Returns:
    • []KnowledgeBase: List of matching knowledge bases
    • error: Error if search fails

SearchKnowledgeBasesByDescription

Search knowledge bases by description.

  • Parameters:
    • ctx context.Context: Context for the operation
    • description string: Description keyword to search for
  • Returns:
    • []KnowledgeBase: List of matching knowledge bases
    • error: Error if search fails

SearchKnowledgeBasesByKeyword

Search knowledge bases by keyword (searches both name and description).

  • Parameters:
    • ctx context.Context: Context for the operation
    • keyword string: Keyword to search for
  • Returns:
    • []KnowledgeBase: List of matching knowledge bases
    • error: Error if search fails

SearchKnowledgeBasesAdvanced

Perform advanced search on knowledge bases with multiple criteria.

  • Parameters:
    • ctx context.Context: Context for the operation
    • params KnowledgeBaseSearchParams: Search parameters including:
      • Keyword: Keyword to search in name and description (optional)
      • Name: Name keyword (optional)
      • Description: Description keyword (optional)
      • ModelID: Model ID for exact matching (optional)
      • CreatorID: Creator ID for filtering (optional)
      • Page: Page number (starting from 1)
      • PageSize: Number of items per page
      • OrderBy: Sorting criteria (e.g., "created_at DESC")
  • Returns:
    • []KnowledgeBase: List of matching knowledge bases
    • int64: Total number of matching knowledge bases
    • error: Error if search fails

Document Management

AddDocument

Adds a text document to a knowledge base and immediately chunks it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • name string: Document name
    • content string: Document content
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

AddDocumentWithMetadata

Adds a document with metadata to a knowledge base and chunks it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • name string: Document name
    • content string: Document content
    • contentType string: Content MIME type
    • metadata map[string]string: Document metadata
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

GetDocument

Retrieves a document by ID.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • *Document: Retrieved document
    • error: Error if retrieval fails

GetDocumentWithChunks

Retrieves a document with its chunks.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • *Document: Retrieved document with chunks
    • error: Error if retrieval fails

DeleteDocument

Deletes a document and its chunks.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if deletion fails

UpdateDocumentContent

Updates a document's content and re-chunks it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
    • newContent string: New document content
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • error: Error if update fails

GetDocumentMetadata

Retrieves a document's metadata.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • map[string]string: Document metadata
    • error: Error if retrieval fails

File Management

AddFile

Adds a file to a knowledge base, extracts its content, and chunks it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • fileName string: Name of the file
    • fileData []byte: File data
    • tikaConfig TikaConfig: Apache Tika configuration
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

AddFileFromReader

Adds a file from an io.Reader to a knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • fileName string: Name of the file
    • reader io.Reader: File data reader
    • tikaConfig TikaConfig: Apache Tika configuration
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

AddFileFromMultipart

Adds a file from an HTTP multipart upload to a knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • file *multipart.FileHeader: Uploaded file
    • tikaConfig TikaConfig: Apache Tika configuration
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

ExtractFileContent

Extracts content and metadata from a file using Apache Tika.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileName string: Name of the file
    • fileData []byte: File data
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • *FileContent: Extracted content and metadata
    • error: Error if extraction fails

ExtractFileContentFromReader

Extracts content and metadata from a file using io.Reader.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileName string: Name of the file
    • reader io.Reader: File data reader
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • *FileContent: Extracted content and metadata
    • error: Error if extraction fails

ExtractFileContentFromMultipart

Extracts content and metadata from an HTTP multipart uploaded file.

  • Parameters:
    • ctx context.Context: Context for the operation
    • file *multipart.FileHeader: Uploaded file
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • *FileContent: Extracted content and metadata
    • error: Error if extraction fails

ExtractFileContentFromURL

Extracts content and metadata from a file at a given URL.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileURL string: URL of the file
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • *FileContent: Extracted content and metadata
    • error: Error if extraction fails

GetFileMetadata

Extracts metadata from a file without storing it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileName string: Name of the file
    • fileData []byte: File data
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • map[string]string: File metadata
    • error: Error if extraction fails

GetFileMetadataFromReader

Extracts metadata from a file using io.Reader without storing it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileName string: Name of the file
    • reader io.Reader: File data reader
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • map[string]string: File metadata
    • error: Error if extraction fails

GetFileMetadataFromMultipart

Extracts metadata from an HTTP multipart uploaded file without storing it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • file *multipart.FileHeader: Uploaded file
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • map[string]string: File metadata
    • error: Error if extraction fails

Search

Search

Performs vector similarity search.

  • Parameters:
    • ctx context.Context: Context for the operation
    • params SearchParams: Search parameters including:
      • Query: Search query text
      • TopK: Maximum number of results to return
      • SimilarityThreshold: Minimum similarity score (0-1)
      • CreatorID: Creator ID for filtering results (optional)
      • KBID: Knowledge base ID to limit search scope (optional)
  • Returns:
    • []SearchResult: Search results
    • error: Error if search fails

FullTextSearch

Performs traditional full-text search.

  • Parameters:
    • ctx context.Context: Context for the operation
    • query string: Search query
    • limit int: Maximum number of results
    • creatorID string: Creator ID for filtering results (optional)
    • kbID string: Knowledge base ID to limit search scope (optional)
  • Returns:
    • []SearchResult: Search results
    • error: Error if search fails

HybridSearch

Performs hybrid search (vector + full-text).

  • Parameters:
    • ctx context.Context: Context for the operation
    • params SearchParams: Search parameters including:
      • Query: Search query text
      • TopK: Maximum number of results to return
      • SimilarityThreshold: Minimum similarity score (0-1)
      • CreatorID: Creator ID for filtering results (optional)
      • KBID: Knowledge base ID to limit search scope (optional)
  • Returns:
    • []SearchResult: Search results
    • error: Error if search fails

Embedding Generation

GenerateEmbedding

Generates a vector embedding for text.

  • Parameters:
    • ctx context.Context: Context for the operation
    • text string: Text to embed
  • Returns:
    • []float32: Vector embedding
    • error: Error if generation fails

BatchGenerateEmbeddings

Generates vector embeddings for multiple texts in batch.

  • Parameters:
    • ctx context.Context: Context for the operation
    • texts []string: Texts to embed
  • Returns:
    • [][]float32: Vector embeddings
    • error: Error if generation fails

GetEmbeddingStatus

Retrieves the status of embedding generation.

  • Parameters:
    • ctx context.Context: Context for the operation
  • Returns:
    • *ChunkStatus: Status information
    • error: Error if retrieval fails

UpdateChunkEmbedding

Updates a chunk's vector embedding.

  • Parameters:
    • ctx context.Context: Context for the operation
    • chunk *Chunk: Chunk to update
    • embedding []float32: Vector embedding
  • Returns:
    • error: Error if update fails

BatchUpdateChunkEmbeddings

Updates multiple chunks' vector embeddings in batch.

  • Parameters:
    • ctx context.Context: Context for the operation
    • chunks []Chunk: Chunks to update
    • embeddings [][]float32: Vector embeddings
  • Returns:
    • error: Error if update fails

GetPendingChunks

Retrieves chunks pending embedding generation.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of chunks
  • Returns:
    • []Chunk: Pending chunks
    • error: Error if retrieval fails

Document Status Management

UpdateDocumentStatus

Updates a document's status.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
    • status string: New status
  • Returns:
    • error: Error if update fails

MarkDocumentAsUploadSuccessful

Marks a document as successfully uploaded.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsUploadFailed

Marks a document as failed during upload.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsExtractSuccessful

Marks a document as successfully content-extracted.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsExtractFailed

Marks a document as failed during content extraction.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsSplitSuccessful

Marks a document as successfully chunked.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsSplitFailed

Marks a document as failed during chunking.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsIndexSuccessful

Marks a document as successfully indexed.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsIndexFailed

Marks a document as failed during indexing.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

IsDocumentReadyForExtract

Checks if a document is ready for content extraction.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • bool: True if ready, false otherwise
    • error: Error if check fails

IsDocumentReadyForSplit

Checks if a document is ready for chunking.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • bool: True if ready, false otherwise
    • error: Error if check fails

IsDocumentReadyForIndex

Checks if a document is ready for indexing.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • bool: True if ready, false otherwise
    • error: Error if check fails

GetDocumentsInStatus

Retrieves documents with a specific status.

  • Parameters:
    • ctx context.Context: Context for the operation
    • status string: Status to filter by
    • limit int: Maximum number of documents
  • Returns:
    • []Document: Documents with the specified status
    • error: Error if retrieval fails

GetDocumentsForExtract

Retrieves documents waiting for content extraction.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of documents
  • Returns:
    • []Document: Documents waiting for content extraction
    • error: Error if retrieval fails

GetDocumentsForSplit

Retrieves documents waiting for chunking.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of documents
  • Returns:
    • []Document: Documents waiting for chunking
    • error: Error if retrieval fails

GetDocumentsForIndex

Retrieves documents waiting for indexing.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of documents
  • Returns:
    • []Document: Documents waiting for indexing
    • error: Error if retrieval fails

CheckDocumentIndexStatus

Checks if all chunks of a document are indexed.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • bool: True if all chunks are indexed, false otherwise
    • error: Error if check fails

UpdateDocumentIndexStatus

Updates a document's index status based on its chunks.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if update fails

Chunk Management

UpdateDocumentChunks

Updates multiple document chunks.

  • Parameters:
    • ctx context.Context: Context for the operation
    • chunks []Chunk: Chunks to update
  • Returns:
    • error: Error if update fails

GetChunksNeedingIndex

Retrieves chunks that need indexing.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of chunks
  • Returns:
    • []Chunk: Chunks needing indexing
    • error: Error if retrieval fails

MarkChunkAsIndexed

Marks a chunk as indexed.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
    • chunkIndex int: Chunk index
  • Returns:
    • error: Error if marking fails

CompareChunkContent

Compares the content of two chunks.

  • Parameters:
    • chunk1 *Chunk: First chunk
    • chunk2 *Chunk: Second chunk
  • Returns:
    • bool: True if content is identical, false otherwise

Utility Methods

GetDB

Retrieves the GORM database connection.

  • Returns:
    • *gorm.DB: Database connection

GetOpenAIClient

Retrieves the OpenAI client.

  • Returns:
    • *openai.Client: OpenAI client

GetEmbeddingModel

Retrieves the current embedding model name.

  • Returns:
    • string: Embedding model name

GetModelDimension

Retrieves the model vector dimension.

  • Returns:
    • int: Vector dimension

EmbeddingToPgVector

Converts a vector embedding to PostgreSQL vector format.

  • Parameters:
    • embedding []float32: Vector embedding
  • Returns:
    • string: PostgreSQL vector format

DefaultTikaConfig

Returns default Tika configuration.

测试

测试环境设置

  1. 创建测试数据库
make setup-test-db
  1. 运行所有测试
make test
  1. 运行搜索功能测试
make test-search
  1. 运行知识库测试
make test-kb
  1. 清理测试数据
make clean-test-db

自动化测试脚本

使用提供的测试脚本进行完整的测试流程:

# 运行完整测试(包含设置和清理)
./scripts/test_search.sh --cleanup

# 仅设置测试环境
./scripts/test_search.sh --setup-only

# 仅清理测试数据
./scripts/test_search.sh --cleanup-only

测试数据隔离

为确保测试的可靠性,所有测试都实现了数据隔离:

  • 使用唯一标识符防止测试数据冲突
  • 每个测试后自动清理创建的数据
  • 建议使用专门的测试数据库
  • 详细说明请参考 测试最佳实践指南

Constants

Document Status Constants

  • DocStatusUploadFailed: Upload failed
  • DocStatusUploadSuccess: Upload successful, waiting for content extraction
  • DocStatusExtractFailed: Content extraction failed
  • DocStatusExtractSuccess: Content extraction successful, waiting for chunking
  • DocStatusSplitFailed: Chunking failed
  • DocStatusSplitSuccess: Chunking successful, waiting for indexing
  • DocStatusIndexFailed: Indexing failed
  • DocStatusIndexSuccess: Indexing successful

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published