Soba Compressor / Embedder

Soba Compressor supports the reading and sentence splitting of *.pdf and *.txt files provided within a directory. It performs chunking of the retrieved data based on sentence + word count and generates a compressed_*.txt file in the soba-embedder/compressed/* directory.

The compressed files are then subsequently passed to the Soba Embedder which performs embedding to convert it to a vector database for subsequent reference in a KNN search. By default, the embeddings are not converted to Tensor format, however you may choose to customize the settings in the main.py file by modifying Model's parameters.

Refer to the concept: Hugging Face - Advanced Rag

Installations

Check required/supported CUDA Version

# Running the nvidia-smi command should show details such as the following
nvidia-smi

# Refer to "CUDA Version: 12.4" when performing the next step
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.61                 Driver Version: 551.61         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

Installing relevant PyTorch Version (Cuda Support)

PyTorch Cuda Support - Download Link

Install punkt

Install punkt if required using NLTK

if __name__== "__main__":
    nltk.download('punkt')

How to use

Standalone Data Compression

from custom_compressor import GenerateCompressedFiles

"""
Reads all *.pdf and *.txt files from "your_file_directory" and
writes them to "soba-inferer/compressed/*" where * represents
the read file's prefix directory
"""
if __name__== "__main__":
    GenerateCompressedFiles("your_file_directory")

Standalone Data Embedding

from custom_embedder import GenerateAllEmbeddings

"""
Creates individual embedding files (*.pkl) using the 'all-mpnet-base-v2'
model from SentenceTransformer after reading each "compressed_file" from
"your_compressed_file_directory"
"""

if __name__ == "__main__":    
    # Perform Embedding
    all_embeddings = GenerateAllEmbeddings(os.path.dirname(__file__) + '\\compressed\\',            # Input Directory (All Compressed Files)
                                           os.path.dirname(__file__) + '\\embedding\\',             # Output Directory (embeddings.pkl)
                                           SentenceTransformer('all-mpnet-base-v2', device="cuda", )) # Load Embedding Model on GPU

For more Sentence Transformer Information - Refer to Sentence Transformer Pretrained Models

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
custom_compressor.py		custom_compressor.py
custom_embedder.py		custom_embedder.py
main.py		main.py
soba_utils.py		soba_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Soba Compressor / Embedder

Installations

Check required/supported CUDA Version

Installing relevant PyTorch Version (Cuda Support)

Install punkt

How to use

Standalone Data Compression

Standalone Data Embedding

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Soba Compressor / Embedder

Installations

Check required/supported CUDA Version

Installing relevant PyTorch Version (Cuda Support)

Install punkt

How to use

Standalone Data Compression

Standalone Data Embedding

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages