Skip to content

Implementation of Set Similarity Join operations using Apache Spark, focusing on different filtering techniques for large-scale text comparison. Developed and tested on Databricks Community Edition.

License

Notifications You must be signed in to change notification settings

VTvito/ApacheSpark-Set-Similarity-Join

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Set Similarity Join with Apache Spark

Project Overview

Implementation of Set Similarity Join operations using Apache Spark, focusing on different filtering techniques for large-scale text comparison. Developed and tested on Databricks Community Edition. Included a step-by-step guide, pipeline version, and tokenization techniques (completed with comments) and a complete presentation of the project in powerpoint.

🛠️ Core Features

  • Text Pre-Processing & Tokenization
  • Multiple Filtering Techniques:
    • Prefix Filtering
    • Length Filtering:
    • Positional Filtering
  • Jaccard Similarity Calculation
  • Performance Optimizations

📁 Repository Structure

Spark-Set-Similarity-Join/
├── notebooks/
│   ├── pipeline/
│   │   └── Set_Similarity_Join_with_Spark_Q-3_pipeline.ipynb
│   ├── step_by_step/
│   │   ├── Set_Similarity_Join_with_Spark_Q-3_step_by_step_def.ipynb
│   │   └── Set_Similarity_Join_with_Spark_Q-3_step_by_step_def_ok_100k.ipynb
│   └── tokenization/
│       └── Set_Similarity_Join_with_Spark_word-tokenization_step_by_step_def.ipynb
└── data/
    └── [XXX]KIdDuplicates.json
    └── [XXX]KProfiles.json

🔧 Environment Setup

Databricks Configuration

  1. Access Databricks Community Edition
  2. Create cluster with:
    • Runtime: 12.2 LTS (Apache Spark 3.3.2)
    • Python: 3.9
    • Node Type: Standard_DS3_v2

Required Libraries

dbutils.library.installPyPI("nltk")
dbutils.library.installPyPI("matplotlib")

📚 Implementation Variants

1. Step-by-Step Implementation

  • File: notebooks/step_by_step/Set_Similarity_Join_with_Spark_Q-3_step_by_step_def.ipynb
  • Detailed filtering implementations
  • Comprehensive explanations
  • Perfect for learning purposes

2. Pipeline Version

  • File: notebooks/pipeline/Set_Similarity_Join_with_Spark_Q-3_pipeline.ipynb
  • Production-ready implementation
  • Optimized filter chains

3. Large-Scale Version (100k)

  • File: notebooks/step_by_step/Set_Similarity_Join_with_Spark_Q-3_step_by_step_def_ok_100k.ipynb
  • Memory-optimized for large datasets
  • Enhanced error handling

4. Advanced Tokenization

  • File: notebooks/tokenization/Set_Similarity_Join_with_Spark_word-tokenization_step_by_step_def.ipynb
  • Advanced text processing
  • NLTK integration

⚙️ Key Components

Filtering Techniques

# Length Filtering
def length_filter(tokens1, tokens2, threshold):
    len1, len2 = len(tokens1), len(tokens2)
    return len1 <= len2 and len1 >= len2 * threshold

# Prefix Filtering
def prefix_length(tokens, threshold):
    return len(tokens) - math.ceil(len(tokens) * threshold) + 1

# Positional Filtering
def positional_filter(tokens1, tokens2, pos, threshold):
    required_overlap = math.ceil(threshold * (len(tokens1) + len(tokens2)) / (1 + threshold))
    return len(set(tokens1[:pos]).intersection(tokens2)) >= required_overlap

📊 Performance Optimization Tips

  • Implement filter chain sequence: Length → Prefix → Positional → Suffix
  • Use batch processing for large datasets
  • Leverage Spark caching strategically
  • Monitor memory usage via Databricks metrics

🤝 Contributions

Lorenzo Sasso (MSc in Data Engineering UniMore)

About

Implementation of Set Similarity Join operations using Apache Spark, focusing on different filtering techniques for large-scale text comparison. Developed and tested on Databricks Community Edition.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published