Implementation of Set Similarity Join operations using Apache Spark, focusing on different filtering techniques for large-scale text comparison. Developed and tested on Databricks Community Edition. Included a step-by-step guide, pipeline version, and tokenization techniques (completed with comments) and a complete presentation of the project in powerpoint.
- Text Pre-Processing & Tokenization
- Multiple Filtering Techniques:
- Prefix Filtering
- Length Filtering:
- Positional Filtering
- Jaccard Similarity Calculation
- Performance Optimizations
Spark-Set-Similarity-Join/
├── notebooks/
│ ├── pipeline/
│ │ └── Set_Similarity_Join_with_Spark_Q-3_pipeline.ipynb
│ ├── step_by_step/
│ │ ├── Set_Similarity_Join_with_Spark_Q-3_step_by_step_def.ipynb
│ │ └── Set_Similarity_Join_with_Spark_Q-3_step_by_step_def_ok_100k.ipynb
│ └── tokenization/
│ └── Set_Similarity_Join_with_Spark_word-tokenization_step_by_step_def.ipynb
└── data/
└── [XXX]KIdDuplicates.json
└── [XXX]KProfiles.json
- Access Databricks Community Edition
- Create cluster with:
- Runtime: 12.2 LTS (Apache Spark 3.3.2)
- Python: 3.9
- Node Type: Standard_DS3_v2
dbutils.library.installPyPI("nltk")
dbutils.library.installPyPI("matplotlib")- File:
notebooks/step_by_step/Set_Similarity_Join_with_Spark_Q-3_step_by_step_def.ipynb - Detailed filtering implementations
- Comprehensive explanations
- Perfect for learning purposes
- File:
notebooks/pipeline/Set_Similarity_Join_with_Spark_Q-3_pipeline.ipynb - Production-ready implementation
- Optimized filter chains
- File:
notebooks/step_by_step/Set_Similarity_Join_with_Spark_Q-3_step_by_step_def_ok_100k.ipynb - Memory-optimized for large datasets
- Enhanced error handling
- File:
notebooks/tokenization/Set_Similarity_Join_with_Spark_word-tokenization_step_by_step_def.ipynb - Advanced text processing
- NLTK integration
# Length Filtering
def length_filter(tokens1, tokens2, threshold):
len1, len2 = len(tokens1), len(tokens2)
return len1 <= len2 and len1 >= len2 * threshold
# Prefix Filtering
def prefix_length(tokens, threshold):
return len(tokens) - math.ceil(len(tokens) * threshold) + 1
# Positional Filtering
def positional_filter(tokens1, tokens2, pos, threshold):
required_overlap = math.ceil(threshold * (len(tokens1) + len(tokens2)) / (1 + threshold))
return len(set(tokens1[:pos]).intersection(tokens2)) >= required_overlap- Implement filter chain sequence: Length → Prefix → Positional → Suffix
- Use batch processing for large datasets
- Leverage Spark caching strategically
- Monitor memory usage via Databricks metrics
Lorenzo Sasso (MSc in Data Engineering UniMore)