Scaling Up Nearest Neighbor Search : How Dataset Size and Dimensionality Affect KNN Variants
This project investigates the performance trade-offs between efficiency and accuracy in KNN search algorithms. It compares three variants: Naive KNN, LSH, and KD-Tree, analyzing their training and testing time, memory usage, and effectiveness in finding the K nearest neighbors for varying dataset sizes (N) and dimensionalities (D). Through visualization, the project showcases the approximate nature of LSH and KD-Tree and how they might miss some true nearest neighbors in exchange for faster execution. The findings provide valuable insights into choosing the suitable KNN variant.
-
Evaluate the performance (training/testing time, memory usage) and accuracy of the following KNN variants for nearest neighbor search:
Naive KNNLocality-Sensitive Hashing (LSH)KD-Tree
-
Analyze the impact of dataset size (
N) and dimensionality (D) on these factors. -
Visually demonstrate the approximate nature of LSH and KD-Tree by comparing the true
Knearest neighbors with those found by these methods in a 2D dataset.
-
Data Generation:
- Create a 2D dataset with varying
NandDvalues using a random number generator.
- Create a 2D dataset with varying
-
KNN Implementations:
- Naive KNN:
- Compute the distance between the query point and all data points.
- Maintain a priority queue (min-heap) to store the
Kclosest points encountered so far, updating it as needed. - Time complexity:
O(N * D)for distance calculations,O(K * log K)for priority queue operations (can be optimized with more efficient data structures).
- LSH:
- Define appropriate hash functions that map similar points to the same bucket with high probability.
- Generate multiple hash tables using different hash functions.
- Query the hash tables and retain points that fall into the same buckets as the query point.
- Perform a final distance calculation on the remaining points to identify the true
Knearest neighbors. - Time complexity: Amortized constant time for hash lookups, extra distance calculations for non-perfect hash functions.
- Memory complexity: Linear in
N(hash tables) and potentiallyD(hash function parameters).
- KD-Tree:
- Construct a KD-Tree, a recursive binary tree where each node represents a hyperplane splitting the space based on a chosen dimension.
- Employ a recursive search algorithm to traverse the KD-Tree, traversing only subspaces that could potentially contain the nearest neighbors.
- Time complexity:
O(log N)for searching in high dimensions, can be higher for skewed data distributions. - Memory complexity: Linear in
N(tree structure).
- Naive KNN:
-
Performance Evaluation:
- Measure training time (building data structures for LSH and KD-Tree) and testing time (finding
Knearest neighbors) for each KNN variant. - Record memory usage for each variant.
- Analyze the impact of
NandDon these metrics.
- Measure training time (building data structures for LSH and KD-Tree) and testing time (finding
-
Visualization:
- Generate a 2D plot of the dataset with the query point highlighted.
- Identify the true
Knearest neighbors using the naive KNN approach for reference. - Show the partitions created by LSH (buckets) and KD-Tree (hyperplane splits) in the 2D space.
- Mark the points identified by LSH and KD-Tree as their approximate
Knearest neighbors. - Visually compare true neighbors vs. approximate neighbors to illustrate the trade-off between efficiency and accuracy in approximate methods.
Contribution : Mrugank Patil & Omkar Rajeev Prabhu