Skip to content

SetSketches saved from different processes have jaccard estimation of 0 #74

@nvanva

Description

@nvanva

Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases.
Here is a minimal example showing this:

import os
import sketch

m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 1000, 1000
for i in range(step1, maxval1+1, step1):
    hll.addh(str(i))

for i in range(step2, maxval2+1, step2):
    hll2.addh(str(i))
    
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')

Run this code twice in 2 different process. Then run:

from pathlib import Path
for ss1 in Path('./').glob('tmp1_*'):
    for ss2 in Path('./').glob('tmp2_*'):
        hll, hll2 = sketch.setsketch.CSetSketch(str(ss1)), sketch.setsketch.CSetSketch(str(ss2))
        jaccard_est = sketch.setsketch.jaccard_index(hll, hll2)
        print(ss1, ss2, jaccard_est)

It will print this in my case:
tmp1_736949 tmp2_736949 0.16761398315429688
tmp1_736949 tmp2_736999 0.0
tmp1_736999 tmp2_736949 0.0
tmp1_736999 tmp2_736999 0.166900634765625

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions