Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases.
Here is a minimal example showing this:
import os
import sketch
m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)
step1, step2, maxval1, maxval2 = 2, 5, 1000, 1000
for i in range(step1, maxval1+1, step1):
hll.addh(str(i))
for i in range(step2, maxval2+1, step2):
hll2.addh(str(i))
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')
Run this code twice in 2 different process. Then run:
from pathlib import Path
for ss1 in Path('./').glob('tmp1_*'):
for ss2 in Path('./').glob('tmp2_*'):
hll, hll2 = sketch.setsketch.CSetSketch(str(ss1)), sketch.setsketch.CSetSketch(str(ss2))
jaccard_est = sketch.setsketch.jaccard_index(hll, hll2)
print(ss1, ss2, jaccard_est)
It will print this in my case:
tmp1_736949 tmp2_736949 0.16761398315429688
tmp1_736949 tmp2_736999 0.0
tmp1_736999 tmp2_736949 0.0
tmp1_736999 tmp2_736999 0.166900634765625
Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases.
Here is a minimal example showing this:
Run this code twice in 2 different process. Then run:
It will print this in my case:
tmp1_736949 tmp2_736949 0.16761398315429688
tmp1_736949 tmp2_736999 0.0
tmp1_736999 tmp2_736949 0.0
tmp1_736999 tmp2_736999 0.166900634765625