Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 44 additions & 11 deletions MachineLearningMathBasic/2.9.8 杰卡德距离-薛冰冰.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,16 @@


### 1.定义
Jaccard距离用来度量两个集合之间的差异性,它是Jaccard的相似系数的补集,被定义为1减去Jaccard相似系数。![图片](C:\Users\薛冰冰\Desktop\图片.png)


Jaccard距离用来度量两个集合之间的差异性,它是Jaccard的相似系数的补集,被定义为1减去Jaccard相似系数。


### 2.性质
若A、B两个集合都为空,则 J(A,B)=1
0<=J(A,B)<=1


0<=J(A,B)<=1


### 3.python中的实现
```python
from numpy import *
Expand All @@ -21,7 +20,6 @@ matV = mat(([1,1,1,1],[1,0,0,1]))
print (dist.pdist(matV,'jaccard'))
结果:[0.5]
```

- Y = pdist(X, 'euclidean')

欧氏距离。
Expand All @@ -35,8 +33,43 @@ print (dist.pdist(matV,'jaccard'))
曼哈顿距离。




>应用:
>给定两个n维二元向量A、B,A、B的每一维都只能是0或者1,利用Jaccard相似系数来计算二者的相似性
给定两个n维二元向量A、B,A、B的每一维都只能是0或者1,利用Jaccard相似系数来计算二者的相似性


>过滤相似度很高的新闻,或者网页去重
>考试防作弊系统
>论文查重系统
考试防作弊系统
论文查重系统

举例:文字查重
```python
from sklearn.feature_extraction.text import CountVectorizer
#CountVectorizer 的用法,通过它的 fit_transform() 方法我们可以将字符串转化为词频矩阵,返回结果为:['么', '什', '你', '呢', '嘛', '在', '干']
import numpy as np

def jaccard_similarity(s1, s2):
def add_space(s):
return ' '.join(list(s))

# 将字中间加入空格
s1, s2 = add_space(s1), add_space(s2)
# 转化为TF矩阵
cv = CountVectorizer(tokenizer=lambda s: s.split())
corpus = [s1, s2]
vectors = cv.fit_transform(corpus).toarray()
# 求交集
numerator = np.sum(np.min(vectors, axis=0))
# 求并集
denominator = np.sum(np.max(vectors, axis=0))
#[[0 0 1 1 1 1 1]
[1 1 1 1 0 1 1]]转化后的
# 计算杰卡德系数
return 1.0 * numerator / denominator
#结果:0.5714285714285714(4/7)
s1 = '你在干嘛呢'
s2 = '你在干什么呢'
print(jaccard_similarity(s1, s2))
```