Skip to content

Testing unknown strings throws Key Error #1

@sebader

Description

@sebader

First of all thanks a lot @jgarciab for your work and sharing it, great tutorial!
I've implemented all your steps on an Azure Databricks Cluster with only a few modifications and successfully trained it against my own data set.
My issue now is that I cannot run the final step to test the similarity of two string, which were neither in the test, nor training set:

str1 = "Hello World"
str2 = "Hallo Welt"
distances = find_distances(str1, str2)
clf.decision_function(np.array(distances,dtype=float))

throws:

Key Error at...
<command-3912256500001942> in cosineWords(a, b, dictTrain, tfidf_matrix_train)
     76     """
     77     ind_a = dictTrain[a.lower().rstrip()]
---> 78     ind_b = dictTrain[b.lower().rstrip()]
     79     score = cosine_similarity(tfidf_matrix_train[ind_a:ind_a+1], tfidf_matrix_train[ind_b:ind_b+1])
     80     return score

If str1 or str2 were not in the training set, there are obviously not in the dictTrain array.
Did I misunderstand something here or is this a bug?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions