Add pandas __setitem__ support for DocumentTermDF#158
Conversation
suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: unitTest Co-authored-by: Henri Froese <henri.froese@yahoo.com>
|
@jbesomi we now have also adapted the set_item method to handle the changes in PR #156 when the set item method receives a data frame, which has more than one column it will convert the columns into a pandas multi-index with the original column names on the lower level and the >>> df1 = pd.DataFrame(["Text 1", "Text 2"], columns=["Test"])
>>> df2 = pd.DataFrame([[3, 5], [8, 4]], columns=["term 1", "term 2"],)
>>> df1["count"] = df2
>>> df1
Test count
term 1 term 2
0 Text 1 3 5
1 Text 2 8 4when just a single column DataFrame is inserted, it will behave, as it used to: >>> df1 = pd.DataFrame(["Text 1", "Text 2"], columns=["Test"])
>>> df2 = pd.DataFrame([3, 5], columns=["count"])
>>> df1["here"] = df2
>>> df1
Test here
0 Text 1 3
1 Text 2 5 |
|
Magic. 🥇 I will review this once #156 is merged so that we can test a bit around with the output DataFrame(s). Does your solution handle correctly the insertion of sparse Pandas DF? For you to know, my only concern is that I don't fully understand why the Pandas API does not allow for such insertion. |
There is no reason, that the pandas API does not support it. This issue we opened at pandas about it. Now our usecase is much easier, with a single column level.
It works so far. As soon, as the other two are merged, I will move this PR to ready to review |
Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
henrifroese
left a comment
There was a problem hiding this comment.
We just went through everything again and added additional tests for sparseness. We believe this is ready to be reviewed/merged now
🐙
jbesomi
left a comment
There was a problem hiding this comment.
Hmm, less complex than expected 🧐
Looks great!
What if we add this for now as an extension that users can test? We can insert this into a separate file texthero.beta.helper or similar and explain in a blog article how to "activate that" (from texthero.beta import helper ...) and what it allows us to do.
| - we don't destroy any pandas functionality as currently calling | ||
| `__setitem__` with a Multiindexed value is just not possible, so | ||
| our changes to Pandas do not break any Pandas functionality for | ||
| the users. We're only _expanding_ the functinoality |
| 2 0 0 1 1 1 1 | ||
| ``` | ||
|
|
||
| That's a DataFrame. Great! Of course, users can |
There was a problem hiding this comment.
we can be a bit more succint here (That's a DataFrame. Great! Of course,)
|
I also discussed this issue with @henrifroese. we decided that we will not merge it, as discussed in the meeting. In our view when hiding this feature in a beta version, most users will probably ignore this. As this is performance-wise not optimized for huge DataFrames we decided that the tradeoff is not worth it and we will close this PR |
|
Ok. I'm sorry that we couldn't merge it, as it would have been a great feature for users, yet, it would have been probably worse to introduce something that would have made the user experience less pleasant. The fact that assigning a sparse DataFrame takes so long because of the large number of columns is what makes me think twice before making it part of the master branch. Let's hope that with Pandas v.2 we will be able to do something similar in a more efficient way 🙏🏻 |
As discussed, this PR makes small changes to
pd.DataFrame.__setitem__to allow users to do e.g.df["tfidf"] = hero.tfidf(df["text"]). See the extensive documentation we added in_helper.py. We also add tests for this 🕵️ 🕵️♂️ 🕵️♀️NOTE: only so many commits/lines as this builds on #156