Conversation
|
request my review when this is ready |
|
add a little script on how the user is going to use this thing as a PR comment |
cleanlab_studio/internal/util.py
Outdated
| cleanset_df: pd.DataFrame, name_col: str, num_rows: int, asc=True | ||
| ) -> List[str]: | ||
| """ | ||
| Extracts the top specified number of rows based on a specified score column from a DataFrame. |
There was a problem hiding this comment.
This will only return the IDs of datapoints to drop for a given setting of the num_rows to drop during autofix
| Parameters: | ||
| - cleanset_df (pd.DataFrame): The input DataFrame containing the cleanset. | ||
| - name_col (str): The name of the column indicating the category for which the top rows should be extracted. | ||
| - num_rows (int): The number of rows to be extracted. |
There was a problem hiding this comment.
In autofix, we can simply multiply the fraction of issues that are the cleanset defaults by the number of datapoints to get this.
There was a problem hiding this comment.
right when we spoke originally, we wanted this call to be similar to the Studio web interface call, hence I rewrote it this way, it was floating percentage before.
the function _get_autofix_defaults does the multiplication by number of datapoints
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
cleanlab_studio/internal/util.py
Outdated
| } | ||
|
|
||
|
|
||
| def _get_autofix_defaults(cleanset_df: pd.DataFrame) -> dict: |
There was a problem hiding this comment.
TODO: Studio team should move this function to backend of the app so it happens on server (eventually should be used in web app too)
cleanlab_studio/internal/util.py
Outdated
| return default_values | ||
|
|
||
|
|
||
| def _get_top_fraction_ids( |
There was a problem hiding this comment.
TODO: Studio team should move this function to backend of the app so it happens on server (eventually should be used in web app too)
cleanlab_studio/studio/studio.py
Outdated
| cleanset_df = self.download_cleanlab_columns(cleanset_id) | ||
| if params is None: | ||
| params = _get_autofix_defaults(cleanset_df) | ||
| print("Using autofix parameters:", params) |
There was a problem hiding this comment.
todo: replace this code once an analogous method exists in Studio backend
|
from anish: Would you want this to be: studio.autofix_dataset(cleanset_id) |
cleanlab_studio/internal/util.py
Outdated
| "label_issue": 0.5, | ||
| "near_duplicate": 0.2, | ||
| "outlier": 0.5, | ||
| "confidence_threshold": 0.95, |
There was a problem hiding this comment.
change to: "relabel_confidence_threshold"
cleanlab_studio/internal/util.py
Outdated
| return not check_none(x) | ||
|
|
||
|
|
||
| def _get_autofix_default_params() -> dict: # Studio team port to backend |
There was a problem hiding this comment.
allow string options (rethink names):
"optimized_training_data"
"drop_all_issues"
"suggested_actions"
There was a problem hiding this comment.
drop_all_issues strategy should just drop all issues from the DF (no re-label)
There was a problem hiding this comment.
suggested_actions strategy should relabel all label issues, drop outliers, and extra copies of near duplicates. Nothing done to ambiguous examples or other issues
| except (TimeoutError, CleansetError): | ||
| return False | ||
|
|
||
| def autofix_dataset( |
There was a problem hiding this comment.
allow string options to passed straight through into _get_autofix_default_params()
There was a problem hiding this comment.
Should be added now, clarified in the docs:
cleanlab_studio/studio/studio.py
Outdated
| self, original_df: pd.DataFrame, cleanset_id: str, params: dict = None | ||
| ) -> pd.DataFrame: | ||
| """ | ||
| This method returns the auto-fixed dataset. |
There was a problem hiding this comment.
Docstring should clarify that Dataset must be a DataFrame (text or tabular dataset only)
cleanlab_studio/internal/util.py
Outdated
|
|
||
| def _get_autofix_default_params() -> dict: # Studio team port to backend | ||
| """returns default percentage-wise params of autofix""" | ||
| return { |
There was a problem hiding this comment.
can choose more specific key names here
cleanlab_studio/studio/studio.py
Outdated
|
|
||
| Example: | ||
| { | ||
| 'drop_ambiguous': 9, |
cleanlab_studio/internal/util.py
Outdated
| def get_autofix_defaults_for_strategy(strategy): | ||
| return AUTOFIX_DEFAULTS[strategy] | ||
|
|
There was a problem hiding this comment.
make everything that should be ported to backend a private method
cleanlab_studio/internal/util.py
Outdated
|
|
||
|
|
||
| # Studio team port to backend | ||
| def get_autofix_defaults_for_strategy(strategy): |
There was a problem hiding this comment.
| def get_autofix_defaults_for_strategy(strategy): | |
| def _get_autofix_defaults_for_strategy(strategy): |
cleanlab_studio/studio/studio.py
Outdated
| dataset, cl_cols, id_col, label_col, keep_excluded | ||
| ) | ||
| return corrected_ds | ||
| return snowflake_corrected_ds |
There was a problem hiding this comment.
Could this be from studio team?
in that case we need to merge their main branch first
@aditya1503 will have a look
There was a problem hiding this comment.
yes need to rebase everything here against latest master branch
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Skeleton code for improved Auto-Fix strategies
Link to Notion: https://www.notion.so/cleanlab/Improve-ML-accuracy-with-Studio-via-better-Autofix-99434fa92a164131b3860093d85e5350?pvs=4
Note: this is only for text/tabular datasets, not image.