improved autofix strategy by aditya1503 · Pull Request #148 · cleanlab/cleanlab-studio

aditya1503 · 2023-11-16T20:40:54Z

Skeleton code for improved Auto-Fix strategies

from cleanlab_studio import Studio
API_KEY = os.environ['CLEANLAB_API_KEY']
studio = Studio(API_KEY)
df = pd.DataFrame(...)
dataset_id = studio.upload_dataset(df)
project_id = studio.create_project(dataset_id=dataset_id, ...)
cleanset_id = studio.get_latest_cleanset_id(project_id)


# Beginner user:
new_df = studio.autofix_dataset(df, cleanset_id)  # deepcopy of df 


# Advanced user pattern:
hyperparam_dict = get_autofix_defaults(cleanset_id)  # contains integer values correspond to number of data points to fix/exclude for each issue-type
# user who wants to edit less data will manually adjust the integers in hyperparam_dict  
new_df = studio.autofix_dataset(df, cleanset_id, params=hyperparam_dict)

Link to Notion: https://www.notion.so/cleanlab/Improve-ML-accuracy-with-Studio-via-better-Autofix-99434fa92a164131b3860093d85e5350?pvs=4

Note: this is only for text/tabular datasets, not image.

jwmueller · 2023-11-16T22:38:17Z

request my review when this is ready

jwmueller · 2023-11-17T17:28:52Z

add a little script on how the user is going to use this thing as a PR comment

cleanlab_studio/studio/studio.py

jwmueller · 2023-11-20T19:04:15Z

cleanlab_studio/internal/util.py

+    cleanset_df: pd.DataFrame, name_col: str, num_rows: int, asc=True
+) -> List[str]:
+    """
+    Extracts the top specified number of rows based on a specified score column from a DataFrame.


This will only return the IDs of datapoints to drop for a given setting of the num_rows to drop during autofix

jwmueller · 2023-11-20T19:04:50Z

cleanlab_studio/internal/util.py

+    Parameters:
+    - cleanset_df (pd.DataFrame): The input DataFrame containing the cleanset.
+    - name_col (str): The name of the column indicating the category for which the top rows should be extracted.
+    - num_rows (int): The number of rows to be extracted.


In autofix, we can simply multiply the fraction of issues that are the cleanset defaults by the number of datapoints to get this.

right when we spoke originally, we wanted this call to be similar to the Studio web interface call, hence I rewrote it this way, it was floating percentage before.
the function _get_autofix_defaults does the multiplication by number of datapoints

cleanlab_studio/internal/util.py

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

jwmueller · 2023-11-22T17:59:07Z

cleanlab_studio/internal/util.py

+    }
+
+
+def _get_autofix_defaults(cleanset_df: pd.DataFrame) -> dict:


TODO: Studio team should move this function to backend of the app so it happens on server (eventually should be used in web app too)

jwmueller · 2023-11-22T17:59:20Z

cleanlab_studio/internal/util.py

+    return default_values
+
+
+def _get_top_fraction_ids(


TODO: Studio team should move this function to backend of the app so it happens on server (eventually should be used in web app too)

jwmueller · 2023-11-22T18:00:32Z

cleanlab_studio/studio/studio.py

+        cleanset_df = self.download_cleanlab_columns(cleanset_id)
+        if params is None:
+            params = _get_autofix_defaults(cleanset_df)
+            print("Using autofix parameters:", params)


todo: replace this code once an analogous method exists in Studio backend

jwmueller · 2023-11-22T18:06:36Z

from anish: Would you want this to be:

studio.autofix_dataset(cleanset_id)
new_df = studio.apply_corrections(df, cleanset_id)

jwmueller · 2023-11-28T19:40:33Z

cleanlab_studio/internal/util.py

+        "label_issue": 0.5,
+        "near_duplicate": 0.2,
+        "outlier": 0.5,
+        "confidence_threshold": 0.95,


change to: "relabel_confidence_threshold"

jwmueller · 2023-11-28T19:42:39Z

cleanlab_studio/internal/util.py

    return not check_none(x)
+
+
+def _get_autofix_default_params() -> dict:  # Studio team port to backend


allow string options (rethink names):

"optimized_training_data"

"drop_all_issues"

"suggested_actions"

drop_all_issues strategy should just drop all issues from the DF (no re-label)

suggested_actions strategy should relabel all label issues, drop outliers, and extra copies of near duplicates. Nothing done to ambiguous examples or other issues

jwmueller · 2023-11-28T19:45:07Z

cleanlab_studio/studio/studio.py

        except (TimeoutError, CleansetError):
            return False
+
+    def autofix_dataset(


allow string options to passed straight through into _get_autofix_default_params()

Should be added now, clarified in the docs:

cleanlab-studio/cleanlab_studio/studio/studio.py

Line 377 in afbe4a9

params (dict, optional): Default parameter dictionary containing confidence threshold for auto-relabelling, and

jwmueller · 2023-11-28T19:46:35Z

cleanlab_studio/studio/studio.py

+        self, original_df: pd.DataFrame, cleanset_id: str, params: dict = None
+    ) -> pd.DataFrame:
+        """
+        This method returns the auto-fixed dataset.


Docstring should clarify that Dataset must be a DataFrame (text or tabular dataset only)

jwmueller · 2023-11-28T19:49:41Z

cleanlab_studio/internal/util.py

+
+def _get_autofix_default_params() -> dict:  # Studio team port to backend
+    """returns default percentage-wise params of autofix"""
+    return {


can choose more specific key names here

… issues

sanjanag · 2023-12-04T17:31:35Z

cleanlab_studio/studio/studio.py

+
+                Example:
+                {
+                    'drop_ambiguous': 9,


change to fractions

s

jwmueller · 2023-12-07T00:01:27Z

cleanlab_studio/internal/util.py

+def get_autofix_defaults_for_strategy(strategy):
+    return AUTOFIX_DEFAULTS[strategy]
+


make everything that should be ported to backend a private method

jwmueller · 2023-12-07T00:01:34Z

cleanlab_studio/internal/util.py



+# Studio team port to backend
+def get_autofix_defaults_for_strategy(strategy):


Suggested change

def get_autofix_defaults_for_strategy(strategy):

def _get_autofix_defaults_for_strategy(strategy):

cleanlab_studio/studio/studio.py

aditya1503 · 2023-12-11T17:19:35Z

cleanlab_studio/studio/studio.py

                dataset, cl_cols, id_col, label_col, keep_excluded
            )
-            return corrected_ds
+            return snowflake_corrected_ds


Could this be from studio team?
in that case we need to merge their main branch first
@aditya1503 will have a look

yes need to rebase everything here against latest master branch

cleanlab_studio/studio/studio.py

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

make pull request

4615637

jwmueller marked this pull request as draft November 16, 2023 22:26

aditya1503 added 3 commits November 18, 2023 01:52

cleaned skeleton code

2a7cf91

cleanup

e7a3d07

add type hinting

72fc919

aditya1503 requested a review from jwmueller November 17, 2023 20:45

jwmueller reviewed Nov 18, 2023

View reviewed changes

cleanlab_studio/studio/studio.py Outdated Show resolved Hide resolved

jwmueller reviewed Nov 18, 2023

View reviewed changes

cleanlab_studio/studio/studio.py Outdated Show resolved Hide resolved

address PR comments

d67bbc3

aditya1503 requested a review from jwmueller November 20, 2023 18:57

jwmueller reviewed Nov 20, 2023

View reviewed changes

cleanlab_studio/internal/util.py Show resolved Hide resolved

aditya1503 and others added 3 commits November 21, 2023 02:09

Update cleanlab_studio/internal/util.py

fc4bf7c

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

linting + doc change

9f00909

set ambiguous to 0

d2a3432

jwmueller reviewed Nov 22, 2023

View reviewed changes

things to port to backend

6bcec4c

jwmueller reviewed Nov 22, 2023

View reviewed changes

jwmueller reviewed Nov 28, 2023

View reviewed changes

sanjanag added 2 commits December 1, 2023 18:49

Updated code for different strategies

cc52ce2

Fixed apply method

62efa2d

sanjanag added 4 commits December 2, 2023 07:51

Added tests for updating label issue rows based on threshold

1d644a0

Fixed mypy issue

3ff2507

Added test for checking right rows are dropped for non near duplicate…

7235b40

… issues

Added test for checking right rows are dropped for near duplicate issues

1b99d60

sanjanag reviewed Dec 4, 2023

View reviewed changes

cleanlab_studio/studio/studio.py Outdated

Example:

{

'drop_ambiguous': 9,

Copy link

Contributor

sanjanag Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to fractions

sanjanag added 12 commits December 5, 2023 18:43

Added get defaults method

330aa44

Return cleanset with original indices

a19c88c

Merge branch 'main' into improve_autofix

69ccda6

Removed unimplemented test

19143a3

removed unncessary merge change

e5b97f5

Fixed tests

20a532c

Fixed mypy error

3bbfc1c

Added newline

b892e87

Fixed formatting

b54a0a7

added tests for dropped indices

f870e04

Added docs for user facing method

eb106d1

s

Black formatting

a7acfa6

sanjanag marked this pull request as ready for review December 6, 2023 12:57

sanjanag requested a review from jwmueller December 6, 2023 13:19

jwmueller reviewed Dec 7, 2023

View reviewed changes

cleanlab_studio/studio/studio.py Outdated Show resolved Hide resolved

aditya1503 commented Dec 11, 2023

View reviewed changes

aditya1503 added 3 commits December 13, 2023 22:16

Merge remote-tracking branch 'origin/main' into improve_autofix

1f0344d

merge main

692efe4

add github change request

afbe4a9

aditya1503 requested a review from jwmueller December 13, 2023 17:13

jwmueller reviewed Dec 16, 2023

View reviewed changes

cleanlab_studio/studio/studio.py Outdated Show resolved Hide resolved

aditya1503 and others added 2 commits December 18, 2023 17:01

Update cleanlab_studio/studio/studio.py

7b96faa

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

linting

b31674c

		}


		def _get_autofix_defaults(cleanset_df: pd.DataFrame) -> dict:

		return not check_none(x)


		def _get_autofix_default_params() -> dict: # Studio team port to backend

		def get_autofix_defaults_for_strategy(strategy):
		return AUTOFIX_DEFAULTS[strategy]



		# Studio team port to backend
		def get_autofix_defaults_for_strategy(strategy):

	def get_autofix_defaults_for_strategy(strategy):
	def _get_autofix_defaults_for_strategy(strategy):

Conversation

aditya1503 commented Nov 16, 2023 • edited by jwmueller Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jwmueller commented Nov 16, 2023

Uh oh!

jwmueller commented Nov 17, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwmueller commented Nov 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwmueller Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwmueller Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwmueller Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aditya1503 commented Nov 16, 2023 •

edited by jwmueller

Loading

jwmueller Nov 28, 2023 •

edited

Loading

jwmueller Nov 28, 2023 •

edited

Loading

jwmueller Nov 28, 2023 •

edited

Loading