Just a small lists of discussed features that are nice to have:
- out of domain error: replace value with random value from other column
- field separation error: replace value with one value from a neighboring column, but the same row
- data type changes: float -> int, numeric -> string: e.g. 120 -> "120", or 1200 -> "1,200"
- change capitalization, e.g. abc -> ABC, abc -> Abc, ...
- standard numerical functions: apply: log, sqrt, ²
- replace by extreme value (e.g. max(x) + epsilon or min(x) - epsilon)
- create typos using machine translation (character-level sequence to sequence learning) (see https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)
- generate OCR errors: 1) create picture of value, 1.1) apply some noise on the picture 2) run OCR 3) use resulting value (see https://github.com/tesseract-ocr/tesseract)
- generate speech recognition errors: 1) text to speech (e.g. http://mary.dfki.de/) 2) apply speech recognition (e.g. https://github.com/kaldi-asr/kaldi)
- implement the opposite of this: https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26
- add an interface for column-wise injection
in process:
finished:
- code refactoring: apply sklearn style
- make sure that warnings do not lead to fewer errors
Just a small lists of discussed features that are nice to have:
in process:
finished: