-
Notifications
You must be signed in to change notification settings - Fork 0
Pipeline that learns and recognize thematics
License
hbenbel/Thematisation
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
███████╗███████╗ ██████╗
██╔════╝██╔════╝██╔═══██╗
███████╗█████╗ ██║ ██║
╚════██║██╔══╝ ██║ ██║
███████║███████╗╚██████╔╝
╚══════╝╚══════╝ ╚═════╝
DESCRIPTION
Pipeline that learns and recognize thematics.
USAGE
./themes.py K [CONFIG_FILE]...
./jaccard.py K TARGET [THEME_FILE]...
There are 2 programs to run sequentially (i.e one after another).
The first one, themes.py, takes themes and text files for each theme
as input and produces theme files (.thm) as output.
The second one, jaccard.py, takes a file to classify and some theme files as
input and display a classification index for each theme.
In between usage of the first and second program, the user may modify
freely the produced theme files (thm), each of which contains the name of
the theme on the first line, then (by order of relevance) one ngram per
line with a relevance score written next to it (separated by a tab). The
score is not used by the second program and is merely shown to help the user
in editing the theme file. One can freely add a ngram to the file (all lowercase,
no punctuation, same number of spaces as other ngrams, no need to put a score),
or remove existing ngrams.
Since all of the ngrams are always written, one may want to remove the
last ones (with the lowest score). One good way to keep the first k-1 ngrams
is to use the following command:
head -n $k my_theme.thm | sponge my_theme.thm
where sponge is a utility available in the moreutils package. The following
shell function would also work fine as a replacement for sponge:
function sponge() {
local tmp=`mktemp`
cat > "$tmp"
cat "$tmp" > "$1"
}
The first program takes configuration files as input (if none is provided,
it reads from stdin). The configuration files should hold tokens on every
lines such that the first token of each line is a theme and the following
tokens refer to sample text files for this theme. Tokens are lexed using
shell-like rules and quoting, and the provided file paths support globbing.
Note that the paths are relative to the configuration file's location, not
the current working directory. Also note that this only applies to the
configuration files: the theme files do not support these features and their
only syntax is line separators and tabs. Themes' filenames are derived from
the themes themselves (lowercase, punctuation is replaced with underscores)
so be wary of conflicting theme names.
Run either program with no arguments to print a quick usage reminder.
EXAMPLES
sh$ cat resources/themes.conf
Dogs test?_chien.txt
Cars test?_voiture.txt
Birds test?_oiseau.txt
sh$ python src/themes.py 2 resources/themes.conf
sh$ for thm in *.thm; do head -n 43 $thm | sponge $thm; done
sh$ python src/jaccard.py 2 resources/test.txt dogs.thm cars.thm
Dogs ==> 0.07142857142857142
Cars ==> 0.0
sh$ cat resources/themes.conf
"Sports critics" sports/*
"Food reviews" food/* reviews/food_*
Sports\ critics reviews/sports_*
"Mistakes made" mistakes made/*
sh$ src/themes.py 2 resources/themes.conf
WARNING: This pattern did not match any file (Th: 'Mistakes made'): mistakes
WARNING: This pattern did not match any file (Th: 'Mistakes made'): made/*
CONTRIBUTORS
Sirine Kéfi
Thibaud Chominot
Hussem Ben Belgacem
About
Pipeline that learns and recognize thematics
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published