Skip to content

Features

Guillaume Loignon edited this page May 4, 2025 · 1 revision
Feature Description Script
word_count Total number of tokens (excluding punctuation). fnt_counters.R
unique_word_count Number of distinct tokens. fnt_counters.R
content_word_count Number of tokens with UPOS in {NOUN, VERB, ADJ, ADV, PRON}. fnt_counters.R
unique_content_word_count Number of distinct content‑word tokens. fnt_counters.R
sentence_count Number of sentences (sentence_id distinct). fnt_counters.R
paragraph_count Number of paragraphs (paragraph_id distinct). fnt_counters.R
char_count Sum of characters over all non‑punctuation tokens. fnt_counters.R
char_count_content Sum of characters over content‑word tokens. fnt_counters.R
avg_word_length char_count / word_count. fnt_counters.R
avg_sentence_length word_count / sentence_count. fnt_counters.R
avg_content_word_length char_count / content_word_count. fnt_counters.R
cnt_ Count of each UPOS tag per document (one column per tag). fnt_counters.R
prop_ Proportion of each UPOS tag per document. fnt_counters.R
present_count Number of verbs tagged Tense=Pres. fnt_counters.R
past_count Number of verbs tagged Tense=Past. fnt_counters.R
future_count Number of verbs tagged Tense=Fut. fnt_counters.R
conditional_count Number of verbs tagged Tense=Cond. fnt_counters.R
subjunctive_count Number of verbs tagged Mood=Sub. fnt_counters.R
indicative_count Number of verbs tagged Mood=Ind. fnt_counters.R
imperative_count Number of verbs tagged Mood=Imp. fnt_counters.R
infinitive_count Number of verbs tagged VerbForm=Inf. fnt_counters.R
past_participle_count Number of tokens with VerbForm=Part + Tense=Past. fnt_counters.R
present_participle_count Number of tokens with VerbForm=Part + Tense=Pres. fnt_counters.R
past_simple_count Number of tokens with Tense=Past + Mood=Ind + VerbForm=Fin. fnt_counters.R
past_compose_count Number of tokens with Tense=Past + Mood=Ind + VerbForm=Part. fnt_counters.R
token_sent_overlap Proportion of tokens in a doc that also appear in the following sentence. fnt_cohesion.R
token_doc_overlap Mean proportion of each token’s occurrences outside its sentence vs. whole doc. fnt_cohesion.R
content_sent_overlap As token_sent_overlap, restricted to content words (NOUN, VERB, ADJ, ADV). fnt_cohesion.R
content_doc_overlap As token_doc_overlap, restricted to content words. fnt_cohesion.R
cosine_sent Mean cosine similarity between adjacent sentences (all tokens). fnt_cohesion.R
cosine_content Mean cosine similarity between adjacent sentences (content tokens only). fnt_cohesion.R
n_complex_nominal Number of “complex” nominals (head nouns with dependents). fnt_extra_syntax.R
n_complex_verb Number of “complex” verbs (head verbs with dependents). fnt_extra_syntax.R
n_clause Estimated number of clauses (clause indicators + 1). fnt_extra_syntax.R
n_clause_per_sent n_clause / sentence_count. fnt_extra_syntax.R
avg_clause_length Average number of tokens per clause. fnt_extra_syntax.R
complex_nom_per_sent n_complex_nominal / sentence_count. fnt_extra_syntax.R
complex_verb_per_sent n_complex_verb / sentence_count. fnt_extra_syntax.R
verb_per_sent Number of verbs per sentence. fnt_extra_syntax.R
avg_dep_dist Mean dependency distance (token ↔ head). fnt_extra_syntax.R
avg_dep_count Mean number of dependents per head. fnt_extra_syntax.R
max_path Maximum dependency‑tree path length per sentence. fnt_heights.R
avg_path Average path length per sentence. fnt_heights.R
count_path Total number of paths (vcount × avg_path). fnt_heights.R
sentence_length Number of nodes in the dependency graph (sentence length). fnt_heights.R
adj_max_path log(max_path) / log(sentence_length). fnt_heights.R
prop_hf Proportion of tokens whose head occurs later in the sentence. fnt_heights.R
prop_hi Proportion of tokens whose head occurs earlier in the sentence. fnt_heights.R
avg_sent_height Mean max_path across all sentences in a document. fnt_heights.R
norm_sent_height Mean adj_max_path across all sentences in a document. fnt_heights.R
total_paths Sum of count_path across sentences. fnt_heights.R
dim1…dimₙ Embedding dimensions for each sentence (columns dim1, dim2, …). fnt_embeddings.R
<dimₖ> (mean) Mean of each embedding dimension per document. fnt_embeddings.R
_freq[_u]_imputed Lexical DB freq (raw or per‑million) with Good‑Turing smoothing. fnt_lexical.R
N, N1 Corpus‑ or DB‑level token count (N) and hapax count (N1). fnt_lexical.R
TTR Type–token ratio: unique types / total tokens. fnt_lexical.R
maas Maas’s lexical diversity index: (log(N)−log(V)) / log(N²). fnt_lexical.R
MATTR Moving‑average TTR over sliding windows (default size 50). fnt_lexical.R
simpsons_D Simpson’s diversity (D‑measure) over tokens. fnt_lexical.R
TTR_content, maas_content, … Same as above, restricted to content words. fnt_lexical.R
maas_verb, simpsons_D_verb Maas and Simpson’s D over verbs only. fnt_lexical.R
token_surprisal Log₂ surprisal (−log₂ P) per token, back‑off from trigram→bigram→unigram. fnt_pos_surprisal.R
pos_entropy Entropy per token (2‑gram context). fnt_pos_surprisal.R
pos_entropy_reduction Difference in entropy between successive positions. fnt_pos_surprisal.R
mean_pos_surprisal Document‑level mean of pos_surprisal. fnt_pos_surprisal.R
sd_pos_surprisal Document‑level SD of pos_surprisal. fnt_pos_surprisal.R
mean_pos_entropy Document‑level mean of pos_entropy. fnt_pos_surprisal.R
sd_pos_entropy Document‑level SD of pos_entropy. fnt_pos_surprisal.R
mean_pos_entropy_reduction Document‑level mean of pos_entropy_reduction. fnt_pos_surprisal.R
sd_pos_entropy_reduction Document‑level SD of pos_entropy_reduction. fnt_pos_surprisal.R
(same six at sentence level) The same six summary stats computed at the sentence level. fnt_pos_surprisal.R

Clone this wiki locally