-
Notifications
You must be signed in to change notification settings - Fork 0
Features
Guillaume Loignon edited this page May 4, 2025
·
1 revision
| Feature | Description | Script |
|---|---|---|
| word_count | Total number of tokens (excluding punctuation). | fnt_counters.R |
| unique_word_count | Number of distinct tokens. | fnt_counters.R |
| content_word_count | Number of tokens with UPOS in {NOUN, VERB, ADJ, ADV, PRON}. | fnt_counters.R |
| unique_content_word_count | Number of distinct content‑word tokens. | fnt_counters.R |
| sentence_count | Number of sentences (sentence_id distinct). | fnt_counters.R |
| paragraph_count | Number of paragraphs (paragraph_id distinct). | fnt_counters.R |
| char_count | Sum of characters over all non‑punctuation tokens. | fnt_counters.R |
| char_count_content | Sum of characters over content‑word tokens. | fnt_counters.R |
| avg_word_length | char_count / word_count. | fnt_counters.R |
| avg_sentence_length | word_count / sentence_count. | fnt_counters.R |
| avg_content_word_length | char_count / content_word_count. | fnt_counters.R |
| cnt_ | Count of each UPOS tag per document (one column per tag). | fnt_counters.R |
| prop_ | Proportion of each UPOS tag per document. | fnt_counters.R |
| present_count | Number of verbs tagged Tense=Pres. | fnt_counters.R |
| past_count | Number of verbs tagged Tense=Past. | fnt_counters.R |
| future_count | Number of verbs tagged Tense=Fut. | fnt_counters.R |
| conditional_count | Number of verbs tagged Tense=Cond. | fnt_counters.R |
| subjunctive_count | Number of verbs tagged Mood=Sub. | fnt_counters.R |
| indicative_count | Number of verbs tagged Mood=Ind. | fnt_counters.R |
| imperative_count | Number of verbs tagged Mood=Imp. | fnt_counters.R |
| infinitive_count | Number of verbs tagged VerbForm=Inf. | fnt_counters.R |
| past_participle_count | Number of tokens with VerbForm=Part + Tense=Past. | fnt_counters.R |
| present_participle_count | Number of tokens with VerbForm=Part + Tense=Pres. | fnt_counters.R |
| past_simple_count | Number of tokens with Tense=Past + Mood=Ind + VerbForm=Fin. | fnt_counters.R |
| past_compose_count | Number of tokens with Tense=Past + Mood=Ind + VerbForm=Part. | fnt_counters.R |
| token_sent_overlap | Proportion of tokens in a doc that also appear in the following sentence. | fnt_cohesion.R |
| token_doc_overlap | Mean proportion of each token’s occurrences outside its sentence vs. whole doc. | fnt_cohesion.R |
| content_sent_overlap | As token_sent_overlap, restricted to content words (NOUN, VERB, ADJ, ADV). | fnt_cohesion.R |
| content_doc_overlap | As token_doc_overlap, restricted to content words. | fnt_cohesion.R |
| cosine_sent | Mean cosine similarity between adjacent sentences (all tokens). | fnt_cohesion.R |
| cosine_content | Mean cosine similarity between adjacent sentences (content tokens only). | fnt_cohesion.R |
| n_complex_nominal | Number of “complex” nominals (head nouns with dependents). | fnt_extra_syntax.R |
| n_complex_verb | Number of “complex” verbs (head verbs with dependents). | fnt_extra_syntax.R |
| n_clause | Estimated number of clauses (clause indicators + 1). | fnt_extra_syntax.R |
| n_clause_per_sent | n_clause / sentence_count. | fnt_extra_syntax.R |
| avg_clause_length | Average number of tokens per clause. | fnt_extra_syntax.R |
| complex_nom_per_sent | n_complex_nominal / sentence_count. | fnt_extra_syntax.R |
| complex_verb_per_sent | n_complex_verb / sentence_count. | fnt_extra_syntax.R |
| verb_per_sent | Number of verbs per sentence. | fnt_extra_syntax.R |
| avg_dep_dist | Mean dependency distance (token ↔ head). | fnt_extra_syntax.R |
| avg_dep_count | Mean number of dependents per head. | fnt_extra_syntax.R |
| max_path | Maximum dependency‑tree path length per sentence. | fnt_heights.R |
| avg_path | Average path length per sentence. | fnt_heights.R |
| count_path | Total number of paths (vcount × avg_path). | fnt_heights.R |
| sentence_length | Number of nodes in the dependency graph (sentence length). | fnt_heights.R |
| adj_max_path | log(max_path) / log(sentence_length). | fnt_heights.R |
| prop_hf | Proportion of tokens whose head occurs later in the sentence. | fnt_heights.R |
| prop_hi | Proportion of tokens whose head occurs earlier in the sentence. | fnt_heights.R |
| avg_sent_height | Mean max_path across all sentences in a document. | fnt_heights.R |
| norm_sent_height | Mean adj_max_path across all sentences in a document. | fnt_heights.R |
| total_paths | Sum of count_path across sentences. | fnt_heights.R |
| dim1…dimₙ | Embedding dimensions for each sentence (columns dim1, dim2, …). | fnt_embeddings.R |
| <dimₖ> (mean) | Mean of each embedding dimension per document. | fnt_embeddings.R |
| _freq[_u]_imputed | Lexical DB freq (raw or per‑million) with Good‑Turing smoothing. | fnt_lexical.R |
| N, N1 | Corpus‑ or DB‑level token count (N) and hapax count (N1). | fnt_lexical.R |
| TTR | Type–token ratio: unique types / total tokens. | fnt_lexical.R |
| maas | Maas’s lexical diversity index: (log(N)−log(V)) / log(N²). | fnt_lexical.R |
| MATTR | Moving‑average TTR over sliding windows (default size 50). | fnt_lexical.R |
| simpsons_D | Simpson’s diversity (D‑measure) over tokens. | fnt_lexical.R |
| TTR_content, maas_content, … | Same as above, restricted to content words. | fnt_lexical.R |
| maas_verb, simpsons_D_verb | Maas and Simpson’s D over verbs only. | fnt_lexical.R |
| token_surprisal | Log₂ surprisal (−log₂ P) per token, back‑off from trigram→bigram→unigram. | fnt_pos_surprisal.R |
| pos_entropy | Entropy per token (2‑gram context). | fnt_pos_surprisal.R |
| pos_entropy_reduction | Difference in entropy between successive positions. | fnt_pos_surprisal.R |
| mean_pos_surprisal | Document‑level mean of pos_surprisal. | fnt_pos_surprisal.R |
| sd_pos_surprisal | Document‑level SD of pos_surprisal. | fnt_pos_surprisal.R |
| mean_pos_entropy | Document‑level mean of pos_entropy. | fnt_pos_surprisal.R |
| sd_pos_entropy | Document‑level SD of pos_entropy. | fnt_pos_surprisal.R |
| mean_pos_entropy_reduction | Document‑level mean of pos_entropy_reduction. | fnt_pos_surprisal.R |
| sd_pos_entropy_reduction | Document‑level SD of pos_entropy_reduction. | fnt_pos_surprisal.R |
| (same six at sentence level) | The same six summary stats computed at the sentence level. | fnt_pos_surprisal.R |