Features

Feature	Description	Script
word_count	Total number of tokens (excluding punctuation).	fnt_counters.R
unique_word_count	Number of distinct tokens.	fnt_counters.R
content_word_count	Number of tokens with UPOS in {NOUN, VERB, ADJ, ADV, PRON}.	fnt_counters.R
unique_content_word_count	Number of distinct content‑word tokens.	fnt_counters.R
sentence_count	Number of sentences (sentence_id distinct).	fnt_counters.R
paragraph_count	Number of paragraphs (paragraph_id distinct).	fnt_counters.R
char_count	Sum of characters over all non‑punctuation tokens.	fnt_counters.R
char_count_content	Sum of characters over content‑word tokens.	fnt_counters.R
avg_word_length	char_count / word_count.	fnt_counters.R
avg_sentence_length	word_count / sentence_count.	fnt_counters.R
avg_content_word_length	char_count / content_word_count.	fnt_counters.R
cnt_	Count of each UPOS tag per document (one column per tag).	fnt_counters.R
prop_	Proportion of each UPOS tag per document.	fnt_counters.R
present_count	Number of verbs tagged Tense=Pres.	fnt_counters.R
past_count	Number of verbs tagged Tense=Past.	fnt_counters.R
future_count	Number of verbs tagged Tense=Fut.	fnt_counters.R
conditional_count	Number of verbs tagged Tense=Cond.	fnt_counters.R
subjunctive_count	Number of verbs tagged Mood=Sub.	fnt_counters.R
indicative_count	Number of verbs tagged Mood=Ind.	fnt_counters.R
imperative_count	Number of verbs tagged Mood=Imp.	fnt_counters.R
infinitive_count	Number of verbs tagged VerbForm=Inf.	fnt_counters.R
past_participle_count	Number of tokens with VerbForm=Part + Tense=Past.	fnt_counters.R
present_participle_count	Number of tokens with VerbForm=Part + Tense=Pres.	fnt_counters.R
past_simple_count	Number of tokens with Tense=Past + Mood=Ind + VerbForm=Fin.	fnt_counters.R
past_compose_count	Number of tokens with Tense=Past + Mood=Ind + VerbForm=Part.	fnt_counters.R
token_sent_overlap	Proportion of tokens in a doc that also appear in the following sentence.	fnt_cohesion.R
token_doc_overlap	Mean proportion of each token’s occurrences outside its sentence vs. whole doc.	fnt_cohesion.R
content_sent_overlap	As token_sent_overlap, restricted to content words (NOUN, VERB, ADJ, ADV).	fnt_cohesion.R
content_doc_overlap	As token_doc_overlap, restricted to content words.	fnt_cohesion.R
cosine_sent	Mean cosine similarity between adjacent sentences (all tokens).	fnt_cohesion.R
cosine_content	Mean cosine similarity between adjacent sentences (content tokens only).	fnt_cohesion.R
n_complex_nominal	Number of “complex” nominals (head nouns with dependents).	fnt_extra_syntax.R
n_complex_verb	Number of “complex” verbs (head verbs with dependents).	fnt_extra_syntax.R
n_clause	Estimated number of clauses (clause indicators + 1).	fnt_extra_syntax.R
n_clause_per_sent	n_clause / sentence_count.	fnt_extra_syntax.R
avg_clause_length	Average number of tokens per clause.	fnt_extra_syntax.R
complex_nom_per_sent	n_complex_nominal / sentence_count.	fnt_extra_syntax.R
complex_verb_per_sent	n_complex_verb / sentence_count.	fnt_extra_syntax.R
verb_per_sent	Number of verbs per sentence.	fnt_extra_syntax.R
avg_dep_dist	Mean dependency distance (token ↔ head).	fnt_extra_syntax.R
avg_dep_count	Mean number of dependents per head.	fnt_extra_syntax.R
max_path	Maximum dependency‑tree path length per sentence.	fnt_heights.R
avg_path	Average path length per sentence.	fnt_heights.R
count_path	Total number of paths (vcount × avg_path).	fnt_heights.R
sentence_length	Number of nodes in the dependency graph (sentence length).	fnt_heights.R
adj_max_path	log(max_path) / log(sentence_length).	fnt_heights.R
prop_hf	Proportion of tokens whose head occurs later in the sentence.	fnt_heights.R
prop_hi	Proportion of tokens whose head occurs earlier in the sentence.	fnt_heights.R
avg_sent_height	Mean max_path across all sentences in a document.	fnt_heights.R
norm_sent_height	Mean adj_max_path across all sentences in a document.	fnt_heights.R
total_paths	Sum of count_path across sentences.	fnt_heights.R
dim1…dimₙ	Embedding dimensions for each sentence (columns dim1, dim2, …).	fnt_embeddings.R
<dimₖ> (mean)	Mean of each embedding dimension per document.	fnt_embeddings.R
_freq[_u]_imputed	Lexical DB freq (raw or per‑million) with Good‑Turing smoothing.	fnt_lexical.R
N, N1	Corpus‑ or DB‑level token count (N) and hapax count (N1).	fnt_lexical.R
TTR	Type–token ratio: unique types / total tokens.	fnt_lexical.R
maas	Maas’s lexical diversity index: (log(N)−log(V)) / log(N²).	fnt_lexical.R
MATTR	Moving‑average TTR over sliding windows (default size 50).	fnt_lexical.R
simpsons_D	Simpson’s diversity (D‑measure) over tokens.	fnt_lexical.R
TTR_content, maas_content, …	Same as above, restricted to content words.	fnt_lexical.R
maas_verb, simpsons_D_verb	Maas and Simpson’s D over verbs only.	fnt_lexical.R
token_surprisal	Log₂ surprisal (−log₂ P) per token, back‑off from trigram→bigram→unigram.	fnt_pos_surprisal.R
pos_entropy	Entropy per token (2‑gram context).	fnt_pos_surprisal.R
pos_entropy_reduction	Difference in entropy between successive positions.	fnt_pos_surprisal.R
mean_pos_surprisal	Document‑level mean of pos_surprisal.	fnt_pos_surprisal.R
sd_pos_surprisal	Document‑level SD of pos_surprisal.	fnt_pos_surprisal.R
mean_pos_entropy	Document‑level mean of pos_entropy.	fnt_pos_surprisal.R
sd_pos_entropy	Document‑level SD of pos_entropy.	fnt_pos_surprisal.R
mean_pos_entropy_reduction	Document‑level mean of pos_entropy_reduction.	fnt_pos_surprisal.R
sd_pos_entropy_reduction	Document‑level SD of pos_entropy_reduction.	fnt_pos_surprisal.R
(same six at sentence level)	The same six summary stats computed at the sentence level.	fnt_pos_surprisal.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Uh oh!

Clone this wiki locally