ILSA: Integrated Lexico-Syntactic Analyzer
Produces classic readability features and more advanced psycholinguistic features, including POS surprisal and dependency-tree based syntactic features.
This is a complete re-write of the pipeline described Loignon (2021). Please cite the 2021 paper if you use ALSI/ILSA (see bibliography at the end of this page).
Included French lexical frequency databases:
-
Manulex (Lété et al, 2004)
-
ÉQOL (Stanké et al, 2019)
-
flelex (François et al., 2014)
-
Quebec's Ministry of education vocabulary list. We also include the yet unpublished frequencies, scraped from the Franqus (USITO) website: https://franqus.ca/liste_orthographique/outil_de_recherche/
-
LexConn (Roze, Danlos & Muller, 2012) — a lexicon of French discourse connectives used for multi-word expression matching.
Please cite the relevant papers if you use the lexical databases included in the ALSI/ILSA tool.
ALSI extracts many types of features (see the full feature list):
- Surface counts — word, sentence, and character counts; POS-tag counts and proportions; verb tense and mood distributions.
- Lexical frequency — word-level frequency and grade-level lookup against four French databases (ÉQOL, Franqus, Manulex, FLELex), with Good-Turing imputation for out-of-vocabulary items.
- Lexical diversity — TTR, Maas, MATTR, and Simpson's D, computed on full vocabulary, content words, and verbs separately.
- Dependency / syntactic complexity — dependency depth, branching factor, head distance, Gibson DLT integration cost, clausal density, and head-final/head-initial ratios.
- Lexical cohesion — token and lemma overlap across sentence windows, argument overlap, and cosine similarity between adjacent sentences.
- Semantic embeddings and coherence — sentence and document embeddings; thematic dispersion, sequential similarity, topic drift, novelty, and conceptual convexity.
- LLM surprisal — token-level surprisal and entropy from masked (MLM) or autoregressive (AR) language models.
- Word burstiness — Weibull β and negative-binomial adaptation scores measuring how clustered each word's occurrences are across documents.
- Multi-word expression (MWE) matching — density features for any user-supplied MWE lexicon, broken down by relation group and category. Demonstrated with LEXCONN (Roze, Danlos & Muller, 2012), a French discourse-connective lexicon.
- Ollama LLM querying — general-purpose row-by-row querying of a locally-run LLM (via Ollama) for annotation, classification, paraphrase, or any templated task.
build_corpus() reads UTF-8 by default, but also supports Latin-1 (encoding = "latin1") and Windows-1252 (encoding = "windows-1252"), which are common in older French corpora. Use encoding = "auto" to let the function detect each file's encoding automatically — if a directory contains files with different encodings, they will all be read correctly and you will get a warning listing what was found. See demos/demo_corpus_read.R for examples.
ALSI uses a Universal Dependency based model, with a custom model of the French language by default. Our French model was trained on the French-GSD treebank, slightly modified so that AUX tags refer only to actual auxiliary verb, as proposed by Duran et al. (2021). It will therefore produce what we consider to be a more sensible tagging and an appropriate use of the AUX tag, e.g.:
- ALSI/ISLA custom model: "Le (DET) chat (NOUN) est (VERB) gris (ADJ). Il (PRON) est (AUX) parti (VERB)." The copula "est" is tagged as VERB. The auxiliary "est" in the second sentence is also correctly tagged as AUX.
- UDPipe model: "Le (DET) chat (NOUN) est (AUX) gris (ADJ). Il (PRON) est (AUX) parti (VERB)." Both "est" are tagged as AUX, which is confusing for languages that have actual auxiliary verbs.
Duran, M., Pagano, A., Rassi, A., & Pardo, T. (2021). On auxiliary verb in Universal Dependencies: Untangling the issue and proposing a systematized annotation strategy. In N. Mazziotta & S. Mille (Eds.), Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021) (pp. 10–21). Association for Computational Linguistics. https://aclanthology.org/2021.depling-1.2/
François, T., Gala, N., Watrin, P., & Fairon, C. (2014, May). FLELex: a graded lexical resource for French foreign learners. In International conference on Language Resources and Evaluation (LREC 2014).
Loignon, G. (2021). ILSA: an automated language complexity analysis tool for French. Mesure et évaluation en éducation, 44, 61-88. https://doi.org/10.7202/1095682ar
Roze, C., Danlos, L., & Muller, P. (2012). LEXCONN: A French lexicon of discourse connectives. Discours, 10. https://doi.org/10.4000/discours.8645
Lété, B., Sprenger-Charolles, L., & Colé, P. (2004). MANULEX: A grade-level lexical database from French elementary school readers. Behavior Research Methods, Instruments, & Computers, 36(1), 156-166.
Stanké, B., Mené, M. L., Rezzonico, S., Moreau, A., Dumais, C., Robidoux, J., ... & Royle, P. (2019). ÉQOL: Une nouvelle base de données québécoise du lexique scolaire du primaire comportant une échelle d’acquisition de l’orthographe lexicale. Corpus, (19).