MODELBLOCKS RESOURCES

This file provides descriptions and access details for each external resource not included in ModelBlocks. Modelblocks needs to know where to access external resources, so each such resource has an associated config/user-*.txt file, which you will need to edit so that it contains the absolute path of that resource on your system.

The Alice in Wonderland corpus

NAME: The Alice in Wonderland corpus
MB POINTER FILE: config/user-alice-directory.txt
AVAILABILITY: Unreleased
DESCRIPTION: fMRI data from 28 subjects listening to the first chapter of Alice in Wonderland.
Collected by Brennan et al (2016).

The BMMM Unsupervised PoS tagger (Christodoulopoulos et al 2011)

NAME: The BMMM Unsupervised PoS tagger (Christodoulopoulos et al 2011)
MB POINTER FILE: config/user-bmmm-directory.txt
AVAILABILITY: Free
URL: https://github.com/christos-c/bmmm
DESCRIPTION: A Bayesian multinomial mixture model (BMMM) for unsupervised
part of speech tagging (Christodoulopoulos et al 2011).

The Berkeley Parser (jarfile)

NAME: The Berkeley Parser (jarfile)
MB POINTER FILE: config/user-berkeleyparserjar-directory.txt
AVAILABILITY: Free
URL: http://nlp.cs.berkeley.edu/software.shtml
DESCRIPTION: The directory containing the jarfile for the berkeley parser.

The British National Corpus (BNC)

NAME: The British National Corpus (BNC)
MB POINTER FILE: config/user-bnc-directory.txt
AVAILABILITY: FREE
URL: http://www.natcorp.ox.ac.uk/
DESCRIPTION: The British National Corpus (BNC) is a 100 million word collection
of samples of written and spoken language from a wide range of sources, designed
to represent a wide cross-section of British English, both spoken and written,
from the late twentieth century.

CCL unsupervised parser (Seginer, 2007)

NAME: CCL unsupervised parser (Seginer, 2007)
MB POINTER FILE: config/user-ccl-directory.txt
AVAILABILITY: Free
URL: http://www.seggu.net/ccl/
DESCRIPTION: The CCL unsupervised constituency parser (Seginer, 2007).

The CHILDES Corpus

NAME: The CHILDES Corpus
MB POINTER FILE: config/user-childes-directory.txt
AVAILABILITY: Free
URL: http://childes.talkbank.org/
DESCRIPTION: CHILDES is the child language component of the TalkBank system.
TalkBank is a system for sharing and studying conversational interactions.

The Dependency Model with Valence (DMV)

NAME: The Dependency Model with Valence (DMV)
MB POINTER FILE: config/user-dmv-directory.txt
AVAILABILITY: Free
URL: https://code.google.com/archive/p/pr-toolkit/
DESCRIPTION: An implementation (Gillenwater et al 2010) of the Dependency
Model with Valence parser (Klein & Manning 2004) for unsupervised
dependency parsing.

The Dundee eye-tracking corpus

NAME: The Dundee eye-tracking corpus
MB POINTER FILE: config/user-dundee-directory.txt
AVAILABILITY: Unreleased
DESCRIPTION: A corpus of eye-tracking measures from 10 subjects who read
newspaper articles (Kennedy et al, 2003).

Echo-state netork (ESN) directory

NAME: Echo-state netork (ESN) directory
MB POINTER FILE: config/user-esn-directory.txt
DESCRIPTION: A directory in which to store output of ESN

English Gigaword

NAME: English Gigaword
MB POINTER FILE: config/user-gigaword4-directory.txt
AVAILABILITY: Paid
URL: https://catalog.ldc.upenn.edu/ldc2003t05
DESCRIPTION: A comprehensive archive of newswire text data in English that
has been acquired over several years by the Linguistic Data Consortium.

Extended Penn Tokenizer

NAME: Extended Penn Tokenizer
MB-POINTER-FILE: config/user-tokenizer-directory.txt
AVAILABILITY: Free
URL: https://github.com/vansky/extended_penn_tokenizer
DESCRIPTION: Extended version of Robert McIntyre's (1995) Penn tokenizer.

GENIA Tagger

NAME: GENIA Tagger
MB POINTER FILE: config/user-geniatagger-directory.txt
AVAILABILITY: Free
URL: http://www.nactem.ac.uk/GENIA/tagger/
DESCRIPTION: Part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text.

KenLM Language Model Toolkit

NAME: KenLM Language Model Toolkit
MB POINTER FILE: config/user-kenlm-directory.txt
AVAILABILITY: Free
URL: https://kheafield.com/code/kenlm/
DESCRIPTION: KenLM estimates, filters, and queries language models. Estimation
is fast and scalable due to streaming algorithms.

KenLM Language Model Toolkit (model binaries directory)

NAME: KenLM Language Model Toolkit (model binaries directory)
MB POINTER FILE: config/user-kenlm-model-directory.txt
AVAILABILITY: Free
URL: https://kheafield.com/code/kenlm/
DESCRIPTION: KenLM estimates, filters, and queries language models. Estimation
is fast and scalable due to streaming algorithms.
This resource is just a directory in which to store compiled binaries.
You can specify a binaries directory using the pointer file above.

The MIT Sentence Passages corpus

NAME: The MIT Sentence Passages corpus
MB POINTER FILE: config/user-passages-directory.txt
AVAILABILITY: Unreleased
DESCRIPTION: A corpus of fMRI bold responses by subjects to audio presentation
of short passages (3-4 sentences each) in isolation.

The Natural Stories Corpus

NAME: The Natural Stories Corpus
MB POINTER FILE: config/user-naturalstories-directory.txt
AVAILABILITY: Unreleased
DESCRIPTION: A corpus of naturalistic stories meant to contain varied,
low-frequency syntactic constructions. There are a variety of annotations
and psycholinguistic measures available for the stories.

OntoNotes

NAME: OntoNotes
MB POINTER FILE: config/user-ontonotes-directory.txt
AVAILABILITY: Paid
URL: https://catalog.ldc.upenn.edu/ldc2013t19
DESCRIPTION: Syntactic and semantic annotations of a large corpus comprising
various genres of text.

The Penn Treebank (PTB)

NAME: The Penn Treebank (PTB)
MB POINTER FILE: config/user-treebank-directory.txt
AVAILABILITY: Paid
URL: https://catalog.ldc.upenn.edu/ldc99t42
DESCRIPTION: One million words of 1989 Wall Street Journal material annotated in Treebank II style.
A small sample of ATIS-3 material annotated in Treebank II style.
Switchboard tagged, dysfluency-annotated, and parsed text.
A fully tagged version of the Brown Corpus.
Brown parsed text.

R-Hacks

NAME: R-Hacks
MB POINTER FILE: config/user-rhacks-directory.txt
AVAILABILITY: Free
URL: https://github.com/aufrank/R-hacks
DESCRIPTION: Useful bits of code for programming and analysis in R.

The Roark Parser (Roark 2001, 2004)

NAME: The Roark Parser MB-POINTER-FILE: config/user-roark-directory.txt
AVAILABILITY: Free
URL: https://github.com/roarkbr/incremental-top-down-parser DESCRIPTION: A standard parser from Roark (2001, 2004) that computes psycholinguistic complexity measures.

SRILM Language Model Toolkit

NAME: SRILM Language Model Toolkit
MB POINTER FILE: config/user-srilm-directory.txt
AVAILABILITY: Free for non-commercial use
URL: http://www.speech.sri.com/projects/srilm/download.html
DESCRIPTION: SRILM is a toolkit for building and applying statistical language models (LMs),
primarily for use in speech recognition, statistical tagging and segmentation,
and machine translation.

The UCL corpus (Frank et al, 2013)

NAME: The UCL corpus (Frank et al, 2013)
MB POINTER FILE: config/user-ucl-directory.txt
AVAILABILITY: Free
URL: http://www.stefanfrank.info/readingdata/Data.zip
DESCRIPTION: Eye-tracking and self-paced-reading data
from subjects reading isolated sentences from a corpus
of novels written by amateur authors.

UPPARSE (Unsupervised parser, Ponvert et al, 2011)

NAME: UPPARSE (Unsupervised parser, Ponvert et al, 2011)
MB POINTER FILE: config/user-upparse-directory.txt
AVAILABILITY: Free
URL: https://github.com/eponvert/upparse
DESCRIPTION:i Efficient implementations of hidden Markov
models (HMMs) and probabilistic right linear grammars (PRLGs) for
unsupervised partial parsing (also known as: unsupervised chunking,
unsupervised NP identification, unsupervised phrasal segmentation).

WordNet

NAME: WordNet
MB POINTER FILE: config/user-wordnet-directory.txt
AVAILABILITY: Free
URL: https://wordnet.princeton.edu/wordnet/download/
DESCRIPTION: WordNet is a large lexical database of English.

xlsx2cxv

NAME: xlsx2cxv
MB POINTER FILE: config/user-xlsx2csv-directory.txt
AVAILABILITY: Free
URL: https://github.com/dilshod/xlsx2csv
DESCRIPTION: XLS to CSV converter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODELBLOCKS RESOURCES

The Alice in Wonderland corpus

The BMMM Unsupervised PoS tagger (Christodoulopoulos et al 2011)

The Berkeley Parser (jarfile)

The British National Corpus (BNC)

CCL unsupervised parser (Seginer, 2007)

The CHILDES Corpus

The Dependency Model with Valence (DMV)

The Dundee eye-tracking corpus

Echo-state netork (ESN) directory

English Gigaword

Extended Penn Tokenizer

GENIA Tagger

KenLM Language Model Toolkit

KenLM Language Model Toolkit (model binaries directory)

The MIT Sentence Passages corpus

The Natural Stories Corpus

OntoNotes

The Penn Treebank (PTB)

R-Hacks

The Roark Parser (Roark 2001, 2004)

SRILM Language Model Toolkit

The UCL corpus (Frank et al, 2013)

UPPARSE (Unsupervised parser, Ponvert et al, 2011)

WordNet

xlsx2cxv

FilesExpand file tree

RESOURCES.md

Latest commit

History

RESOURCES.md

File metadata and controls

MODELBLOCKS RESOURCES

The Alice in Wonderland corpus

The BMMM Unsupervised PoS tagger (Christodoulopoulos et al 2011)

The Berkeley Parser (jarfile)

The British National Corpus (BNC)

CCL unsupervised parser (Seginer, 2007)

The CHILDES Corpus

The Dependency Model with Valence (DMV)

The Dundee eye-tracking corpus

Echo-state netork (ESN) directory

English Gigaword

Extended Penn Tokenizer

GENIA Tagger

KenLM Language Model Toolkit

KenLM Language Model Toolkit (model binaries directory)

The MIT Sentence Passages corpus

The Natural Stories Corpus

OntoNotes

The Penn Treebank (PTB)

R-Hacks

The Roark Parser (Roark 2001, 2004)

SRILM Language Model Toolkit

The UCL corpus (Frank et al, 2013)

UPPARSE (Unsupervised parser, Ponvert et al, 2011)

WordNet

xlsx2cxv