A Repo on Ingestion & Analysis of the Special Index of Plants and Chemicals mined via Public Resource media.org
In 2021, The Florilegium: A Special Index to Plants was generated and made publicly available by Public Resource
BACKGROUND
A florilegium in ancient times was a chapbook, a commonplace book you carried with. When you read a manuscript, you would jot down quotes and observations. The name means "to gather flowers." You can learn more about the name of this collection from the Wikipedia.
This collection consists of a scan of journal articles, looking for plant names from two lists:
The first is a list of 1,828 names from the EssoilDB generated by National Institute of Plant Genome Research (NIPGR) New Delhi, India
The other is a list of 8,361 plants from the University of Trans-Disciplinary Health Sciences and Technology (TDU).
In addition, the data also contains search results for 3,443 chemical names, that were found in our earlier work (EssOilDB)
THE DATA
This data is preliminary and consists of 6,150,600 instances of a plant name found in a journal article for the TDU list and 1,468,218 found in the NIPGR list. In addition, the data contains 30,683,797 hits found for 3,443 chemical names
The corpus of 107,233,728 journal articles was first put through Text Extraction- Followed by Tables of N-grams (Singel terms, BiGrams, Trigrams and so on upto 5 terms in length from each document Text). These n-gram tables were then searched for Terms of Interest (eg. Plants and Chemicals) and the output is split into 16 slices.
The current data upload consists of all 16 slices using the TDU list of plants.
The data is presented in a human readable report and in a tsv format for loading into spreadsheets or databases.
The ingestion will be done by NIPGR Florilegium Interns
The files are coded with the letter being searched (a-z) then by the slice being searched (0-f). So, a human readable file (print master report) might have the name pmr_b4_2021-11-22.txt, indicating that the file was generated on November 22 and consists of plant names starting with the letter b on slice 4.
GOALS - GENERAL Goal: Mine the Corpus to end up with a related table connecting:
Plant names
1. Mentioned in the corpus
In how many papers
How many times per paper?
2. Mentioned alone (with no other plants) in the same document
3. Commonly mentioned together?
Chemical Constituents
1. Mentioned in the corpus?
In how many papers?
How many times per paper?
2. What plants do they co-occur with?
Activities
1. Biological
2. Uses
Agriculture, Medicine, Industrial, etc
Identifiers
1. Article IDs
2. KEYS to the data will be the DOIs and MD5#
3. Need to add wikidata IDs
wherever possible
PEOPLE INVOLVED
- Project Owner: Gita Yadav Responsible for: Project Vision and Criteria
- Project Advisor: Peter Murray-Rust Responsible for: Strategy
- Program Manager: Manny Faria Arruda Responsibilities: Planning, Coordination, and Tracking
- Intern A Responsibilities:
- Timely, complete and accurate logging of activities, methods, trial and error results, etc.
- Intern B Responsibilities:
- Timely, complete and accurate logging of activities, methods, trial and error results, etc.
CHALLENGES
- This current data does not have TF (term Freq) or IDF metrics but these will be added later, to support weak/stroing correlations
TO LEARN MORE
To learn more, watch <a href=https://archive.org/details/multicasting?and%5B%5D=subject%3A%22TDM%20Today%22> The TDM Today Show!
You may also be interested in The General Index
And the Special Index to Species.
This data only represents the searches on n-gram Tables. if you wish to search the complete Texts of all 57 million papers used in the corpus, please search Open Alex Directly.