A shortage of skilled labor is one of the evolving problems in manufacturing. AR technology can speed up workforce training in addition to decreasing work-related injuries, cutting operational costs, and increasing productivity using digital workflows.
Project is created with:
Pillowversion: 7.2.0matplotlibversion: 3.3.1numpyversion: 1.18.5openpyxlversion: 3.0.7pandasversion: 1.1.1scikit-learnversion: 0.23.2tqdmversion: 4.48.2boto3version: 1.18.16gensimversion: 3.8.3nltkversion: 3.6.2spacyversion: 2.3.5
- Clone the repo.
- (Recommended) Create and activate a virtualenv under the
env/ directory. Git is already configured to ignore it. - Install the very minimal requirements using
pip install -r requirements.txt - Run Jupyter in whatever way works for you. The simplest would be to run
pip install jupyter && jupyter notebook. - Start work.
This project is a system for finding any relevant information within enterprise documentation.
-
Readable and scanned PDF manuals of products
-
For Hackathon purposes, Emerson and Ashcroft products documents are used which include:
- Measurement Instrumentation
- Valves, actuators, and regulators
- Other Sensors and transmitters
-
More than 1,000 documents
This model will to be evaluated with Mean Average Precision at 5 (MAP@5) metric
-
Text Extraction
- GCP Vision
-
Text Preprocessing
- Hyphen or Spaces Removing
- Unuseful Text Removing
- Paragraphs Joining
- Other
-
Dataset Creating
- Split Text into Sentences
- Split Sentence into Words
- Identify Pages, Documents Names for Texts
- Other
-
Modelling
- Text Preprocessing:
- Stop Words
- Duplicates
- Very short words
- Wor2Vec
- FastText
- Normalization
- Other
- Text Preprocessing:
-
Vector Creating
- Extend Dataset with Vectors
- Grouping Data
-
Searching similar and metric score in validation data.
👤 Nazarii Drushchak
- Email: naz2001r@gmail.com
- Kaggle: naz2001r