EC601 - VQA - For Visually Impaired

Introduction

A natural application of artificial intelligence is to help blind people overcome their daily visual challenges and allowing them to live a healthy and independent life through AI-based assistive technologies. In this regard, one of the most promising tasks is Visual Question Answering (VQA).

A Literature review of VQA has been done. This includes researching data sets and different architectures used previously as well as the newest state of the art models. It also includes a detailed review of the different methods used depending on the type of problem that is being tackled. Visit EC-601 HW1.pdf for more details.

Objectives:

Research and understand the evolution of VQA architectures
Train and validate different VQA architectures to compare performance and while examining the limitation of the models
Understand the domain shift that is need to be considered when designing VQA solutions to empower visually impaired people
Evaluate and propose techniques to address the differences found within the VQA datasets.

Data sets:

VQA v-2 The VQA dataset has been constructed using images from the MS COCO dataset and crowdsourcing questions and answers. Visual content in MS COCO is originating from a web-based image search and it is typically high-quality. Moreover, VQA authors have instructed the crowdsourced workers to collect interesting, diverse, and well posed questions.

VizWiz Originally was a phone application aimed at helping blind people with their daily visual problems. VizWiz was allowing visually impaired users to take a picture, ask verbally a question that they would like answered about the picture. After it shut down, the data was collected and used. For each image there was a question. They then crowdsourced the answers to make the data set complete. Two main takeaways:

Images are often characterized by poor quality due to poor lighting, focus, and framing of the content of interest
Questions are in average more conversational and sometimes are incomplete due to audio recording imperfections such as clipping a question at the end or recording background audio content

VizWiz vs VQA v-2

Reformatting Data:

The vizwiz data set is the second data set used in our project. VizWiz arises from a natural setting, reflecting a use case where a person asks questions about the surrounding world.

The vizwiz data set can be downloaded and unzipped from the following webpage: https://vizwiz.org/tasks-and-datasets/vqa/. Four files should be downloaded and unzipped:

- Images: training, validation, and test. 
- Annotations

Note: annotation file contains train, valid, and test)

Now unlike VQA v2, where questions and annotations are found in different folders. For vizwiz, both questions and answers are found together. A code was developed to restructure the data from vizwiz to vqa v2 or vice versa to fit the needs of the model being used.

data-converter.py can be leveraged to perform such task.

Architectures & Results

We were able to sucessfully train three models, one of which is the state of the art models. We show the training and validation results.

Model 1: Visual Question Answering(https://arxiv.org/pdf/1505.00468.pdf).

The model consists of the following:

A two layer LSTM to encode the questions.
The last hidden layer of VGGNet to encode the image followed by feature normalization.
Fusion via element-wise multiplication.
Fully connected layer followed by a softmax layer to obtain a distribution over answers.

A pytorch implementation of the model was followed using this github.

Steps for the training and validation are found on the gitub listed above.

VQA - v2 data set is used for training and validation: https://visualqa.org/download.html.

After training and validating on VQA - v2 we were able to get the following results:

The output of the data_converter.py will provide the files for questions and annotations that fit the same structure of VQA v2. These files should be added to a folder containing the training, validation and testing images for vizwiz. The folder should contain three files: images, question and annotations. Following this step the same process for training and validation done on VQA v2 is followed. Only the path to the new data set folder should be changed.

After training and validating on Viz Wiz we were able to get the following results:

In order to check how well the model generalizes using the vqa v2 dataset, training on vqa v2 and validation using vizwiz was done:

Note: The accuracy of the model was calculated based on:

Accuracy = min( # of humans who provided the answer / 3, 1)

Meaning an answer is considered 100% accurate if at least 3 workers provided that exact answer. When comparing the answers, all responses are made lowercase, numbers converted to digits, and punctuation & articles removed.

Model 2: A Strong Baseline For Visual Question Answering (https://arxiv.org/abs/1704.03162).

The model is consisted of the following:

A convolutional neural network based on ResNet to embed the image.
Input question is tokenized and embedded and fed to a multi-layer LSTM.
Features and the final state of LSTMs are used to compute multiple attention distributions over image features.
Image feature glimpses and the state of LSTM fed into two fully connected layers two produce probabilities over answer classes.

A Pytorch implementation of this model can be achieved using this github

After training and validating on the Vizwiz dataset, we can see the results showing as follow:

As the original implementation is used on VizWiz dataset, we transfered the VQA v2 dataset' images as well as the annotations(questions and answers) to fit in the model, and then start training. As the VQA model is much larger than the Vizwiz, there is not enough usage time for the GPU and there is only a few epoches. And the results of the VQA dataset is shown as below:

Model 3: VinVL: Revisiting Visual Representations in Vision-Language Models (https://arxiv.org/abs/2101.00529)

The model consists of the following:

Image region feature and object tag extraction using ResNext-152C4
pretrained BERT
a linear classifier at the end to produce probabilities over answer classes.

The official github page for VinVL can be found here.

We fintuned the basic VinVL model for VQA task on VQA v2 dataset using the same parameters as the paper and got the following results.

We fintuned the basic VinVL model for VQA task on VizWiz dataset and got the following results.

Data Sets Exploratory Analysis

Due to the poor performance (Low accuracy) of validation of the vizwiz dataset on the models trained on VQA v2, we decided to investigate and further understand the differences between the dataset. In the hopes that this process will help us identify steps needed to get a better accuracy on the vizwiz dataset.

The datasets include images, questions and answers.

Images

Comparing the images in both data sets, we can observe that there is a size difference. VQA v-2 is much larger than then vizwiz (80000 training images for vqa v2 compared to 20000 for vizwiz). In addition, the quality of images in VQA v2 are much better than vizwiz. Many images in vizwiz contain partially the object of interest.

Example for images:

We see that images from VizWiz are blurry and sometimes don't contain the object of interest, which may lead to poor performance across all models when evaluated on VizWiz dataset. We also observe that image region features and object tags extracted from VinVL are often mislabeled, introducing false information to the prediction process.

object of interest (the bottle) partially out of frame	wrong object labels (e.g. bottle, fish)

Questions:

We then started comparing the questions between the two datasets. The first thing we noticed is the difference between the number questions. In VQA v2 there are 10 questions per image, while in vizwiz there is 1 question per image. VQA v2 has around a million visual questions while vizwiz has around 31,000 visual questions.

Second we visualized the distribution of number of words in the questions. We noticed that most questions are between 5-10 words, however, in vizwiz there are questions that are around 40-50 words. We attribute these outliers to the fact that questions in vizwiz are asked in a conversational setting.

We then created a sunburst plot for istribution of the first words of all the questions in VQA v2 and VizWiz. We observed that in vizwiz there is more diversity in the starting word used compared to VQA v2.

VQA v2	VizWiz

Finally, we created a word cloud to show the most prominent words used in the questions:

We concluded that there is a big difference between the common initial words, number of words and type of questions used in the datasets.

Answers:

We then moved on to analyze and understand the difference between answers in both datasets. In VizWiz, 67% of answers have one word, 20% two words. In VQA v2.0 instead 89.3% of answers have one word, 6.9% two words. We were also interested in comparing the frequency of the unigrams found in the answers in both datasets. We plotted the top 30 words, what was most interesting is unanswerable and unsuitable, compared to yes and no respectively for VQA v2.

We created a word cloud to show the most prominent words used in the answers:

An additional feature that is found in both datasets is level of confidence. An number between 0-1 is given by the annotator for each answer so that the he or she can provide the level of confidence for the answer they provide. A distribution plot was created and it shows that in vizwiz there are as much number of answers that were given a confidence between 0.7-0.9 as the ones given a 1 confidence level. This shows that overall there was level confidence in answers provided for the questions in vizwiz than in VQA v2, and this can be attributed to the quality of images.

Poster:

PDF of the poster can be found here

References:

https://github.com/tbmoon/basic_vqa

https://vizwiz.org/tasks-and-datasets/vqa/.

https://github.com/DenisDsh/VizWiz-VQA-PyTorch

https://github.com/pzzhang/VinVL

https://kth.diva-portal.org/smash/get/diva2:1299462/FULLTEXT01.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
Images		Images
EC-601 HW1.pdf		EC-601 HW1.pdf
README.md		README.md
VQA-Sprint 3.pdf		VQA-Sprint 3.pdf
VQA-Sprint1.pdf		VQA-Sprint1.pdf
VQA-Sprint2.pdf		VQA-Sprint2.pdf
VQA-Sprint4.pdf		VQA-Sprint4.pdf
data-converter.py		data-converter.py
poster.pdf		poster.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EC601 - VQA - For Visually Impaired

Introduction

Objectives:

Data sets:

Reformatting Data:

Architectures & Results

Model 1: Visual Question Answering(https://arxiv.org/pdf/1505.00468.pdf).

Model 2: A Strong Baseline For Visual Question Answering (https://arxiv.org/abs/1704.03162).

Model 3: VinVL: Revisiting Visual Representations in Vision-Language Models (https://arxiv.org/abs/2101.00529)

Data Sets Exploratory Analysis

Poster:

References:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EC601 - VQA - For Visually Impaired

Introduction

Objectives:

Data sets:

Reformatting Data:

Architectures & Results

Model 1: Visual Question Answering(https://arxiv.org/pdf/1505.00468.pdf).

Model 2: A Strong Baseline For Visual Question Answering (https://arxiv.org/abs/1704.03162).

Model 3: VinVL: Revisiting Visual Representations in Vision-Language Models (https://arxiv.org/abs/2101.00529)

Data Sets Exploratory Analysis

Poster:

References:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages