VHASR: A Multimodal Speech Recognition System With Vision Hotwords
Paper
• 2410.00822 • Published
YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
This repository provides the VHASR trained on Flickr8k, ADE20k, COCO, and OpenImages.
Our paper is available at https://arxiv.org/abs/2410.00822.
Our code is available at https://github.com/193746/VHASR/tree/main.
For specific details about training and testing, please refer to https://github.com/193746/VHASR/tree/main.
If you are interested in our work, you can use large-scale data to train your own model and perform inference using the following command. Note that you should place the config file of clip in '{model_file}/clip_config' like the four pretrained models we provide.
cd VHASR
CUDA_VISIBLE_DEVICES=1 python src/infer.py \
--model_name "{path_to_model_folder}" \
--speech_path "{path_to_speech}" \
--image_path "{path_to_image}" \
--merge_method 3
@misc{hu2024vhasrmultimodalspeechrecognition,
title={VHASR: A Multimodal Speech Recognition System With Vision Hotwords},
author={Jiliang Hu and Zuchao Li and Ping Wang and Haojun Ai and Lefei Zhang and Hai Zhao},
year={2024},
eprint={2410.00822},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2410.00822},
}