摘要

In order to improve the generation method in vision-grounded language model ViMac, a core-based visual semantic representation is proposed. With core-based semantic representation, ViMac can work with Compounds generation method to output more accurate compounds instead of single words. Compounds generation method can describe unseen visual feature values by creating new compounds and overcome the subjective variabilities imported during the learning phase. In the experiment, three generation methods are compared by the generation error rate. Gaussian model based generation method gets the result of 82%, KNN generation method gets the result of 69%, and Compounds method gets the result of 54%, which reduces at least 15% on the generation error rate. In another comparison experiment on execution time of nonparametric generation methods, KNN method gets the result of 35.2s. Compound method gets the result of 15.7s, which is almost half of the time cost by KNN method. Experimental results indicate that Compounds generation method can greatly reduce both the generation error rate and the computational complexity compared with KNN method and Gaussian model based method.

全文