dc.description.abstract |
This paper focuses on creating a model that works better in our native tongue for
generation captions from the images because it would be applied to any website or
app and is easy to use even by blind people. Moreover, attention has been drawn to
the usage of RESNET-152, a deep neural network with 152 layers of depth, as an
encoder for Bengali captioning problems. As there hasn't been any research on
adopting this approach with the Bangladeshi dataset, we try to create a Bangla
Captions dataset. Our proposed model is a transfer learning-based approach that gives
state-of-the-art performance on our dataset. For accurate features, we employed five
CNN architectures: ResNet 50, ResNet 101, and ResNet 152, with a caption-model
made up of a BI-LSTM. By applying this hybrid model on our dataset, we achieved a
good outcome. Experimental results demonstrate that the models outperform the
results of previous research and that the accuracy is acceptable with a BLEU-I score
of 88.18 when the encoder is ResNet-152. |
en_US |