Abstract:
Recent developments in several areas, including computer vision and language produc-
tion, have been made possible by the combination of Natural Language Processing
(NLP) and Machine Learning (ML) methods. In order to generate automatic image
descriptions in Bengali, this study introduces a novel method that makes use of the
interplay between NLP and ML algorithms.By automatically creating evocative cap-
tions for photos, the suggested method seeks to close the understanding gap between
visual material and words. Convolutional neural networks (CNNs) are used to extract
significant characteristics from photos, and pre-trained language models are used to
produce coherent and contextually appropriate Bengali descriptions.To achieve our
goal, we have built our own dataset that included 3000 different images with different
activities of people and nature. We have applied our hybrid model DenseNet201 with
LSTM on our dataset. By using our hybrid model we achieve best accuracy of 72.00 %
vaild and 91.00% train. The outcomes show that the suggested automatic picture de-
scription system generates accurate and contextually appropriate Bengali captions with
a significant improvement over existing approaches. In addition to enhancing Bengali’s
linguistic resources, this initiative improves content accessibility and offers a useful
resource for people who are blind or visually impaired. The promise of interdisciplinary
research in developing intelligent systems that bridge the gap between languages and
images is demonstrated by the seamless integration of NLP and ML approaches.