Dataset: https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset
Image captioning is a challenging task that involves generating descriptive textual representations for images, surpassing the complexity of mere image classification. To tackle this intricate task, we adopt a well-established approach that combines Convolutional Neural Networks (CNNs) with Long Short-Term Memory (LSTM) networks , further enhanced by the integration of an Attention layer within the decoder. This enables us to effectively generate coherent and meaningful captions. Moreover, we employ advanced techniques such as mutliprocessing during the image retrieval and preprocessing stages, resulting in a substantial reduction in training time. We can efficiently fetch and preprocess multiple images simultaneously, harnessing the full potential of modern computing architectures.