Research

Areas of Interests:

Machine Learning and Deep Learning Applications
Natural Language Processing
Large Language Models
Computer Vision
Vision Language Models

Publications

M. A. Mahtab, & J. Maisha. "Automated Financial Report Detection, Classification and Structure Recognition Using YOLO and SLANet". In IEEE International Conference on Future Machine Learning and Data Science (FMLDS), Sydney, Australia, 2024, pp. 277-282. [DOI]

Abstract

Financial report/statement analysis is an essential phase in ensuring regulatory compliance and acts as a fundamental safeguard against financial misconduct. Organizations can discover irregularities, identify potential fraud tendencies, and maintain financial transparency by closely reviewing financial statements. However, in order to proceed with such a work, the statements must be extracted from annual reports and presented in spreadsheet format that allows for swift examination. Therefore, we built a scalable and automated pipeline based on using three state-of-the-art machine learning models: YOLO, PP-OCR, and SLANet, along with three financial table datasets. We introduced a novel classification layer to significantly cut processing time by removing non-statement tables present in the annual report. Our best model configuration achieved an mAP of 99.5%, a precision of 98.4% on financial table detection, and 100% accuracy on the classification of such tables, with an average per page prediction time of 662.6ms from PDF to Excel format. Additionally, SLANet demonstrated a TEDS value of 0.876 on statements of financial position, showcasing 17.6 times faster inference speed than the current benchmark. Our automated ML-based pipeline works as a precursor to financial report analysis while eliminating the need for manual labor and reducing time.

J. Maisha, M. A. Mahtab, F. H. Swarnali, T. Tabassum, & M. T. R. Shawon. "Safeguarding Music Artists Through Machine Learning Based Singer Classification". In 2024 IEEE 17th International Scientific Conference on Informatics (Informatics), Poprad, Slovakia. [Preprint]

Abstract

Across various fields, the area of Bengali music, with its rich cultural heritage, is highly unexplored but identifying and classifying singers has become vital for safeguarding intellectual property and encouraging the music industry. Precise identification of Bengali singers is crucial for preserving singers’ rights and ensuring fair compensation. This paper shows the effectiveness of different Convolutional Recurrent Neural Network (CRNN) and Recurrent Neural Network (RNN) based architectures on singer classification, specifically in Bengali songs, leveraging Mel-spectrogram features. In our study, the superiority of CRNN models over RNN and traditional ML models has been showcased, and among ML models our ensemble technique outperformed the rest. Our CRNN model achieved an F1-score of 0.97, significantly outperforming traditional ML methods (F1-scores for Logistic Regression, Random Forest, SVM, KNN, and Ensemble are 0.76, 0.74, 0.72, 0.68 and 0.83 respectively). The significance of CRNN in singer classification for Bengali music has been highlighted in our study, which offers a solution to protect singers’ rights by reducing copyright infringement. Additionally, our work has the potential to assist in the development of music recommendation systems.

M. A. Mahtab, J. Maisha, M. M. Rahman, & S. K. S. Joy. “An Empirical Study on Utilizing Large Language Models for Bengali Image Caption Generation”. In 2024 27th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh. [Preprint]

Abstract

An exemplary caption not only describes what is happening in a particular image but also denotes intricate traditional objects in the image by their local representative terms through which the native speakers can recognize the object in question. A caption that fails to accomplish the latter is not effective in conveying proper utility. To ensure caption locality, we aim to explore the potential of Large Language Models (LLM) in Bengali image captioning, which have lately shown promising results in English language caption generation. As a first for the Bengali language, we utilized CLIP (Contrastive Language-Image Pre-training) encodings as a prefix to the captions by employing a mapping network, followed by fine-tuning BanglaGPT, a Bengali pre-trained large language model to generate the image captions. Furthermore, we explored vision transformer-based encoders (ViT, Swin) with BanglaGPT as the decoder. The best BanglaGPT-based model outperformed the current benchmark results, with BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, and CIDEr scores of 70.2, 63.9, 58.8, 54.3, 39.2, and 95.9 on the BanglaLekha dataset and 82.4, 76.8, 71.9, 67.4, 36.6, and 76.9 on the BNature dataset.

F. H. Swarnali, J. Maisha, M. A. Mahtab, M. S. I. Iftikar, & F. M. Shah. “Bengali Multi-class Text Classification via Enhanced Contrastive Learning Techniques”. In 2024 27th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh. [Preprint]

Abstract

Bengali, one of South Asia's most frequently spoken languages, poses substantial difficulties in tasks such as sentiment analysis and other forms of text classification due to its intricate grammatical structure. This is not just vital for protecting mental health through precise sentiment analysis, but it also has broader ramifications in sectors where accurately discriminating between fine-grained meanings is critical. Improving classification methods to address these subtle distinctions is a timely necessity for advancing natural language processing in Bengali. Our study aims to advance the field of Bengali text classification by implementing Token-level Adversarial Contrastive Training (TACT) and Label-aware Contrastive (LCL) loss, leveraging contrastive learning methods. The two new losses distinguished fine-grained text better, compared to our previous findings on Contrastive Adversarial Training (CAT) and Supervised Contrastive Loss (SCL). For binary class classification, TACT reached an F1-score of 98% outperforming CAT and LCL and setting a new benchmark on the Rokomari Book Review (RBR) dataset. For multi-class classification, TACT achieved an F1-score of 91%, matching the current benchmark on the Bengali Hate Speech (BHS-M) dataset. Furthermore, our custom Bengali multi-class text classification dataset, Daraz Product Review (DPR) further contributes to the field.

Ongoing Research

Sub Dialect Detection and Machine Translation for Bengali Language.
Explainable Contrastive Learning Approach for Spam Review Classification Utilizing LLMs.
YOLO-LLM Based Bengali Image Captioning System.
Plagarism Detection for Old Bengali Songs.
Bengali Social Media Post Text Summarization.

Muhammad Azmain Mahtab

Research

Publications

Ongoing Research