![]()
Advanced Cross-Validation Framework for Mental Health AI: BERT and Neural Networks Achieve High Accuracy on Mental Chat16K
Irfan Ali
Irfan Ali, Department of Data Science & Artificial Intelligence, Indian Institute of Science Education and Research, Tirupati (Andhra Pradesh), India.
Manuscript received on 28 November 2025 | Revised Manuscript received on 04 December 2025 | Manuscript Accepted on 15 December 2025 | Manuscript published on 30 December 2025 | PP: 10-17 | Volume-6 Issue-1, December 2025 | Retrieval Number: 100.1/ijainn.A111206011225 | DOI: 10.54105/ijainn.A1112.06011225
Open Access | Editorial and Publishing Policies | Cite | Zenodo | OJS | Indexing and Abstracting
© The Authors. Published by Lattice Science Publication (LSP). This is an open-access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Conversational AI is becoming an essential tool for supporting mental health, yet there are still few robust evaluation frameworks for large-scale therapeutic dialogue datasets. This study presents a comprehensive analysis of the MentalChat16K dataset, which contains 16,084 mental health conversation pairs (6,338 real clinical interviews and 9,746 synthetic dialogues), using modern deep learning architectures. We develop and evaluate BERT-based text classification models and featureengineered neural networks for mental health conversation analysis. Our BERT classifier achieves 86.7% accuracy and 86.1% F1-score for sentiment-based mental health state classification. A feature-based neural network achieves 86.7% accuracy and 83.5% F1 Score for therapeutic response type prediction. In addition, five-fold cross-validation with a Random Forest classifier on engineered features yields 99.99% ± 0.02% accuracy. We show that this very high performance is driven by practical feature engineering on a more straightforward classification task, distinct from the primary BERT and neural network models. We further perform statistical significance testing using McNemar’s test and bootstrap confidence intervals, confirming that model performance differences are statistically significant (p < 0.05). Performance on real versus synthetic data is comparable (100.0% vs 99.95%), suggesting robustness across data sources. The dataset consists of 39.4% real clinical interviews and 60.6% GPT-3.5-generated conversational-stations; a demographic analysis highlights the lack of explicit demographic labels and the resulting limitations. Our methodology includes domain-optimised BERT architectures, thorough hyperparameter documentation, and a stratified cross-validation framework. GPU-accelerated experiments provide practical insights for deploying such models in workplace mental health systems. Overall, this study establishes performance benchmarks for conversational mental health AI with promising accuracy levels for research and development, while emphasising the need for independent clinical validation before any real-world use. This work contributes to the growing field of AI-powered mental health support technologies.
Keywords: Mental Health, Conversational AI, BERT, Neural Networks, Therapeutic Communication, Sentiment Analysis, Deep Learning, MentalChat16K.
Scope of the Article: AI Marketing
