Facial Movement Recognition Using CNN-BiLSTM in Vowel for Bahasa Indonesia

Muhammad Daffa Abiyyu Rahman; Alif Aditya Wicaksono; Eko Mulyanto Yuniarno; Supeno Mardi Susiki Nugroho

doi:10.12962/jaree.v8i1.372

PDF

Published: Jan 31, 2024

DOI: https://doi.org/10.12962/jaree.v8i1.372

Muhammad Daffa Abiyyu Rahman

Institut Teknologi Sepuluh Nopember

Alif Aditya Wicaksono

Institut Teknologi Sepuluh Nopember

Eko Mulyanto Yuniarno

Institut Teknologi Sepuluh Nopember

Supeno Mardi Susiki Nugroho

Institut Teknologi Sepuluh Nopember

Abstract

Speaking is a multimodal phenomenon that has both verbal and non-verbal cues. One of the non-verbal cues in speaking is the facial movement of the subject, which can be used to find the letter being spoken by the subject. Previous research has been done to prove that lip movement can translate to vowels for Bahasa Indonesia, but detecting the whole facial movement is yet to be covered. This research aimed to establish a CNN-BiLSTM model that can learn spoken vowels by reading the subject's facial movements. The CNN-BiLSTM model yielded a 98.66% validation accuracy, with over 94% accuracy for all five vowels. The model is also capable of recognizing whether the subject is currently silent or speaking a vowel with 98.07% accuracy.

Issue

Vol. 8 No. 1 (2024): January

Section

Articles

Copyright

Submission of a manuscript implies that the submitted work has not been published before (except as part of a thesis or report, or abstract); that it is not under consideration for publication elsewhere; that its publication has been approved by all co-authors. If and when the manuscript is accepted for publication, the author(s) still hold the copyright and retain publishing rights without restrictions. Authors or others are allowed to multiply article as long as not for commercial purposes. For the new invention, authors are suggested to manage its patent before published. The license type is CC-BY-NC 4.0.

Disclaimer

No responsibility is assumed by publisher and co-publishers, nor by the editors for any injury and/or damage to persons or property as a result of any actual or alleged libelous statements, infringement of intellectual property or privacy rights, or products liability, whether resulting from negligence or otherwise, or from any use or operation of any ideas, instructions, procedures, products or methods contained in the material therein.

Author Biographies

Muhammad Daffa Abiyyu Rahman, Institut Teknologi Sepuluh Nopember

Electrical Engineering

Alif Aditya Wicaksono, Institut Teknologi Sepuluh Nopember

Computer Engineering

Eko Mulyanto Yuniarno, Institut Teknologi Sepuluh Nopember

Electrical Engineering, Supervisor

Supeno Mardi Susiki Nugroho, Institut Teknologi Sepuluh Nopember

Electrical Engineering, Supervisor

References

G. Vigliocco, P. Perniss, and D. Vinson, “Language as a multimodal phenomenon: implications for language learning, processing, and evolution,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 369, no. 1651. The Royal Society, p. 20130292, Sep. 19, 2014. doi: 10.1098/rstb.2013.0292.

R. Sultana and R. Palit, "A survey on Bengali speech-to-text recognition techniques," 2014 9th International Forum on Strategic Technology (IFOST), Cox's Bazar, Bangladesh, 2014, pp. 26-29

I. Papadimitriou, A. Vafeiadis, A. Lalas, K. Votis, and D. Tzovaras, ‘Audio-Based Event Detection at Different SNR Settings Using Two-Dimensional Spectrogram Magnitude Representations’, Electronics, 2020.

S. Prom-on and M. Onsri, "Effects of Facial Movements to Expressive Speech Productions: A Computational Study," 2019 IEEE 2nd International Conference on Knowledge Innovation and Invention (ICKII), Seoul, Korea (South), 2019, pp. 481-484.

Z. Lu and L. Czap, "Modelling the tongue movement of Chinese Shaanxi Xi'an dialect speech," 2018 19th International Carpathian Control Conference (ICCC), Szilvasvarad, Hungary, 2018, pp. 98-103, doi: 10.1109/CarpathianCC.2018.8399609.

K. Kumatani and R. Stiefelhagen, "State Synchronous Modeling on Phone Boundary for Audio Visual Speech Recognition and Application to Muti-View Face Images," 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing - ICASSP '07, Honolulu, HI, USA, 2007, pp. IV-417-IV-420.

N. K. Mudaliar, K. Hegde, A. Ramesh, and V. Patil, "Visual Speech Recognition: A Deep Learning Approach," 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 2020, pp. 1218-1221, doi: 10.1109/ICCES48766.2020.9137926.

S. Isobe et al., ‘GAMVA: A Japanese Audio-Visual Multi-Angle Speech Corpus’, 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 134–139, 2021.

T. Tasaka and N. Hamada, "Speaker dependent visual word recognition by using sequential mouth shape codes," 2012 International Symposium on Intelligent Signal Processing and Communications Systems, Tamsui, Taiwan, 2012, pp. 96-101, doi: 10.1109/ISPACS.2012.6473460.

Maxalmina, S. Kahfi, K. N. Ramadhani, and A. Arifianto, "Lip Motion Recognition for Indonesian Vowel Phonemes Using 3D Convolutional Neural Networks," 2020 3rd International Conference on Computer and Informatics Engineering (IC2IE), Yogyakarta, Indonesia, 2020, pp. 157-161, doi: 10.1109/IC2IE50715.2020.9274562.

G. Inc., "MediaPipe", 2022. [Online]. Available: https://github.com/google/mediapipe.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘Learning representations by back-propagating errors', Nature, vol. 323, no. 6088, pp. 533–536, 1986.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition", Proceedings of the IEEE, vol. 86, no. 11, 1998, pp. 2278–2324.

S. Hochreiter and J. Schmidhuber, ‘Long Short-term Memory’, Neural computation, vol. 9, pp. 1735–1780, 12 1997.

S. Kanai, Y. Fujiwara, Y. Yamanaka, and S. Adachi, ‘Sigsoftmax: Reanalysis of the Softmax Bottleneck’, arXiv [stat.ML]. 2018.

J. D. O’Connor, and J. L. M. Trim, “Vowel, Consonant, and Syllable—A Phonological Definition,” vol. 9, no. 2. Informa UK Limited, pp. 103–122, Aug. 1953.

Kementrian Pendidikan dan Kebudayaan Indonesia, "Huruf Vokal - EYD V", 2023. [Online]. Available: https://ejaan.kemdikbud.go.id/eyd/penggunaan-huruf/huruf-vokal/.

Kementrian Pendidikan dan Kebudayaan Indonesia, "Kata Pengantar - EYD V", 2023. [Online]. Available: https://ejaan.kemdikbud.go.id/eyd/.

A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier Nonlinearities Improve Neural Network Acoustic Models", 2013.

B. Xu, N. Wang, T. Chen, and M. Li, “Empirical Evaluation of Rectified Activations in Convolutional Network,” May 2015.

D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations, Dec. 2014.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014.

Facial Movement Recognition Using CNN-BiLSTM in Vowel for Bahasa Indonesia

Abstract

Muhammad Daffa Abiyyu Rahman, Institut Teknologi Sepuluh Nopember

Alif Aditya Wicaksono, Institut Teknologi Sepuluh Nopember

Eko Mulyanto Yuniarno, Institut Teknologi Sepuluh Nopember

Supeno Mardi Susiki Nugroho, Institut Teknologi Sepuluh Nopember

References

Find us

Publisher

Visitors

Article Sidebar

Main Article Content

Abstract

Article Details

Muhammad Daffa Abiyyu Rahman, Institut Teknologi Sepuluh Nopember

Alif Aditya Wicaksono, Institut Teknologi Sepuluh Nopember

Eko Mulyanto Yuniarno, Institut Teknologi Sepuluh Nopember

Supeno Mardi Susiki Nugroho, Institut Teknologi Sepuluh Nopember

References