Facial Movement Recognition Using CNN-BiLSTM in Vowel for Bahasa Indonesia

Muhammad Daffa Abiyyu Rahman, Alif Aditya Wicaksono, Eko Mulyanto Yuniarno, Supeno Mardi Susiki Nugroho

Abstract


Speaking is a multimodal phenomenon that has both verbal and non-verbal cues. One of the non-verbal cues in speaking is the facial movement of the subject, which can be used to find the letter being spoken by the subject. Previous research has been done to prove that lip movement can translate to vowels for Bahasa Indonesia, but detecting the whole facial movement is yet to be covered. This research aimed to establish a CNN-BiLSTM model that can learn spoken vowels by reading the subject's facial movements. The CNN-BiLSTM model yielded a 98.66% validation accuracy, with over 94% accuracy for all five vowels. The model is also capable of recognizing whether the subject is currently silent or speaking a vowel with 98.07% accuracy.

Full Text:

PDF

References


G. Vigliocco, P. Perniss, and D. Vinson, “Language as a multimodal phenomenon: implications for language learning, processing, and evolution,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 369, no. 1651. The Royal Society, p. 20130292, Sep. 19, 2014. doi: 10.1098/rstb.2013.0292.

R. Sultana and R. Palit, "A survey on Bengali speech-to-text recognition techniques," 2014 9th International Forum on Strategic Technology (IFOST), Cox's Bazar, Bangladesh, 2014, pp. 26-29

I. Papadimitriou, A. Vafeiadis, A. Lalas, K. Votis, and D. Tzovaras, ‘Audio-Based Event Detection at Different SNR Settings Using Two-Dimensional Spectrogram Magnitude Representations’, Electronics, 2020.

S. Prom-on and M. Onsri, "Effects of Facial Movements to Expressive Speech Productions: A Computational Study," 2019 IEEE 2nd International Conference on Knowledge Innovation and Invention (ICKII), Seoul, Korea (South), 2019, pp. 481-484.

Z. Lu and L. Czap, "Modelling the tongue movement of Chinese Shaanxi Xi'an dialect speech," 2018 19th International Carpathian Control Conference (ICCC), Szilvasvarad, Hungary, 2018, pp. 98-103, doi: 10.1109/CarpathianCC.2018.8399609.

K. Kumatani and R. Stiefelhagen, "State Synchronous Modeling on Phone Boundary for Audio Visual Speech Recognition and Application to Muti-View Face Images," 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing - ICASSP '07, Honolulu, HI, USA, 2007, pp. IV-417-IV-420.

N. K. Mudaliar, K. Hegde, A. Ramesh, and V. Patil, "Visual Speech Recognition: A Deep Learning Approach," 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 2020, pp. 1218-1221, doi: 10.1109/ICCES48766.2020.9137926.

S. Isobe et al., ‘GAMVA: A Japanese Audio-Visual Multi-Angle Speech Corpus’, 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 134–139, 2021.

T. Tasaka and N. Hamada, "Speaker dependent visual word recognition by using sequential mouth shape codes," 2012 International Symposium on Intelligent Signal Processing and Communications Systems, Tamsui, Taiwan, 2012, pp. 96-101, doi: 10.1109/ISPACS.2012.6473460.

Maxalmina, S. Kahfi, K. N. Ramadhani, and A. Arifianto, "Lip Motion Recognition for Indonesian Vowel Phonemes Using 3D Convolutional Neural Networks," 2020 3rd International Conference on Computer and Informatics Engineering (IC2IE), Yogyakarta, Indonesia, 2020, pp. 157-161, doi: 10.1109/IC2IE50715.2020.9274562.

G. Inc., "MediaPipe", 2022. [Online]. Available: https://github.com/google/mediapipe.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘Learning representations by back-propagating errors', Nature, vol. 323, no. 6088, pp. 533–536, 1986.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition", Proceedings of the IEEE, vol. 86, no. 11, 1998, pp. 2278–2324.

S. Hochreiter and J. Schmidhuber, ‘Long Short-term Memory’, Neural computation, vol. 9, pp. 1735–1780, 12 1997.

S. Kanai, Y. Fujiwara, Y. Yamanaka, and S. Adachi, ‘Sigsoftmax: Reanalysis of the Softmax Bottleneck’, arXiv [stat.ML]. 2018.

J. D. O’Connor, and J. L. M. Trim, “Vowel, Consonant, and Syllable—A Phonological Definition,” vol. 9, no. 2. Informa UK Limited, pp. 103–122, Aug. 1953.

Kementrian Pendidikan dan Kebudayaan Indonesia, "Huruf Vokal - EYD V", 2023. [Online]. Available: https://ejaan.kemdikbud.go.id/eyd/penggunaan-huruf/huruf-vokal/.

Kementrian Pendidikan dan Kebudayaan Indonesia, "Kata Pengantar - EYD V", 2023. [Online]. Available: https://ejaan.kemdikbud.go.id/eyd/.

A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier Nonlinearities Improve Neural Network Acoustic Models", 2013.

B. Xu, N. Wang, T. Chen, and M. Li, “Empirical Evaluation of Rectified Activations in Convolutional Network,” May 2015.

D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations, Dec. 2014.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014.




DOI: https://doi.org/10.12962/jaree.v8i1.372

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.