Facial Movement Recognition Using CNN-BiLSTM in Vowel for Bahasa Indonesia

Penulis

  • Muhammad Daffa Abiyyu Rahman Institut Teknologi Sepuluh Nopember
  • Alif Aditya Wicaksono Institut Teknologi Sepuluh Nopember
  • Eko Mulyanto Yuniarno Institut Teknologi Sepuluh Nopember
  • Supeno Mardi Susiki Nugroho Institut Teknologi Sepuluh Nopember

DOI:

https://doi.org/10.12962/jaree.v8i1.372

Abstrak

Speaking is a multimodal phenomenon that has both verbal and non-verbal cues. One of the non-verbal cues in speaking is the facial movement of the subject, which can be used to find the letter being spoken by the subject. Previous research has been done to prove that lip movement can translate to vowels for Bahasa Indonesia, but detecting the whole facial movement is yet to be covered. This research aimed to establish a CNN-BiLSTM model that can learn spoken vowels by reading the subject's facial movements. The CNN-BiLSTM model yielded a 98.66% validation accuracy, with over 94% accuracy for all five vowels. The model is also capable of recognizing whether the subject is currently silent or speaking a vowel with 98.07% accuracy.

Biografi Penulis

Muhammad Daffa Abiyyu Rahman, Institut Teknologi Sepuluh Nopember

Electrical Engineering

Alif Aditya Wicaksono, Institut Teknologi Sepuluh Nopember

Computer Engineering

Eko Mulyanto Yuniarno, Institut Teknologi Sepuluh Nopember

Electrical Engineering, Supervisor

Supeno Mardi Susiki Nugroho, Institut Teknologi Sepuluh Nopember

Electrical Engineering, Supervisor

Referensi

G. Vigliocco, P. Perniss, and D. Vinson, “Language as a multimodal phenomenon: implications for language learning, processing, and evolution,†Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 369, no. 1651. The Royal Society, p. 20130292, Sep. 19, 2014. doi: 10.1098/rstb.2013.0292.

R. Sultana and R. Palit, "A survey on Bengali speech-to-text recognition techniques," 2014 9th International Forum on Strategic Technology (IFOST), Cox's Bazar, Bangladesh, 2014, pp. 26-29

I. Papadimitriou, A. Vafeiadis, A. Lalas, K. Votis, and D. Tzovaras, ‘Audio-Based Event Detection at Different SNR Settings Using Two-Dimensional Spectrogram Magnitude Representations’, Electronics, 2020.

S. Prom-on and M. Onsri, "Effects of Facial Movements to Expressive Speech Productions: A Computational Study," 2019 IEEE 2nd International Conference on Knowledge Innovation and Invention (ICKII), Seoul, Korea (South), 2019, pp. 481-484.

Z. Lu and L. Czap, "Modelling the tongue movement of Chinese Shaanxi Xi'an dialect speech," 2018 19th International Carpathian Control Conference (ICCC), Szilvasvarad, Hungary, 2018, pp. 98-103, doi: 10.1109/CarpathianCC.2018.8399609.

K. Kumatani and R. Stiefelhagen, "State Synchronous Modeling on Phone Boundary for Audio Visual Speech Recognition and Application to Muti-View Face Images," 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing - ICASSP '07, Honolulu, HI, USA, 2007, pp. IV-417-IV-420.

N. K. Mudaliar, K. Hegde, A. Ramesh, and V. Patil, "Visual Speech Recognition: A Deep Learning Approach," 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 2020, pp. 1218-1221, doi: 10.1109/ICCES48766.2020.9137926.

S. Isobe et al., ‘GAMVA: A Japanese Audio-Visual Multi-Angle Speech Corpus’, 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 134–139, 2021.

T. Tasaka and N. Hamada, "Speaker dependent visual word recognition by using sequential mouth shape codes," 2012 International Symposium on Intelligent Signal Processing and Communications Systems, Tamsui, Taiwan, 2012, pp. 96-101, doi: 10.1109/ISPACS.2012.6473460.

Maxalmina, S. Kahfi, K. N. Ramadhani, and A. Arifianto, "Lip Motion Recognition for Indonesian Vowel Phonemes Using 3D Convolutional Neural Networks," 2020 3rd International Conference on Computer and Informatics Engineering (IC2IE), Yogyakarta, Indonesia, 2020, pp. 157-161, doi: 10.1109/IC2IE50715.2020.9274562.

G. Inc., "MediaPipe", 2022. [Online]. Available: https://github.com/google/mediapipe.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘Learning representations by back-propagating errors', Nature, vol. 323, no. 6088, pp. 533–536, 1986.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition", Proceedings of the IEEE, vol. 86, no. 11, 1998, pp. 2278–2324.

S. Hochreiter and J. Schmidhuber, ‘Long Short-term Memory’, Neural computation, vol. 9, pp. 1735–1780, 12 1997.

S. Kanai, Y. Fujiwara, Y. Yamanaka, and S. Adachi, ‘Sigsoftmax: Reanalysis of the Softmax Bottleneck’, arXiv [stat.ML]. 2018.

J. D. O’Connor, and J. L. M. Trim, “Vowel, Consonant, and Syllable—A Phonological Definition,†vol. 9, no. 2. Informa UK Limited, pp. 103–122, Aug. 1953.

Kementrian Pendidikan dan Kebudayaan Indonesia, "Huruf Vokal - EYD V", 2023. [Online]. Available: https://ejaan.kemdikbud.go.id/eyd/penggunaan-huruf/huruf-vokal/.

Kementrian Pendidikan dan Kebudayaan Indonesia, "Kata Pengantar - EYD V", 2023. [Online]. Available: https://ejaan.kemdikbud.go.id/eyd/.

A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier Nonlinearities Improve Neural Network Acoustic Models", 2013.

B. Xu, N. Wang, T. Chen, and M. Li, “Empirical Evaluation of Rectified Activations in Convolutional Network,†May 2015.

D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,†International Conference on Learning Representations, Dec. 2014.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,†Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014.

##submission.downloads##

Diterbitkan

2024-01-31

Terbitan

Bagian

Articles(Jaree lama)