Abstract
I. INTRODUCTION
II. ARCHITECTURE OF THE PROPOSED METHOD
A. Pitch/Mel-spectrogram feature
B. Contents feature
C. Convolution Block
D. CBAM(Convolutional Block Attention Module)
E. Transformer Encoder
III. RESULTS
IV. CONCLUSION
ACKNOWLEDGMENT
REFERENCES