Anomaly recognition in visual and audio data has gained increasing significance in computer vision, as it plays a crucial role in protecting human lives and property. In this work, we developed a semi-supervised multimodal framework for anomaly recognition that combines audio and visual data for better performance. The proposed framework employs a hybrid network consisting of a convolutional neural network, Bi-Directional Long Short-Term Memory, a multi-head attention module, and a fully connected layer for anomalous pattern recognition. We created a novel real-time visual-audio anomaly recognition dataset and evaluated our framework on it, achieving promising results.