DMMAN: A two-stage audio–visual fusion framework for sound separation and event localization
Neural Networks, ISSN: 0893-6080, Vol: 133, Page: 229-239
2021
- 18Citations
- 18Captures
Metric Options: Counts1 Year3 YearSelecting the 1-year or 3-year option will change the metrics count to percentiles, illustrating how an article or review compares to other articles or reviews within the selected time period in the same journal. Selecting the 1-year option compares the metrics against other articles/reviews that were also published in the same calendar year. Selecting the 3-year option compares the metrics against other articles/reviews that were also published in the same calendar year plus the two years prior.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Metrics Details
- Citations18
- Citation Indexes18
- 18
- CrossRef9
- Captures18
- Readers18
- 18
Article Description
Videos are used widely as the media platforms for human beings to touch the physical change of the world. However, we always receive the mixed sound from the multiple sound objects, and cannot distinguish and localize the sounds as the separate entities in videos. In order to solve this problem, a model named the Deep Multi-Modal Attention Network (DMMAN), is established to model the unconstrained video datasets for further finishing the sound source separation and event localization tasks in this paper. Based on the multi-modal separator and multi-modal matching classifier module, our model focuses on the sound separation and modal synchronization problems using two stage fusion of the sound and visual features. To link the multi-modal separator and multi-modal matching classifier modules, the regression and classification losses are employed to build the loss function of the DMMAN. The estimated spectrum masks and attention synchronization scores calculated by the DMMAN can be easily generalized to the sound source and event localization tasks. The quantitative experimental results show the DMMAN not only separates the high quality of the sound sources evaluated by Signal-to-Distortion Ratio and Signal-to-Interference Ratio metrics, but also is suitable for the mixed sound scenes that are never heard jointly. Meanwhile, DMMAN achieves better classification accuracy than other contrast baselines for the event localization tasks.
Bibliographic Details
http://www.sciencedirect.com/science/article/pii/S0893608020303580; http://dx.doi.org/10.1016/j.neunet.2020.10.003; http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=85096708056&origin=inward; http://www.ncbi.nlm.nih.gov/pubmed/33232859; https://linkinghub.elsevier.com/retrieve/pii/S0893608020303580; https://dx.doi.org/10.1016/j.neunet.2020.10.003
Elsevier BV
Provide Feedback
Have ideas for a new metric? Would you like to see something else here?Let us know