An improved feedforward sequence memory neural network structure

In the recent ICASSP voice top meeting, a poster paper from Alibaba's Voice Interactive Intelligence team introduced an enhanced feedforward sequence memory neural network called Deep Feedforward Sequence Memory Neural Network (DFSMN). The researchers further integrated DFSMN with low frame rate (LFR) technology to develop an LFR-DFSMN acoustic model for speech recognition. This model not only achieves significant performance improvements over the widely used BLSTM-based systems in both English and Chinese large vocabulary tasks but also outperforms BLSTM in terms of training speed, parameter count, decoding efficiency, and model latency. **Research Background** Deep neural networks have become the dominant choice in large vocabulary continuous speech recognition systems. Due to the strong long-term correlation in speech signals, recurrent neural networks (RNNs), such as LSTM and its variants, are commonly used. However, RNNs often suffer from slow training speeds and gradient vanishing issues due to the BPTT algorithm. To address this, our previous work proposed a non-recursive structure known as Feedforward Sequential Memory Networks (FSMN), which can effectively model long-term dependencies while offering better training efficiency and performance. Building on this, we developed the Deep Feedforward Sequence Memory Neural Network (DFSMN). By introducing skip connections between adjacent memory modules, we ensure that high-level gradients are well propagated to lower layers, preventing gradient vanishing in deep networks. Additionally, considering real-world applications, we combined DFSMN with LFR to enhance training and inference speed. We optimized the DFSMN structure to allow flexible delay control, making it suitable for real-time speech recognition systems. **Performance Evaluation** We evaluated DFSMN on several large vocabulary speech recognition tasks, including 2,000-hour English FSH and 20,000-hour Chinese datasets. In the 2,000-hour English task, DFSMN achieved a 1.5% improvement in performance compared to BLSTM with fewer parameters. On the Chinese dataset, LFR-DFSMN showed over 20% relative improvement over LFR-LCBLSTM. Moreover, LFR-DFSMN demonstrated excellent flexibility in controlling delays—achieving better results than LFR-LCBLSTM even with just 5 frames of delay, compared to 40 frames in the latter. **FSMN Overview** The original FSMN architecture, shown in Figure 1(a), is essentially a feedforward neural network with added memory blocks in hidden layers to model contextual information. Inspired by digital signal processing, FSMN mimics the behavior of high-order FIR filters, allowing it to model long-term correlations more efficiently than RNNs. It can be extended to one-way or two-way versions, depending on whether future context is considered. A compact version, cFSMN, was later introduced to reduce parameter counts using low-rank matrix factorization. **Introduction to DFSMN** As shown in Figure 3, DFSMN builds upon cFSMN by adding skip connections between memory modules, enabling better gradient flow during training and avoiding vanishing gradients in deep networks. Additionally, we incorporated dilation-like factors into the memory module to enhance long-term modeling without increasing the order significantly. This allows for efficient delay control, making DFSMN ideal for real-time applications. **LFR-DFSMN Acoustic Model** LFR reduces the frame rate by grouping multiple frames together, thereby decreasing computational load and improving decoding efficiency. When combined with DFSMN, this leads to the LFR-DFSMN model, which uses 10 DFSMN layers and 2 DNN layers. Input and output are processed at one-third the original frame rate, leading to faster training and decoding. **Experimental Results** On the 2,000-hour English task, DFSMN with 12 layers outperformed models with fewer layers, showing a clear performance boost with increased depth. Compared to BLSTM, DFSMN had fewer parameters and achieved a 1.5% absolute performance gain. In the Chinese task, LFR-DFSMN improved over LFR-LCBLSTM by more than 20%, while reducing training time by three times and decoding latency to nearly one-third. Finally, we found that LFR-DFSMN could maintain high accuracy even with minimal delay—just 5 frames—outperforming LFR-LCBLSTM with a 40-frame delay. This makes it a promising solution for real-time speech recognition systems.

Core-i9 I7 I5 I3 Mini PC

Core i9 Mini PC: The Core-i9 processor is the most powerful in the lineup. It offers high clock speeds and multiple cores/threads, making it ideal for demanding tasks such as video editing, 3D rendering, and gaming. For example, it can handle complex video editing projects with ease, providing smooth playback and quick rendering times.
Core-i7 Mini PC: The Core-i7 processor also offers excellent performance. It has a good balance of processing power and energy efficiency. It can handle most productivity tasks and some gaming. For instance, it can run multiple applications simultaneously without significant slowdowns.
Core-i5 Mini PC: The Core-i5 processor is a great choice for everyday computing needs. It can handle tasks like web browsing, office work, and light gaming. It provides good performance at a more affordable price point.
Mini PC i3: The Core-i3 processor is suitable for basic computing tasks. It can handle simple office applications and web browsing. It is the most budget-friendly option in the lineup.

Core i9 Mini PC,,Portable Mini Pc,Mini Pc I3

Shenzhen Innovative Cloud Computer Co., Ltd. , https://www.xcypc.com

This entry was posted in on