MVA: Mean subtraction, Variance normalization and ARMA filtering

OUTLINE

This site contains the source of a program performing mean subtraction, variance normalization and ARMA filtering on a per-utterance basis. Specifically, this program takes as input a feature file in HTK format and output an post-processed feature file also in HTK format. The program should be invoked as

MVA arma_order input_file output_file

Note that this program is written specifically for little-endian (e.g. Intel/Linux) machines and assuming that the byte order in HTK feature file is big-endian (default), since byte swapping is coded in the program. You should modify the source for different data file formats or different byte order/architectures.


PUBLISHED BASIC RESULTS

The basic results with MVA post-processing have been reported in the 2002 ICSLP conference,

It is important to note that some of the experiments reported in the papers used a different trainer/recognizer from the baseline one in the Aurora 2.0 distribution. In particular, we have used GMTK in some of the experiments. It is also important to note that we used a different HTK feature extraction configuration than the one in the Aurora 2.0 distribution to generate the raw features.


HTK FRONTEND/BACKEND CONFIGURATIONS

Here we compare the performance of MVA on Aurora 2.0 using a number of different HTK configuration files and using the standard HTK-only system that is distributed with Aurora 2.0. Specifically, the feature space is 39-dimensional, each word has 16 emitting states with emitting density function a Gaussian mixture with 3 components. The silence model has 3 emitting states, each having Gaussian mixture density with 6 components. The short-pause model has one emitting state which is tied to the middle state of the silence model. After we do this, at the bottom we show the results using a GMTK-based backend, which also has 39-dimensional feature vector, and the same number of states as the HTK-based system, but each Gaussian mixture has 16 components and the short pause state is not tied to the middle state of the silence model. With this stronger backend, the results show further improvement over the HTK-based results.

Specifically, we first compare the results with the following different HTK configuration files:

Config 1 generates MFCC_0_E and uses MFCC_E_D_A during training and decoding.
Config 2 generates MFCC_0_E but uses MFCC_0_D_A during training and decoding.
Config 3 generates MFCC_0_D_A directly and uses MFCC_0_D_A during training and decoding.
Config 4 also generates MFCC_0_D_A_Z directly and uses MFCC_0_D_A during training and decoding.
(Config 4 is different from Config 3 in NUMCHANS and LOFREQ/HIFREQ, and that the raw features are mean subtracted.)

all configurations.

plp_config.


AURORA 2.0 RESULTS WITH HTK

Results on Aurora 2.0 are summarized in the following tables. (Here the same color is used for the same configuration and training condition, so one can compare results with/without MVA by looking at the same-color rows in these tables.)

without MVA
config
trSet

test set a

test set b

test set c

avg



clean
0-20dB
-5 dB

clean
0-20dB
-5 dB

clean
0-20dB
-5 dB

clean
0-20dB
-5 dB
1
multi

98.54
87.29
23.46

98.54
85.52
23.86

98.59
83.12
22.08

98.55
85.75
23.34
clean

98.94
61.13
8.40

98.94
55.57
7.92

99.00
66.68
11.40

98.95
60.02
8.81
2
multi

98.54
86.10
22.78

98.54
86.05
26.78

98.50
83.88
21.81

98.53
85.64
24.19
clean

 99.10
57.40
7.41

99.10
54.68
7.43

99.16
65.23
10.58

99.11
57.88
8.05
3
multi

98.53
86.06
22.06

98.53
85.92
24.33

98.56
83.05
19.39

98.54
85.40
22.43
clean

98.98
52.51
5.51

98.98
48.86
3.84

99.00
60.65
10.23

98.98
52.68
5.79
4
multi

98.49
87.74
24.02

98.49
88.97
26.85

98.30
88.22
24.84

98.45
88.33
25.32
clean

99.12
63.04
10.76

99.12
68.06
12.62

99.21
63.53
10.83

99.14
65.15
11.52


MVA (with m=2)
config
trSet

test set a

test set b

test set c

avg



clean
0-20dB
-5 dB

clean
0-20dB
-5 dB

clean
0-20dB
-5 dB

clean
0-20dB
-5 dB
1
multi

97.89
90.48
33.70

97.89
88.81
27.92

98.15
88.04
27.48

97.94
89.32
30.14
clean

98.95
75.79
16.05

98.95
76.01
15.13

99.01
72.49
14.44

98.96
75.22
15.36
2
multi

98.05
91.00
40.22

98.05
91.03
38.20

98.10
90.87
39.98

98.06
90.99
39.36
clean

99.06
78.29
18.66

99.06
79.22
18.85

99.01
79.20
19.44

99.05
78.84
18.89
3
multi

98.51
92.11
42.58

98.51
91.90
40.76

98.63
91.81
43.31

98.53
91.97
42.00
clean

99.17
83.16
24.78

99.17
84.35
24.67

99.13
82.89
23.42

99.16
83.58
24.46
4
multi

98.69
91.95
42.11

98.69
92.02
40.75

98.66
91.64
42.10

98.68
91.92
41.56
clean

99.03
83.46
22.33

99.03
84.53
 21.48

99.04
83.06
20.49

99.03
83.81
21.62

The MVA post-processing is applied after the raw features are extracted. Therefore, with configurations 1 and 2, MVA is applied only to the static features, and delta and acceleration features are calculated from the MVA post-processed static features. On the other hand, with configuration 3 and 4 , MVA is applied directly to all the static and dynamic features.

Without MVA post-processing, the performance is 4 > 1 > 2 > 3. With MVA post-processing, the performance is 3 > 4 > 2 > 1. From the results with different configuration files, it is observed that the relative improvement of MVA is dependent on the configurations. Specifically, using c_0 degrades the baseline. That is, between configuration 1 and 2 without MVA, the baseline degrades very slightly in the multi-condition training cases, and more significantly in the clean training cases. However, using c_0 improves when MVA is applied. That is, between configuration 1 and 2 with MVA, the improvements are significant between using logE (configuration 1) and c_0 (configuration 2). The case is similar, when the dynamic features are extracted as the raw features and MVA is applied directly on all features (i.e., both static and dynamic features), when one compares configuration 2 and 3. The differences in NUMCHANS and LOFREQ/HIFREQ do not have significant impact on the performances, when one compares configuration 3 and 4.


AURORA 2.0 EXPERIMENTS WITH GMTK

In the GMTK system, the whole-word HMMs have 16gc per state and the parameters are trained with 40 epochs. Note, as mentioned earlier, that this is a stronger backend with more model parameters than the HTK-based system.

Aurora 2.0 results with MVA (m=2) using GMTK as the back end
config
trSet

test set a

test set b

test set c

avg



clean
0-20dB
-5 dB

clean
0-20dB
-5 dB

clean
0-20dB
-5 dB

clean
0-20dB
-5 dB
3
multi

99.29
93.78
48.36

99.29
93.26
44.46

99.27
93.34
48.28

99.29
93.48
46.78
clean

99.66
85.04
28.04

99.66
86.04
28.01

99.64
84.62
26.05

99.66
85.36
27.63
4
multi

99.25
93.78
48.66

99.25
93.30
44.58

99.22
93.21
48.55

99.24
93.47
47.01
clean

99.69
85.15
27.78

99.69
86.08
27.59

99.67
84.33
25.59

99.69
85.36
27.27


The training script and multi-train header used with GMTK.
complete results for configuration 3: multi_testa, multi_testb, multi_testc and clean_testa, clean_testb, clean_testc.
complete results for configuration 4: multi_testa, multi_testb, multi_testc and clean_testa, clean_testb, clean_testc.


AURORA 3.0 EXPERIMENTS WITH HTK

Results on Aurora 3.0 are summarized in the following tables for Danish, Finnish, German and Spanish. Note that the front end is HCopy with the corresponding configurations, so the raw features are not the same as in the Aurora 3.0 distribution, which uses its own front end. The back end here is the same as the Aurora 3.0 baseline system, with 3 Gaussian component per state, 16 emitting states per word.

Danish
Well-Matched

Medium-Mismatched

Highly Mismatched

raw
MVA

raw
MVA

raw
MVA
config 1
86.10
90.87

67.66
79.66

39.85
67.32
config 2
87.50
92.31

68.36
82.49

39.69
77.16
config 3
84.45
92.21

63.28
80.51

32.25
76.88

Finnish
Well-Matched

Medium-Mismatched

Highly Mismatched

raw
MVA

raw
MVA

raw
MVA
config 1
91.71
94.75

79.75
87.96

36.36
63.82
config 2
91.08
95.13

78.59
88.78

57.31
88.20
config 3
90.01
94.60

72.02
87.55

49.01
89.47

German
Well-Matched

Medium-Mismatched

Highly Mismatched

raw
MVA

raw
MVA

raw
MVA
config 1
91.46
94.65

81.48
86.97

74.05
88.39
config 2
92.13
94.95

81.48
88.58

74.47
90.84
config 3
91.16
95.27

79.94
88.29

71.18
90.33

Spanish
Well-Matched

Medium-Mismatched

Highly Mismatched

raw
MVA

raw
MVA

raw
MVA
config 1
93.10
95.12

83.05
91.57

52.84
82.65
config 2
92.69
95.80

86.02
92.23

42.95
87.58
config 3
90.85
95.84

83.25
92.45

42.86
87.58



COMPARISON TO A NOISE-ROBUST SYSTEM

Here the results on Aurora 3.0 using MVA are compared to one of the best results* in the special session on noise-robustness in 2002 ICSLP Conference. Note that the entries in the table are the word error rates.

Results of Aurora 2.0 (0-20dB) Compared to the Best Known

 Test Set A
Test Set B Test Set C

best
MVA
best
MVA
best
MVA
Multi Train
7.88
7.89
8.04
8.10
9.43
8.19
Clean Train
12.56
16.84
13.00
15.65
14.45
17.11


Results of Aurora 3.0 Compared to the Best Known

Finnish
Spanish German
Danish

best
MVA
best
MVA
best
MVA
best
MVA
WM
3.91
5.40
3.36
4.16
4.89
4.73
6.63
7.79
MM
19.08
12.45
6.08
7.55
9.16
11.71
18.51
19.49
HM
13.39
10.53
8.45
12.42
8.75
9.67
20.41
23.12

* D. Macho et. al. "Evaluation of a Noise-Robust DSR Front-end on Aurora Databases".


USEFUL LINKS FOR (NOISE-ROBUST) ASR

AUTHORS

Original draft on 02/12/03; latest revision on 05/05/04.