Fri Jul 1, 2005 This directory contains the 2nd released linux binaries for the new version of GMTK. Specifically, you will find in this directory: New versions: gmtkJT - new version of score (prob(evidence)) gmtkEMtrainNew - new EM trainer gmtkViterbiNew - new viterbi inference engine gmtkTriangulate - core GMTK triangulation engine gmtkDTindex - create a decision tree index file gmtkParmConvert - convert parameters gmtkTFmerge - merge two or more triangulation files gmtkTime - compute work done in given amount of time There is currently no documentation for the new versions. The new version is much faster and more powerful, however, so I am making these files available for now for linux until actual source and documentation is available. The versions here also support language model files, and a number of other features. The names of the new programs eventually will change as will the front end Viterbi program interface. The old versions (prior to 2004) are no longer being maintained. Again, there is *no* documentation at the moment (this is a pre-alpha version) but we are happy to try to answer questions as best as possible. The new version now supports short utterances (template length or template + 1 length), and supports disconnected networks. Because of the speedups, my group and a few others have found the new versions indispensable to getting their work done. Using very simple triangulations (see below), actual real-time inference speedups (time_old_version/time_new_version) is so far on the order of at minimum about 4.5 and at maximum about 700 (almost 3 orders of magnitude) using a combination of system and algorithmic speedups. If you try the new version and you are not getting a speedup, or if things are slow, it's very probably because you are using a poor triangulation. Memory decreases are similar (reductions from about 3.5 to up to 100). The new GMTK is not slow, but the new GMTK with a poor triangulation can be slow however. It is important for you therefore to try to find a good triangulation! In the new version, all triangulation is done offline via gmtkTriangulate. All the new inference programs require a ".trifile". Getting a good triangulation at the moment takes some art (a better way is in the works). For now, here is a quick and dirty triangulation that usually works pretty well when you have much determinism in your graph. gmtkTriangulate -strF xxx -rePart T -findBest T -triangulation completed 0) Note that '-rePart T' implies '-reTri T' 1) If you can do it, it's often very good to keep '-findBest' turned on. If you are finding that it is taking a long time, you might also try '-force L' or '-force R', whichever completes faster. This can have a big speedup impact, depending on the graph of course. Including the appropriate '-force' option should make the delay tolerable. Note, however, that with -findBest T, it runs an exponential algorithm so it might run for weeks (this is not an infinite loop). If you are seeing this on any of your graphs, try both: "-findBest F -force L" and "-findBest F -force R" and use the version .trifile which runs faster. Also, use '-verbose 30' to have it print out what's going on. Also, if you see memory growing slowly, this is not a memory leak, rather it is memoizing previous cases. If you want to turn off memoizing, use the -noBoundaryMemoize option. 2) In general, it's good to keep backup triangulation files around (sometimes triangulation files take a long time to generate, and they are easy to delete, so gmtkTriangulate's default is to keep 10 backup trifiles files, something that has saved me much time in the past, at the cost of a bit of directory messiness). 3) it's probably good to use the default trifile name (i.e., .str.trifile) then the gmtkJT, gmtkViterbiNew, etc. command lines are shorter. I.e., if you don't give an '-outputTri' option, it'll name it foo.str.trifile (which is the default name for the inference programs). 4) Currently, the 'completed' triangulation seems to in general work well when there is much determinism in the graphs. We know better ones exists (having esoteric reasons to do with the E partition) but it can be difficult to find. When you do not have any determinism, use the following: gmtkTriangulate -strF xxx -rePart T -findBest T -anyTime 60s The program will run for 60 seconds and output the best triangulation it found in that amount of time. If you want to see what it is doing, run it with the '-verbose 20' option. Other points: 1) The new versions (inference and triangulation) have a '-verbose' option that take a number between 1 and 100. Higher numbers print out many more messages (inference with -verbose 100 prints out every step!!). Default is 10. Good ones for debugging your output include: -verb 50, prints partition messages -verb 60, prints clique insertions -verb 70, prints variable iterations w/o parent values -verb 80, prints variable iterations w parent values 2) In the new version, you need to make sure that each decision tree maps *all* possible parent values to a valid child value. The aurora tutorial on the web has a bug where some sets of parent values are mapped to values out of the range [0:card-1] of the child. This was never a problem in the old version as that case just happened to occur with zero probability (so was pruned away before it occurred), but in the new version, for some triangulations you'll get an error. Also, note that decision tree leaf node formulas are now surrounded by curly braces, such as {}. In other words, your DT leaf formulas must look like: -1 { p0 } rather than the old version: -1 ( p0 ) The good news is that GMTK now supports arbitrary integer formulas in DT leaf nodes, so you can do things like: -1 { min(p0,3+p1>>2) + 5 } Many different integer operations are defined. Documentation forthcoming. 3) Related to 2 above, the new version can benefit significantly (both speed and memory) from using RV cardinalities as small as possible. For example, if you have a binary transition variable, make sure the card is 2. While it is not incorrect to make it larger than 2, there are cases where keeping it at 2 will help (in general, there are cases where the benefit can be significant). 4) The new version has changed the names 'GC_IN_FILE' to 'MC_IN_FILE' ('mixture component' rather than 'Gaussian component') and has changed 'MG_IN_FILE' to 'MX_IN_FILE' ('mixture' rather than 'mixture Gaussian'). 5) in gmtkEMTrainNew, you'll see a few new beam pruning options, specifically: -cbeam is analogous to the beam pruning that existed in the old version, it prunes entries at the clique level. -sbeam is new, and it prunes at the separator level. Removing a separator entry corresponds potentially to removing many clique entries. '-ckbeam k' where k is an integer is also new, it prunes cliques by leaving all but the best k clique entries in place after pruning. '-crbeam p' is also new. It prunes the number of clique entries down to a fraction (0 < p <= 1) of entries that originally were in the clique. '-cmbeam f' is new again (based on an idea from Andrew McCallum). It prunes the clique down so that the top (1-f) fraction of the clique mass is retained, (0 <= f < 1). Other options to use here: -cmmin k (keep at least k entries in the clique after pruning), -cmfurther b (after cmbeam pruning, keep an additional number of clique entries that are at least a beam b below the maximum entry that exists below the cmbeam constraint, -cmfurther has a domain similar to -cbeam (i.e., a log prob), but you might use much smaller values such as 1-10)). You can try -cbeam, -ckbeam, -sbeam, etc. or use them all together. Early studies show that -cbeam and -sbeam have similar effects, although this highly depends on the distributions. -ckbeam can be very effective for high "entropy" cliques. -crbeam is also very effective. -ebeam is also new, and it prunes clique entries during EM training. If you have very high dimensional Gaussians and/or you wish to prune away "outliers" during EM training, then -ebeam will do the trick. 6) If you see the following, please ignore for now, it is harmless (will be fixed soon). WARNING: Can't close pipe 'cpp foo.str.trifile' where 'foo' is the name of your structure file. Feedback and any bugs you discover would be much appreciated.