20180329 工作报告

又是将近半年过去了,还有一个月入职就满一年了,再次对自己这一年来的成长做一个总结,以及对未来的发展做一个期望。

目前掌握的技能:(the number in the round bracket means the level of my skill in the related area judged by myself)

1 Customize grammar fst for command and control application in a very short time using thrax.  Fix grammar fst in a low level need using openfst. (95%)

Usually the work flow is : ① getting document for application, using python to extract needed command content and creating grm files used by thrax. ②Make sure symbol table and dictionary cover all the words for the command, then create G.fst ③ Rarely, fst may need further modification to work with the legacy system, openfst is used to modify symbol table mapping or replacing fst into exsiting fst.

2 Design and implement an Information Extraction system using concatenated fst.(65%)

    Prototype was created using python and then created java version of the system with several modification for the project. The Frame work is shown below, detail will not be described due to the policy.

IE

 

3 Understanding the whole structure of the current Spear speech Engine and kaldi interface

3.1 front-end processing(80%):

Spear: structure understood and utilized in work,  mathematic formula was learn before but not clear right now

Kaldi: Understand the interface but never implemented anything so far.

3.2 Structure and Wrapper around decoder(75%) in C++/C:

Spear: Understand the whole structure, integrated DNN VAD and modify the interface to expose some parameters to the android developer.  In conclusion, have the ability to understand the structure and modify it, but never design a whole structure by  myself so far and do not clearly understand the benefit for the current design, especially on security reason.

Kaldi:  Seen some code in src/***bin, Not exactly wrapper just help me understand how to use function in kaldi.

3.3 Decoder design and implementation :

(90%+)Spear: Similar to fast decoder design in kaldi, no lattice is used. Understand and fixed some error in the implementation of Viterbi and prune in decoder.

(75%)Kaldi: Never dig into the code of any decoder. Based on the structure I have seen so far, all the decoder without using lattice should be pretty easy for me to use, while the decoder with lattice will need me to take some time to learn to know how to fully manipulate.

3.4 WFST compilation\ HCLG recipe:

(80%)Spear: Fully Understand the flow work in this recipe, while the specific trick is known but not fully clear to me. I need to implement once to declare I fully understand the whole recipe. Was helping other employee to figure out bug in the recipe which makes some minimization hangs, fixed several bugs during the trying but fail in the end.

(60%)Kaldi: Read about the HCLG document many many times,  have tried till LG step. Idea of whole recipt is fully understood, but the exact code implementation is not clear, especially for the design and creation of H and C.

3.5 VAD module:(85% in integration, 0% in training)

With the trained VAD module, adding this component in the initialization and wrapper for the decoder. Simple Diagram is shown below,  and experiment with parameters to get the best performance of VAD (some analysis is shown below), and example of result is shown in the end.

In conclusion, prototype is build and ready for test in practice. Already implemented in some demo application but not get any feedback yet.

DNN-vad(1)

Framework For VAD integration

 

 

pasted image 0

Performance Analysis  on parameters

 

unnamed

Visualization of the VAD implementation result
1st line is audio time domain, 2nd line is audio in frequency domain,
3rd line is the original vad evaluation result based on frames,(0 means non-speech, 1 means speech)
4th line is the smoothing result based on frames,(0 means non-speech, 1 means speech)
5th line is the final decision on frames,(0 means not sending to decoder, 1 means sending to decoder)

4 Built system using Cmake(40%):

Experience with understanding the exsiting Cmake and modify it to correct some linking error in built. No idea of how to build one.

5 Using Swig to import C/C++ code in Java (50%):

Experience in adding function and make new JNI files to with existing swig config files. No idea how to create on from start.

 

进步与不足:

1【进步】 在语音识别引擎方面,代码掌握的程度相较半年前大大提高了,能够作为核心成员, 根据具体项目对引擎进行修改和debug,开始接触kaldi的引擎代码。【不足】kaldi的了解程度不够高,由于有公司现有引擎的底子,预计下一个半年之内能够有很大的提高

2 【进步】语音识别系统的各个模块基本上全都或多或少摸索过(语言模型(arpa or Grammar),WFST, decoder,wrapper,NLP)。【不足】除了声学模型的训练

3 【进步】具体的完成了项目与学习现有的代码并且对代码进行了改动与提高。【不足】不了解实现的具体技术的是否是最流行的,最高效的算法。

4 (1-10进行打分)python代码的能力可以 python 9,C++ 7,shell 7,  java 3

计划:

1 学习ChatScript, 目前了解的最好的能够与语音识别相结合的NLP框架。

2 系统的对代码能力进行提高。(参加课程)

3 尽快开展一个以kaldi为核心的项目

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.