<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5869613913763631939</id><updated>2012-02-15T22:49:25.138-08:00</updated><title type='text'>"Doubting" Augustinus</title><subtitle type='html'>Everything is impossible is possible.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>10</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-5311617994360095954</id><published>2008-10-13T06:18:00.001-07:00</published><updated>2008-10-13T06:39:47.292-07:00</updated><title type='text'>Bigram based dynamic programming algorithm for Chinese word segmentation [en]</title><content type='html'>Scan all characters left-to-right, to calculate the score of current position "i" as a word boundary by accumulating the previous best choice.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Scoring function:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;    For all j = i -1 to 0&lt;br /&gt;        pre_best = argmax{ score(0, j) + logprob(j, i) }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The logprob(j, i) means the log probability that C[j+1…i] form a word, C stands for a character sequence, and this probability should consider bigram to match the requirement.&lt;br /&gt;&lt;br /&gt;Initialization:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;    score[0,0] = 0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;"Sub-optimal" characteristic of dynamic programming:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;    score(0, i) =&lt;br /&gt;        score(0, pre_best) + logprob(pre_best, i)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;After all calculations, just go back-tracking to pick the best choices up to form the segmented string.&lt;br /&gt;&lt;br /&gt;Here is an example that based on SIGHAN word segmentation bakeoff corpus' word counts only. It is not the standard form of language model,&lt;br /&gt;but generally it is easy to modify to use smoothed (in Good-Turing or Chen-Goodman modified KN discounting or other scheme) log probabilities.&lt;br /&gt;To simulate the behavior "Katz's back-off" of real language model, this example includes a *fake* factor to boost bigram probabilities.&lt;br /&gt;&lt;br /&gt;question: 混合型演算法 (C1, C2, C3, C4, C5, C6)&lt;br /&gt;&lt;br /&gt;混    10&lt;br /&gt;合    20&lt;br /&gt;型    20&lt;br /&gt;演    15&lt;br /&gt;算    60&lt;br /&gt;法    80&lt;br /&gt;&lt;br /&gt;混合    35&lt;br /&gt;演算    80&lt;br /&gt;算法    9&lt;br /&gt;混合型    6&lt;br /&gt;演算法    160&lt;br /&gt;&lt;br /&gt;演算:法    1&lt;br /&gt;混合型:演算法    5&lt;br /&gt;&lt;br /&gt;Notations:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;    &lt;li&gt;a(x, y) stands for accumulative scores of previous segmentations.&lt;/li&gt;&lt;br /&gt;    &lt;li&gt;d(x, y) stands for known word counts of current positions.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;i=1:&lt;br /&gt;    a(0, 0) + d(0, 1) = 0 + 10 = 10&lt;br /&gt;&lt;br /&gt;    argmax{...}=C1&lt;br /&gt;    a(0, 1) = 10&lt;br /&gt;&lt;br /&gt;i=2:&lt;br /&gt;    a(0, 1) + d(1, 2) = 10 + 20 = 30&lt;br /&gt;    a(0, 0) + d(0, 2) = 0 + 35 = 35&lt;br /&gt;&lt;br /&gt;    argmax{...}=C12&lt;br /&gt;    a(0, 2) = 35&lt;br /&gt;&lt;br /&gt;i=3:&lt;br /&gt;    a(0, 2) + d(2, 3) = 35 + 20 = 55&lt;br /&gt;    a(0, 0) + d(0, 3) = 0 + 6 = 6&lt;br /&gt;&lt;br /&gt;    argmax{...}=C3&lt;br /&gt;    a(0, 3) = 55&lt;br /&gt;&lt;br /&gt;i=4:&lt;br /&gt;    a(0, 3) + d(3, 4) = 55 + 15 = 70&lt;br /&gt;&lt;br /&gt;    argmax{...}=C4&lt;br /&gt;    a(0, 4) = 70&lt;br /&gt;&lt;br /&gt;i=5:&lt;br /&gt;    a(0, 4) + d(4, 5) = 70 + 60 = 130&lt;br /&gt;    a(0, 3) + d(3, 5) = 70 + 80 = 150&lt;br /&gt;&lt;br /&gt;    argmax{...}=C45&lt;br /&gt;    a(0, 5) = 150&lt;br /&gt;&lt;br /&gt;i=6:&lt;br /&gt;    a(0, 5) + d(5, 6) = 150 + 80 + 1 * (factor of bigram)&lt;br /&gt;                      = 230 + (fob)&lt;br /&gt;    a(0, 4) + d(4, 6) = 70 + 9 = 79&lt;br /&gt;    a(0, 3) + d(3, 6) = 55 + 160 + 5 * (factor of bigram)&lt;br /&gt;                      = 220 + 5(fob)&lt;br /&gt;&lt;br /&gt;    [let fob &gt; 2]&lt;br /&gt;    argmax{...}=C456&lt;br /&gt;    a(0, 6) = 220 + 5(fob)&lt;br /&gt;&lt;br /&gt;backtrack: C456 -&gt; C3 -&gt; C12&lt;br /&gt;&lt;br /&gt;answer: 混合 型 演算法&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-5311617994360095954?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/5311617994360095954/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=5311617994360095954' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/5311617994360095954'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/5311617994360095954'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/bigram-based-dynamic-programming.html' title='Bigram based dynamic programming algorithm for Chinese word segmentation [en]'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-8674424953143433947</id><published>2008-10-13T06:04:00.001-07:00</published><updated>2008-10-13T06:04:27.937-07:00</updated><title type='text'>Vi-Vi-Vi-Viterbi</title><content type='html'>大師的姓可不能念錯。&lt;br /&gt;&lt;br /&gt;Prof. Andrew Viterbi，猶太裔，出生於義大利。藉由這些線索，我猜發音規則應以拉丁語而非以英語為準。但，數年來，都只是猜測。&lt;br /&gt;&lt;br /&gt;（英語發音慣例是，名詞的重音在第一個音節，動詞則在第二個音節，"record" 就是絕佳的例子；然而，拉丁語系的語言則通常一律把重音擺在第二音節之後，所以有不少該語系的人分不出 "record" 念法的差別。）&lt;br /&gt;&lt;br /&gt;今天窮極無聊，終於在某影片中找到肯定的答案了：&lt;br /&gt;&lt;br /&gt;&lt;a href="http://engineering.usc.edu/about/viterbi/viterbi_video.htm"&gt;Andrew &amp; Erna Viterbi: The Journey and the Legacy&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;當然，把 Viterbi algorithm 融會貫通，才是更重要的事：所有和 Markov chain 及 Shannon noisy channel model 相關的 machine learning 演算法都會用到：n-gram language model, Hidden Markov Model, Maximum Entropy, Conditional Random Field, etc.&lt;br /&gt;&lt;br /&gt;附帶一提，義大利有個城市叫 Viterbo，也可作為姓氏，同樣應該將重音擺在第二音節。&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-8674424953143433947?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/8674424953143433947/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=8674424953143433947' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/8674424953143433947'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/8674424953143433947'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/vi-vi-vi-viterbi.html' title='Vi-Vi-Vi-Viterbi'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-1314840667222890416</id><published>2008-10-13T06:02:00.000-07:00</published><updated>2008-10-13T06:04:06.814-07:00</updated><title type='text'>Viterbi algorithm</title><content type='html'>還沒有時間詳細寫下輸入法相關的範例，就先列出我覺得最好的參考資料湊合著用。&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/viterbi_algorithm/s1_pg1.html"&gt;里茲 (Leeds) 大學的 Viterbi Algorithm 教學頁&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;看這一頁之前，最好先看看前頭講 &lt;a href="http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/forward_algorithm/s1_pg1.html"&gt;Forward Algorithm&lt;/a&gt; 的部分。若是對 Hidden Markov Model (HMM) 不熟，請先看看 &lt;a href="http://yuhfan.deyuan.idv.tw/weblog/?p=5"&gt;HMM 不負責講座&lt;/a&gt;系列。&lt;br /&gt;&lt;br /&gt;如果沒耐心看完文字敘述，可以先跳到 &lt;a href="http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/forward_algorithm/s3_pg1.html"&gt;Forward Algorithm example&lt;/a&gt; 和 &lt;a href="http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/viterbi_algorithm/s3_pg1.html"&gt;Viterbi Algorithm example&lt;/a&gt;，透過動畫步驟演示來比較箇中差異。&lt;br /&gt;&lt;br /&gt;最容易懂也和輸入法最相關的分析，則在 &lt;a href="http://www.cs.colorado.edu/%7Emartin/slp.html"&gt;Speech and Language Processing&lt;/a&gt; 一書中第五章的  Weighted Automata 與第七章的  The Viterbi Algorithms Revisited。&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Viterbi_algorithm"&gt;Wikipedia 上的 Viterbi Algorithm&lt;/a&gt; 寫得太亂，而且竟然偷懶把 Forward Algorithm 合併至該頁，恐怕只會讓人愈看愈模糊，故不推薦。&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-1314840667222890416?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/1314840667222890416/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=1314840667222890416' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/1314840667222890416'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/1314840667222890416'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/viterbi-algorithm.html' title='Viterbi algorithm'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-5468548449134217767</id><published>2008-10-13T06:01:00.001-07:00</published><updated>2008-10-13T06:01:31.769-07:00</updated><title type='text'>Bi-gram 中文應用範例：Language model 的授權問題（續）</title><content type='html'>從中文維基百科的語料產出 language model，還有點「時間管理」問題，再等等......&lt;br /&gt;&lt;br /&gt;既然要等，就連這個一起：&lt;a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html"&gt;All Our N-gram are Belong to You&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.&lt;br /&gt;Watch for an announcement at the &lt;a target="_blank" href="http://www.ldc.upenn.edu/"&gt;LDC&lt;/a&gt;, who will be distributing it soon, and then order your set of 6 DVDs.&lt;/blockquote&gt;Well, resources at the LDC are usually &lt;span style="font-style: italic;"&gt;NOT FREE&lt;/span&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-5468548449134217767?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/5468548449134217767/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=5468548449134217767' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/5468548449134217767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/5468548449134217767'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/bi-gram-language-model_13.html' title='Bi-gram 中文應用範例：Language model 的授權問題（續）'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-8545420633500909857</id><published>2008-10-13T06:00:00.003-07:00</published><updated>2008-10-13T06:10:52.297-07:00</updated><title type='text'>Bi-gram 中文應用範例： Maximum Likelihood Estimation 演算法流程圖</title><content type='html'>這流程圖就是&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram.html"&gt;虛擬碼&lt;/a&gt;的視覺化呈現，或許有助於理解。&lt;br /&gt;圖中 "Permute candidate" 那段，若應用於音節或鍵碼序列轉換為字詞，就是把所有同音詞或同碼詞的排列產生出來；如果是斷詞，則不必做任何事。&lt;br /&gt;以菱形表示的條件判斷式的數量，與緊接著要進行什麼處理，都會隨著 n-gram 的 n 等於多少而有所不同，也就是目前作法最缺乏彈性的地方。&lt;br /&gt;&lt;br /&gt;&lt;a href="http://beta.zooomr.com/photos/12260@Z01/121173/" title="Zooomr :: Photo Sharing"&gt;&lt;img src="http://static.zooomr.com/images/14b6d4b13d9e606a2e0f9f11e337789f078f1d98.jpg" alt="Bi-gram MLE Flow Chart" style="border: 1px solid rgb(0, 0, 0);" border="0" height="355" width="400" /&gt;&lt;/a&gt;&lt;span style="float: left;"&gt;Bi-gram MLE Flow Chart&lt;/span&gt;&lt;br /&gt;&lt;span style="float: right;"&gt;&lt;br /&gt;Hosted on &lt;strong&gt;Zooom&lt;span style="color: rgb(158, 174, 21);"&gt;r&lt;/span&gt;&lt;/strong&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-8545420633500909857?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/8545420633500909857/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=8545420633500909857' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/8545420633500909857'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/8545420633500909857'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/bi-gram-maximum-likelihood-estimation.html' title='Bi-gram 中文應用範例： Maximum Likelihood Estimation 演算法流程圖'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-7024014772943312114</id><published>2008-10-13T06:00:00.001-07:00</published><updated>2008-10-13T06:07:52.729-07:00</updated><title type='text'>Bi-gram 中文應用範例：虛擬碼與程式碼更新</title><content type='html'>首先，又是虛擬碼的問題：前篇&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram_13.html"&gt;勘誤&lt;/a&gt;提到的那行，變數應為 temp_score 而非 best_score。&lt;br /&gt;&lt;br /&gt;再來，順著虛擬碼的脈絡仔細看，不難發現 IF left &gt; 0 AND left != prefix 這段，沒有相應的 ELSE 區塊。這是最初偷懶沒寫，因為使用 SRILM 產生出來的 language model 時有個小小把戲：用 &amp;lt;s&amp;gt; 代表句子開頭，&amp;lt;/s&amp;gt; 代表句子結尾。於是，如果斷詞或音轉字之前，先在測試句前後標上這兩個符號，就讓頭尾的字詞也有了 bi-gram。例如，本來測試句是「下雨天留客」，在句首只能有「下」這個 uni-gram；但要是把測試句改成「&amp;lt;s&amp;gt;下雨天留客&amp;lt;/s&amp;gt;」，便能將「&amp;lt;s&amp;gt; 下」這樣的 bi-gram 列入考慮了。等等，「&amp;lt;s&amp;gt;」就不用管它嗎？當然，因為這是人工附加的，無論該符號會獲得什麼樣的分數，反正最後都要拿掉。&lt;br /&gt;&lt;br /&gt;但是，偷懶是不好的，而且程式這種東西，搞不好一轉頭就忘了偷懶的事實與理由。所以，無論是虛擬碼、實際釋出的 C# 與 C++ 程式碼，我都把對應的部分補回去了。&lt;br /&gt;&lt;br /&gt;講到 if-else，原來的段落裡還有一個地方稍微特殊了些－－ ELSE IF Length(right_gram) == 1 －－算是基於「漢字」特性而寫的特例。翻成自然語言是「抑或右邊是個不認得的漢字」之類的意思。然而，對通用的 language model 來講，應該只是「認不認識」的區別而已。所以，這段 ELSE IF 就直接一般化成 ELSE 了。&lt;br /&gt;&lt;br /&gt;最後，為了讓虛擬碼更簡潔些，在不同情況下重覆出現的動態規劃分數更新步驟，則已抽離為 Scoring 常式了。這名字其實不太對，抽象層次上，scoring function 還應包括不同條件下分數加總的行為，而不只是這個被抽離出來的分數更新步驟。不過算了。XD&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-7024014772943312114?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/7024014772943312114/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=7024014772943312114' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/7024014772943312114'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/7024014772943312114'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/bi-gram_5463.html' title='Bi-gram 中文應用範例：虛擬碼與程式碼更新'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-2002369132987101155</id><published>2008-10-13T05:59:00.001-07:00</published><updated>2008-10-13T06:13:12.268-07:00</updated><title type='text'>Bi-gram 中文應用範例：虛擬碼勘誤</title><content type='html'>嗚啊，真是非常抱歉，〈&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram.html"&gt;Bi-gram 中文應用範例：虛擬碼&lt;/a&gt;〉裡頭有一行錯了，而且是嚴重的錯誤：&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;strike&gt;best_score&lt;/strike&gt; temp_score += &lt;span style="color: rgb(255, 0, 0); font-weight: bold;"&gt;-&lt;/span&gt;LogProb(Unknown) &lt;span style="color: rgb(255, 0, 0); font-weight: bold;"&gt;+&lt;/span&gt; BackOff(right_gram);&lt;/blockquote&gt;&lt;br /&gt;應該要加上括號再取負值：&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;strike&gt;best_score&lt;/strike&gt; temp_score += &lt;span style="color: rgb(255, 0, 0); font-weight: bold;"&gt;-(&lt;/span&gt;LogProb(Unknown) + BackOff(right_gram)&lt;span style="color: rgb(255, 0, 0); font-weight: bold;"&gt;)&lt;/span&gt;;&lt;/blockquote&gt;&lt;br /&gt;說起來這有點蠢，加了負號讓自己都容易疏忽；特別是在選擇 best_score 時，用「大於」總是比「小於」要直觀。事實上，&lt;a href="http://svn.openfoundry.org/openvanilla/trunk/Experiments/WordSegmenter/"&gt;C++ 版的 WordSegmenter&lt;/a&gt; 裡就沒加負號。改寫成 C# 版時，基於平常寫動態規劃程式的習慣，為避免某些初值在進行比較時不合理，才故意把負的 log probability 都轉成正值。然而，我後來程式根本沒那樣寫，仍然沿用了在 log probability 為負值時所需的額外條件判斷來檢查初值： if(&lt;span style="color: rgb(255, 0, 0); font-weight: bold;"&gt;best_score == 0&lt;/span&gt; || ... ) ，如此一來，加負號的意義便喪失了。雖然之前釋出的實際程式並未犯下與虛擬碼相同的錯誤，不過  C# 版中多此一舉的狀況，還是改一改好了。&lt;br /&gt;&lt;br /&gt;真是腦筋不清楚不行啊！&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;更新：上文中畫上刪除線的 best_score 應為 temp_score，但敘述中提到的 best_score 仍是正確的。詳見〈&lt;/span&gt;&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram_5463.html"&gt;Bi-gram 中文應用範例：虛擬碼與程式碼更新&lt;/a&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;〉&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-2002369132987101155?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/2002369132987101155/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=2002369132987101155' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/2002369132987101155'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/2002369132987101155'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/bi-gram_13.html' title='Bi-gram 中文應用範例：虛擬碼勘誤'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-5205874322781948013</id><published>2008-10-13T05:58:00.001-07:00</published><updated>2008-10-13T06:13:55.634-07:00</updated><title type='text'>Bi-gram 中文應用範例：Language model 的授權問題</title><content type='html'>在〈&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram-c.html"&gt;Bi-gram 中文應用範例：以 C# 實作&lt;/a&gt;〉中所用到的 language models，礙於語料本身的授權問題，我要晚個一兩天才能放出來。&lt;br /&gt;&lt;br /&gt;根據 &lt;a href="http://sighan.cs.uchicago.edu/bakeoff2005/data/instructions.php.htm"&gt;SIGHAN 2005 Second International Chinese Word Segmentation Bakeoff: Detailed Instructions&lt;/a&gt; 的聲明，我們無法直接使用那些語料做其他用途：&lt;br /&gt;&lt;blockquote&gt;&lt;h2&gt;Licensing&lt;/h2&gt;  &lt;p&gt; The corpora have been made available by the providers for the purposes of this competition only. By downloading the training and testing corpora, you agree that you will &lt;strong style="color: rgb(0, 0, 0); font-style: italic;"&gt;not use these corpora for any other purpose than as material for this competition&lt;/strong&gt;. Petitions to use the data for any other purpose MUST be directed to the original providers of the data. Neither SIGHAN nor the ACL will assume any liability for a participant's misuse of the data. &lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;br /&gt;因此，我可能會想辦法改用 Wikipedia 來產生 language models。&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-5205874322781948013?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/5205874322781948013/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=5205874322781948013' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/5205874322781948013'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/5205874322781948013'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/bi-gram-language-model.html' title='Bi-gram 中文應用範例：Language model 的授權問題'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-2171784141281010869</id><published>2008-10-13T05:55:00.000-07:00</published><updated>2008-10-13T06:17:26.138-07:00</updated><title type='text'>Bi-gram 中文應用範例：虛擬碼</title><content type='html'>這是〈&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram-c.html"&gt;Bi-gram 中文應用範例：以 C# 實作&lt;/a&gt;〉的音節轉詞彙程式的虛擬碼：&lt;br /&gt;&lt;br /&gt;&lt;div style="width: 400px; text-align: left;"&gt;&lt;a href="http://beta.zooomr.com/photos/12260@Z01/116927/" title="Zooomr :: Photo Sharing"&gt;&lt;img src="http://static.zooomr.com/images/620507d604e0eb3028996b99e78ce86b0beef7db.jpg" alt="Bi-gram Based Chinese Syllable-to-Word Conversion Algorithm Pseudo Codes" style="border: 1px solid rgb(0, 0, 0);" border="0" height="300" width="400" /&gt;&lt;/a&gt;Bi-gram Based Chinese Syllable-to-Word Conversion Algorithm Pseudo Codes &lt;span style="float: right;"&gt;Hosted on &lt;strong&gt;Zooom&lt;span style="color: rgb(158, 174, 21);"&gt;r&lt;/span&gt;&lt;/strong&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;我在這裡排不出好看的效果，只好用圖片了。不過這樣倒是剛好可以讓人有個粗略印象，由程式排版中得知，那些巢狀迴圈有多麼討厭。&lt;br /&gt;&lt;br /&gt;把虛擬碼中的 Homophones() 拆掉，留下 Substring()，再把相關的 FOR EACH 拿掉，就變成斷詞了。&lt;br /&gt;&lt;br /&gt;而在 WordSegmenter 範例中，我動了一點手腳：temp_score += scores[prefix]; 被換成了 temp_score += LogProb(right_gram) + scores[prefix]; ，多加的 LogProb(right_gram) 會使得某些高頻字更容易被斷成單字詞。這麼做的原因是，在語料涵蓋率不夠高的情況下，我利用中文「自由詞素」的特性，來確保某些不常見的單字詞序列不會被左右常見的高頻詞搶走；反應在測試上，偶爾就會看到「是的」被斷成「是　的」。當然，這只是種基於實驗結果所耍的小把戲，不算是標準的 &lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram-maximum-likelihood-estimation.html"&gt;Maximum Likelihood Estimation&lt;/a&gt;。&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;更新：原圖中虛擬碼有一處重大錯誤，已更正，並請參閱〈&lt;/span&gt;&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram_13.html"&gt;Bi-gram 中文應用範例：虛擬碼勘誤&lt;/a&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;〉，謝謝。&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;又更新：請參考〈&lt;/span&gt;&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram_5463.html"&gt;Bi-gram 中文應用範例：程式碼與虛擬碼更新&lt;/a&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;〉。&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;再次更新：〈&lt;/span&gt;&lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bi-gram-maximum-likelihood-estimation.html"&gt;Bi-gram 中文應用範例：Maximum Likelihoood Estimation 演算法流程圖&lt;/a&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;〉或許有助於理解。&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-2171784141281010869?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/2171784141281010869/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=2171784141281010869' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/2171784141281010869'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/2171784141281010869'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/bi-gram.html' title='Bi-gram 中文應用範例：虛擬碼'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5869613913763631939.post-6826599546896266148</id><published>2008-10-13T05:53:00.000-07:00</published><updated>2008-10-13T06:28:04.297-07:00</updated><title type='text'>Bi-gram 中文應用範例：以 C# 實作</title><content type='html'>&lt;strike&gt;能用 VS.NET 2003 或 2005 編譯的程式碼與方案檔，放在 &lt;a href="http://svn.openfoundry.org/srilmwin32/trunk/BigramApplications/"&gt;OpenFoundry: SRILMWin32 的 SVN&lt;/a&gt; 裡。如果不喜歡 C# 而想要看看 C++ 版，目前僅有斷詞程式可供參考，放在 &lt;a href="http://svn.openfoundry.org/openvanilla/trunk/Experiments/WordSegmenter/"&gt;OpenVanilla 的實驗區：WordSegmenter&lt;/a&gt;。&lt;/strike&gt;&lt;br /&gt;&lt;strike&gt;上述程式碼位置，請改以 &lt;a href="http://taipedia.selfip.info/mediawiki/index.php/ELUTE"&gt;http://taipedia.selfip.info/mediawiki/index.php/ELUTE&lt;/a&gt; 記載的為準。&lt;/strike&gt;基於我工作上潛在的 NDA，此專案已註銷。&lt;br /&gt;&lt;br /&gt;這些程式用到了幾項外部資源：&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;用 &lt;a href="http://www.speech.sri.com/projects/srilm/"&gt;SRILM&lt;/a&gt; 從 &lt;a href="http://sighan.cs.uchicago.edu/bakeoff2005/"&gt;SIGHAN 2005 Chinese Word Segmentation Bakeoff&lt;/a&gt; 的&lt;a href="http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.tar.bz2"&gt;語料&lt;/a&gt;中提煉出的 language models。&lt;/li&gt;&lt;li&gt;從 &lt;a href="http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/includes/ZhConversion.php"&gt;MediaWiki: includes/ZhConversion.php&lt;/a&gt; 裡擷取出來的&lt;a href="http://svn.openfoundry.org/srilmwin32/trunk/Data/"&gt;簡繁詞彙對照表&lt;/a&gt;。&lt;/li&gt;&lt;li&gt;&lt;a href="http://rt.openfoundry.org/Foundry/Project/?Queue=436"&gt;新酷音輸入法詞庫&lt;/a&gt;專案的 &lt;a href="http://svn.openfoundry.org/libchewingdata/utf-8/tsi.src"&gt;tsi.src&lt;/a&gt;。&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;首先是中文斷詞。演算法在 &lt;a href="http://doubting-augustinus.blogspot.com/2008/10/bigram-based-dynamic-programming.html"&gt;Bigram based dynamic programming algorithm for Chinese word segmentation [en]&lt;/a&gt; 這篇裡提過了。只要切換不同的 language model，就能支援簡體、繁體甚至混合的斷詞。&lt;br /&gt;&lt;br /&gt;再來是簡繁轉換。其實，只要斷詞做完，簡繁轉換「幾乎」只是到對照表裡查詢再代換之即可，幾乎，是的。由於簡繁詞彙之間的對應其實是一對多關係，轉換結果並不唯一，因此這個程式還可以寫得更複雜一點——更像輸入法。&lt;br /&gt;&lt;br /&gt;而智慧型輸入法，就可視為音節或鍵碼序列轉換成詞彙的過程。目前本篇所提到的範例，與大多數輸入法產品類似的是，僅在斷詞過程中將同音或同碼詞於迴圈中展開，也就是仍然只考慮詞彙相鄰關係，而未考慮音節或鍵碼序列的 bi-grams，與音節或鍵碼序列轉換為某詞彙的機率或排名。這方面的說明，請參考我在 &lt;a href="http://groups.google.com/group/chewing-devel"&gt;Google Groups: Chewing IM Development&lt;/a&gt; 上針對〈&lt;a href="http://groups.google.com/group/chewing-devel/browse_thread/thread/99411a9fb0afa383"&gt;音找字 vs. 音找詞&lt;/a&gt;〉所作的&lt;a href="http://groups.google.com/group/chewing-devel/msg/955d06bc6046cb12"&gt;回應&lt;/a&gt;。&lt;br /&gt;&lt;br /&gt;除了上述問題應進一步加強之外，本範例在演算法實作上仍有改善空間：現在直接以巢狀迴圈實作的方式，若想要從 bi-gram 推廣到 n-gram，顯然相當麻煩。因此，如何更有彈性地使用資料結構並實作動態規劃演算法，將是下一個課題。&lt;br /&gt;&lt;br /&gt;另外，若想要有個從詞彙反轉回音節或鍵碼序列的程式，把簡繁轉換程式裡的對照表換成音節或鍵碼序列與詞彙的對照表即可。現成的表格整理在這裡：〈&lt;a href="http://openvanilla.org/oldwiki/zh/index.php?title=ChineseInformationProcessing#.E5.90.84.E7.A8.AE.E8.BC.B8.E5.85.A5.E6.B3.95.E8.A1.A8.E6.A0.BC"&gt;OpenVanilla: ChineseInformationProcessing: 輸入法表格&lt;/a&gt;〉。&lt;br /&gt;&lt;br /&gt;範例程式核心演算法的虛擬碼，以及各個 language model 檔案，明天我整理好會放出來。&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5869613913763631939-6826599546896266148?l=doubting-augustinus.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://doubting-augustinus.blogspot.com/feeds/6826599546896266148/comments/default' title='張貼意見'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5869613913763631939&amp;postID=6826599546896266148' title='0 個意見'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/6826599546896266148'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5869613913763631939/posts/default/6826599546896266148'/><link rel='alternate' type='text/html' href='http://doubting-augustinus.blogspot.com/2008/10/bi-gram-c.html' title='Bi-gram 中文應用範例：以 C# 實作'/><author><name>barabbas</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
