Scoring function:
For all j = i -1 to 0
pre_best = argmax{ score(0, j) + logprob(j, i) }
The logprob(j, i) means the log probability that C[j+1…i] form a word, C stands for a character sequence, and this probability should consider bigram to match the requirement.
Initialization:
score[0,0] = 0
"Sub-optimal" characteristic of dynamic programming:
score(0, i) =
score(0, pre_best) + logprob(pre_best, i)
After all calculations, just go back-tracking to pick the best choices up to form the segmented string.
Here is an example that based on SIGHAN word segmentation bakeoff corpus' word counts only. It is not the standard form of language model,
but generally it is easy to modify to use smoothed (in Good-Turing or Chen-Goodman modified KN discounting or other scheme) log probabilities.
To simulate the behavior "Katz's back-off" of real language model, this example includes a *fake* factor to boost bigram probabilities.
question: 混合型演算法 (C1, C2, C3, C4, C5, C6)
混 10
合 20
型 20
演 15
算 60
法 80
混合 35
演算 80
算法 9
混合型 6
演算法 160
演算:法 1
混合型:演算法 5
Notations:
- a(x, y) stands for accumulative scores of previous segmentations.
- d(x, y) stands for known word counts of current positions.
i=1:
a(0, 0) + d(0, 1) = 0 + 10 = 10
argmax{...}=C1
a(0, 1) = 10
i=2:
a(0, 1) + d(1, 2) = 10 + 20 = 30
a(0, 0) + d(0, 2) = 0 + 35 = 35
argmax{...}=C12
a(0, 2) = 35
i=3:
a(0, 2) + d(2, 3) = 35 + 20 = 55
a(0, 0) + d(0, 3) = 0 + 6 = 6
argmax{...}=C3
a(0, 3) = 55
i=4:
a(0, 3) + d(3, 4) = 55 + 15 = 70
argmax{...}=C4
a(0, 4) = 70
i=5:
a(0, 4) + d(4, 5) = 70 + 60 = 130
a(0, 3) + d(3, 5) = 70 + 80 = 150
argmax{...}=C45
a(0, 5) = 150
i=6:
a(0, 5) + d(5, 6) = 150 + 80 + 1 * (factor of bigram)
= 230 + (fob)
a(0, 4) + d(4, 6) = 70 + 9 = 79
a(0, 3) + d(3, 6) = 55 + 160 + 5 * (factor of bigram)
= 220 + 5(fob)
[let fob > 2]
argmax{...}=C456
a(0, 6) = 220 + 5(fob)
backtrack: C456 -> C3 -> C12
answer: 混合 型 演算法
