Okapi BM25 (當中 BM 係 best matching 嘅簡稱)係一種用嚟做資訊提取 嘅函數 。呢套演算法 會攞用家問嘅嘢(
Q
{\displaystyle Q}
)做 input
,然後同每份文件(
D
{\displaystyle D}
)計個分數(
score
{\displaystyle {\text{score}}}
)反映件文件對用家條問題嚟講幾有啦更[ 1] [ 2]
Okapi BM25 條式係噉嘅:
score
(
D
,
Q
)
=
∑
i
=
1
n
IDF
(
q
i
)
⋅
f
(
q
i
,
D
)
⋅
(
k
1
+
1
)
f
(
q
i
,
D
)
+
k
1
⋅
(
1
−
b
+
b
⋅
|
D
|
avgdl
)
{\displaystyle {\text{score}}(D,Q)=\sum _{i=1}^{n}{\text{IDF}}(q_{i})\cdot {\frac {f(q_{i},D)\cdot (k_{1}+1)}{f(q_{i},D)+k_{1}\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}}}
,當中[ 註 1]
q
1
,
q
2
,
.
.
.
q
i
{\displaystyle q_{1},q_{2},...q_{i}}
係
Q
{\displaystyle Q}
入面嘅每隻關鍵字 ;
f
(
q
i
,
D
)
{\displaystyle f(q_{i},D)}
係
q
i
{\displaystyle q_{i}}
喺
D
{\displaystyle D}
入面出現得有幾密(相對於
D
{\displaystyle D}
嘅長度);
|
D
|
{\displaystyle |D|}
係
D
{\displaystyle D}
嘅長度(以字數 計);
avgdl
{\displaystyle {\text{avgdl}}}
係摷咗嗰啲文件嘅平均 長度;
而
k
1
{\displaystyle k_{1}}
同
b
{\displaystyle b}
係參數 ,好多時冇做最佳化 嘅話就設做
k
1
∈
[
1.2
,
2.0
]
{\displaystyle k_{1}\in [1.2,2.0]}
,
b
=
0.75
{\displaystyle b=0.75}
[ 3] 。
IDF
(
q
i
)
{\displaystyle {\text{IDF}}(q_{i})}
呢個分計法如下-
IDF
(
q
i
)
=
ln
(
N
−
n
(
q
i
)
+
0.5
n
(
q
i
)
+
0.5
+
1
)
{\displaystyle {\text{IDF}}(q_{i})=\ln \left({\frac {N-n(q_{i})+0.5}{n(q_{i})+0.5}}+1\right)}
當中
N
{\displaystyle N}
係摷咗嘅文件嘅數量,
當中
n
(
q
i
)
{\displaystyle n(q_{i})}
摷咗嘅文件當中有幾多份係有
q
i
{\displaystyle q_{i}}
喺裏面嘅,
如果
q
i
{\displaystyle q_{i}}
係一隻常用字(例如英文入面嘅 in 或者 of 呀噉),噉佢嘅
IDF
(
q
i
)
{\displaystyle {\text{IDF}}(q_{i})}
分數理應會低(
N
−
n
(
q
i
)
{\displaystyle N-n(q_{i})}
數值細);所以
IDF
(
q
i
)
{\displaystyle {\text{IDF}}(q_{i})}
呢嚿嘢嘅存在係為咗阻止啲常用字干擾搜尋結果。
計完之後,就會每份文件得出個分數
score
{\displaystyle {\text{score}}}
表示份文件對條問題嚟講幾有啦更,分數愈高表示愈有啦更,然後個搜尋器 就可以按分數將啲摷到嘅文件列出嚟,分數最高嘅行先。Okapi BM25 源於 1980 年代,到咗廿一世紀初經已廣泛噉俾搜尋器採用。
↑
∑
{\displaystyle \sum }
係加總 。
↑ Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management . 36 (6): 779-808.
↑ Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 2". Information Processing & Management . 36 (6): 809-840.
↑ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval , Cambridge University Press, 2009, p. 233.