关键词搜索
关键词搜索,也称为“BM25(最佳匹配25)”或“稀疏向量”搜索,返回具有最高 BM25F 分数的对象。
查询代理将简单的英文问题自动转换为优化的 Weaviate 查询 - 无需手动构建查询。
基本的 BM25 搜索
要使用 BM25 关键词搜索,请定义一个搜索字符串。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
limit=3
)
for o in response.objects:
print(o.properties)
示例响应
响应如下
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}
搜索运算符
v1.31 中添加搜索运算符定义了查询 词元中必须存在于对象中才能返回的最小数量。选项是 and 或 or(默认)。
or
使用 or 运算符,搜索将返回包含搜索字符串中的至少 minimumOrTokensMatch 个词元的对象。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
from weaviate.classes.query import BM25Operator
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="Australian mammal cute",
operator=BM25Operator.or_(minimum_match=1),
limit=3,
)
for o in response.objects:
print(o.properties)
and
使用 and 运算符,搜索将返回包含搜索字符串中所有词元的对象。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
from weaviate.classes.query import BM25Operator
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="Australian mammal cute",
operator=BM25Operator.and_(), # Each result must include all tokens (e.g. "australian", "mammal", "cute")
limit=3,
)
for o in response.objects:
print(o.properties)
检索 BM25F 分数
您可以检索每个返回对象的 BM25F score 值。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
from weaviate.classes.query import MetadataQuery
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
return_metadata=MetadataQuery(score=True),
limit=3
)
for o in response.objects:
print(o.properties)
print(o.metadata.score)
示例响应
响应如下
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"_additional": {
"score": "2.8725255"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "2.7672548"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}
仅在选定的属性上搜索
关键词搜索可以定向到仅搜索对象属性的子集。在此示例中,BM25 搜索仅使用 question 属性来生成 BM25F 分数。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
from weaviate.classes.query import MetadataQuery
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
query_properties=["question"],
return_metadata=MetadataQuery(score=True),
limit=3
)
for o in response.objects:
print(o.properties)
print(o.metadata.score)
示例响应
响应如下
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.7079012"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.4311616"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "2.8312314"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera"
}
]
}
}
}
使用权重来提升属性
您可以设置每个属性对整体 BM25F 分数的影响权重。此示例将 question 属性的权重提高 2 倍,而 answer 属性保持静态。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
query_properties=["question^2", "answer"],
limit=3
)
for o in response.objects:
print(o.properties)
示例响应
响应如下
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.0038033"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.8706005"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.2457707"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
}
]
}
}
}
设置词元化
BM25 查询字符串在用于使用倒排索引搜索对象之前会被 词元化。
您必须在集合定义中为 每个属性指定词元化方法。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
"Article",
vector_config=Configure.Vectors.text2vec_cohere(),
properties=[
Property(
name="title",
data_type=DataType.TEXT,
vectorize_property_name=True, # Use "title" as part of the value to vectorize
tokenization=Tokenization.LOWERCASE, # Use "lowercase" tokenization
description="The title of the article.", # Optional description
),
Property(
name="body",
data_type=DataType.TEXT,
skip_vectorization=True, # Don't vectorize this property
tokenization=Tokenization.WHITESPACE, # Use "whitespace" tokenization
),
],
)
为了实现模糊匹配和容错,请使用 trigram 词元化。有关详细信息,请参阅上面的 模糊匹配部分。
limit & offset
使用 limit 设置要返回的对象的固定最大数量。
可选地,使用 offset 对结果进行分页。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
limit=3,
offset=1
)
for o in response.objects:
print(o.properties)
限制结果组
要将结果限制为与查询相似距离的组,请使用 autocut 过滤器来设置要返回的组数。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
auto_limit=1
)
for o in response.objects:
print(o.properties)
示例响应
响应如下
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "2.6768136"
},
"answer": "OSHA (Occupational Safety and Health Administration)",
"question": "The government admin. was created in 1971 to ensure occupational health & safety standards"
}
]
}
}
}
分组结果
v1.25定义标准以对搜索结果进行分组。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
from weaviate.classes.query import GroupBy
jeopardy = client.collections.use("JeopardyQuestion")
# Grouping parameters
group_by = GroupBy(
prop="round", # group by this property
objects_per_group=3, # maximum objects per group
number_of_groups=2, # maximum number of groups
)
# Query
response = jeopardy.query.bm25(
query="California",
group_by=group_by
)
for grp_name, grp_content in response.groups.items():
print(grp_name, grp_content.objects)
示例响应
响应如下
'Jeopardy!'
'Double Jeopardy!'
过滤结果
为了获得更具体的结果,请使用 filter 来缩小搜索范围。
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
from weaviate.classes.query import Filter
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
filters=Filter.by_property("round").equal("Double Jeopardy!"),
return_properties=["answer", "question", "round"], # return these properties
limit=3
)
for o in response.objects:
print(o.properties)
示例响应
响应如下
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.9633813"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.6719631"
},
"answer": "pseudopods",
"question": "Amoebas use temporary extensions called these to move or to surround & engulf food",
"round": "Double Jeopardy!"
}
]
}
}
}
词元化
Weaviate 将过滤器术语转换为词元。默认词元化是 word。word 词元化器保留字母数字字符,将它们转换为小写,并在空格处拆分。它将字符串 "Test_domain_weaviate" 转换为 "test"、"domain" 和 "weaviate"。
有关详细信息和附加词元化方法,请参阅词元化。
模糊匹配
您可以通过使用 trigram 词元化 在 BM25 搜索中启用模糊匹配和容错。此技术将文本分解为重叠的 3 个字符序列,即使存在拼写错误或变化,BM25 也能找到匹配项。
这使得在它们共享许多三元组的情况下,匹配相似但不相同的字符串成为可能
"Morgn"和"Morgan"共享诸如"org", "rga", "gan"之类的三元组
在创建集合时,将词元化方法设置为 trigram
如果某个片段无法工作或您有任何反馈,请打开一个 GitHub issue。
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
"Article",
vector_config=Configure.Vectors.text2vec_cohere(),
properties=[
Property(
name="title",
data_type=DataType.TEXT,
tokenization=Tokenization.TRIGRAM, # Use "trigram" tokenization
),
],
)
- 有选择地在需要模糊匹配的字段上使用三元组词元化。过滤行为将发生显著变化,因为文本过滤将基于三元组词元化的文本进行,而不是整个单词。
- 对需要精确匹配的字段使用
word或field词元化。
更多资源
问题和反馈
如果您有任何问题或反馈,请在 用户论坛 中告诉我们。
