字符串相似性判断
最后更新:1/12/2026
AI 助手
会员专享功能
只有会员才能使用 AI 生成答案功能。 立即升级
Other Answers (1)
Generated by DeepSeek V3
字符串相似性判断
字符串相似性判断是计算机科学中一个重要问题,在搜索引擎、拼写检查、数据清洗等领域有广泛应用。以下是几种常见的字符串相似性度量方法和实现方式:
1. 编辑距离 (Levenshtein Distance)
编辑距离衡量将一个字符串转换成另一个字符串所需的最少单字符编辑操作次数(插入、删除、替换)。
def levenshtein_distance(s1, s2):
if len(s1) < len(s2):
return levenshtein_distance(s2, s1)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
# 示例
print(levenshtein_distance("kitten", "sitting")) # 输出: 3
2. Jaccard 相似度
计算两个集合的交集与并集的比值,常用于短文本比较。
def jaccard_similarity(s1, s2):
set1 = set(s1.split())
set2 = set(s2.split())
intersection = set1.intersection(set2)
union = set1.union(set2)
return len(intersection) / len(union)
# 示例
print(jaccard_similarity("apple banana", "banana orange")) # 输出: 0.333...
3. 余弦相似度
将字符串表示为词频向量,然后计算向量夹角的余弦值。
from collections import Counter
import math
def cosine_similarity(s1, s2):
vec1 = Counter(s1.split())
vec2 = Counter(s2.split())
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
return float(numerator) / denominator
# 示例
print(cosine_similarity("apple banana", "banana orange")) # 输出: 0.5
4. Jaro-Winkler 距离
特别适用于人名匹配,给予前缀更多权重。
def jaro_winkler(s1, s2, p=0.1):
def jaro_similarity(s1, s2):
len_s1 = len(s1)
len_s2 = len(s2)
if len_s1 == 0 or len_s2 == 0:
return 0.0
max_dist = max(len_s1, len_s2) // 2 - 1
match_s1 = [False] * len_s1
match_s2 = [False] * len_s2
matches = 0
for i in range(len_s1):
start = max(0, i - max_dist)
end = min(i + max_dist + 1, len_s2)
for j in range(start, end):
if not match_s2[j] and s1[i] == s2[j]:
match_s1[i] = True
match_s2[j] = True
matches += 1
break
if matches == 0:
return 0.0
transpositions = 0
k = 0
for i in range(len_s1):
if match_s1[i]:
while not match_s2[k]:
k += 1
if s1[i] != s2[k]:
transpositions += 1
k += 1
return (matches/len_s1 + matches/len_s2 + (matches-transpositions/2)/matches) / 3
jaro = jaro_similarity(s1, s2)
# 计算共同前缀长度(最多4个字符)
prefix = 0
for i in range(min(len(s1), len(s2), 4)):
if s1[i] == s2[i]:
prefix += 1
else:
break
return jaro + prefix * p * (1 - jaro)
# 示例
print(jaro_winkler("MARTHA", "MARHTA")) # 输出: 0.9611...
5. N-gram 相似度
将字符串分解为连续的n个字符的子串,然后比较这些子串的相似程度。
def ngram_similarity(s1, s2, n=2):
def get_ngrams(s, n):
return [s[i:i+n] for i in range(len(s)-n+1)]
ngrams1 = set(get_ngrams(s1, n))
ngrams2 = set(get_ngrams(s2, n))
intersection = ngrams1 & ngrams2
union = ngrams1 | ngrams2
return len(intersection) / len(union) if union else 0.0
# 示例
print(ngram_similarity("night", "nacht")) # 输出: 0.25 (使用2-gram)
选择合适的方法
选择哪种相似度算法取决于具体应用场景:
- 编辑距离:适用于拼写检查、DNA序列比对
- Jaccard/Cosine:适用于文档相似性比较
- Jaro-Winkler:适用于人名、短字符串匹配
- N-gram:适用于模糊搜索、推荐系统
在实际应用中,可能需要组合多种方法或调整参数以获得最佳效果。