字符串相似性判断

最后更新:1/12/2026

AI 助手

会员专享功能

只有会员才能使用 AI 生成答案功能。 立即升级

Other Answers (1)

Generated by DeepSeek V3

字符串相似性判断

字符串相似性判断是计算机科学中一个重要问题,在搜索引擎、拼写检查、数据清洗等领域有广泛应用。以下是几种常见的字符串相似性度量方法和实现方式:

1. 编辑距离 (Levenshtein Distance)

编辑距离衡量将一个字符串转换成另一个字符串所需的最少单字符编辑操作次数(插入、删除、替换)。

python
def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    
    if len(s2) == 0:
        return len(s1)
    
    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

# 示例
print(levenshtein_distance("kitten", "sitting"))  # 输出: 3

2. Jaccard 相似度

计算两个集合的交集与并集的比值,常用于短文本比较。

python
def jaccard_similarity(s1, s2):
    set1 = set(s1.split())
    set2 = set(s2.split())
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union)

# 示例
print(jaccard_similarity("apple banana", "banana orange"))  # 输出: 0.333...

3. 余弦相似度

将字符串表示为词频向量,然后计算向量夹角的余弦值。

python
from collections import Counter
import math

def cosine_similarity(s1, s2):
    vec1 = Counter(s1.split())
    vec2 = Counter(s2.split())
    
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])
    
    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
    
    if not denominator:
        return 0.0
    return float(numerator) / denominator

# 示例
print(cosine_similarity("apple banana", "banana orange"))  # 输出: 0.5

4. Jaro-Winkler 距离

特别适用于人名匹配,给予前缀更多权重。

python
def jaro_winkler(s1, s2, p=0.1):
    def jaro_similarity(s1, s2):
        len_s1 = len(s1)
        len_s2 = len(s2)
        
        if len_s1 == 0 or len_s2 == 0:
            return 0.0
        
        max_dist = max(len_s1, len_s2) // 2 - 1
        match_s1 = [False] * len_s1
        match_s2 = [False] * len_s2
        matches = 0
        
        for i in range(len_s1):
            start = max(0, i - max_dist)
            end = min(i + max_dist + 1, len_s2)
            for j in range(start, end):
                if not match_s2[j] and s1[i] == s2[j]:
                    match_s1[i] = True
                    match_s2[j] = True
                    matches += 1
                    break
        
        if matches == 0:
            return 0.0
        
        transpositions = 0
        k = 0
        for i in range(len_s1):
            if match_s1[i]:
                while not match_s2[k]:
                    k += 1
                if s1[i] != s2[k]:
                    transpositions += 1
                k += 1
        
        return (matches/len_s1 + matches/len_s2 + (matches-transpositions/2)/matches) / 3
    
    jaro = jaro_similarity(s1, s2)
    
    # 计算共同前缀长度(最多4个字符)
    prefix = 0
    for i in range(min(len(s1), len(s2), 4)):
        if s1[i] == s2[i]:
            prefix += 1
        else:
            break
    
    return jaro + prefix * p * (1 - jaro)

# 示例
print(jaro_winkler("MARTHA", "MARHTA"))  # 输出: 0.9611...

5. N-gram 相似度

将字符串分解为连续的n个字符的子串,然后比较这些子串的相似程度。

python
def ngram_similarity(s1, s2, n=2):
    def get_ngrams(s, n):
        return [s[i:i+n] for i in range(len(s)-n+1)]
    
    ngrams1 = set(get_ngrams(s1, n))
    ngrams2 = set(get_ngrams(s2, n))
    
    intersection = ngrams1 & ngrams2
    union = ngrams1 | ngrams2
    
    return len(intersection) / len(union) if union else 0.0

# 示例
print(ngram_similarity("night", "nacht"))  # 输出: 0.25 (使用2-gram)

选择合适的方法

选择哪种相似度算法取决于具体应用场景:

  • 编辑距离:适用于拼写检查、DNA序列比对
  • Jaccard/Cosine:适用于文档相似性比较
  • Jaro-Winkler:适用于人名、短字符串匹配
  • N-gram:适用于模糊搜索、推荐系统

在实际应用中,可能需要组合多种方法或调整参数以获得最佳效果。