如何在Postgresql中使⽤模糊字符串匹配
It's a fact - people make typos or simply use alternate spellings on a frequent basis.
这是事实-⼈们经常打错字或只是使⽤替代拼写。
Whatever the cause, from a practical point of view, different variants of similar strings can pose challenges for software developers. Your application needs to be capable of handling these inevitable edge-cases.
从实际的⾓度来看,⽆论原因是什么,相似字符串的不同变体都会给软件开发⼈员带来挑战。 您的应⽤程序需要能够处理这些不可避免的情况。
Take names, for example. I go by Peter in some places, Pete in others. Amongst other variants, my name can be represented by:
以名字为例。 我在某些地⽅经过彼得,在其他地⽅经过彼得。 在其他变体中,我的名字可以⽤以下⽅式表⽰:
"Pete Gleeson"
“⽪特格⾥森”
"Peter J Gleeson"
“彼得·格⾥森(Peter J Gleeson)”
"Mr P Gleeson"
“格⾥森先⽣”
"Gleeson, Peter"
“格⾥森,彼得”
And that's not to mention alternative spellings of my surname, such as "Gleason". All these different variations for just one string - matching them against each other programmatically might not seem obvious.
更不⽤说我姓⽒的其他拼写形式,例如“格⾥森”。 对于⼀个字符串,所有这些不同的变体-以编程⽅式将它们彼此匹配可能并不明显。
Luckily, there are solutions out there.
幸运的是,那⾥有解决⽅案。
The generic name for these solutions is 'fuzzy string matching'. The 'fuzzy' refers to the fact that the solution does not look for a perfect, position-by-position match when comparing two strings. Instead, they allow some degree of mismatch (or
'fuzziness').
这些解决⽅案的通⽤名称是“模糊字符串匹配”。 “模糊”是指这样的事实,即在⽐较两个字符串时,解决⽅案并不寻求完美的逐位匹配。 相反,它们允许⼀定程度的不匹配(或“模糊性”)。
There are solutions available in many different programming languages. Today, we'll explore some options available in Postgresql (or 'Postgres') - a widely used open source SQL dialect with some seriously useful add-on features.
有许多不同编程语⾔提供的解决⽅案。 今天,我们将探讨Postgresql(或“ Postgres”)中可⽤的⼀些选项-⼀种⼴泛使⽤的开放源SQL⽅⾔,具有⼀些⾮常有⽤的附加功能。
海底捞火锅店配置 (Setting up)
First, make sure you .
⾸先,确保 。
Then, create a new database in its own directory (you can call it anything you like, here, I called it 'fuzz-demo'). From the command line:
然后,在其⾃⼰的⽬录中创建⼀个新数据库(您可以随意命名,在这⾥,我将其称为“ fuzz-demo”)。 从命令⾏:
$ mkdir fuzz-demo && cd fuzz-demo
$ initdb .
$ pg_ctl -D . start
$ createdb fuzz-demo
For this demo, I used a table with details about artists in the Museum of Modern Art. You can
对于此演⽰,我使⽤了⼀张桌⼦,上⾯有现代艺术博物馆中有关艺术家的详细信息。 您可以
Next, you can start psql (a terminal-based front end for Postgresql):
接下来,您可以启动psql(Postgresql的基于终端的前端):
$ psql fuzz-demo
Now, create a table called artists:
现在,创建⼀个名为artists的表:
CREATE TABLE artists (
artist_id INT,
name VARCHAR,
nationality VARCHAR,
gender VARCHAR,
birth_year INT,
death_year INT);
Finally, you can use Postgresql's COPY function to copy the contents of artists.csv into the table:
最后,您可以使⽤Postgresql的COPY函数将artist.csv的内容复制到表中:
COPY artists FROM '~/Downloads/artists.csv' DELIMTER ',' CSV HEADER;
If everything has worked so far, you should be able to start querying the artists table.
如果到⽬前为⽌⼀切正常,则应该可以开始查询Artists表。
SELECT * FROM artists LIMIT 10;
通配符过滤器 (Wildcard filters)
Say you remember the first name of an artist called Barbara, but cannot quite remember her second name. It began with '', but you're not sure how it ended.
假设您记得⼀位叫Barbara的艺术家的名字,但是却不太记得她的名字。 它以“ Hep ...”开始,但是您不确定它是如何结束的。
Here, you can use a filter and SQL's wildcard operator %. This symbol stands in for any number of unspecified characters.
在这⾥,您可以使⽤过滤器和SQL的通配符% 。 该符号代表任意数量的未指定字符。
SELECT
*
FROM artists
WHERE name LIKE 'Barbara%'
AND name LIKE '%Hep%';
The first part of the filter finds artists whose name begins with 'Barbara', and ends in any combination of characters.
过滤器的第⼀部分查名称以'Barbara'开头且以任何字符组合结尾的艺术家。
The second part of the filter finds artists whose name can begin and end with any combination of ch
aracters, but must contain the letters 'Hep' in that order.
过滤器的第⼆部分查艺术家的名字可以以任何字符组合开头和结尾,但必须按顺序包含字母“ Hep”。
But what if you are unsure of the spelling of either name? Filters and wildcards will only get you so far.
但是,如果您不确定两个名字的拼写怎么办? 过滤器和通配符只会帮助您解决问题。
使⽤三字组 (Using trigrams)
Luckily, Postgres has a helpful extension with the catchy name pg_trgm. You can enable it from psql using the command below:
幸运的是,Postgres的扩展名为pg_trgm,很有帮助。 您可以使⽤以下命令从psql启⽤它:
CREATE EXTENSION pg_trgm;
This extension brings with it some helpful functions for fuzzy string matching. The underlying principle is the use of trigrams (which sound like something out of Harry Potter).
此扩展带有⼀些有⽤的函数,⽤于模糊字符串匹配。 基本原理是使⽤三字组合(听起来像哈利·波特那样)。
Trigrams are formed by breaking a string into groups of three consecutive letters. For example, the string "hello" would be represented by the following set of trigrams:
通过将字符串分成三个连续字母的组来形成三元组。 例如,字符串“ hello”将由以下三字母组表⽰:
" h", " he", "hel", "ell", "llo", "lo "
金毛狗粮“ h”,“ he”,“ hel”,“ ell”,“ llo”,“ lo”
By comparing how similar the set of trigrams are between two strings, it is possible to estimate how similar they are on a scale between 0 and 1. This allows for fuzzy matching, by setting a similarity threshold above which strings are considered to match.
通过⽐较两个字符串之间的字母组合的相似程度,可以估计它们在0到1之间的尺度上的相似程度。这可以通过设置相似性阈值来进⾏模糊匹配,在该阈值之上可以认为字符串匹配。
SELECT
*
FROM artists
WHERE SIMILARITY(name,'Claud Monay') > 0.4 ;
Perhaps you want to see the top five matches?
农村喜剧电影 爆笑也许您想看到前五场⽐赛?
SELECT
*
FROM artists
ORDER BY SIMILARITY(name,'Lee Casner') DESC
LIMIT 5;
The default threshold is 0.3. You can use the % operator in this case as shorthand for fuzzy matching names against a potential match:
默认阈值为0.3。 在这种情况下,可以使⽤%运算符作为针对潜在匹配的模糊匹配名称的简写:
电子墨水屏SELECT
*
FROM artists
WHERE name % 'Andrey Deran';
Perhaps you only have an idea of one part of the name. The % operator lets you compare against elements of an array, so you can match against any part of the name. The next query uses Postgres' STRING_TO_ARRAY function to split the artists' full names into arrays of separate names.
也许您只知道名称的⼀部分。 使⽤%运算符可以与数组的元素进⾏⽐较,因此可以与名称的任何部分进⾏匹配。 下⼀个查询使⽤Postgres 的STRING_TO_ARRAY函数将艺术家的全名拆分为单独名称的数组。
SELECT
*
FROM artists
WHERE 'Cadinsky' % ANY(STRING_TO_ARRAY(name,' '));
语⾳算法 (Phonetic algorithms)
Another approach to fuzzy string matching comes from a group of algorithms called phonetic algorithms.
模糊字符串匹配的另⼀种⽅法来⾃⼀组称为语⾳算法的算法。
These are algorithms which use sets of rules to represent a string using a short code. The code contains the key
information about how the string should sound if read aloud. By comparing these shortened codes, it is possible to fuzzy match strings which are spelled differently but sound alike.
这些是使⽤规则集来使⽤短代码表⽰字符串的算法。 该代码包含有关⼤声读取字符串的声⾳的关键信息。 通过⽐较这些缩短的代码,可以对拼写不同但听起来相似的匹配字符串进⾏模糊处理。
Postgres comes with an extension that lets you make use of some of these algorithms. You can enable it with the following command:
Postgres带有⼀个扩展,可让您使⽤其中的⼀些算法。 您可以使⽤以下命令启⽤它:
CREATE EXTENSION fuzzystrmatch;
One example is an algorithm called Soundex. Its origins go back over 100 years - it was first patented in 1918 and was used in the 20th century for analysing US census data.
⼀个⽰例是称为Soundex的算法。 它的起源可以追溯到100年前-它于1918年⾸次获得专利,并在20世纪⽤于分析美国⼈⼝普查数据。
Soundex works by converting strings into four letter codes which describe how they sound. For example, the Soundex representations of 'flower' and 'flour' are both F460.
Soundex的⼯作原理是将字符串转换成四个字母代码,以描述它们的发⾳。 例如,“花”和“⾯粉”的Soundex表⽰形式均为F460。The query below finds the record which sounds like the name 'Damian Hurst'.
被乔晶晶称为下⾯的查询查听起来像名称“ Damian Hurst”的记录。
教师节鲜花图片SELECT
*
FROM artists
WHERE nationality IN ('American', 'British')
AND SOUNDEX(name) = SOUNDEX('Damian Hurst');
Another algorithm is one called metaphone. This works on a similar basis to Soundex, in that it converts strings into a code representation using a set of rules.
另⼀种算法是称为元⾳。 这与Soundex相似,其⼯作原理是使⽤⼀组规则将字符串转换为代码表⽰形式。
The metaphone algorithm will return codes of different lengths (unlike Soundex, which always returns four characters). You can pass an argument to the METAPHONE function indicating the maximum length code you want it to return.
变⾳位算法将返回不同长度的代码(与Soundex不同,后者始终返回四个字符)。 您可以将⼀个参数传递给METAPHONE函数,该参数指⽰您希望其返回的最⼤长度代码。
SELECT
artist_id,
name,
METAPHONE(name,10)
FROM artists
WHERE nationality = 'American'
LIMIT 5;
Because both metaphone and Soundex return strings as outputs, you can use them in other fuzzy string matching functions. This combined approach can yield powerful results. The example below finds the five closest matches for the name Si Tomlee.
因为metaphone和Soundex都将字符串作为输出返回,所以您可以在其他模糊字符串匹配函数中使⽤它们。 这种组合⽅法可以产⽣有⼒的结果。 下⾯的⽰例查名称Si Tomlee的五个最接近的匹配项。
SELECT
*
FROM artists
WHERE nationality = 'American'
ORDER BY SIMILARITY(
METAPHONE(name,10),
METAPHONE('Si Tomlee',10)
) DESC
LIMIT 5;
Here, a trigram-only approach would not have helped much, as there is little overlap between 'Cy Twombly' and 'Si Tomlee'. In fact, these only have a SIMILARITY score of 0.05, even though they sound similar when read aloud.
在这⾥,仅使⽤三元组的⽅法不会有太⼤帮助,因为“ Cy Twombly”和“ Si Tomlee”之间⼏乎没有重叠。 实际上,尽管它们朗读时听起来相似,但它们的SIMILARITY仅为0.05。
Due to their historical origins, neither of these algorithms works well with names or words of non-English language origin. However, there are more internationally-focused versions.
由于其历史渊源,这些算法都不能很好地⽤于⾮英语来源的名称或单词。 但是,还有更多⾯向国际的版本。
One example is the double metaphone algorithm. This uses a more sophisticated set of rules for producing metaphones. It can provide alternative encodings for English and non-English origin strings.
⼀个例⼦是双变⾳位算法。 这使⽤了⼀套更复杂的规则来⽣产对讲机。 它可以为英语和⾮英语来源字符串提供替代编码。
As an example, see the query below. It compares the double metaphone outputs for different spellings of Spanish artist Joan Miró:
例如,请参见下⾯的查询。 它⽐较了西班⽛艺术家JoanMiró的不同拼写的双⾳位输出:
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论