How to find similar texts and sort

There is a simple method to sort a set of texts according to the similarity to a given text: according to the Euclidean distance between the frequencies of words in the analyzed texts. In principle, the algorithm should be clear on this; a simple implementation can be found here .

Surprisingly, the simple method gives good results. For example, if you are looking for the next book to read, you can enter the text of a read book or several read books as a model for the search, and then for this repository of 10 books we get the following results for the book FAIRY TALES By The Brothers Grimm:

0.0320757Repo\THEADVENTURESOFTOMSAWYER.txt
0.0363329Repo\ATALEOFTWOCITIES-ASTORYOFTHEFRENCHREVOLUTION.txt
0.0388528Repo\ALICEТSADVENTURESINWONDERLAND.txt
0.0440605Repo\MOBY-DICKor, THEWHALE.txt
0.046679Repo\THEADVENTURESOFSHERLOCKHOLMES.txt
0.0472574Repo\TheIliadofHomer.txt
0.0511793Repo\TheRomanceofLust.txt
0.053746Repo\PRIDEANDPREJUDICE.txt
0.0543531Repo\BEOWULF-ANANGLO-SAXONEPICPOEM.txt
0.0557194Repo\Frankenstein; or, theModernPrometheus.txt

As can be seen from the results, fairy-like books were found most similar, and the horror book was the least similar.

For commercial purposes, such a program can be used to find the most suitable advertisement for a given web page by comparing the text of a user-readable page with the text of the pages where existing advertisements lead.

Another use is to find a resume from the database, following the example of a candidate’s resume that is suitable for this position, but does not want to join or leaves the company. Finding a replacement for an employee is not such a rare business case. You can also sort the database of resume by similarity to the job description.

PS By the way, Habr in the list of similar articles gives something not very similar. Maybe Habra also apply this method?

Tags:

How to find similar texts and sort

Also popular now: