Word hunt
I will continue the series of articles “How to entertain yourself using the Wikipedia frequency dictionary and the Python interpreter, if nothing else is at hand and is not expected in the near future.”
I will try to recreate that wonderful evening when my Wikipedia parser worked, I received a coveted dictionary, opened Python in an interactive mode and began to enter various queries in order to get words with all sorts of unusual properties. The one, two years ago, the session with the shell, unfortunately, did not survive, so I will do it all over again.
So. We put the wiki_freq.txt file in some folder, open the Python shell in this folder. Go.
To get started, load the dictionary:
And try something very simple, for example, to get the longest words. We will list the beginning of the list of all dictionary words sorted by length in reverse order (if the length of two words is the same, they are sorted by frequency):
Hyphens strongly spoil the picture, although the "spyware-Japanese-German-Trotskyist-wrecking" is certainly wonderful. Because the dictionary was compiled two years ago, now it is possible to check whether this or that unusual word was a trick of hooligans, this particular adjective was not, it is still in the article about Kaganovich Lazar Moiseevich.
But I’ll probably remove the hyphens. I will not scoff at the dictionary, I’ll better beat it and create a new one without hyphens:
start the search for long words again.
The absolute record holder is the word Thaumatauacatangiangakauauauotamateaturipukakapikimaungahoronukupokanuenuakitanathu, the name of a small hill in New Zealand. For some reason, locals prefer to shorten the name to Taumat. The second word from Wikipedia has disappeared - probably a mistake or tricks of hooligans. When I introduced the third word - llanvairpullgvingillogogerihuirndrobullantisiliogogogo, Wikipedia inquired - Perhaps you meant "llanvairpullgvingillogogerihuirndrobullantisilogogog"?Yes! It was llanvairpullgvingillogogerihuirndrobullantisilogogogokh I had in mind! Apparently the word has changed, more correct transliteration has appeared. This is the name of a village in Wales. I also really liked the name of the lake in North America - Cargoggagoggmanchauggagoggchaubunagungamaugg. Most of the list is made up of the names of all kinds of substances, Soviet furious abbreviations like the Upper Volga power grid, or just cases of dropping the keys of white space.
Let's look for long words-palindromes. Add a filter to the last query:
He quoted Google, followed the links, learned about the Holomolokh Lake, as well as the planned language Sunilinus, a language where every word is a palindrome. I also learned that the longest word-palindrome is saippuakauppias, a soap seller in Finnish. It turned out to be interesting with the word “onsone,” the author of the article about glycerin was too lazy to switch the keyboard layout and wrote the formula of glycerin aldehyde in Cyrillic - CH2 is ANNOUNCED (as it is still there, the difference is not visible to the naked eye) and my parser took the two as a separator.
Let's look for anagram words.
To begin with, we’ll compose an auxiliary dictionary of anagrams, set an occurrence threshold of 20 to ensure that any rubbish is cut off (with long words and palindromes I could not do this, because these words are extremely rare):
And look for the 10 largest sets of anagrams:
And also sets of anagrams with words of maximum length:
We are looking for words in which all letters are different, again we are changing the same long-suffering line. I'm already starting to regret that I did not make a function out of it:
We make a function, come in handy and enter the opposite query - words in which each letter occurs at least twice:
Here the rubbish is mainly, but there is also a beautiful thing - multimillion, unintelligent, inappropriate, special processor, anti-sectarian, part apparatus, etc.
And exactly twice:
wind turbines, Voskanyanovskaya, pseudo-confession ... There’s
still another sea, and time is later ... If the inhabitants of the habr like the article, I’ll write the second part. If you have ideas for interesting queries to the dictionary, write in comments or in private messages, and I will translate them into code and publish the results.
PS The frequency dictionary is still here - yadi.sk/d/cRHJvinF7Vt8w
I will try to recreate that wonderful evening when my Wikipedia parser worked, I received a coveted dictionary, opened Python in an interactive mode and began to enter various queries in order to get words with all sorts of unusual properties. The one, two years ago, the session with the shell, unfortunately, did not survive, so I will do it all over again.
So. We put the wiki_freq.txt file in some folder, open the Python shell in this folder. Go.
To get started, load the dictionary:
>>> import codecs
>>> d={}
>>> for l in codecs.open("wiki_freq.txt","r","utf-8-sig"):
w,f=l.split()
d[w]=int(f)
And try something very simple, for example, to get the longest words. We will list the beginning of the list of all dictionary words sorted by length in reverse order (if the length of two words is the same, they are sorted by frequency):
>>> print ', '.join(sorted(d.keys(),key=lambda w:(len(w),d[w]),reverse=True)[:100])
the words
Pori-Ikaalinen-kuru-vilpula-lyankipohya-Padasjoki-Heinola-mantyuharyu-Savitaipale-Lappeenranta-Antrea-rout, barn-Zemitany-Ciekurkalns-Yugla-Baltezers-Ropazhi-Krievupe-Vangazhi-Inchukalns-eglyupe-Silciems-Sigulda petrozavodsk- Shala-Medvezhyegorsk-Povenets-kondopoga-Kizhi-Valaam-Leningrad-Petrozavodsk, Thaumatauakatangiakakauauotamateaturipukakapikimaungoronokupanuenuakitanatuu, armor-piercing-fragmentation-sand-wave-blue-armored-sandy-armored-sand-white-armored lava-pyatikhatki, baturino-rakino-dyshkin-zaddenka-stepalkovo-kacheli-cheshikhino-poehnova-kozlina, Mikhailovsky-Sevsk-Dmitrov-Krom-naryshkino-eagle-novosil-borki-kostorny, Russian-Georgian-German-Dutch-Italian- Finnish, ultra-high-tech-indestructible-super-cosmo-cyber-costume, Komsomol-Moscow-Leningrad-Markovtsev-Komsomol, Gelendzhik-Anapa-Tunnel-Ubinsk-Novodmitrievsk-Gelendzhik, Tula-Gorbachev-Belev-Kozelsk-Sukhinichi-Kozelsk-Trenet-Tula, Copenhagen-Hamburg-Bremen-Dortmurt Zurich, methionyl-glutamyl-histidyl-phenylalanil-prolyl-glycyl-proline, fly-mura-tour-tara-para-park-spider-spider-rogue-raft-slot-elephant, tsivi-tsivi-tsivi-tsivi-wiss- wyss-wyss-twyss-wyss-wyss-cirr, threonyl-lysyl-prolyl-arginyl-prolyl-glycyl-prolyl-diacetate, october-tram-nevsky-nezhinsky-cosm Pushkin-onavtov, white-cornflower-yellow-red-yellow-cornflower-white, rehberger-etelavuori-koyvusaari-holopainen-kallio-joutsen, prinzmarmabükpetralbertgansiosifbernhardtgilgoglulgoglulgoglulgoglulloggulgulgullgulgulgulgulgulgulglugloggulgul chanson burn kuzminki-Mishukovo-Kozelsk-Inyutin-Ermolino-Lapshinka, methylphenylthiomethyl-dimethylaminomethyl-gidroksibromindol, taumatauakatangiangakoauauotamateapokanuenuakitanatahu, Washington-London-Copenhagen-Stockholm-Helsinki-Moscow, the next-forward-good-senator-like-it-there, Villa de nuestra senora de la candelaria-de-medellin, krasnoyarsk-kirensk-yakutsk-nagaevo-nizhnekolymsk-uelen, pajanka-mari-oshaevo-akhmanovo-voya-paigishevo-kazakovo, george alexander-michael-friedrich-vr franz-karl, Almetyevsk-Bugulma-Leninogors on-Aznakaevsky, Samara, Volgograd, Rostov, Astrakhan, Elista, Budennovsk, gidrazinokarbonilmetilbromfenildigidrobenzdiazepin, sverhleningradsko Petersburg professor-liberal, Spaso-Eleazar velikopustynsky-Trehsvyatitelsky, Mirador Miramar Sóller center-plazario-otey Airport, arginyl-alpha-aspartyl-lysyl-valyl-tyrosyl-arginine, Adam-Karl-Wilhelm-Stanislav-Eugene-Pavel-Ludwig, Kharkov-Lozova-Aleksandrovsk-Perekop-Sevastopol, fly-mura-fura-handicap-bark-corne- koan-clan-clone-elephant, Bryansk-Starodub-Trubchev-Novgorod-Seversky, F minor-A-flat-major-B-flat-major-minor, station-Marx-greenhouse-Baharova-lane, Rostov- na-donu-vladivostok-san-francisco-new-york, fly-muse-loose-vine-pose-time-port-grade-groan-elephant, dubossars-quarantine-tomcat-dorotsky-grigoriopol, spyware-japanese-german- Trotskyist co-pest, bicarbonate-sodium chloride-calcium-magnesia, Ukrainian-Crimean Tatar-Russian-Belarusian, Wakamaya-Silva-Maradona-Maria-Nina-Sanchez-Rakht, Polish-Italian-French-German-Canadian, network-transformer rectifier-inverter-traction, cyber-equipment-sandwiches-with-ham-and-cheese, Italy-Slovenia-Hungary-Slovakia-Ukraine-Russia, musical-choreographic-cinematic, heavy-pop-rock-world-jazz-avant-garde-metal-blues, alexander-maximilian-ignatius-victor-adolf, karna-rajapuram-gatva- Cambodiah-nirjitastava, decamethylene dimethylmethoxycarbonylmethylammonium, sevzapvnipienergoprom-sevzapenergomontazhproekt, palatinate-zweibrucken-birkenfeld-gelnhausen, football-cowboy-super-heroic-soldier, bicarbonne-calcium chloride-hydrogen go, el-santuario de nuestra-senora-de-la-soledad, lions-ternopil-vinnitsa-dnipropetrovsk-donbass, bogdanovich-egorshino-alapaevsk-goroblagodatskaya, Kazan-chistopol-almetyevsk-bugulma-orenburg, geo-bi philo-phantasy-mantas-parabolic, visit-unbeliever-with-explanatory-pamphlet, Comunidad de Villa-i-Tierra-Antigua-de-Cuellar, Barnaul-Ekibastuz-Kokchetav-Kustanay-Chelyabinsk, Schleswig-Holstein-Sonderburg-Augustenburg, Gargantilla del Losoya-i-Pinilla-de-buytragidrogylnutrobylnopyrnyl glubopyrnos carbon dioxide-chloride-bicarbonate-sodium, bicarbonate-calcium-sulphate-sodium, Schleswig-Holstein-Sondenburg-Augustenburg, Schleswig-Holstein-Sonderburg-Augustenberg, Petropavlovsk-Kokchetav-Akmolaysor-Kamenskoroz-Kraskwa-Forszów-Kamenskornskorzwańskorzwańskorzwańskorzwańskorzwański equalizer irish seam Sko-Austrian-Polish-Czech, Ermilovo-Primorsk-Glebych-in-Soviet-Vyborg, Hungary-Beli-Manastir-Osiek-Djakovo-Slavonski, Edison and Svan-United-Electric-Light-Company, Skurato-Orel- Kursk-Kharkov-Lozova-Ilovaisk, sulfate-hydrocarbonate-magnesia-calcium, religious enlightenment-charity,
Hyphens strongly spoil the picture, although the "spyware-Japanese-German-Trotskyist-wrecking" is certainly wonderful. Because the dictionary was compiled two years ago, now it is possible to check whether this or that unusual word was a trick of hooligans, this particular adjective was not, it is still in the article about Kaganovich Lazar Moiseevich.
But I’ll probably remove the hyphens. I will not scoff at the dictionary, I’ll better beat it and create a new one without hyphens:
>>> for l in codecs.open("wiki_freq.txt","r","utf-8-sig"):
w,f=l.split()
if '-' in w:continue
d[w]=int(f)
start the search for long words again.
the words
taumatauakatangiangakoauauotamateaturipukakapikimaungahoronukupokanuenuakitanatahu, printsmarmabyukpetralbertgansiosifbernhardtvilgelmsberg, llanvayrpullgvingillgogerihuirndrobullllantisiliogogogo, taumatauakatangiangakoauauotamateapokanuenuakitanatahu, gidrazinokarbonilmetilbromfenildigidrobenzdiazepin, dekametilendimetilmentoksikarbonilmetilammoniya, glyukopiranozidmetilbuteniltrigidroksiflavanol, metoksihlordietilaminometilbutilaminoakridin, italbenzokladbischemotosilikonfarmometalhimiya, chargoggagoggmanch augogoggchaubunagungamaug, enterohematohepatohematopulmoenteric, danlilogomicum juylupigacchunzhekhan, spray a weakly concentrated extract, polytetrafluoroethylene acetoxypropylbutane, catheter chlorofluoromethoxyphenylphosphorodiphenylphosphorodiphenylacetate lesnayazatenennayaglubokayachernayadolina, ultravysokotemperaturnoobrabotannoe, nicotinamide, prevysokomnogorassmotritelstvuyuschy, superarhiekstraultramegagrandiozno, skonstruirovanteplovozostroitelnym, uridindifosfatglyukuroniltransferazu, lizofosfatidilholinatsiltransferazoy, uridindifosfatglyukoroniltransferazy, superkalifradzhilistikekspialidoshes, lizofosfatidilholinatsiltransferaza, tetrametiltetraazobitsiklooktandion, nicotinamide, superkalifradzhilistikekspialidoshes , please, tetrahydroxyglucopyranosyl xanthene, bisethylenedithiotetraselenium fulvalen, adenosine phosphatisopentyltransferase, the union of the main supply chain, ishpara, and pairedium hydrogen sulfide galogenoserebryanomfotograficheskom, nikotinamidadenindinukleotidfosfa, poliamidbenzimidazoltereftalamida, rentgenoelektrokardiograficheskogo, bromdigidrohlorfenilbenzodiazepin, undzibtsigtauzendzibenhundertziben, lecithin-cholesterol acyltransferase, lecithin-cholesterol acyltransferase, alyuminievokobaltovomolibdenovyh, predsedatelstvovaliarheanaktidy, kominefteenergomontazhavtomatikoy, pesnitruschobnadezhdyrazbityhserdets, induktotermogalvanogryazelechenie, lecithin-cholesterol acyltransferase, methylenetetrahydrofolate , Upper Volga power grid, the formation of a mono-bromine derivative, a petroleum energy and energy installation, automation, a tractor and agricultural machinery, hexacosiohex-hexa-hexaphobia, methylene tetrahydrofolate reductase, pancreatocholango-roentgenography, com-oil-and-gas installation elektrogastroenterograficheskogo, proizvodstveelektrooborudovaniya, adeninfosforiboziltransferaznoy, tsiklodekstringlyukanotransferazy, dioksifenilalanindekarboksilazy, psihiatriideinstitutsionalizatsiya, dimetilalkilbenzilammoniyhlorid, natsionalisticheskiorientirovanoy, samoprovozglashonoyturkestanskoy, diatsilglitseriltrimetilgomoserin, entsefalomielopoliradikulonevrit, gidrolektrosvetoterapevticheekie, mahachkalinskogogosudarstvennogo, gidroksimetilhinoksilindioksid, dihlordifeniltrihlormetilmetan, poperechnomorschinistoysk ultrapurity, a pathic of autopsy, imagery, pornographic video products, encephaloduroarteria from the autotractor, electrical equipment, machachulalongkornrajavidiyalaya, begging lured, new-village, agro-industrial, hexanitrohexaazole-gastroeurope
The absolute record holder is the word Thaumatauacatangiangakauauauotamateaturipukakapikimaungahoronukupokanuenuakitanathu, the name of a small hill in New Zealand. For some reason, locals prefer to shorten the name to Taumat. The second word from Wikipedia has disappeared - probably a mistake or tricks of hooligans. When I introduced the third word - llanvairpullgvingillogogerihuirndrobullantisiliogogogo, Wikipedia inquired - Perhaps you meant "llanvairpullgvingillogogerihuirndrobullantisilogogog"?Yes! It was llanvairpullgvingillogogerihuirndrobullantisilogogogokh I had in mind! Apparently the word has changed, more correct transliteration has appeared. This is the name of a village in Wales. I also really liked the name of the lake in North America - Cargoggagoggmanchauggagoggchaubunagungamaugg. Most of the list is made up of the names of all kinds of substances, Soviet furious abbreviations like the Upper Volga power grid, or just cases of dropping the keys of white space.
Let's look for long words-palindromes. Add a filter to the last query:
>>> print ', '.join(sorted([w for w in d.keys() if w==w[-1::-1]],
key=lambda w:(len(w),d[w]),reverse=True)[:100])
the words
aqueducts, hydrogens, monk, onsnone, her, creeper, sologolos, monaghan, holomolokh, razinizar, sunilinus, xxxxxxxxx, noossoon, utility, aniline, tartrate, akasaka, ogopogo, anisina, anikin, kaanaak, imamator, apocapot, roth, hhhhhhh, glenelg, tenenet, ateleta, iiiiiii, nononen, yonina, aveleva, anapana, markram, raviwar, arthur, aaaaaaa, motet, clincher, arakara, nemtman, malalyam, avavava, maktam, agkuk, kalyk, monkak, kalyub, monkuk, duunuud, tatatat, mokikom, oohhhhhhhhhh, sea, kinonik, atarata, nananan, ppppvppp, to the masses , tommot, nyssin, renner, rapper, saccas, moss, kelleck, semmes, mott, kashak, rsser, kukkuk, nekken, serres, noppon, killik, xxxxxx, aveeva, millim, tennet, callak, hattah, mall, kattak, mermer , tippit, karak, nabban, pyrrip, tibbit, savvas, pteetp, remmer, nannan, roulur, yonic, sossos,
He quoted Google, followed the links, learned about the Holomolokh Lake, as well as the planned language Sunilinus, a language where every word is a palindrome. I also learned that the longest word-palindrome is saippuakauppias, a soap seller in Finnish. It turned out to be interesting with the word “onsone,” the author of the article about glycerin was too lazy to switch the keyboard layout and wrote the formula of glycerin aldehyde in Cyrillic - CH2 is ANNOUNCED (as it is still there, the difference is not visible to the naked eye) and my parser took the two as a separator.
Let's look for anagram words.
To begin with, we’ll compose an auxiliary dictionary of anagrams, set an occurrence threshold of 20 to ensure that any rubbish is cut off (with long words and palindromes I could not do this, because these words are extremely rare):
>>> from collections import defaultdict
>>> anagrams = defaultdict(list)
>>> for k in d:
if d[k]>20: anagrams[''.join(sorted(k))].append(k)
And look for the 10 largest sets of anagrams:
>>> for l in sorted(anagrams.values(), key=len, reverse=True)[:10]:
print ', '.join(l)
anagrams
Doran, Ardon, Dorn, Rodan, Nodar, Andro, People, Drona, Radon, Daron, Ronda, Nardo, Odran, Norad
Irka, Rika, Arches, Crayfish, Caviar, Acre, Iraq, Kira, Kari, Ikar, Cairo, Arik
vrane, arven, equal, avren, faithful, narva, avner, narev, nerve, varna, revan, varen
ryan, inar, nari, iran, rain, henri, arin, rina, nair, rani, arnie merka
, marek, marche, ermak, gloom, chambers, cream, rivers, edge, karma, frame
anchor, nokra, karon, koran, krona, mink, akron, ranko, wound, carnot, kraon satir
, sitar, grow, tyras, syrta, sarti, trias, taris, thrasi, ritas, isstra
krua, hand, arch, raku, kaur, urka, acre, kara, ukra, urak, kura
sector, corset, rockets, surroundings, bonfire, bonfire, row, cortes, stoker, construction sites, cross
noir, naru, urn, rune, nura, wound, uranium, naur, ruan, arun, arnu
Irka, Rika, Arches, Crayfish, Caviar, Acre, Iraq, Kira, Kari, Ikar, Cairo, Arik
vrane, arven, equal, avren, faithful, narva, avner, narev, nerve, varna, revan, varen
ryan, inar, nari, iran, rain, henri, arin, rina, nair, rani, arnie merka
, marek, marche, ermak, gloom, chambers, cream, rivers, edge, karma, frame
anchor, nokra, karon, koran, krona, mink, akron, ranko, wound, carnot, kraon satir
, sitar, grow, tyras, syrta, sarti, trias, taris, thrasi, ritas, isstra
krua, hand, arch, raku, kaur, urka, acre, kara, ukra, urak, kura
sector, corset, rockets, surroundings, bonfire, bonfire, row, cortes, stoker, construction sites, cross
noir, naru, urn, rune, nura, wound, uranium, naur, ruan, arun, arnu
And also sets of anagrams with words of maximum length:
>>> for k in sorted([k for k in anagrams if len(anagrams[k])>1],
key=len, reverse=True)[:10]:
print ', '.join(anagrams[k])
anagrams
short, short
astrological, realistic
, characterized
psychopathology, pathopsychology,
ontological, neolithic
oligarchic, archaeological
topological, typological
isothermal, isometric
cynological, oncological
Ostrograd, country
astrological, realistic
, characterized
psychopathology, pathopsychology,
ontological, neolithic
oligarchic, archaeological
topological, typological
isothermal, isometric
cynological, oncological
Ostrograd, country
We are looking for words in which all letters are different, again we are changing the same long-suffering line. I'm already starting to regret that I did not make a function out of it:
>>> print ', '.join(sorted([k for k in d if len(set(k))==len(k)],key=lambda w:(len(w),d[w]),reverse=True)[:100])
the words
difficult to heal, quadrangle, quadripole, paper spinning, paper spinning, arsenic, subepidermal, paper spinning, subepidermal, bolshepudginskaya, white emigrant, prosperous, direct, extrasophageal, providing, providing, quadrilateral, quadrangle, quadrangle, quadrangle, quadrangle congratulatory, four-pulse, highly specialized, multi-brand, pickups, four yo-pole, slovenliness, pan-French, gumpoldskirchen, Bolshenarymsky, congratulatory, paper-spinning, pan-French, incendiary, hooked up, South Alberta, extremophilic, fasciculatory, highly specialized, visco-elastic, sub-vertical, ritmendblyuzovyh, recline obscheitalyanskuyu, extremophilic, Germanophile, dvuhprestolnym, viscoelastic, filmprodutsenty, minrybhozyaystva, chetyrohspalnuyu, normalized, non-degenerate, extemporaneous, Hairless, stepokruzhayuschy, subvertical, Vanderbilt, shombulygchinskaya, predahtubinskoy, multi-brand, bystrouvyadayuschie, transfusion, highly specialized, vildenburgskaya, skulptirovany, a thousand, strolling, viscoplastic, nostalgic, lightly stuck, representing hailing, emphasized, representing, monastic, crushing, oldenburg, exposed, crushing, pickup, controllability, coal mining, pickup, pulitzer, braunschweig, crushing, sane, oldenburg, oldenburg, hydrometeorological recline obscheitalyanskuyu, extremophilic, Germanophile, dvuhprestolnym, viscoelastic, filmprodutsenty, minrybhozyaystva, chetyrohspalnuyu, normalized, non-degenerate, extemporaneous, Hairless, stepokruzhayuschy, subvertical, Vanderbilt, shombulygchinskaya, predahtubinskoy, multi-brand, bystrouvyadayuschie, transfusion, highly specialized, vildenburgskaya, skulptirovany, chiliagon, walking, viscoplastic, nostalgic, lightly stuck, representing, representing, emphasizes repents, representing, monastic, crushing, oldenburg, exposed, crushing, pickup, handling, coal, pickup, pulitzer, braunschweig, crushing, sane, oldenburg, oldenburg, hydrometeorological recline obscheitalyanskuyu, extremophilic, Germanophile, dvuhprestolnym, viscoelastic, filmprodutsenty, minrybhozyaystva, chetyrohspalnuyu, normalized, non-degenerate, extemporaneous, Hairless, stepokruzhayuschy, subvertical, Vanderbilt, shombulygchinskaya, predahtubinskoy, multi-brand, bystrouvyadayuschie, transfusion, highly specialized, vildenburgskaya, skulptirovany, chiliagon, walking, viscoplastic, nostalgic, lightly stuck, representing, representing, emphasizes repents, representing, monastic, crushing, oldenburg, exposed, crushing, pickup, handling, coal mining, pickup, pulitzer, braunschweig, crushing, sane, oldenburg, oldenburg, hydrometeorological extremophilic, Germanophile, dvuhprestolnym, viscoelastic, filmprodutsenty, minrybhozyaystva, chetyrohspalnuyu, normalized, non-degenerate, extemporaneous, Hairless, stepokruzhayuschy, subvertical, Vanderbilt, shombulygchinskaya, predahtubinskoy, multi-brand, bystrouvyadayuschie, transfusion, highly specialized, vildenburgskaya, skulptirovany, chiliagon strolling, viscoplastic, nostalgic, lightly stuck, representing, representing, emphasized, representing, monastic blowing, crushing, oldenburg, exposed, crushing, pickup, handling, coal mining, pickup, pulitzer, braunschweig, crushing, sane, oldenburg, oldenburg, hydrometeor extremophilic, Germanophile, dvuhprestolnym, viscoelastic, filmprodutsenty, minrybhozyaystva, chetyrohspalnuyu, normalized, non-degenerate, extemporaneous, Hairless, stepokruzhayuschy, subvertical, Vanderbilt, shombulygchinskaya, predahtubinskoy, multi-brand, bystrouvyadayuschie, transfusion, highly specialized, vildenburgskaya, skulptirovany, chiliagon strolling, viscoplastic, nostalgic, lightly stuck, representing, representing, emphasized, representing, monastic blowing, crushing, oldenburg, exposed, crushing, pickup, handling, coal mining, pickup, pulitzer, braunschweig, crushing, sane, oldenburg, oldenburg, hydrometeor filmprodutsenty, minrybhozyaystva, chetyrohspalnuyu, normalized, non-degenerate, extemporaneous, Hairless, stepokruzhayuschy, subvertical, Vanderbilt, shombulygchinskaya, predahtubinskoy, multi-brand, bystrouvyadayuschie, transfusion, highly specialized, vildenburgskaya, skulptirovany, chiliagon strolling, viscoplastic, nostalgic, legkozastryavshih representing representing, emphasized, representing, monastic, crushing, oldenburg, exposed, crushing em, pickup, controllability, coal mining, pickup, pulitzer, braunschweig, devastating, sane, oldenburg, oldenburg, hydrometeorological filmprodutsenty, minrybhozyaystva, chetyrohspalnuyu, normalized, non-degenerate, extemporaneous, Hairless, stepokruzhayuschy, subvertical, Vanderbilt, shombulygchinskaya, predahtubinskoy, multi-brand, bystrouvyadayuschie, transfusion, highly specialized, vildenburgskaya, skulptirovany, chiliagon strolling, viscoplastic, nostalgic, legkozastryavshih representing representing, emphasized, representing, monastic, crushing, oldenburg, exposed, crushing em, pickup, controllability, coal mining, pickup, pulitzer, braunschweig, devastating, sane, oldenburg, oldenburg, hydrometeorological subvertical, Vanderbilt, Shombulygchinskaya, Predakhtubinskaya, multi-brand, fast-fading, blood transfusions, highly specialized, Wildenburg, sculpted, thousand-square, strolling, viscoplastic, nostalgic, nostalgic, open-mouthed, submissive, submissive, presenting saponiformes, controllability, coal mining, pickup, pulitzer, braunschweig, crushing, sane, olde nburg, oldenburg, hydrometeorological service, exposed subvertical, Vanderbilt, Shombulygchinskaya, Predakhtubinskaya, multi-brand, fast-fading, blood transfusions, highly specialized, Wildenburg, sculpted, thousand-square, strolling, viscoplastic, nostalgic, nostalgic, open-mouthed, submissive, submissive, presenting saponiformes, controllability, coal mining, pickup, pulitzer, braunschweig, crushing, sane, olde nburg, oldenburg, hydrometeorological service, exposed
We make a function, come in handy and enter the opposite query - words in which each letter occurs at least twice:
>>> def output(lst): print ', '.join(sorted(lst,key=lambda w:(len(w),d[w]),reverse=True)[:100])
>>> output([k for k in d if min(Counter(k).values())>=2])
the words
dezoksitsitidindezoksitsitidin, izoalloksazinizoalloksazin, transformatoromtransforma, torlaukskhobnatorlaukskhobn, gipsokartonagipsokarton, kolokolokolokolokolokol, skaramangiyskaramangy, baratashvilibaratashvili, makklendonommakklendon, kapuchinatorkapuchinator, paragadionparagadion, kulevrinykulevrinery, ekstraktorekstraktor, kvadrotsiklkvadrotsikl, ferdinandferdinande, kinoteatrakinoteatr, liyaldvrildvayrildy, motsarteumemotsarteum, ittokkortoormiitom, gornostaygornostay, adalbertadalbert, obschezhitiyaobschezhitiya, plesetskayaple setskaya, gusenbachagussenbach, cupcake, canton, old-Kramatorsky, compliance, metallothioneins, unintelligent, interservisinvest, respectively, inappropriate, spankingpunking, multi-million, inappropriate, podolsk-podolsk, boredo-cumulative, small Nanlunananluna, radiator, Odessa, Colombia, Colombia, Sagittarius, Sagittarius, society, society, Moscow gorge, stronghold, tank, intelligent, respectively, respectively, respectively, respectively, thioantimonites, iooroffintesson tanton, ti accordingly, anti-trinitarian, neurosensory, wind aggregates, grossengottern, Voskanyanovskaya, and cetone acetate, charanchocharango, sadberisadbury, respectively, pseudo-confession, geranyl geranyl, state farms, Maltsevmalets, Moscow, respectively, dddddeeeeeeee, predetermined,
Here the rubbish is mainly, but there is also a beautiful thing - multimillion, unintelligent, inappropriate, special processor, anti-sectarian, part apparatus, etc.
And exactly twice:
>>> output([k for k in d if set(Counter(k).values())==set([2])])
the words
kolumbiyakolumbiya, streltsystreltsy, krepostkrepost, tankovyhtankovyh, wind turbines, grossengottern, voskanyanovskaya, sadberisadberi, psevdoispovedi, geranylgeranyl, maltsevmaltsev, predopredililo, ababvgvgddezhzhe, tszyuantszyuan, kriptoportik, ultveultve, steklosteklo, eksumaeksuma, pushkinpushkin, vestanstevna, kiberakibera, poruchiporuchi, zholudizholudi, printsuprintsu, Tingatinga, millionth, junjun, tennis player, polypyrrole, monsoon, nduconduko, rock cagorga, chuangzhuang, maistismata, barberetta, netriderin, pedropedro, syngasinga, glimaglima, mohabhab, mail, jungjung, pinkapinka, riverville, barbarella, erkatakert, papatoetoe, fungufangu, ferrefatta, snorrasona, ngalongalo, missionary, hitonhiton, titicaca, makemake, bulgyanuyantpu, bulgyanuyantpu, bulbulyan, tyulagyan jingjing, papinian, huihui, tiltil,
wind turbines, Voskanyanovskaya, pseudo-confession ... There’s
still another sea, and time is later ... If the inhabitants of the habr like the article, I’ll write the second part. If you have ideas for interesting queries to the dictionary, write in comments or in private messages, and I will translate them into code and publish the results.
PS The frequency dictionary is still here - yadi.sk/d/cRHJvinF7Vt8w