Venyamean August 19, 2008 at 15:38

Programmatically breaking a word into syllables

Recently, I ran into the problem of implementing word wrapping with PHP. Continuous harassment before the search engines did not work - the finished script was not detected. Why is there a script, even with the search for the algorithm there were difficulties. Therefore, I, armed with a notebook and a pencil, went to PhilFak of Ufa of our BSU to ask my friends philologists how it really works. And then he armed himself with NotePad'om ++ and wrote a simple little such script, capable of somehow coping with the task. What came of it - we read on the cat.

So, for starters, let's figure out why we all need this. I needed to implement a word splitting algorithm for hyphenation , and, as we recall from the school course, there are three different ways to approach the question:
1) “Graphic” , on the basis of which we construct hyphenation in such a way as not to impede the visual perception of a word or phrase as a graphic whole, which, however, is not actually programmable — how do you define the concept of a “graphic whole” to a machine ??
2) "Morphemic" , according to which during the transfer significant parts of the word are not broken. In the pros , it has a clear transfer logic for both the machine and the user, in the minuses it is the need to compile a dictionary of morphemes, which is not a little oops.
3) “Phonetic” , that is, “syllable”, where hyphenations are implemented in such a way as not to impede the reading of a word (as we remember from the same notorious school curriculum, it is the syllable that is the unit of reading and writing in Russian). So, exactly this method was chosen for implementation . Its pluses are the ease of compiling the initial code, say, the “core”, the minus is the need to maintain a system of rules by which the primary result is processed, and the partition logic that is not obvious to the user , not just words, but even into syllables . This is connected with this: The

syllable in the Russian literary language is determined by the principle of ascending sonority . According to the degree of sonority, they usually indicate:vowels - 4 , sonorous consonants - 3 . voiced noisy consonants - 2 , deaf consonants - 1 .
Hence:
friends - 23434 - friends;
cabbage - 1414114 - ka-poo-sta;
black - 143344 - black;
hawk - 3411343 - i-scrub;
hockey - 141143 - hockey;
stone --143433434 - ka-me-nna-i;
Daria - 24334 - Yes-rya;
skates - 14314 - skates;
family - 14334 - family;
Yes, it’s not like how syllables are usually divided in schoolse (we personally didn’t share it at all ...), but catch a philologist accidentally running by the sleeve and make him speak - he will confirm the above.

So now it’s clear how we are going to divide the words. Let's look at the code now, and in order to understand the algorithm along the way, I provided it with comments:

// for starters, I decided that Unicode is good (I won’t go into details, the topic isn’t about that), therefore our word and all operations on its component parts will occur in Unicode characters
// the following function will help us with this:
function win2uni ($ s)
{
// conversion win1251 -> iso8859-5:
$ s = convert_cyr_string ($ s, 'w', 'i');
// conversion iso8859-5 -> unicode:
for ($ result = '', $ i = 0; $ i       $ charcode = ord ($ s [$ i]);
      $ result. = ($ charcode> 175)? "& #". (1040 + ($ charcode-176)). ";": $ s [$ i];
}
return $ result;
}
// Now, having dealt with the encodings, we divide the letters into groups, as described above.
// of course, for absolute parity, we would need to process not letters, but sounds, but I decided to simplify my task.
// we will not process the sound softening options (b) as it should be according to the rules of the section syllable, but simply agree as follows:
// the carry symbol can never (!) be before “b” and “b”
// in the process we will simply ignore them and in case of need to move the carry sign
// here. go:
$ group_4 = array (win2uni ("a"), win2uni ("e"), win2uni ("e"), win2uni ("and"), win2uni ("o"), win2uni ("y"), win2uni (" e "), win2uni (" u "), win2uni (" i "));
$ group_3 = array (win2uni ("l"), win2uni ("m"), win2uni ("n"), win2uni ("p"), win2uni ("d"));
$ group_2 = array (win2uni ("b"), win2uni ("c"), win2uni ("d"), win2uni ("d"), win2uni ("h"), win2uni ("w"));
$ group_1 = array (win2uni ("k"), win2uni ("c"), win2uni ("c"), win2uni ("f"), win2uni ("t"), win2uni ("w"), win2uni (" u "), win2uni (" x "), win2uni (" c "), win2uni (" h "));
// now describe the variables used by the script:
$ word = "cat"; // a word that we split into syllables
$ split = array (); // an array in which we store the belonging of each character of the word to one of the described groups
$ word_split = array (); // character-broken word
$ start = 0; // start of the loop
$ end = strlen ($ word); // end of the loop
// so let's start processing:

// shovel the original word:

while ( $ start <$ end)
{
$ word_split [$ start] = win2uni (substr ($ word, $ start, 1)); // pick the character
$ is_group1 = in_array (win2uni ($ word_split [$ start]), $ group_1); // if the symbol belongs to the first group, set accordingly. flag true
$ is_group2 = in_array (win2uni ($ word_split [$ start]), $ group_2); // similar to
$ is_group3 = in_array (win2uni ($ word_split [$ start]), $ group_3); // similar to
$ is_group4 = in_array (win2uni ($ word_split [$ start]), $ group_4); // similarly
// (in general, you can do without flags, they helped me in the debugging process, and then it was too lazy to remove them ...)

// now check the status of the flags:
if (! empty ($ is_group1)) // the symbol activated the first flag !
{
$ split [$ start] = 1; // write the symbol belongs to the first group, respectively. array
}
elseif (! empty ($ is_group2)) // similar to
{
$ split [$ start] = 2;
}
elseif (! empty ($ is_group3)) // similar to
{
$ split [$ start] = 3;
}
elseif (! empty ($ is_group4)) // similar to
{
$ split [$ start] = 4;
}
elseif (empty ($ is_group1) and empty ($ is_group2) and empty ($ is_group3) and empty ($ is_group4)) // and if this symbol is not in any of the groups (this is a soft sign, for example), then
{
$ split [$ start] = $ word_split [$ start]; // write it as is, and then we'll figure it out
}

$ start ++;
}
// like this, the word was cracked. next - test output of $ split, see what happened

foreach ($ split as $ s)
{
echo $ s;
}

echo "
";

// and the test output of $ word_split, in addition, you need to compare the output of $ split with something =)
foreach ($ word_split as $ w)
{
echo $ w;
}
echo "
";

// but now, in fact, we beat the word into syllables:
// (I was too lazy to save the result of output to a variable, and then output it, so I output it immediately in a loop):
$ count = 0; // we have a new counter =) old I fired =) =)

while ($ count <= count ($ split))
{
$ a = $ split [$ count]; // belonging to the group of the current character
$ b = $ split [$ count + 1]; // belonging to the group of the next character
// calculate the difference between the group of the current and next character.
if ($ a- $ b == 0 and $ b == 4) // if it is 0 and these are vowels
   {
   echo $ word_split [$ count];
   echo "-"; // push the transfer between them
   }
else
{
   if (! is_numeric ($ b) or $ a- $ b <= 0) // if there is a “soft sign” or no decay of sonoriness
      {
      echo $ word_split [$ count]; // then we don’t put any hyphen
      }
   else // if there is a decline in sonority
      {
      echo $ word_split [$ count];
      echo "-"; // insert the hyphen
      }
}
$ count ++;
}

echo "
";
// that's all =)
?> * This source code was highlighted with Source Code Highlighter .

It seems that the program is ready. But - let's take a test on the words described above in the theory of transfer:
for example, take the word “hawk” : it is transferred as “i-strab” , but the program beats it to “i-str-b” , because between “e " And " b " there is a decline in sonority. Bug? You can process the output with regular expressions, and thus close the hole. But the program will transfer the word “Landsknecht” in general, it is not clear how. Why? Yes, it is simply not Russian and does not obey the Russian laws of the syllabus. In general, you see that the code will have to be finished with a whole system of rules, the development of which I leave to the readers' conscience. This is a good training for the brain, moreover, you will become proficient in philology))

That, in fact, is all that I wanted to write about today. Thanks for attention.

PS My first post. I did it! =)

Tags:

Programmatically breaking a word into syllables

Also popular now: