
parsing MS-office files
Recently, a task was set for me: to extract some information from MS-office files (.xls, .doc) for its subsequent processing. In fact, it was necessary to pull out the text contained in the document.
For .xls, the PhpExcelReader project was quickly found , and there is nothing more to say - look at the code, google, and I can only give a few lines of code to help:
It turned out to be a bit more complicated at first with .docs: I just could not find a free PHP parser that would not use COM (I couldn’t get paid for it either, but I was still looking for a free one; by the way, if people know about this project - you are welcome to comment).
I was completely desperate when I suddenly decided to look at the .doc file using the console less utility. less complained that " catdoc is not installed ", I took my breath, typed sudo apt-get install catdoc - and voila - I have a console viewer of Word documents in my hands. After that, it remains only to write:
For .xls, the PhpExcelReader project was quickly found , and there is nothing more to say - look at the code, google, and I can only give a few lines of code to help:
$reader = new Spreadsheet_Excel_Reader();
$reader->setUTFEncoder('iconv');
$reader->setOutputEncoding('UTF-8');
$reader->read($this->filename);
$text = "";
if ($reader->sheets && count($reader->sheets))
{
$sheet = $reader->sheets[0];
if (isset($sheet['cells']))
{
foreach ($sheet['cells'] as $row)
{
$text .= implode(' ', $row) . "\n";
}
}
}
echo $text;
* This source code was highlighted with Source Code Highlighter.
It turned out to be a bit more complicated at first with .docs: I just could not find a free PHP parser that would not use COM (I couldn’t get paid for it either, but I was still looking for a free one; by the way, if people know about this project - you are welcome to comment).
I was completely desperate when I suddenly decided to look at the .doc file using the console less utility. less complained that " catdoc is not installed ", I took my breath, typed sudo apt-get install catdoc - and voila - I have a console viewer of Word documents in my hands. After that, it remains only to write:
/**
* @note catdoc program should be installed and reside within $PATH!
*/
echo shell_exec('catdoc ' . escapeshellarg($this->filename));
* This source code was highlighted with Source Code Highlighter.