parsing MS-office files

    Recently, a task was set for me: to extract some information from MS-office files (.xls, .doc) for its subsequent processing. In fact, it was necessary to pull out the text contained in the document.



    For .xls, the PhpExcelReader project was quickly found , and there is nothing more to say - look at the code, google, and I can only give a few lines of code to help:

    $reader = new Spreadsheet_Excel_Reader();
    $reader->setUTFEncoder('iconv');
    $reader->setOutputEncoding('UTF-8');
    $reader->read($this->filename);

    $text = "";

    if ($reader->sheets && count($reader->sheets))
    {
      $sheet = $reader->sheets[0];
      
      if (isset($sheet['cells']))
      {
        foreach ($sheet['cells'] as $row)
        {
          $text .= implode(' ', $row) . "\n";
        }
      }
    }
    echo $text;

    * This source code was highlighted with Source Code Highlighter.


    It turned out to be a bit more complicated at first with .docs: I just could not find a free PHP parser that would not use COM (I couldn’t get paid for it either, but I was still looking for a free one; by the way, if people know about this project - you are welcome to comment).

    I was completely desperate when I suddenly decided to look at the .doc file using the console less utility. less complained that " catdoc is not installed ", I took my breath, typed sudo apt-get install catdoc - and voila - I have a console viewer of Word documents in my hands. After that, it remains only to write:
    /**
    * @note catdoc program should be installed and reside within $PATH!
    */
    echo shell_exec('catdoc ' . escapeshellarg($this->filename));

    * This source code was highlighted with Source Code Highlighter.

    Also popular now: