Automate the conversion of word files to other formats

Some state. Structures generate reports in doc files. Somewhere this is done by hand, and somewhere automatically. Imagine that you are instructed to process a ton of such documents. This may be necessary to isolate some specific information or just check the contents. We need to take out only unformatted text without graphs and pictures. For example, such data is easier to shove into a neural network for further analysis.

Here are some options for the most ordinary person:

  • Handles iterate over all files one by one. After the tenth document, the thought will come to you that you are doing something wrong.
  • Try to find on the Internet a special library (extension) for working with doc files in a programming language that you own. Spend another hour understanding how to work with this library. You also have to face the fact that the principles of working with doc and docx are slightly different.
  • Try to automatically re-save all documents in a different format, which will be more convenient to work with.

Just about the last option and we will talk.

And a vbs script hurries to help us. A vbs script can be called from the command line, which can be done in any programming language.

Create a converter.vbs file

Const wdFormatText = 2
Set objWord = CreateObject("Word.Application")
Set objDoc = objWord.Documents.Open(Wscript.Arguments.Item(0), True)
objDoc.SaveAs WScript.Arguments.Item(1), wdFormatText
objWord.Quit

In the first line we indicate in which format we will convert: 2 - to txt, 17 - to pdf.
All formats can be viewed here . In the second line, we directly open the word. After opening, you can add the following line:

objWord.Visible = TRUE

This will cause us to see the process of opening Word. This can be useful if at some point an error occurs, the word does not close itself, and without this line the process can be killed only through the task manager, and we can just click on the cross.

At the command prompt, the script will run as follows:

converter.vbs полный_путь_к_файлу\имя_файла.docx полный_путь_куда_сохранить\имя_файла_без_расширения

Wscript.Arguments.Item (0) is the full_path_to_file \ file_name.docx
WScript.Arguments.Item (1) is the full_path_to_save_file_name_file_name without extension
Accordingly, in the third line of our script we open the file and save it on the next line in the specified format. And at the end we close the word.

There is another little trick you need. Sometimes, due to differences in word versions or for other reasons, word may swear, saying that the file is damaged. When manually opening the file, we will see a warning “the table is damaged, continue opening the file?”. And you just need to click on "Yes", but the script will stop its work at this point.

Vbs has a very clumsy try catch construct. You can get around this problem by adding just two lines. In total, a full-fledged stable script is as follows:

Const wdFormatText = 2
Set objWord = CreateObject("Word.Application")
objWord.Visible = TRUE
On Error Resume Next
Set objDoc = objWord.Documents.Open(Wscript.Arguments.Item(0), True)
Set objDoc = objWord.Documents.Open(Wscript.Arguments.Item(0), True)
objDoc.SaveAs WScript.Arguments.Item(1), wdFormatText
objWord.Quit

As you can see, the opening of the file is duplicated. In the case when everything is in order with the file, the file will simply open twice, and in case of an error, it will simply continue to open the file.

And for every fireman, an example of how a function might look in Python

import os
#folder_from = os.getcwd() + r'\words' - папка, где лежит тонна word файлов
#folder_to = os.getcwd() + r'\txts' - папка, куда будем сохранять 
def convert(file_name):
    str1 = folder_from + r"/" + file_name
    str2 = folder_to + r"/" + file_name[:file_name.rfind('.')] 
    os.system('converter.vbs  "' + str1 + '" "' + str2 + '"') #запуск скрипта

Next, simply apply this function to all files that need to be converted.

Total


  1. This solution is suitable for all word formats.
  2. You spent no more than 10 minutes reading this article.
  3. You can implement knowing any programming language.

Also popular now: