Internetional August 31, 2008 at 14:09

Problems of extracting information from electronic digital sources

I work at the Faculty of International Relations of St. Petersburg State University. We mainly study “people with a humanitarian mindset,” and when our students in research try to use electronic sources of information that they find via the Internet, sometimes they simply cannot get the information they need from there.

For such students, I wrote an article that can be used as a guide to "extracting information." But, since I also have a humanitarian mindset, I would like to discuss it with visitors to Habrahabr. Maybe here they will show me something that I have undeservedly ignored.

Caution, the text is long enough :)

Although almost any information can be presented in electronic digital form today, for a researcher of international relations, textual, audio, and visual sources are most relevant. All of them are machine readable. This means that the information contained in them cannot be directly perceived by a person; To extract information from these sources, technical devices are needed: computer hardware and software capable of processing information objects created using various information processing technologies.

Technologies for processing information, its conversion to digital form, recording and storage on electronic media were created and are being created in different parts of the Earth, by various research teams, academic groups, commercial developers, public organizations and individuals. As a result, there are a considerable number of standards for electronic documents, for which a wide variety of computer programs are needed. Some standards are internationally recognized; they are recommended for use by reputable international organizations: International Organization for Standardization (ISO), International Telecommunication Union (ITU), International Electrotechnical Commission (IEC). For example, the standard for electronic presentation of text and graphic information PDF (portable document format) developed by Adobe is an official electronic document standard approved by ISO and received ISO 32000 according to the classification of standards of this organization. This standard is widely used throughout the world. The electronic documents based on it can be found in large quantities both on the World Wide Web and in file-sharing networks. And to find computer programs that display electronic documents created on the basis of this standard is not difficult. The electronic documents based on it can be found in large quantities both on the World Wide Web and in file-sharing networks. And to find computer programs that display electronic documents created on the basis of this standard is not difficult. The electronic documents based on it can be found in large quantities both on the World Wide Web and in file-sharing networks. And to find computer programs that display electronic documents created on the basis of this standard is not difficult.

Sometimes, in the presence of officially recognized international standards, alternative technologies become de facto standards. For example, if a researcher who does not focus on modern information technologies, but who has a clear understanding of the system of international standards, plans to use electronic correspondence with participants and eyewitnesses of events of interest to develop his topic, he might come up with the use of software for processing, transmission receiving e-mail based on the X.400 standard, developed and adopted for these purposes by the International Telecommunication Union. But when searching for such software, he will encounter significant difficulties. And if he finds it, it turns out that his potential respondents do not have similar programs. The thing is, that the transfer and processing of mail based on these standards is practiced mainly in government, intergovernmental and banking correspondence. Individuals all over the world use the SMTP and POP3 protocols for these purposes, which are less reliable and secure, not approved by authorized organizations as international standards, but actually being such at a global level. Accordingly, the researcher needs to use an email client program that works specifically with these standards for transmitting and receiving information. not approved by authorized organizations as international standards, but actually being such at the global level. Accordingly, the researcher needs to use an email client program that works specifically with these standards for transmitting and receiving information. not approved by authorized organizations as international standards, but actually being such at the global level. Accordingly, the researcher needs to use an email client program that works specifically with these standards for transmitting and receiving information.

In addition to such global standards of information technologies used to create electronic documents and work with them, there are regional standards (for example, in the European region, such standards as the European Committee for Standardization, the European Committee for Standardization are involved in the development of such standards and their promotion to the global level solutions in electrical engineering and the European Institute for Standardization in the field of network infrastructure.The latter organization develops and actively implements called the Information Society Standards System (ISSS), its own standards for the creation and processing of electronic information sources may exist in different countries and be approved by relevant institutions at the national level. original standards for encoding and decoding information can be used in commercial software products, in corporate information systems and just in certain social circles. For example, a researcher cannot ignore files created using the Microsoft Office software package, although many of the electronic presentation technologies used in these programs are not only not recognized as international standards by competent international organizations, but are patented, and therefore, they require licensed software for legal work with them, which can cost a lot of money, neither “TAR” file archives or “OGG” sound files, the creation technologies of which, again e being the official standard for storing electronic information, actively used by proponents of open source ideology around the world. Data valuable for the study of international relations can be presented electronically by people from various countries and social groups who have access to a wide variety of (sometimes very specific) information technologies and software products, and a researcher today must have at least a general idea on how to process data presented in different electronic formats.

To extract information from any digital source, you need a device that can convert information from a machine-readable form into a form suitable for human perception. To date, many devices have this functionality: from a desktop computer to a mobile phone. We will not dwell on sophisticated information transformation technologies. Experts in the field of computer science in recent years have written many manuals on these issues. Let us dwell on a brief description of those modern solutions that do not require knowledge of the technical subtleties of working with electronic documents, which are primarily needed by an international researcher who receives information from different countries and processes this information on different devices, some of which can be at home, others - in various organizations in Internet access points, etc., in general, where a researcher working with operational information needs to receive and process it. One cannot but take into account the fact that working with electronic sources in the social sciences cannot replace field research. Therefore, some devices may be located abroad - in countries in which the processes are being investigated.

Information for the international researcher comes in different languages. If electronic digital sources contain information in foreign languages in sound form, the researcher needs to know these languages in order to extract it. If this particular researcher does not speak one or another language necessary to extract information valuable for the topic under study from an electronic digital source, it means that it makes sense to conduct a study of such a topic to a research team in which there are people who speak different languages. If the source in a foreign language contains information in the form of printed text, you can use electronic translation technology to extract this information. This technology is implemented as in special handheld devices called “electronic translators”, as well as in software for multifunctional electronic devices. In a public form, it is implemented on large servers of the global information network. Access to these servers is carried out from any computer connected to the Internet and equipped with a web browser. A well-developed publicly available electronic translation system, developed by Google, can be found at <translate.google.com >. She works with over a hundred languages. Another electronic translation system that a Russian-speaking researcher needs to know is available at < translate.ru >. It supports fewer languages and directions of translation, but provides significantly better translation into Russian. The fact is that this system is being developed by Russian specialists and is developing longer than the Google translation system. There are other machine translation systems.

It is not possible to use electronic translators to extract information such as legal acts or political statements. The subtleties of language that are important for such documents are lost in machine translation. Moreover, “Russian-language” texts obtained as a result of machine translation should not be quoted in academic works. Proposals are often received inconsistent, but there is no need to talk about style. The current level of artificial intelligence allows a researcher using machine translation technology to understand only the general meaning of the information contained in the translated source. Although often because of the multivariant translation of many words, even the general meaning of individual sentences may not be completely clear. To clarify the meaning of individual words, you can use electronic dictionaries - programs that work with databases in which words of one language are associated with a combination of their meanings in other languages. These programs, unlike electronic translators, do not make the user choose the best option for translating a word. Public electronic dictionaries can be found through search engines on the World Wide Web with a simple query like "Chinese-Russian dictionary online", where instead of "Chinese" you can insert any other language. If in this way it is not possible to find out the meaning of a word (there is no public dictionary or the searched dictionaries do not have the word you are looking for), you can go in two ways: either enter the unknown word in a foreign language as a search query and using additional options, offered by search engines, limit the search results to Russian-language documents (if a foreign word is given in the Russian text, it is likely that the meaning of this word is explained), or find an electronic dictionary that translates the words of an unfamiliar language into English. There are many more such dictionaries on the World Wide Web due to the more developed English-speaking segment of the World Wide Web and the greater prevalence of English among its users. It is difficult to imagine a qualified researcher of international relations who does not speak English. giving translation of words of an unfamiliar language into English. There are many more such dictionaries on the World Wide Web due to the more developed English-speaking segment of the World Wide Web and the greater prevalence of English among its users. It is difficult to imagine a qualified researcher of international relations who does not speak English. giving translation of words of an unfamiliar language into English. There are many more such dictionaries on the World Wide Web due to the more developed English-speaking segment of the World Wide Web and the greater prevalence of English among its users. It is difficult to imagine a qualified researcher of international relations who does not speak English.

A machine translation technology can be used by a researcher not only for the translation of ready-made printed texts, but also for electronic correspondence, the other participant of which does not speak Russian, but speaks one of the languages that this or that electronic translation system works with. To do this, it is enough to keep the tab with the page of the electronic translation system open in the browser and enter it into this system before sending each message, receive the translation and send the already translated text. You can do the same with messages coming from the respondent.
It is often necessary to translate the received information not only from one language to another, but also from one format to another. For example, the document found by the researcher was prepared in the Microsoft Word 2007 text editor and has the extension “docx”, and the researcher (or directly on the computer that he is currently using) does not have this program. What to do? To run to the nearest software store and buy several copies of the Microsoft Office package, each of which costs a lot of money? (One copy is not enough if you work on different computers.) But what if someone else’s computer is not allowed to install this or that software? The problem is solved by online electronic document conversion systems. The most developed of them is available at < zamzar.com>. Having visited this site, the user needs to fill out a simple form: select the local file that needs to be converted (the file name can contain only Latin letters and numbers), determine the final format (for example, a file with the extension “docx” can be “translated” into twenty different formats , including “doc”, and “rtf”, and “pdf”, and “txt”, and “odt”, and “html”), provide your email address and wait a bit. As a result, the user receives an email with a link that leads to the file in the desired format. Using this and other similar converters, it is easy to extract information from sources for which there are no viewing and processing tools on the device used at one time or another.

There are many websites that, like the Zamzar.com site, can help you process various types of information in electronic digital form without having to constantly have at hand a computer on which all the software required to work with files of various types is installed. Thanks to them, the researcher can work with his data, being anywhere in the world, at any computer connected to the Internet.

If you need to process e-mail, and the computer does not have an e-mail client, the web interface that is present at all major providers of free e-mail services will be useful. The largest of them are available at the following addresses: < mail.ru >, < gmail.com >, < mail.yandex.ru >, < mail.yahoo.com>. The advantage of the web interface over the processing of e-mail using email clients is not only that with this method you do not need to have an installed and configured email client, but also that all messages received and sent ever are stored on the servers of service providers e-mail (for which many of them provide unlimited space), and they can be accessed from any computer, anywhere, anytime.

Instant messaging also has a web interface. Through this type of communication, it is very convenient to receive operational information from their eyewitnesses or participants if the researcher maintains contact with them. For the ICQ system, the web interface is available at < go.icq.com >, for Google Talk for < gmail.> at Yahoo! Messenger at < webmessenger.yahoo.com >. In addition, there are solutions that allow you to work with multiple messaging systems on one web page at the same time. The most convenient of them are available at < meebo.com >, < koolim.com >, < flick.im >. During correspondence through all these sites, the message archive is stored on a remote server, and it can be accessed from anywhere at any time, using only a computer connected to the Internet.

There are also several ready-made solutions for working with texts. The most complete of them are available at < www.thinkfree.com >, < www.zoho.com >, < docs.google.com>. All of them try to imitate traditional office suites (first of all, Microsoft Office) and offer almost all the functionalities present in such suites. But at the same time, the program is not installed on the local computer, work with documents is carried out directly through the browser, and the files are stored directly on the server. That is, no special software is needed to extract information from such files. Virtually any computer connected to the Internet, on which you can run a modern browser, is suitable.

Tools for processing spreadsheets, viewing and creating presentations are also present on all of these sites.

To extract information from audio recordings and sound files on computers that do not have special software, you can use audio and video hosting systems that convert files to Flash technology objects that are also listened to or viewed in any modern browser. The largest audio hosting in the CIS countries is available at < vkontakte.ru/audio.php >, and video hosting is available at < vkontakte.ru/video.php>. To use the capabilities of this site you need to register on it. But the requirement to be a registered user makes it possible, by storing the file on this server, not to provide access to it to an unlimited circle of people, which may be prohibited by the owners of exclusive rights to the audio or video recording used by the researcher as a source. On audio and video hosting sites that do not require registration, access to the downloaded file is available to everyone who has access to the site, since the user who uploaded the file to the server himself was not authorized on it, and providing further access to the file is only much more difficult for him from due to problems with his identification.

To store web pages found on the World Wide Web, it is extremely convenient to use the so-called "social bookmarking services" that store copies of web pages that are bookmarked on their servers. One of the most convenient tools is available at < diigo.com>. Firstly, it is convenient in that it allows you to always have the addresses of the right sites on hand; secondly, it makes it possible to select text and make notes directly on web pages, without saving them on the local computer and not launching any additional programs, and during subsequent viewing of these web pages even on other computers again see what specifically was highlighted and what notes were made (when working with text sources this is very useful); and most importantly, if the content of the page on the World Wide Web changes, you can always return to the document that was on this page at the moment when it was "bookmarked".

If a link to a certain document is found on the World Wide Web, and the document itself cannot be found using this link, it is likely that it was there, but is currently deleted. How to extract information from an electronic digital source that no longer exists? In this case, the Wayback Machine tool created by the public organization Internet Archive, which deals with the practical aspects of preserving the digital heritage, can be useful. “Wayback Machine” gives you the opportunity to see how this or that web page looked at a particular moment. Unfortunately, not all websites are stored in the “digital archive” of the “Internet archive”, but there is still a significant chance that significant documents that have ever been posted on the World Wide Web can be found. In “Wayback Machine” you need to enter the website address,

Some search engines provide similar capabilities. If the search engine found a document, provided a link to it, and the link did not lead to the desired document, sometimes you can view a copy of the found web page stored on the search engine server at the moment this page was indexed.

For searching and processing news it is convenient to collect all the news in one place. This is possible thanks to the RSS feeds that almost every site with regularly updated content has today: news sites, blogs, forums, etc. By subscribing to a large number of RSS feeds, which will reflect documents appearing on different sites on a topic of interest, the researcher has the opportunity to quickly view new arrivals, read all the most important and not miss anything without having to go to all these sites and look for new materials that appear there . This option is provided by reader programs, many of which also work on remote servers and are accessible from any computer connected to the Internet: < google.com/reader >, < lenta.yandex.ru >, etc.

There are also dozens of websites on which all or many of the above types of services are integrated. For example, users of the < desktoptwo.com > website can run directly on it without running any programs on their local computer except a browser, work with an HTML editor, OpenOffice.org office application suite, an RSS feed reader, an MP3 player, and a system instant messaging. At the address < g.ho.st > on the remote server, you can also run a program for reading RSS, notepad, clock, email client, a program for instant messaging; in addition, without leaving this site, you can launch Zoho office applications, watch videos from Youtube video hosting, and images from Flickr photo hosting. And on the server <ulteo.com > you can run a full-fledged distribution of the Linux operating system with a wide range of applications and work with them directly in the browser window.

The last example requires the installation of a special applet on the computer. All other services work simply in the browser. Sometimes you need support for AJAX technology, sometimes you need flash technology. Their support is in most modern browsers. If there is a chance that the researcher will get a computer with Internet access on which such a browser is not installed, you can use the portable version of the free Mozilla Firefox browser, which you can carry everywhere with you on a small information storage device (for example, a memory card) and run without installation on almost any modern computer. The distribution of this browser can be found, for example, at < portableapps.com>. Using this program is also convenient because it can save information about accounts on different servers (including passwords), and with its help authorization on various resources from different computers is faster and easier, and confidential information about user web sessions It is not saved on those computers from which it goes online.

If on one or another computer the portable version of the browser fails to start (for example, due to the fact that the operating system the browser does not need is installed on the computer), an operating system distribution kit that does not require installation on the computer’s hard drive and can be launched from the CD drive. Different distributions have such operating systems. The most common of these distributions, in which a modern browser is installed, is available for download at < knoppix.com >.

Many sources of information on international relations contain confidential information, access to which is limited. Such sources can be electronic. You can familiarize yourself with them only after obtaining admission. It is impossible to refer to information from these sources in academic works because of its secrecy. But on the computers of organizations involved in international relations, electronic databases of documents that are not secret but inaccessible through public information networks can also be stored. Access to them can only be obtained from computers connected to the local network of these organizations. In fact, the only way to obtain information from such sources (in the absence of direct access to the computer networks of these organizations) is to contact the leaders or employees of these organizations, authorized to copy information from such databases, with a request to find and provide certain documents. Of course, no one is obligated to fulfill such a request from a researcher. Help can be provided only of good will.

In addition, often the electronic digital sources of information needed by the researcher of international relations, although located on the World Wide Web, are not open for direct access to all comers. Access to them can only be people who have an account on the sites on which they are posted, or even only certain visitors to these sites — those who have access to it have been opened by the person who posted this or that information on the site. With the development of a new type of information systems on the World Wide Web - sites whose content forms an unlimited circle of their users, and the users themselves who post this information can provide access rights to certain visitors to certain records, we encounter such sources more and more often. It is from them that it is most convenient today to receive information about certain events from their participants or eyewitnesses. Finding such electronic digital sources of information on the world wide web, to which limited access is open, is not so simple. Search engines cannot index these sources, therefore, a simple search by keywords in search engines can lead to the fact that such important sources by the researcher will be missed.

Today, it is important for a researcher of social processes to have accounts in various “social networks”, on blogging sites, on thematic forums, etc., in order to access information that is open on these sites only for authorized users, as well as internal search engines. mechanisms of these sites themselves. Search systems on these sites allow you to search for communities by interests, users by place of residence, political views or other information that they provide about themselves. That is why such sites are an unusually valuable resource for researching social processes, systems and relationships, including international ones.

However, it is impossible to limit oneself to simple registration on a number of such sites. As already mentioned, often their users have the right to restrict access to the information that they post on these sites. Therefore, simply monitoring the development of these sites in order to track relevant information is not the best method of extracting information from such sources. To obtain the fullness of information, it is necessary to include monitoring the appearance of new information on these sites or even observing participation — participation in those processes of social communication that are carried out through these sites. Active participation in community discussions and correspondence with people who can provide valuable information about the subject of research. Without personal communication with people who post valuable information on such sites,

Examples with databases of electronic documents that are accessible only through the local networks of certain organizations, and with information accessible through a world wide web to a limited circle of people, show that working with electronic digital sources of information cannot be considered as an alternative to finding information through social contacts. Without resorting to social interaction, it is impossible to obtain comprehensive information about social processes, systems and relationships.

(First published in my Academic FAQ )

Tags:

Problems of extracting information from electronic digital sources

Also popular now: