How many scientific articles on the Internet?

    Professor Lee Giles of the College of Information Technology at the University of Pennsylvania spent much of his career developing search engine science articles so that the academic community has easy access to materials.

    The professor recently published the first of its kind study, which estimates the number of available scientific articles on the Internet. The work "The Number of Scholarly Documents on the Public Web" published in the May issue of the journal PLoS ONE, and cited in Nature.

    The work takes into account only English-language documents, taking into account overlap in the two largest specialized search engines: Google Scholar and Microsoft Academic Search. Scientific documents are publications in journals and reports from conferences, dissertations and dissertations, books, technical reports and working documents (preliminary versions of scientific articles).

    Statistical methods have shown that at least 114 million scientific documents in English are available through the Internet, of which about 100 million are available through Google Scholar. At least 27 million documents (24%) are publicly available.



    The authors adapted in their work the double coverage method, which is usually used in ecology to estimate the size of animal populations. There, he suggests catching a certain number of animals that are tagged and released into the wild. Then re-fishing is carried out in the same area. Scientists estimate the percentage of ringed animals in the second sample - and make an approximate estimate of the total population size using a simple formula.

    Giles research has practical meaning for him as a developer. Back in 1997, he and his colleagues released the CiteSeer open search engine for scientific documents, mainly from the field of computer science. At the same time, the search engine took into account quotes and links in documents to build an index taking into account the ranking. It is believed that this is the first automatic citation indexing system, the forerunner of tools such as Google Scholar and Microsoft Academic Search.

    In 2008, a new version of CiteSeerX was released, in which the theme was expanded to physics, economics, medicine and other scientific industries. Giles is trying to assess what infrastructure is needed to index documents in each industry.



    Giles emphasizes the fact that 24% of all documents are freely available on the Web in the form of direct links to documents through Google Scholar (in computer science, the percentage of freely available documents is 50%). The professor also notes that documents in the public domain are more often quoted and have more weight.


    Also popular now: