Barcode database free download without registration (and other persimmons)

    Good afternoon. In the public domain, a huge directory of barcodes with product names, categories and brands has finally appeared.

    We have been working on it for about 8 years and now it has about 3 million barcodes in the EAN (EAN-13, EAN-8) and UPC (UPC-A, UPC-E) standards.

    What is there?


    There is a table containing barcode entries and the corresponding product names, in all entries there is a category and in many - a brand.

    The range of products presented is very wide. There is no heavy equipment there, but probably all consumer segments are present (pharmaceuticals, perfumes, cosmetics, foodstuffs, toys, sexshop assortment, books, stationery, hardware, tools, etc. etc.).

    The original the online version of the manual is stored on the Universe-HTT server.
    The open version is posted on github . Please note that the fragmented database is stored in the sources. The full file is in the release .

    Why is it needed?


    Those who searched (mostly unsuccessfully) on the Internet or anywhere else for a barcode guide and already know why it is needed. For the rest, I’ll list the useful properties of such an extensive data array:

    • First of all, this is a list of products with "solid" identifiers. That is, you take an arbitrary product, for example, lying on your bedside table, and by the barcode printed on the package, you can compare it with a similar product located somewhere in a warehouse in Rio de Janeiro.
    • The consequence of the previous paragraph will be the ability to facilitate electronic document flow between enterprises, because the problem of synchronization of most (but not all, of course) goods disappears.
    • You can quickly open a new store without driving the goods into the accounting system, but getting them from such a directory by searching for a barcode (a very idealized example, oh well).

    The above options and their possible variations are quite commonplace. There are much more interesting uses for this guide:

    • Trademark Dictionary Analysis
    • Training of neural networks for the classification of goods and the normalization of their names
    • Development of “intelligent” systems for comparing price offers from different sources
    • Comparative analysis of sales and other operations in unrelated enterprises
    • ... The list goes on with your imagination

    Presentation format


    The database is represented by a text file in UTF-8 encoding with fields separated by a tab character.

    The structure of the record is as follows:

    • ID: Internal Product Identifier
    • UPCEAN: Barcode
    • Name: Product Name
    • CategoryID: Internal category identifier
    • CategoryName: The name of the category. Since the directory of categories is hierarchical, this name is compound - from the highest level to the terminal level to which the product belongs. Level Separators - Slash ('/')
    • BrandID: Internal Brand Identifier
    • BrandName: Brand Name

    Internal identifiers are hardly of interest to anyone - we upload them only for our own purposes (if you suddenly need to accurately identify the link to the record if you have any questions from the outside).

    Records in a freely distributed format are sorted by product name in alphabetical order.

    Features


    If you carefully study the data presented, you will notice that, unlike most of the similar directories available on the Internet (both paid and free), intensive work was carried out on the names of goods.

    A few words about how we do this.

    First of all, the directory (administered in the OpenPapyrus system ) is automatically processed using technology that I once described on the Habré .

    I would like to say that the technology mentioned does everything for us. But alas. A large amount of work has to be done in semi-automatic and manual modes.

    Many names have to be “decrypted" - in the original source they can contain inconceivable abbreviations and completely neglect our product naming system :)

    All bar codes published in the public domain are guaranteed to pass the test for compliance with one of 4 standards: EAN-13, EAN-8 , UPC-A, and UPC-E and include a check digit. Possible defects and problems will be described below.

    Completeness and relevance


    To the typical question “are all barcodes in the directory?” The answer is stereotypical: no and cannot be.

    If you evaluate the completeness of the reference book by the probability of the absence of a barcode there that accidentally caught your eye, then this will be 10-15 percent (my own very rough estimate, besides, they themselves understand, biased). In any case, there is no other similar size in open access. The

    geographical coverage (in the countries in which the goods are sold) is significant: Russia, Ukraine, Belarus, the USA, Great Britain, the European Union, South Africa, Brazil, Malaysia and many others.

    The presentation languages ​​are mainly Russian and English. We usually ignore sources with other languages, since nothing is meaningful in those languages ​​(as an exception, there are positions in Spanish, Czech, and other languages).

    We update the directory on the Universe-HTT server with a frequency of several months (when we accumulate a sufficient amount of data in the preliminary buffer). The last time they uploaded data in June of this year. Most new positions there are most likely absent. However, while this may seem surprising, new barcodes do not appear so often. Many products with the same codes are sold in retail for years.

    We also plan to update the open version of the directory from time to time.

    Sources


    From what sources do we take all this data? Mostly from the internet. We collect various price lists, open reports, including from government agencies (for example, some states in the US publish procurement data).

    Weeds


    The directory contains a number of defects. There are not many of them, but it is necessary to report them.

    Defective Codes


    First of all, barcodes come across that are mistakenly interpreted as UPC-A while in reality it is EAN-13 without a check digit. The reason is that the original source (we don’t already know which one) contained an EAN-13 code without a check digit, but the last digit met the check digit calculation rule for UPC-A and our modest algorithm counted this code as related to UPC-A. This could be corrected, but noticed too late and the hands did not reach the mass adjustment.
    Problems of this kind are vanishingly small, but, as they say, alas.

    Gross mismatch


    Further, there is confusion in the goods. That is, in some (extremely rare cases) a barcode corresponds to a name that is not at all related to it.

    Private codes


    Some barcodes may be private. Those EAN-13s that start at 2 are discarded at the start, but sometimes something goes wrong and private codes come up, either starting at '2' or those starting with some other digit, nonetheless private, not registered in any of the organizations involved in this (GS1, for example).

    Classification


    As we did not try to establish a good classification of the directory - not much was possible. A third of the positions belong to the default group - that is, it is absolutely not classified. The rest may well be erroneously categorized.

    Not all products are associated with brands, although we worked very hard on this issue.

    How to help?


    If you have a desire to help expand the directory, then we will be grateful for the data sent about the barcodes known to you. I strongly doubt that there are anyone who wants to, but just in case, I inform you that according to the information in the profile it is not difficult to find me.

    Anyone who has the ability to implement automatic classification of directory items and share ideas and best practices will receive the title of an incredibly kind person. For our part, we undertake to inform the public about the success of our own research in this area.

    Self-interest


    If you liked the guide, then mark it on github with an asterisk. If you liked it very much, also mark the OpenPapyrus project with an asterisk , because all administration and management of the directory is carried out with its help.

    Terms of Use


    There are none. As you wish, use it. If you give a link to us - thank you, no - we will survive.

    Bitter regrets


    Not wanting to pass off need as virtue, I will inform you that we were hoping to somehow monetize the reference book under discussion. However, we have not been able to achieve noticeable successes in this field over the past years. Therefore, they decided: let it be better than the general, than the fuck will be gone. Something like this looks like our motives for the indicated action.

    Thanks for attention.

    Also popular now: