Macromedia: analysis and interpretation of multimedia information. M-lang

From the sandbox

This article is devoted to the general problems of the use and development of macro-media technologies. Based on the well-known principles and methods of analysis and information processing, the author set himself the goal of defining the basic concepts and rules necessary for the development of a generative grammar and language for describing the process of analyzing multimedia information.
Two approaches to the analysis of multimedia information are considered: content, and content-interpretation. The article also sets out the basic rules and gives examples of designs and specifications of the language for describing graphic information analysis algorithms - M-Lang.

Introduction

Based on observations and studies of the development of the Internet in recent years, we can draw conclusions about the prevailing number of new technologies for storing, outputting, processing and searching for multimedia and graphic information in comparison with its simplest, textual form. The reasons for this are obvious: increasing the bandwidth of the connection, increasing the computing power and disk space, the high competition of resource owners in the developing environment for commerce (the potential of which will not be exhausted soon).
In addition, thanks to a simple idea, good advertising, novelty and thoughtful execution, such successful resources as FaceBook, Twitter, YouTube, and their Russian counterparts VKontakte and Odnoklassniki gained great popularity. Thus, two defining conditions were formed for the development of a new informational entity - “Macromedia” technology.
In general, macromedia is a set of technology for the transfer, storage, processing, display, analysis and interpretation of multimedia content. Macromedia is characterized by: display flow, that is, simultaneous transmission to the user and display on his side; mainly multimedia (video / audio) nature of information; processes of interpretation, analysis and comparison of this information. The tasks of transferring, storing, processing and displaying, by themselves, are well studied, while the issue of analyzing multimedia information remains open

1. Content analysis method.

In the most general case, the task of analyzing multimedia information is similar to the task of analyzing text information. Text analysis, however, is a much simpler process, if only because there is no need to interpret the data into any representation convenient for analysis at another level of abstraction. Also, the task is simplified by a small relative to graphical information data size order.

Example 1.
Two text files were uploaded to the server:

A.txt	B.txt
Grass in the yard, firewood on the grass.	Birch firewood, inexpensive. Shipment at own expense from the yard. Tel 123-45-67

The simplest relevance analysis will be as follows:
The system splits each file into an array of tokens, removes service words, punctuation marks, and performs a two-way search of tokens from one file to another, taking into account morphology. As a result, the connection between the words “firewood” and “yard” will be found and reflected in the search index between the files.
Of course, modern technologies for analyzing textual information are significantly, tens of times more complicated, and operate with many parameters, but their essence remains the same - to interpret files into structures and compare contents.
More complex is the task of analyzing multimedia information.
Example 2.
Two images were uploaded to the server:

A.bmp	B.bmp

Comparing files by contents directly will not give anything - from the point of view of the machine, these are different images, but in reality - no.
At this stage, two approaches to solving this problem appear:
Content analysis method is a relatively simple approach that allows to determine and measure the similarity of data, but no more. The essence of the approach is similar to example 1: break the data into components and compare them directly. In example 2, it will look like this:

Thus, when comparing without taking into account transformations (in this case, horizontal reflection of the third element), the images are 75% identical, and taking into account the transformation, they are 100% identical. We write it as "{A, B} (75/100)" and we will continue to adhere to such a record (let's call it M-Lang). In the second case, we can draw conclusions about the belonging of both images to a certain set. For example, "smile." Then a second approach appears.

2. Content-interpretation.

The essence of the approach is to interpret the constituent parts of multimedia data into certain concepts in a formal language, and to build relationships between these concepts. The main advantage of this approach is system learnability. It should be noted that composition / decomposition can occur an unlimited number of times, also the number of possible recursions and / or iterations in the analysis process.
Consider the simplest case.
To illustrate, we expand Example 2 with the concept of “tags” and construct a common table of all the information we have.
Example 3

Further, using the concepts of weight, probabilities and statistics, predicative and Boolean logic, we describe the system operation in terms of M-Lang. M-Lang is a language developed by the author to describe algorithms, rules and specifications for the recognition, analysis and interpretation of graphic information.
Basic rules and constructions of the M-lang language:

{O1, ..., On} - a joint set of objects.
{O1, ..., On} (K) - objects are jointly identical by K percent
O1 [(N) “word”] - the O1 object has a word tag, and the weight of the object in the tag is N
O1 = {P1, P2, P3} - object O1 consists of objects P1, P2, P3
{O1 | ... | On} - objects belong to the set.
Also, operators and transformations of Boolean logic and predicative logic are allowed in the language.
You can use the rules of fuzzy logic.

It should be clarified that the objects mean a certain entity that is separate and unambiguous at the current level of abstraction, for example: an image, a part of an image, a statement that describes something. An object consists of other objects, and is also part of the object. You can decompose the image as much as you like, however, it is rational to do this until the ambiguous sets of objects cease to appear.
Consider an example of a description of the image recognition process, based on the figure from example 3.

Formula	Decryption
{A, B, C} (75/75)	Objects A, B, and C are jointly 75% identical excluding transformation and 75% identical to transformation.
{A, B} (75/100)	Objects A and B are jointly 75% identical excluding transformation and 100% identical to transformation.
{A ["smile"], C ["smile"]} (75/75)	Object A has an explicit "smile" tag, object C has an explicit "smile" tag, while objects A and C are 75% identical without regard to transformation and 75% with regard to transformation.
A = {a1 (50%); A2 (25%); a3 (25%)} or {a1 (50%); A2 (25%); a3 (25%)} = A	Object A consists of parts a1, a2 and a3 in the proportion 2 \ 1 \ 1 or parts a1, a2 and a3, in the proportion 2 \ 1 \ 1, make up object A.
{a1 (50%); a2 (25%); a3 (25%)} [“smile”] -> T0: ({a1} [(50) “smile”] \| {a2} [(25) “smile ”] \| {A3} [(25)“ smile ”])	a1, a2 and a3, combined in the proportion 2 \ 1 \ 1, have an explicit smile tag, therefore the statement T0: part a1 has 50 weight in the smile tag, a2 has 25 weight in the smile tag, a3 has 25 weight in the tag "Smile."
{b1 (50%); b2 (25%); b3 (25%)} [“sadness”] -> T1: ({b1} [(50) “sadness”] \| {b2} [(25) “sadness ”] \| {B3} [(25)“ sadness ”])	b1, b2 and b3, combined in the proportion 2 \ 1 \ 1, have an explicit sadness tag, therefore the T1 statement: part b1 has 50 weight in the sadness tag, b2 has 25 weight in the sadness tag, b3 has 25 weight in the tag "Sadness."
{c1 (50%); c2 (25%); c3 (25%)} [“smile”] -> T2: ({c1} [(50) “smile”] \| {c2} [(25) “smile”] \| {c3} [(25) “smile” ])	c1, c2 and c3, combined in the proportion 2 \ 1 \ 1, have an explicit smile tag, therefore the statement T2: part c1 has 50 weight in the smile tag, c2 has 25 weight in the smile tag, c3 has 25 weight in the tag "Smile."
T3: {a1, b1, c1} (100/100) T4: {a2, b2, c2} (100/100)	Statement T3: Parts a1, b1 and c1 are 100% identical without transformation and 100% with transformation. Statement T4: Parts a2, b2 and c2 are 100% identical without transformation and 100% with transformation.
T1 v T2 -> T5: ({a1 \| b1 \| c1} [(100) "smile"]) \| ({a2 \| b2 \| c2} [(50) "smile"]) \| ({a3 \| b3 \| c3} [(25) "smile"])	Proceeding from the union of statements T1, T2, T3, and T4, statement T5 follows: a1, b1, and c1 have a weight of 100 in the smile tag, a2, b2, and c2 have a weight of 50 in the smile tag, a3, b3, c3 have weight 25 in the tag "smile"
T5 v T3 v T4 -> b1 [(100) “smile”] v b2 [(50) “smile”] -> B [(100) “sad”] [(150) “smile”]	Based on the combination of statements T5, T4 and T3, it follows that parts b1 and b2 have a weight of 100 and 50 in the smile tag, respectively, therefore object B has an explicit sadness tag and an implicit smile tag with a weight of 150.

Thus, object B, based on the comparison and interpretation of the content, was assigned the tag "Smile".
This example is trivial, and the grammar of the M-lang language is simplified. They do not take into account, for example, the position of parts in space, color, quality, format, codecs and the compression mechanism of input files, for video files - length of time. A separate language specification is required for the analysis of audio streams. However, further development, the author is sure, will turn M-lang into a powerful tool for modeling and creating rules for the analysis and interpretation of multimedia information.
The main advantages of M-lang.

The simplicity of the grammar of the language.
Using a wide range of universal concepts, such as probabilities, weights, fuzzy logic, etc.
The language is easy to interpret in both formal and algorithmic languages.
The language can be used both for the development of an entire algorithm or general rules, for and for verification of ready-made systems, including those developed without using the description on M-lang.
Easy to understand.

Conclusion

Due to the lack of standards and more or less current open source technologies, the need has urgently arisen for creating a development model and description of algorithms for analyzing and interpreting multimedia information. Such a language can be developed by M-lang, which uses elements of algorithmic and functional development languages, methods of probability theory and mathematical statistics, Boolean algebra and predicative logic to describe rules and algorithms. The main advantage of this language is its simplicity of translation into both algorithmic and natural languages.

Tags:

Macromedia: analysis and interpretation of multimedia information. M-lang

Introduction

1. Content analysis method.

2. Content-interpretation.

Conclusion

Also popular now: