# Using semantic annotation to identify requirements

Good afternoon,% userName%.

In my previous topic on Requirements Management for IT-projects, I touched on the topic of identifying requirements using concepts and reusing already implemented requirements from one project to another. In this topic, I would like to develop this topic.

Next comes a little math, theoretical calculations and a lot of letters.

Requirement management is one of the key processes throughout the software development period. This process provides not only the collection of immediate customer wishes, but also their presentation in a form accessible to all participants in the software development process.

Using modern methodologies and programming paradigms, such as object-oriented programming, allows you to create independent complete modules that can be used in several projects. The possibility of reuse is achieved by observing the basic principles of object-oriented programming: encapsulation, inheritance, and polymorphism.

Many business processes in enterprises of the same field of activity proceed in a similar way. The differences in these processes are insignificant and are associated with historically established structures of business processes. The flow of business processes at several enterprises similarly contributes to the emergence of boxed versions of information systems, where the most common flow patterns of business processes are implemented. To adapt the information system to the specifics of the business processes of a particular enterprise, the supplier organization customizes the software product.

When customizing an information system for several enterprises of the same subject area, modules developed for one enterprise can be used to customize the information system for another enterprise. The time taken to finalize the module is much lower than when developing it from scratch. With the increase in the number of improvements made, the need for new improvements decreases due to the reuse or adaptation of existing ones.

To reuse the developed modules, it is necessary not only to comply with the principles of object-oriented programming, but also need technology that would allow the identification of modules for reuse without involving an expert or with his minimal participation.

In this case, the expert is the analyst or project manager, but since the analyst or project manager cannot participate in all the projects of the organization and be aware of all the improvements made, an apparatus is needed to identify the modifications that are being performed with the possibility of searching for reuse. Such an apparatus is a semantic annotation.

Work with requirements involves their collection and subsequent processing. To do this, you need a mechanism that would allow you to uniquely identify requirements and perform a search among existing ones.

For the most part, the description of the requirements is textual, that is, using the natural language - the limitations and necessary features are described in the form of text using the terms of the subject area.

When adding a new requirement to the project, it is necessary to search among the requirements already existing in the project to avoid duplication. In this case, the identity of the requirements is determined by the semantic correspondence of the texts by which these requirements are presented. To determine compliance, a mechanism is needed to determine the similarity of texts.

The most common method for determining the similarity of texts is the shingles algorithm. This algorithm allows you to identify fuzzy duplicates of texts and can be used to cluster documents by similarity and highlight plagiarized documents.

The use of this algorithm, as well as its modifications (the algorithm of supershells and megashingles) does not give a representative result, since the description of requirements uses a limited set of lexical constructions, which does not allow to obtain an accurate result.

Using methods of full-text analysis of the text does not allow unambiguously identifying texts in view of the limited nature of the set of lexical constructions used. To solve the problem, it is proposed to use semantic annotation, which will allow using a set of concepts of short length to describe the requirement, presented in the form of a text of greater length.

Define the basic concepts:

We represent the requirement using the following model:

where

A requirement in a natural language can also be identified by a set of concepts:,

where

Each requirement must have, concepts characterizing the requirement from the following points of view:

Since the requirements are an integral part of the project, and it, in turn, belongs to the category, each requirement within the domain also receives a set of categories defined for this domain.

Thus, the requirement can be represented as the following model:

where

Let us take a measure of the difference between the two requirements, the semantic distance, which is an indicator of semantic difference and is a real number in the range from 0 to 1, where 1 - the requirements are identical, 0 - the requirements are completely unrelated. The initial data for the calculation are concepts that annotated requirements.

We introduce additional concepts:

The alphabet is an arbitrary nonempty finite set whose elements are called letters or symbols.

A word or chain in the alphabet V is an arbitrary tuple from the set (kth Cartesian power of the alphabet V) for different k = 0, 1, 2 ...

In this particular case, the alphabet is the totality of all the concepts available in the system; the concepts are symbols of this alphabet. A set of concepts describing a requirement is a word whose length is determined by the number of categories of a given domain. The position of each character in a word is determined by the category to which the concept belongs, as a result of which we have a finite set of words that can be composed of characters in a given alphabet.

The semantic distance can be determined based on the calculation of the following indicators:

To find the semantic distance in this article, the Hamming distance is taken as the basis. In general, the Hamming distance will be calculated by the following formula:

where

ai1 is the i-th character of the first line,

ai2 is the i-th character of the second line.

H is equal to one if the symbols ai1 and ai2 coincide and is equal to zero in all other cases.

To calculate the semantic distance between requirements, we use the Hamming distance in the following form:

where

L is the semantic distance

Ci is the i-th concept of the requirement

N is the number of concepts in the demand (length of the requirement).

Categories within a domain can have different priorities, that is, differ in weights. Coincidence in a category with a greater weight should have a greater influence on semantic distance. In order to reflect the significance of the categories within the domain and in the process of calculating the semantic distance, each category is weighted. Weights are determined by the system based on feedback from an expert:

Initially, all categories within a domain have a weight of one.

We represent the category in the form of the following model:

where

T is the name of the category,

W is the weight of the category within the domain.

Then the semantic distance taking into account the weights of the categories will be calculated by the following formula:

where

Wi is the weight of the i-th category within the domain.

max W - weight of the category with the maximum weight within the domain

Using the Hamming method is sufficient to work with strings in which each character is independent and not associated with the rest. Since concepts are terms presented in a natural language, and not just binary meanings, semantic relations can be established between them, such as synonymy, antonymy, meronymy.

To calculate the semantic distance taking into account the semantic relations between concepts, we introduce the following concept model:

where

C is a concept,

V is the value of a linguistic variable that describes a concept,

{S} is a set of concepts that are synonyms with this. The semantic distance between them is 1.

{M} - a set of meronyms for a given concept. The semantic distance in this case is determined expertly on the basis of a dictionary of meronyms. The less related the terms, the less the semantic distance between them. It is equal to unity, if the terms are synonyms and tends to zero as the semantic distance.

Thus, the semantic distance taking into account semantic relations can be calculated by the following formula:

where

- a set of concepts that are semantically related to the concept .

In this case, not only the original concepts are compared, but also all the concepts connected with them by semantic relationships. If the original concepts do not match, then the related concepts are compared in the following sequence:

Using semantic distance and semantic annotation allows you to:

Prediction of requirements parameters is a priority and will be useful when using flexible programming methodologies, for example, SCRUM to predict the complexity of requirements.

PS: Please do not blame for the academic style of presentation - a sample of a pen for publication in the VAK journal.

In my previous topic on Requirements Management for IT-projects, I touched on the topic of identifying requirements using concepts and reusing already implemented requirements from one project to another. In this topic, I would like to develop this topic.

Next comes a little math, theoretical calculations and a lot of letters.

##### Requirements management

Requirement management is one of the key processes throughout the software development period. This process provides not only the collection of immediate customer wishes, but also their presentation in a form accessible to all participants in the software development process.

Using modern methodologies and programming paradigms, such as object-oriented programming, allows you to create independent complete modules that can be used in several projects. The possibility of reuse is achieved by observing the basic principles of object-oriented programming: encapsulation, inheritance, and polymorphism.

Many business processes in enterprises of the same field of activity proceed in a similar way. The differences in these processes are insignificant and are associated with historically established structures of business processes. The flow of business processes at several enterprises similarly contributes to the emergence of boxed versions of information systems, where the most common flow patterns of business processes are implemented. To adapt the information system to the specifics of the business processes of a particular enterprise, the supplier organization customizes the software product.

When customizing an information system for several enterprises of the same subject area, modules developed for one enterprise can be used to customize the information system for another enterprise. The time taken to finalize the module is much lower than when developing it from scratch. With the increase in the number of improvements made, the need for new improvements decreases due to the reuse or adaptation of existing ones.

To reuse the developed modules, it is necessary not only to comply with the principles of object-oriented programming, but also need technology that would allow the identification of modules for reuse without involving an expert or with his minimal participation.

In this case, the expert is the analyst or project manager, but since the analyst or project manager cannot participate in all the projects of the organization and be aware of all the improvements made, an apparatus is needed to identify the modifications that are being performed with the possibility of searching for reuse. Such an apparatus is a semantic annotation.

##### More work on this topic

Work with requirements involves their collection and subsequent processing. To do this, you need a mechanism that would allow you to uniquely identify requirements and perform a search among existing ones.

For the most part, the description of the requirements is textual, that is, using the natural language - the limitations and necessary features are described in the form of text using the terms of the subject area.

When adding a new requirement to the project, it is necessary to search among the requirements already existing in the project to avoid duplication. In this case, the identity of the requirements is determined by the semantic correspondence of the texts by which these requirements are presented. To determine compliance, a mechanism is needed to determine the similarity of texts.

The most common method for determining the similarity of texts is the shingles algorithm. This algorithm allows you to identify fuzzy duplicates of texts and can be used to cluster documents by similarity and highlight plagiarized documents.

The use of this algorithm, as well as its modifications (the algorithm of supershells and megashingles) does not give a representative result, since the description of requirements uses a limited set of lexical constructions, which does not allow to obtain an accurate result.

##### Mathematical apparatus of semantic annotation

Using methods of full-text analysis of the text does not allow unambiguously identifying texts in view of the limited nature of the set of lexical constructions used. To solve the problem, it is proposed to use semantic annotation, which will allow using a set of concepts of short length to describe the requirement, presented in the form of a text of greater length.

Define the basic concepts:

- Domain - a set of projects of one subject area.
- A project is a set of requirements that implement a given functionality, as well as activities aimed at achieving a result and creating a unique product or service.
- A concept is an attribute that identifies a requirement from a certain point of view, a subject area.
- Category or linguistic variable - a set of concepts related to one subject area or point of view. The concept in this case is a term.

We represent the requirement using the following model:

where

**C**is the condition or opportunity that the requirement should represent,**R**is the implementation of this requirement in the system.A requirement in a natural language can also be identified by a set of concepts:,

where

**Сi**is a concept describing a requirement.Each requirement must have, concepts characterizing the requirement from the following points of view:

- an object,
- subject,
- event,
- act.

Since the requirements are an integral part of the project, and it, in turn, belongs to the category, each requirement within the domain also receives a set of categories defined for this domain.

Thus, the requirement can be represented as the following model:

where

**CO**is the concept describing the requirement object,**CS**is the concept describing the subject of the demand,**CE**is the concept describing the demand event,**CA**is the concept describing the action,**{CD}**is the set of concepts from categories received from the domain.Let us take a measure of the difference between the two requirements, the semantic distance, which is an indicator of semantic difference and is a real number in the range from 0 to 1, where 1 - the requirements are identical, 0 - the requirements are completely unrelated. The initial data for the calculation are concepts that annotated requirements.

We introduce additional concepts:

The alphabet is an arbitrary nonempty finite set whose elements are called letters or symbols.

A word or chain in the alphabet V is an arbitrary tuple from the set (kth Cartesian power of the alphabet V) for different k = 0, 1, 2 ...

In this particular case, the alphabet is the totality of all the concepts available in the system; the concepts are symbols of this alphabet. A set of concepts describing a requirement is a word whose length is determined by the number of categories of a given domain. The position of each character in a word is determined by the category to which the concept belongs, as a result of which we have a finite set of words that can be composed of characters in a given alphabet.

The semantic distance can be determined based on the calculation of the following indicators:

- Levenshtein distance, defined as the minimum number of operations to insert one character, delete one character, or replace one character with another.
- The Damerau-Levenshtein distance is a development of the Levenshtein distance and also takes into account symbol permutations. Using this method to find the semantic distance is unjustified, since the characters occupy a strictly defined position in the line in accordance with the concept category.
- Hamming distance determines the number of positions in which two lines are distinguished.

To find the semantic distance in this article, the Hamming distance is taken as the basis. In general, the Hamming distance will be calculated by the following formula:

where

ai1 is the i-th character of the first line,

ai2 is the i-th character of the second line.

H is equal to one if the symbols ai1 and ai2 coincide and is equal to zero in all other cases.

To calculate the semantic distance between requirements, we use the Hamming distance in the following form:

where

L is the semantic distance

Ci is the i-th concept of the requirement

N is the number of concepts in the demand (length of the requirement).

Categories within a domain can have different priorities, that is, differ in weights. Coincidence in a category with a greater weight should have a greater influence on semantic distance. In order to reflect the significance of the categories within the domain and in the process of calculating the semantic distance, each category is weighted. Weights are determined by the system based on feedback from an expert:

- The expert is offered a list of requirements similar to those introduced (or selected from existing ones) based on the calculation of semantic distance.
- The expert notes the requirements, which, from his point of view, were similar.
- Categories for which the requirements noted by the expert coincided increase their weight by one.

Initially, all categories within a domain have a weight of one.

We represent the category in the form of the following model:

where

T is the name of the category,

W is the weight of the category within the domain.

Then the semantic distance taking into account the weights of the categories will be calculated by the following formula:

where

Wi is the weight of the i-th category within the domain.

max W - weight of the category with the maximum weight within the domain

Using the Hamming method is sufficient to work with strings in which each character is independent and not associated with the rest. Since concepts are terms presented in a natural language, and not just binary meanings, semantic relations can be established between them, such as synonymy, antonymy, meronymy.

To calculate the semantic distance taking into account the semantic relations between concepts, we introduce the following concept model:

where

C is a concept,

V is the value of a linguistic variable that describes a concept,

{S} is a set of concepts that are synonyms with this. The semantic distance between them is 1.

{M} - a set of meronyms for a given concept. The semantic distance in this case is determined expertly on the basis of a dictionary of meronyms. The less related the terms, the less the semantic distance between them. It is equal to unity, if the terms are synonyms and tends to zero as the semantic distance.

Thus, the semantic distance taking into account semantic relations can be calculated by the following formula:

where

- a set of concepts that are semantically related to the concept .

In this case, not only the original concepts are compared, but also all the concepts connected with them by semantic relationships. If the original concepts do not match, then the related concepts are compared in the following sequence:

- All synonymous concepts are compared. If there is no coincidence among the synonymous concepts, then go to step 2.
- We compare all concepts, taking into account meronomy in descending order of semantic distance. The distance between the original and the meronymic concept lies in the interval [0..1].

##### Conclusion

Using semantic distance and semantic annotation allows you to:

- Identify similar requirements at the input stage and prevent their re-entry.
- Search for similar requirements among those already implemented and use the code that implements them, test scenarios, use cases and other project artifacts again.
- Perform a cluster analysis of requirements for grouping and subsequent analysis.
- Predict requirements parameters.

Prediction of requirements parameters is a priority and will be useful when using flexible programming methodologies, for example, SCRUM to predict the complexity of requirements.

PS: Please do not blame for the academic style of presentation - a sample of a pen for publication in the VAK journal.