Public discussion of the GOST draft on compression of digitized audio data

    Dear Habrausers!

    Continuing the recently begun tradition of publishing draft standards developed by our company as part of the activities of the technical standardization committee TK-234 “Alarm systems and anti-criminal protection” , we present to your attention the standard “Security television systems. Compression of digitized audio data. General technical requirements and methods for evaluating algorithms. "

    We will be extremely grateful for the constructive criticism of the project, and all valuable comments and suggestions will be included in the next edition of the standard. The text of the standard under the cut.

    For a better understanding of the structure of this standard and the general approach, we recommend that you familiarize yourself with the already adoptedstandard for compression of digitized video data , developed by us back in 2011.

    NATIONAL STANDARD OF THE RUSSIAN FEDERATION

    Security television systems. Compression of digitized audio data.


    Classification. General technical requirements and methods for evaluating algorithms


    Introduction
    The active use in the systems of security television (COT) of compression methods for digitized audio data borrowed from multimedia applications of television has made it impossible to carry out investigative measures, as well as operational functions, using the majority of existing COT.
    An important distinguishing feature of digital audio data compression methods for COT is the need to ensure high sound quality in the restored audio data. This standard allows you to streamline existing and developing methods of compression of digitized audio data, intended for use as part of anti-criminal protection systems.
    As a criterion for the classification of compression algorithms for digitized audio data, this standard sets the valuesquality metrics characterizing the degree of deviation of the original and their corresponding restored digitized audio data .
    This standard should be used in conjunction with GOST R 51558-2008 "Means and systems for television security. Classification. General technical requirements. Test Methods. "

    1 Scope
    This standard applies to digital television security systems (hereinafter referred to as CCT) and establishes general technical requirements and methods for evaluating compression algorithms for digitized audio data in CCCT.
    This International Standard is applied to compression (decompression) algorithms, regardless of their implementation at the hardware level.
    This International Standard specifies the classification of compression (decompression) algorithms for digitized audio data.
    This International Standard establishes a methodology for comparing various compression and decompression algorithms for digitized audio data.
    This standard is used in conjunction with the standards GOST R IEC 60065, GOST R 51558, GOST 13699, GOST 15971, GOST R 52633.5-2011

    2 Normative references
    In this standard, normative references to the following standards are used:
    GOST R 51558-2008 Security television equipment and systems. General technical requirements and test methods
    GOST R IEC 60065-2009 Audio, video and similar electronic equipment. Safety requirements
    GOST 13699-91 Recording and reproduction of information. Terms and definitions
    GOST 15971-90 Information processing systems. Terms and Definitions
    GOST R 52633.5-2011 Information security. Information security technique. Automatic training of neural network converters biometrics-access code

    3 Terms and definitions
    This standard uses the terms in accordance with GOST 15971-90, GOST 13699, GOST R 51558, GOST R 52633.5-2011, GOST R IEC 60065-2009, as well as the following terms with corresponding definitions :
    1. audio data (audio data), audio (audio signal), the mono-channel audio signal (monophonic audio): the analog signal carrying information about the change in time sound amplitude.
    2. multi-channel audio(multi-channel audio): an audio signal consisting of combining a certain number of audio signals (channels) that carry information about the same sound; Designed for better sound transmission taking into account spatial orientation.
    3. stereo two-channel audio signal (stereophonic audio signal), stereo audio signal (stereo audio signal), two-channel audio signal (stereo audio signal): a multi-channel audio signal consisting of two mono-channel audio signals.
    4. The digitized audio data (digitized audio data): data obtained by analog-digital conversion of audio data representing a sequence of bytes in a format (WAV or the like.).
    5.analog- to-digital converter, ADC: A device that converts an input analog audio signal into digitized audio data.
    6. Sample rate: the sampling rate of a time-continuous signal during its analog-to-digital conversion to digitized audio data.
    7. bits of resolution (resolution of ADC): the number of bits which each sample is encoded into signal ADC process.
    8. frame : a fragment of an audio signal with a given number of values ​​(frame length).
    9. digitized audio data format(digitized audio data format): a representation of digitized audio data that enables its processing by digital computing means.
    10. compression (compression) of digitized audio data (audio compression): processing of digitized audio data, designed to reduce their volume.
    11. The compressed audio data (compressed audio data): data obtained by compressing the digitized audio data.
    12. The compression of the digitized audio data with losses (lossy audio compression): compression of digitized audio data, in which loss of information occurs, and thus recovered (as a result of decompression) digitized audio data differ from the original digitized audio data.
    thirteen.digitized audio data compression without loss (lossless audio compression): compression of digitized audio data, in which there is no loss of information, and thereby recovered (as a result of decompression) digitized audio data do not differ from the original digitized audio data.
    14. decompression of compressed audio data (audio decompression): recovery of digitized data from compressed audio data.
    15. restored audio data (decoded audio data): data obtained from compressed audio data after decompression.
    16. The audio encoder (audio encoder): software, hardware or hardware and software, with the aid of which the compression of digitized audio data.
    17. The audio decoder (audio decoder): software, hardware or hardware and software, with the aid of which the decompression of the compressed audio data.
    18. audio codec (audio codec): a software, hardware or hardware-software module that can perform both compression and decompression of audio data.
    19. The degree of compression (compression ratio): volume reduction factor of digitized audio data as a result of compression.
    20. The bit rate (bit rate): expressed in bits estimate the number of compressed audio data defined for a certain time interval and divided by the duration of the selected time interval in seconds.
    21.quality of restored audio data (decoded audio data quality): an objective assessment of the conformity of the restored audio data to the original digitized audio data based on the calculated quality metrics.
    22. The quality metric (quality metric): analytically determined parameters characterizing the degree of deviation of the restored audio data from the source of digitized audio.
    23. A method of estimating compression algorithm (method of evaluating compression algorithm): Analytical method determining the quality metric values to meet requirements for audio data compression algorithms.
    24. compression algorithm(compression algorithm): an accurate set of instructions and rules describing the sequence of actions according to which the original audio data is converted into compressed, implemented using an audio encoder.
    25. The decompression algorithm (decompression algorithm): exact instruction set and right-Vil describing the steps, according to which compressed audio data are converted into reconstructed implemented using audio decoder.
    26. The time-frequency metrics (time-frequency metric): quality metric is based on a comparison of spectrogram digitized and restored audio data.
    27. amplitude-time metric(time-amplitude metric): a quality metric based on a comparison of digitized and reconstructed audio data by waveform.
    28. resampling of an audio signal: changing the sampling frequency of an audio signal.
    29. psychoacoustics model : A model for compressing lossy audio data using the sound perception of the human ear.
    30. psychoacoustics masking (psychoacoustics masking): hiding under certain conditions of one sound with another sound due to the peculiarities of the perception of sound by the human ear.
    31. masking threshold(masking threshold): threshold level of a signal that is not distinguishable by a person due to the effect of psychoacoustic masking.
    32. noise : a set of aperiodic sounds of varying intensity and frequency that do not carry useful information.
    33. The spectrum of the signal (frequency spectrum): result of decomposition of the signal into simple sinusoidal function (harmonic).
    34. The discrete Fourier transform, DFT (discrete Fourier transform, DFT): transformation assigns to N counts discrete signal spectrum of N discrete signal samples
    35. The algorithm is a fast Fourier transform (fast Fourier transform, FFT): fast algorithm for computing discrete Fourier transforms.
    36. spectrogram : characteristic of the power density of a signal in a time-frequency space.
    37. The window (window function): weighting function which is used to control the effects due to the presence of side lobes in the spectral estimates (spectrum spreading). It is convenient to consider an existing final data record or an existing final correlation sequence as some part of the corresponding infinite sequence visible through the applicable window.
    38. windowing Hanna (short-time Fourier transform with Hann window): DFT with a weight function - the window Hanna.
    39. artificial neural network(artificial neural network, ANN): a mathematical model, as well as its software or hardware implementations, built in a sense in the image and likeness of nerve cell networks of a living organism and used to approximate continuous functions. An artificial neural network consists of an input layer with neurons and an output layer with neurons. Between these layers is one or more intermediate, hidden, layers with neurons.
    40. distorted frame : a frame for which the maximum ratio of noise to masking threshold exceeds 1.5 dB.
    41. peak -to-peak signal-to-noise ratio: the ratio between the maximum possible signal value and the noise power.
    42. differentiation(from Latin differentia - difference) - the allocation of the quotient from the total population according to some criteria.

    4 General technical requirements
    The requirements for compression of digitized audio data are aimed at assessing the quality of the restored audio data, which is determined by the quality of each individual sound fragment of the restored audio data. The size of the sound fragment is determined in seconds, or by the number of digitized values ​​inside the fragment.
    The quality of the sound fragment of the reconstructed audio data is determined by the quality metrics characterizing the degree of distortion of the reconstructed audio data after compression compared to the original digitized audio data. The procedure for calculating metrics is given in chapter 6 of this document.
    According to the quality metrics of the recovered audio data, the compression algorithms for digitized audio data belong to one of three classes (see chapter 5 of this document).
    The belonging of the digitized data compression algorithm to a particular class is determined by the quality metrics calculated for it and Table 1 in Chapter 5.

    5 Classification of compression algorithms
    5.1. The following quality metrics are used to assess the quality of the recovered audio data and classify compression algorithms: peak signal-to-noise ratio ( peak signal-to-noise ratio, PSNR); waveform difference coefficient; metric based on an objective assessment of audio data in terms of perception by a person (perceptual evaluation of audio quality, PEAQ).
    5.2 The classification of compression algorithms for digitized audio data is based on the values ​​of quality metrics that reflect those aspects of changes in digitized audio data after they are processed by compression and decompression algorithms, which can have a critical effect on the possibility of using reconstructed audio data to establish the presence of sound signals, differentiate sounds and speech.
    5.3 Depending on the values ​​of the quality metrics calculated during the assessment, the compression algorithms for the digitized audio data can be assigned to one of the following classes (see Table 1):
    • Class I - full-featured compression algorithms that ensure the quality of the restored audio data is indistinguishable from the quality of the original audio data;
    • class II - compression algorithms that ensure the quality of the restored audio data, sufficient to establish the presence of sound signals, differentiate sounds, speech and not inferior in this quality to the original audio data, but distinguishable from the quality of the original audio data;
    • Class III - compression algorithms that ensure the quality of the restored audio data, sufficient to establish the presence of sound signals and not inferior in this quality to the original audio data, but which interferes with the differentiation of sounds, understanding of speech.


    Table 1 - Classification of compression algorithms
    5.4. The values ​​of the quality metrics are determined for each sound fragment (five seconds long) of the digitized audio data, and the following rating is selected: the lowest value for the PSNR and PEAQ metrics; the highest value for the coefficient of difference of waveforms.
    To calculate PSNR metrics and the coefficient of difference in waveforms, the original and reconstructed digital audio data must be represented with a sampling frequency of 44100 Hz, 16 bits of memory per one discrete sample value and with one audio channel. The length of the sound fragment in five seconds in this case is 220,500 digitized values.
    To calculate the PEAQ metric, the original and reconstructed digital audio data must be presented with a sampling frequency of 48000 Hz, 16 bits of memory per one discrete sample value and with one or two audio channels. The five-second sound fragment length in this case is 240,000 digitized values ​​for each channel.
    For signals with a frequency other than the required, you must first resample the audio signal.

    Compression Algorithm Evaluation Methods


    6.1 General description of the estimation methods
    The general scheme of the operation of the DSPC from the point of view of using compression and decompression algorithms is presented in Figure 1.

    Figure 1 - The general scheme of the operation of the DSPA.

    Analog audio data undergoes analog-to-digital conversion, resulting in digitized audio data with a certain sampling frequency and number of bits on one discrete digitized value. On the computer, the digitized audio data must be stored in one of the formats for storing the digitized audio data.
    The digitized audio data is compressed, resulting in compressed audio data.
    Compressed audio data is used to store the archive or for transmission over the network, after which they are decompressed. As a result of decompression of the compressed audio data, restored audio data is generated, which are used for reproduction by the operator and are input to the software modules for the analysis of audio data.
    In accordance with the presented general scheme of the CCTT, the classification of compression algorithms for digitized audio data is performed by evaluating the quality metrics of the restored audio data from the original digitized audio data. Depending on the features of the technical implementation of a particular CCST, there are two assessment methods:
    - based on the separation of digitized audio data;
    - based on the separation of audio data.
    Before evaluating the quality metrics, both audio signals (original and reconstructed) must be converted to signals with a sampling frequency of 44100 Hz and 48000 Hz. For both frequencies (44100 Hz and 48000 Hz), the number of bits per one discrete digitized value should be 16.

    6.1.1 Method for evaluating the algorithm based on the separation of digitized audio data
    To use this method, the technical implementation of the DSPC should allow digitized audio data to be received before processing by compression and decompression algorithms.
    The general scheme of the implementation of the estimation method based on the separation of the digitized audio data is presented in Figure 2.

    Figure 2 - The general scheme of the implementation of the estimation method based on the separation of the digitized audio data
    The evaluation implementation algorithm is performed by the following sequence of actions:
    - a sequence of audio data is fed to the input of the tested PSCC;
    - using the capabilities of the CCTT, the digitized and restored audio data is stored on storage devices;
    - perform the calculation of the quality metrics and carry out the classification of the compression algorithm in Table 1.
    - calculate the values ​​of the quality metrics and classify the compression algorithm in Table 1.

    6.1.2 The method of evaluating the algorithm based on the separation of audio data
    The estimation method based on the separation of audio data should be used only if the technical implementation of the DSPC does not allow the application of the estimation method based on the separation of digitized audio data. The use of this method requires the availability of an additional PSC as part of the test bench, which is designed to store digitized audio data.
    The general scheme for the implementation of the estimation method based on the separation of audio data is presented in Figure 3.

    Figure 3 - The general scheme for the implementation of the estimation method based on the
    Algorithm for evaluating this method involves the following steps:
    - serial audio data is fed to the input of the tested PSC, which are duplicated to another PSC through audio divider (from the test bench);
    - using the capabilities of the CCTT, the restored audio data is stored on storage devices;
    - using the capabilities of the Center for Integrated Testing from the test bench, the digitized audio data is stored on storage devices;
    - perform the calculation of the values ​​of quality metrics and classify the compression algorithm according to Table 1.

    6.2. PEAQ calculation algorithm
    This metric is intended to assess the quality of the processed signal relative to the original, taking into account the auditory characteristics of a person (psychoacoustic model). This metric for evaluating audio signal quality is recommended by ITU-R BS 1387.1.
    Requirements for input audio signals:
    • both audio signals (original and restored) for calculating the PEAQ metric must have a sampling frequency of 48 kHz. For signals with a frequency other than 48 kHz, you must first resample the audio signal;
    • both audio signals must have the same length (consist of the same number of digitized values).

    > Designations
    - sampling frequency of signals;
    - the number of digitized signal values ​​that determine the length of the sound fragment (frame size);
    - digitized frame data,
    Single frame advance: thus, the overlap of frames is 50%;
    - frame sampling frequency taking into account the frame-by-frame step;
    - the number of frequency bands of filtering.

    Calculation of the metric should consist of 5 stages.
    I Signal preprocessing.
    Window transform application. Initial digitized data is divided into frames. The digitized data of each frame undergoes a scaled window Hann transform according to the formula (2). The Hannah window function has the form:
    (1) A
    scaled version of the Hannah window function:
    (2)
    The transition to the frequency domain is carried out by applying the discrete Fourier transform (DFT):
    (3)

    The model of the outer and middle ear The
    frequency response of the outer and middle ear should be calculated as follows formula:
    (4)
    According to formulas (4), the vector of weighting coefficients is calculated as follows:
    (5)

    Using these weights (5), the weighted DFT energy is calculated :
    (6)
    Decomposition of the critical hearing band
    The following are the formulas necessary for conversion to the Bark scale (7) and inverse transformation (8):
    (7)
    where z is measured in Barki.
    (8)
    Frequency bands Frequency
    bands are defined by setting the lower, center and upper frequencies of each band. These values ​​in the Bark scale are defined as follows:
    (9)
    The inverse transformation is performed according to the following formulas:
    (10)
    The value i = 1, 2, ..., .
    Energy of the frequency band
    For the i-th frequency band, the energy contribution from the k-th fundamental frequency of the DFT is calculated by the following formula:
    (11)
    Then the energy of the i-th frequency band is equal to:
    (12)
    The following is the final formula for the energy of the i-th frequency band:
    (13)
    Internal noise of the ear
    To compensate for the internal noise in the ear, we introduce an extra value for the energy of each frequency band:
    (14)
    where the internal noise is modeled as follows:
    (15) The
    energies will be called hereinafter the images of heights .
    Propagation energy within one frame The
    characteristic of the propagation energy in the Bark scale is calculated as follows:
    (16)
    where
    (17)
    The function S (i, l, E) has the following form:
    (18)
    where
    (19)
    Below are formulas for calculating the terms and :
    (20)
    and
    (21)
    Energy - by images of non-widespread excitations .
    Energy filtering.
    Let n be the frame index (frames are indexed starting from n = 0). Then the energy of the nth frame corresponding to formula (16) is denoted as follows: The energy is filtered in accordance with the following formula:
    (22)
    where is the time constant for the fading energy. The initial condition for filtering:
    The final values are images of excitations .
    Time
    constants The time constant for filtering the i-th band is calculated as follows:
    (23)
    can be calculated as follows:
    (24)

    II. Image Processing

    Figure 4 below shows the preliminary calculations described in the previous chapter.

    Figure 4: Signal preprocessing scheme The
    indices R and T denote the original and reconstructed audio signals, respectively. The index k denotes the index of the frequency band (total frequency bands - 109), and the index n denotes the frame number. For recurrence formulas at this and the next stage (stage III), zero initial conditions are always chosen.
    Processing of excitation patterns
    The input data for this stage of calculations are the excitation patterns and calculated by formula (22) for the initial and tested audio signals, respectively.
    Correction of images of excitations
    First, filtering is performed for both the audio signal according to the formula:
    (25)
    The time constant is calculated according to the formulas (23) and (24), but , . The initial condition for filtering is chosen equal to 0.
    Next, the correction coefficient is calculated :
    (26)
    The excitation images are adjusted as follows:
    (27)
    Adaptation of the excitation patterns
    Using the same time constants and initial conditions as for the correction of excitation patterns, the output signals calculated by the formula (27) are smoothed in accordance with the following formulas:
    (28)
    Based on the relationship between the values ​​calculated in (28), a pair of auxiliary signals is calculated:
    (29)
    If in the previous formula (29) the numerator and denominator are zero, then you need to-complete the following steps: .
    If k = 0, then,
    to generate factors for image correction, auxiliary signals are filtered using the same time constants and the initial condition as in (25):
    (30)
    where
    (31)
    (32)
    As the final result of this processing stages, spectrally adapted images are obtained on the basis of formula (30) :
    (33)

    Processing of modulation images
    The input data for this stage of calculations are images of uncommon excitations andcalculated by formula (16) for the source and test audio signals, respectively. The purpose of this section is to compute modulation measures of spectral envelopes .
    First, the average volume is calculated :
    (34)
    Next, you need to calculate the following differences:
    (35)
    The time constants and initial conditions are the same as in the previous section.
    The modulation measures of the spectral envelopes are calculated as follows:
    (36)
    Calculation of volume The volume
    images are calculated in accordance with the following formulas:
    (37)
    where
    (38)
    and
    (39)
    Parameter c = 1,07664.
    General volumefor both signals are calculated as follows:
    (40)

    III. Calculation of the output values ​​of the psychoacoustic model The
    output characteristics from chapter I are used to calculate the output characteristics of chapter II in accordance with the scheme below (see Figure 5).

    Figure 5 The image processing scheme
    In turn, the values ​​of the previous chapter (II) use the calculation of the output values ​​of the variables of the psychoacoustic model (see table 1 and figure 6).

    Figure 6. Scheme for calculating the values ​​of the output variables of the psychoacoustic model
    In total, the values ​​of 11 variables of the psychoacoustic model are calculated. They are listed in Table 2.

    Table 2. Output variables of the psychoacoustic model
    For two-channel audio signals, the variable values ​​for each channel are calculated separately, and then averaged. The values ​​of all variables (except the values ​​of the ADBB and MFPDB variables) for each signal channel are calculated independently of the second channel.
    General description of the parameter calculation process
    All values ​​of the output variables of the model are obtained by averaging over all frames the time and frequency functions obtained in the previous step (the result is a scalar value).
    The values ​​that will be averaged should lie within the limits defined by the following condition: the beginning or end of the data to be averaged is defined as the first position from the beginning or from the end of the sequence of values ​​of the amplitudes of the audio signal, for which the sum of five consecutive absolute values ​​of the amplitudes exceeds 200 any of the audio channels. Frames that lie outside these boundaries should be ignored during averaging. The threshold value 200 is used if the amplitudes of the input audio signals are normalized in the range from -32768 to + 32767. Otherwise, the threshold value is calculated as follows:
    (41)
    where is the maximum value of the amplitude of the audio signal.
    Further, the frame index n: starts from zero for the first frame that satisfies the threshold checking conditions with a threshold and counts the number of frames N up to the last frame that satisfies the above condition.
    Window difference of modulations 1 (WinModDiff1B)
    The following is a formula for calculating the instantaneous difference of modulations :
    (42)
    The value of the instantaneous difference of modulations is averaged over all frequency bands in accordance with the following formula:
    (43)
    The final value of the output variable is obtained by averaging formula 43 with a sliding window L = 4 (85 ms, since each step is equal to 1024 digitized values):
    (44)
    In this case, the so-called delay averaging is applied- The first 0.5 seconds of the signal do not participate in the calculations. The number of frames to skip is: (45)
    In Formula 45, an operation means discarding the fractional part.
    Thus, in Formula 44, the frame index includes only frames that go after a delay of 0.5 seconds.
    Average Modulation Difference 1 (WinModDiff1B)
    The value of this output variable of the psychoacoustic model is calculated using the following formula:
    (46)
    where
    (47)
    Delay averaging is also used to calculate this value.
    Average Modulation Difference 2 (WinModDiff2B)
    First, the value of the instantaneous modulation difference is calculated by the formula:
    (48)
    Then, the modulation difference value averaged over the frequency bands is calculated:
    (49)
    The final value of the psychoacoustic model variable is calculated as follows:
    (50)
    where
    (51)
    Delay averaging is also used to calculate this value.
    Noise Volume (RmsNoiseLoudB)
    The following is a formula for finding the values ​​of the instantaneous noise volume:
    (52)
    where
    (53)
    where:
    (54)
    (55)
    (56)
    and
    Next, if the instantaneous volume is less than 0, then it is set to 0:
    (57 )
    The value of the final output variable of the psychoacoustic model is found by averaging the instantaneous volume:
    (58)
    Delay averaging is used to calculate this value. Together with delayed averaging, a loudness threshold is used to find the instantaneous loudness of the noise starting from which the averaging process is performed. Thus, averaging starts from the first value determined by the condition that the volume threshold is exceeded, but no later than 0.5 seconds from the beginning of the signal (in accordance with delayed averaging).
    The condition for exceeding the volume threshold
    The values ​​of the instantaneous noise volume at the beginning of both signals (source and test) are ignored until 50 ms have passed after the total volume value exceeds the threshold value of 0.1 in both channels of one of the signals.

    The threshold exceeding condition can be represented as:
    (59)
    The following formula is intended to calculate the number of frames that can be skipped after the threshold is exceeded:
    (60)

    Bandwidths of the original and restored audio signals (BandwidthRefB and BandwidthTestB) The
    operations for calculating the bandwidths of the original and restored audio signals are described in terms of operations at the output values ​​of the DFT expressed in decibels (dB) . First of all, the following operations are performed for each frame:
    • For the restored signal: the largest component is located after the frequency of 21.6 kHz. This value is called the threshold level.
    • For the original signal: performing a downward search, starting at 21.6 kHz, the first value is found that exceeds the threshold level by 10 dB. The frequency corresponding to this value is called the bandwidth for the original signal.
    • For the recovered signal: by performing a downward search, starting with the bandwidth of the original signal, the first value is found that exceeds the threshold level by 5 dB. We designate the frequency corresponding to this value as the bandwidth for the reconstructed signal.
    If the found frequencies for the original signal do not exceed 8.1 kHz, then the bandwidth for this frame is ignored.
    The bandwidths for all frames are called the fundamental DFT frequencies.
    The fundamental DFT frequency for the nth frame is denoted asfor the original signal and how - for the restored signal. To calculate the final values ​​of the variables of the psychoacoustic model, the bandwidths of the initial and reconstructed signals, it is necessary to perform the following formulas, respectively:
    (61)
    (62)
    where summation is performed only for those frames in which the main DFT frequency exceeds 8.1 kHz.
    The ratio of the noise level to the masking threshold (Total NMRB)
    The masking threshold is calculated by the following formula:
    (63)
    where
    (64)
    The noise level is calculated as:
    (65)
    where k is the index of the fundamental frequency of the DFT.
    Ratio of noise to masking thresholdin the k-th band is expressed by the following formula:
    (66)
    The final ratio of the noise to the masking threshold (in dB) is calculated as follows:
    (67)
    relative distortion of the frame (RelDistFramesB)
    Maximum noise ratio to the threshold masking frame is calculated as follows:
    (68)
    Distortion the frame is considered in which the maximum ratio of noise to masking threshold exceeds 1.5 dB.
    The final value of the output variable of the psychoacoustic model is the ratio of the number of distorted frames to the total number of frames.

    Maximum Distortion Detection Probability (MFPDB)
    First of all, we calculate the asymmetric excitation:
    (69)
    where
    (70)
    Next, we calculate the step for detecting distortion :
    (71)
    where
    (72) The
    probability of detection is calculated as follows:
    (73)
    where b is calculated as:
    (74)
    We calculate the number of steps above the detection probability threshold:
    (75)
    Characteristics (73) and (75) are calculated for each channel of the signal. For each frequency and time, the total probability of detection and the total number of steps above the threshold are selected as the larger value from all channels:
    (76)
    where indices 1 and 2 indicate the channel number.
    For single-channel signals, the above values ​​are calculated as follows:
    (77)
    The following computational procedure is performed:
    (78)
    where and the initial condition is zero.
    The maximum probability of detecting distortion is calculated by the recurrence formula:
    (79)
    The final value of the output variable of the psychoacoustic model is calculated as follows:
    (80)
    Average block distortion (ADBB)
    First, the sum of the total number of steps above the detection threshold is calculated :
    (81)
    Moreover, the summation is carried out for all values, for which the
    Final characteristic has the form:
    (82)
    Harmonic structure of error (EHSB)
    The DFT outputs for the initial and reconstructed signals are denoted as and accordingly.
    The characteristic is calculated:
    (83)
    A vector of length M is formed from the values ​​of D [k]:
    (84) The
    normalized autocorrelation is calculated by the formula:
    (85)
    where
    Let —C [l] = C [l, 0]. Next, it is necessary to calculate:
    (86)
    When calculating (85) in case the signals are equal, it is necessary to set the normalized autocorrelation equal to one in order to avoid dividing by 0.
    A window function of the following type is introduced:
    (87)
    The window transformation (87) is applied to the normalized autocorrelation :
    (88)
    where
    (89)
    The power spectrum is calculated by the formula:
    (90) The
    search for the maximum peak of the power spectrum begins with k = 1 and ends at orThe found maximum peak value is denoted as Then the final value of the output variable of the psychoacoustic model is calculated using the following formula:
    (91)
    When calculating this value, low-energy frames are excluded. To determine low energy frames, a threshold value is entered:
    (92)
    where for amplitudes stored as a 16 bit integer.
    The frame energy is estimated using the following formula:
    (93)
    When calculating the harmonic structure of the error, the frame is ignored if:
    (94)

    IV. Normalization of values ​​of output variables of the psychoacoustic model
    The normalization of the values ​​of the output variables of the psychoacoustic model obtained in the previous step is performed in accordance with the following formula:
    (95)
    where is the value of the i-th output variable of the psychoacoustic model, the values and are shown in Table 3 below.

    Table 3. Constants for normalizing the values ​​of the output variables of the psychoacoustic model

    V. Estimating the quality of the reconstructed signal using an artificial neural network
    (96)
    where bmin = −3.98 and bmax = 0.22, and the function sig (x) is the asymmetric sigmoid:
    ( 97)
    The value is calculated as follows:
    (98)
    where- the normalized value of the i-th output variable, I - the number of output variables (equal to 11), J - the number of neurons in the hidden layer (equal to 3), - the values ​​of the weights and displacements of the neural network, shown in Tables 4-6 below.

    Table 4 Weights of the neural network
    <
    Table 5 Displacements of the neural network

    Table 6 Weights and displacements of the neural network
    This metric value (PEAQ) is a real number belonging to the interval [-3.98; 0.22].


    6.3 PSNR calculation algorithm
    Пиковое отношение сигнал/шум между исходным аудиосигналом и восстановленным рассчитывается по формулам:
    (99)
    (100)
    где, в свою очередь:
    (101)
    и — i-ые оцифрованные значения исходного и восстановленного аудиосигналов соответственно, i = 1, 2,….,n, а — максимальное значение среди оцифрованных значений исходного аудиосигнала.


    6.4 Алгоритм вычисления метрики «коэффициент различия форм сигналов»
    Пусть — исходный моноканальный аудиосигнал (либо один канал исходного многоканального аудиосигнала). Аналогично, — восстановленный моноканальный аудиосигнал (либо один канал восстановленного многоканального аудиосигнала). Оба сигнала состоят из одинакового количества значений N.
    Массивы значений амплитуд сигналов ипредставляются в виде относительного изменения значений амплитуд сигнала: (102)
    Значение метрики «коэффициент различия форм сигналов» K вычисляется как среднеквадратическое отклонение массивов значений амплитуди:
    (103)


    7 Methods for comparing compression algorithms for digitized audio data
    7.1 Two or more compression algorithms are comparable with each other if they belong to the same class in accordance with Table 1
    7.2 Out of two or more comparable compression algorithms, the best one that provides the best values ​​at least two of the three metrics shown in Table 1. The best value for the metric is the larger value for the PSNR and PEAQ metrics, and the smaller value for the “waveform difference coefficient” metric.

    References:
    1. P. Kabal, An Examinationa and Intrpretation of ITU-R BS.1387 Perceptual Evaluation of Audio Quality
    2. PQevalAudio

    Also popular now: