Theory of sound. What you need to know about sound in order to work with it. Experience Yandex.Music

    Sound, like color , people perceive differently. For example, what seems too loud or a low-quality one may be normal for others.

    To work on Yandex.Music, it is always important for us to remember the different subtleties that sound carries. What is volume, how does it change and what does it depend on? How do sound filters work? What are the noises? How does the sound change? How people perceive it.



    We learned a lot about all this while working on our project, and today I will try to describe on fingers some basic concepts that you need to know if you are dealing with digital sound processing. This article does not have serious mathematics like fast Fourier transforms and other things - these formulas are easy to find on the network. I will describe the essence and meaning of the things that have to face.

    The reason for this post can be considered that we added the ability to listen to tracks in high quality (320kbps) to Yandex.Music applications . And you can not count. So.

    Digitization, or There and back


    First of all, we will understand what a digital signal is, how it is obtained from an analog signal and where the analog signal actually comes from. The latter can be defined as simply as possible as voltage fluctuations arising from vibrations of the membrane in the microphone.


    Fig. 1. Sound

    waveform This is a sound waveform - this is how the audio signal looks. I think everyone has seen such pictures at least once in their life. In order to understand how the process of converting an analog signal to digital is arranged, you need to draw a waveform of sound on graph paper. For each vertical line we find the intersection point with the waveform and the nearest integer value on the vertical scale - a set of such values ​​will be the simplest recording of a digital signal. Fig. 2. An interactive example of wave addition and digitization of a signal.



    Source: www.desmos.com/calculator/aojmanpjrl We will


    use this interactive example to understand how waves of different frequencies overlap and how digitization occurs. In the left menu, you can enable / disable the display of graphs, configure input parameters and sampling parameters, or you can simply move control points.

    At the hardware level, this , of course, looks much more complicated, and depending on the hardware, the signal can be encoded in completely different ways. The most common of these is pulse-code modulation.at which not a specific value of the signal level is recorded at each moment of time, but the difference between the current and previous value. This reduces the number of bits per sample by about 25%. This encoding method is used in the most common audio formats (WAV, MP3, WMA, OGG, FLAC, APE) that use the PCM WAV container .

    In reality, to create a stereo effect when recording audio, most often not one, but several channels are recorded. Depending on the storage format used, they can be stored independently. Also, signal levels can be recorded as the difference between the level of the main channel and the level of the current.

    Inverse conversion from digital to analog is done using digital-to-analog converterswhich may have a different device and operating principles. I will omit the description of these principles in this article.

    Sampling


    As you know, a digital signal is a set of signal level values ​​recorded at specified intervals. The process of converting a continuous analog signal to a digital signal is called sampling (time and level). There are two main characteristics of a digital signal - the sampling rate and the sampling depth by level.


    Fig. 3. Discretization of the signal.
    Source: https://en.wikipedia.org/wiki/Sampling_(signal_processing)


    The sampling rate indicates at what time intervals the signal strength data goes. There is the Kotelnikov theorem (in Western literature it is referred to as the Nyquist-Shannon theorem, although there is also the name Kotelnikov-Shannon), which states: in order to be able to accurately restore an analog signal from a discrete signal, it is necessary that the sampling frequency be at least two times higher than the maximum frequency in the analog signal. If we take the approximate range of sound frequencies perceived by a person from 20 Hz to 20 kHz, then the optimal sampling frequency ( Nyquist frequency ) should be around 40 kHz. For standard audio CDs, it is 44.1 kHz


    . 4. The quantization of the signal.
    Source: https://ru.wikipedia.org/wiki/ Quantization_ (signal processing)


    Sampling depththe level describes the bit depth of the number, which describes the signal level. This characteristic imposes a limitation on the accuracy of recording the signal level and its minimum value. It is worth noting that this characteristic is not related to volume - it reflects the accuracy of the signal recording. The standard sampling depth on an audio CD is 16 bits. At the same time, if you do not use special studio equipment, the majority ceases to notice the difference in sound already in the region of 10-12 bits. However, the large sampling depth allows you to avoid the appearance of noise during further processing of the sound.

    Noises


    In digital audio, there are three main sources of noise.

    Jitter


    These are random signal deviations, as a rule, arising due to instability of the frequency of the master oscillator or different propagation speeds of different frequency components of the same signal. This problem occurs at the digitization stage. If you describe "on the fingers" "on graph paper", this is due to a slightly different distance between the vertical lines.

    Crushing noise


    It is directly related to the sampling depth. Since when a signal is digitized, its real values ​​are rounded with a certain accuracy, weak noises arise, associated with its loss. These noises can appear not only at the stage of digitization, but also in the process of digital processing (for example, if the signal level drops first, then rises again).

    Aliasing


    When digitizing, a situation is possible in which frequency components that were not in the original signal may appear in the digital signal. This error is called Aliasing . This effect is directly related to the sampling rate, or rather, to the Nyquist frequency. The easiest way to understand how this happens is by looking at this picture:


    Fig. 5. Alias. Source: en.wikipedia.org/wiki/Aliasing

    Green shows the frequency component, whose frequency is higher than the Nyquist frequency. When digitizing such a frequency component, it is not possible to write enough data for its correct description. As a result, during playback, a completely different signal is obtained - a yellow curve.

    Signal strength


    To begin with, you should immediately understand that when it comes to a digital signal, you can only talk about the relative signal level. The absolute depends primarily on the reproducing equipment and is directly proportional to the relative. When calculating relative signal levels, it is customary to use decibels . In this case, the signal with the maximum possible amplitude at a given sampling depth is taken as the reference point. This level is indicated as 0 dBFS (dB is decibel, FS = Full Scale is full scale). Lower signal levels are indicated as -1 dBFS, -2 dBFS, etc. It is quite obvious that there are simply no higher levels (we initially take the highest possible level).

    At first it can be hard to figure out how the decibels and the real signal level are related. In fact, everything is simple. Every ~ 6 dB (more precisely 20 log (2) ~ 6.02 dB) indicates a twofold change in the signal level. That is, when we talk about a signal with a level of -12 dBFS, we understand that this is a signal whose level is four times less than the maximum, and -18 dBFS - eight, and so on. If you look at the definition of decibel, it indicates the value 10log (a / a0)- then where does 20 come from? The thing is that the decibel is the logarithm of the ratio of two energy quantities of the same name, multiplied by 10. The amplitude is not energyvalue, therefore, it must be converted to a suitable value. The power that waves with different amplitudes carry is proportional to the square of the amplitude. Therefore, for the amplitude (if all other conditions except the amplitude are accepted unchanged), the formula can be written as 10log (a ^ 2 / a0 ^ 2) => 20log (a / a0)

    NB It is worth mentioning that the logarithm in this case is taken as decimal, while most libraries under the function with the name log mean the natural logarithm.

    At various sampling depths, the signal level on this scale will not change. A signal with a level of -6 dBFS will remain a signal with a level of -6 dBFS. But still, one characteristic will change - the dynamic range. The dynamic range of a signal is the difference between its minimum and maximum value. It is calculated by the formulan * 20log (2), where n is the sampling depth (for rough estimates, you can use a simpler formula: n * 6). For 16 bits this is ~ 96.33 dB, for 24 bits ~ 144.49 dB. This means that the largest level drop that can be described with 24-bit sampling depth (144.49 dB) is 48.16 dB more than the largest level drop with 16-bit depth (96.33 dB). Plus, crushing noise at 24 bits is 48 dB quieter.

    Perception


    When we talk about the perception of sound by a person, we must first understand how people perceive sound. Obviously what we hear with our ears . Sound waves interact with the eardrum, displacing it. Vibrations are transmitted to the inner ear, where they are captured by receptors. How much the eardrum shifts depends on a characteristic such as sound pressure . In this case, the perceived volumedepends on sound pressure not directly, but logarithmically. Therefore, when changing the volume, it is customary to use the relative SPL (sound pressure level) scale, the values ​​of which are indicated all in the same decibels. It is also worth noting that the perceived sound volume depends not only on the sound pressure level, but also on the sound frequency:


    Fig. 6. The dependence of perceived loudness on the frequency and amplitude of sound.
    Source: en.wikipedia.org/wiki/ Sound Volume


    Volume


    The simplest example of sound processing is changing its volume. In this case, the signal level is simply multiplied by some fixed value. However, even in such a simple matter as volume control, there is one pitfall. As I noted earlier, the perceived volume depends on the logarithm of sound pressure, which means that using a linear volume scale is not very effective. With a linear volume scale, two problems arise at once: for a noticeable change in volume, when the slider is above the middle of the scale, you have to move it far enough, while closer to the very bottom of the scale, the shift is less than the thickness of the hair, can change the volume by half (I think everyone has come across this). To solve this problem, a logarithmic volume scale is used. Moreover, along its entire length, moving the slider a fixed distance changes the volume the same number of times. In professional recording and processing equipment, as a rule, it is the logarithmic volume scale that is used.

    Maths


    Here I’m probably going to return to mathematics a bit, because the implementation of the logarithmic scale is not such a simple and obvious thing for many, and finding this formula on the Internet is not as easy as we would like. At the same time I will show how easy it is to translate the volume values ​​to dBFS and vice versa. For further explanation this will be helpful.

    // Минимальное значение громкости - на этом уровне идёт отключение звука
    var EPSILON = 0.001;
    // Коэффициент для преобразований в dBFS и обратно
    var DBFS_COEF = 20 / Math.log(10);
    // По положению на шкале вычисляет громкость
    var volumeToExponent = function(value) {
    	var volume = Math.pow(EPSILON, 1 - value);
    	return volume > EPSILON ? volume : 0;
    };
    // По значению громкости вычисляет положение на шкале
    var volumeFromExponent = function(volume) {
    	return 1 - Math.log(Math.max(volume, EPSILON)) / Math.log(EPSILON);
    };
    // Перевод значения громкости в dBFS
    var volumeToDBFS = function(volume) {
    	return Math.log(volume) * DBFS_COEF;
    };
    // Перевод значения dBFS в громкость
    var volumeFromDBFS = function(dbfs) {
    	return Math.exp(dbfs / DBFS_COEF);
    }
    


    Digital processing


    Now back to the fact that we have a digital, not an analog signal. The digital signal has two features that should be considered when working with volume:
    • the accuracy with which the signal level is indicated is limited (and quite strong. 16 bits is 2 times less than that used for a standard floating-point number);
    • the signal has an upper level limit beyond which it cannot go.


    From the fact that the signal level has a precision limitation, two things follow:
    • crushing noise increases with increasing volume. For small changes, this is usually not very critical, since the initial noise level is much quieter than noticeable, and it can be safely raised 4-8 times (for example, use an equalizer with a scale limit of ± 12dB);
    • you should not lower the signal level much first, and then increase it strongly - at the same time new crushing noises may appear, which were not there initially.


    From the fact that the signal has an upper level limit, it follows that one cannot safely increase the volume above unity. In this case, peaks that are higher than the border will be “cut off” and data loss will occur.


    Fig. 7. Clipping.
    Source: https://en.wikipedia.org/wiki/Clipping_(audio)


    In practice, all this means that the standard sampling parameters for Audio-CD (16 bits, 44.1 kHz) do not allow for high-quality sound processing, because have very little redundancy. For these purposes, it is better to use more redundant formats. However, it should be borne in mind that the total file size is proportional to the discretization parameters, therefore, issuing such files for online playback is not a good idea.

    Volume measurement


    In order to compare the volume of two different signals, you first need to measure it somehow. There are at least three metrics for measuring signal loudness — the maximum peak value, the average value of the signal level, and the ReplayGain metric.

    The maximum peak value is a fairly weak metric for estimating volume. It does not take into account the general volume level in any way - for example, if you record a thunderstorm, then most of the time the recording will rustle quietly and only thunder will sound a couple of times. The maximum peak value of the signal level for such a recording will be quite high, but most of the recording will have a very low signal level. However, this metric is still useful - it allows you to calculate the maximum gain that can be applied to the record, at which there will be no data loss due to "clipping" peaks.

    The average value of the signal level is a more useful metric and easy to calculate, but still has significant drawbacks related to how we perceive sound. The screeching of a circular saw and the roar of a waterfall recorded with the same average signal level will be perceived in completely different ways.

    ReplayGain most accurately conveys the perceived level of recording volume and takes into account the physiological and psychological characteristics of sound perception. For industrial production of recordings, many recording studios use it, it is also supported by most popular media players. (The Russian article on WIKI contains many inaccuracies and actually does not correctly describe the essence of the technology)

    Volume normalization


    If we can measure the volume of various recordings, we can normalize it. The idea of ​​normalization is to bring different sounds to the same level of perceived loudness. To do this, several different approaches are used. As a rule, they try to maximize the volume, but this is not always possible due to limitations of the maximum signal level. Therefore, usually some value is taken slightly less than the maximum (for example, -14 dBFS), to which all signals try to bring.

    Sometimes volume normalization is performed within the framework of a single recording - while different parts of the recording are amplified by different values ​​so that their perceived loudness is the same. This approach is very often used in computer video players - the soundtrack of many films may contain sections with very different loudness. In such a situation, problems arise when watching movies without headphones at a later time - at a volume at which the whispers of the main characters are normally heard, the shots can wake up the neighbors. And at a volume at which the shots do not hit on the ears, a whisper becomes generally indistinguishable. With intra-track normalization of the volume, the player automatically increases the volume in quiet areas and lowers in loud ones.

    Also, internal normalization is sometimes performed to increase the overall volume of tracks. This is called normalization with compression. With this approach, the average value of the signal level is maximized by amplifying the entire signal by a given amount. Those areas that should have been subjected to “circumcision”, due to exceeding the maximum level, are amplified by a smaller amount, avoiding this. This method of increasing the volume significantly reduces the sound quality of the track, but, nevertheless, many recording studios do not disdain to use it.

    Filtration


    I will not describe all the audio filters at all, I will limit myself to the standard ones that are present in the Web Audio API. The simplest and most common of them is the biquad filter ( BiquadFilterNode ) - this is an active second-order filter with an infinite impulse responsewhich can reproduce a sufficiently large number of effects. The principle of operation of this filter is based on the use of two buffers, each with two samples. One buffer contains the last two counts in the input signal, the other contains the last two counts in the output signal. The resulting value is obtained by summing five values: the current sample and samples from both buffers multiplied by pre-calculated coefficients. The coefficients of this filter are not set directly, but are calculated from the parameters of frequency, Q-factor (Q) and gain.

    All graphs below display a frequency range from 20 Hz to 20,000 Hz. The horizontal axis displays the frequency, the logarithmic scale is applied along it, the vertical axis shows the magnitude (yellow graph) from 0 to 2, or the phase shift (green graph) from -Pi to Pi. The frequency of all filters (632 Hz) is marked with a red line in the graph.

    Lowpass



    Fig. 8. Lowpass filter.

    Passes only frequencies below the set frequency. The filter is set by frequency and quality factor.

    Highpass



    Fig. 9. Highpass filter.

    It acts similarly to lowpass, except that it passes frequencies higher than the specified one, and not lower.

    Bandpass



    Fig. 10. The bandpass filter.

    This filter is more selective - it passes only a certain frequency band.

    Notch



    Fig. 11. The filter is notch.

    It is the opposite of bandpass - it passes all frequencies outside a given band. However, it is worth noting the difference in the graphs of the attenuation of the impact and in the phase characteristics of these filters.

    Lowshelf



    Fig. 12. The lowshelf filter.

    It is a more “smart” version of highpass - it enhances or attenuates frequencies below a given one, passes frequencies higher without changes. The filter is set by frequency and gain.

    Highshelf



    Fig. 13. The highshelf filter.

    The smarter version of lowpass - enhances or attenuates frequencies above a predetermined one, passes frequencies below it unchanged.

    Peaking



    Fig. 14. The filter is peaking.

    This is a more “smart” version of notch - it enhances or attenuates frequencies in a given range and allows other frequencies to pass unchanged. The filter is set by frequency, gain and quality factor.

    Allpass filter



    Fig. 15. The allpass filter.

    Allpass differs from all others - it does not change the amplitude characteristics of the signal, instead of which it makes a phase shift of the given frequencies. The filter is set by frequency and quality factor.

    WaveShaperNode Filter


    The waveshaper ( en ) is used to formulate complex effects of sound distortion, in particular, it can be used to implement the effects of "distortion" , "overdrive" and "fuzz" . This filter applies a special shaping function to the input signal. The principles for constructing such functions are quite complex and draw on a separate article, so I will omit their description.

    ConvolverNode Filter


    A filter that produces a linear convolution of the input signal with an audio buffer that sets a certain impulse response . The impulse response is the response of a certain system to a single impulse. In simple terms, this can be called a "photograph" of sound. If a real photograph contains information about light waves, about how much they are reflected, absorbed and interact, then the impulse response contains similar information about sound waves. The convolution of the audio stream with a similar “photograph” imposes, as it were, the effects of the environment in which the impulse response was taken on the input signal.

    This filter requires decomposition of the signal into frequency components. This decomposition is performed using the fast Fourier transform.(unfortunately, the Russian-language Wikipedia has a completely insignificant article written, apparently, for people who already know what an FFT is and can write the same insignificant article themselves). As I said in the introduction, I will not cite the FFT mathematics in this article, but it would be wrong to not mention the cornerstone algorithm for digital signal processing.

    This filter implements a reverb effect . There are many libraries of ready-made audio buffers for this filter that implement various effects ( 1 , 2 ), such libraries are well found on request [impulse response mp3].

    Materials




    Many thanks to my colleagues who helped to collect materials for this article and gave useful advice.

    Special thanks to Taras Audiophile Kovrizhenko for describing the algorithms for normalizing and maximizing volume and Sergey forgotten Konstantinov for a large number of explanations and tips on this article.

    UPD. Corrected the section on filtering and added links for different types of filters. Thanks to Denis deniskreshikhin Kreshikhin and Nikita merlin-vrn Kipriyanov for paying attention.

    Also popular now: