Life of Oracle I / O: Tracing Logical and Physical I / O with SystemTap
- Transfer
We invite all specialists to participate in the section devoted to commercial databases on PG Day'17 Russia! Alexander Khalukhin from Deutsche Bank is preparing an intensive workshop on performance diagnostics for Oracle Database . See the full list of performances in the program on the site !
And we continue to publish translations of articles from database experts from CERN. Today's material is part two of a series of articles on tracing and debugging Oracle DB performance using SystemTap.
This article focuses on tracking logical and physical readings in Oracle using SystemTap. Here you will find several examples that illustrate the basic mechanisms used by Oracle to perform physical and logical I / O, and learn how to create SystemTap scripts to diagnose and troubleshoot Oracle I / O.
Introduction
Performing input / output from and onto the media, i.e. physical input / output is one of the most important tasks that underlie all database engines. When accessing data stored in the database cache in RAM, we can talk about logical input / output. Physical I / O is performed by the database using the OS kernel system calls. The type of system calls used depends on the database, OS, and storage type. Logical I / O is performed by database processes using Oracle kernel functions. Oracle provides physical I / O call time measurements using the wait event interface , and logical I / O time is generally considered CPU time. I / O tracing performed by Oracle processes and complemented by the details provided by the Oracle wait event interface is a powerful method for examining the internal deviceOracle I / O. Using a few real-life examples, we will find out what mechanisms Oracle uses to access the storage, how I / O operations correlate with the wait event interface, and also find out the time rate for the most common I / O wait events.
Creating a tracing tool
Investigation of the operations that Oracle and OS perform when performing an I / O operation requires special tools. SystemtapIs a dynamic tracing tool designed specifically for Linux that allows you to track the functions of both OS and Oracle (Linux kernel functions, system calls and Oracle database kernel functions). The following is a description of the systemtap script that can be used to track Oracle's logical and physical I / O: it consists of probes that can join function calls of interest and read function parameters, memory, and processor registers .
Systemtap requires relatively recent kernel versions to be able to profile userspace: utrace or uprobe_events functionality must be available. For example, RHEL / OL 6.5 and higher, but not RHEL 5, is suitable. In addition, for kernel trace calls, you must install the kernel packagedebuginfo.
Подробнее об этом и о том, как создать тестовую среду, можно прочитать в этой статье. Скрипт, который мы используем в данном посте для трассировки логического и физического ввода-вывода Oracle, а также интерфейса событий ожидания, выглядит следующим образом: trace_oracle_logicalio_wait_events_physicalio_12102.stp (для Oracle 12.1.0.2). Также доступен вариант для версии 11.2.0.4: trace_oracle_logicalio_wait_events_physicalio_11204.stp. Прежде, чем перейти к примерам исследования ввода-вывода Oracle, я бы хотел рассмотреть три основные части скрипта: отслеживание физического ввода-вывода, логического ввода-вывода и событий ожидания.
Отслеживание физического ввода-вывода
Ultimately, Oracle makes system calls to perform I / O. The type of calls depends on the database settings, OS type, and storage type. In this article, you can find examples for Oracle on Linux using ASM storage on block devices. Please note that direct NFS, local storage and ASM with asmlib were not included in the article and may be considered in future studies. System calls can also be tracked using the strace (1) utility on Linux. However, in this article you will see how probes SystemTap can be created to collect information about system calls of interest: pread, pwrite, io_submit and io_getevents. The following SystemTap probes can be useful for the underlying layer of block I / O: ioblock.request, ioblock_trace.request and ioblock.end (note that these are not system calls, but trace points for function calls inside the Linux kernel).
Примечание, добавленное в августе 2015 года: probes Systemtap для kernel.trace('block_rq_issue') и kernel.trace('block_rq_complete') также могут использоваться для исследования интерфейса блочного ввода-вывода с дополнительным преимуществом: они не требуют kernel debuginfo. Примеры таких probes можно найти на странице с материалами для загрузки или на GitHub.
Отслеживание логического ввода-вывода
Речь идет об истории и деталях операций логического ввода-вывода, когда Oracle читает блок данных из памяти (буферный кэш). Операции логического ввода-вывода включают в себя подмножество операций физического ввода-вывода: если Oracle не находит данные в памяти (буферном кэше), it performs a physical read. You can find some examples of this mechanism in action later in this article.
The Oracle toolkit provides extensive information on logical readings in several V $ views, such as V $ SYSSTAT , V $ SESSTAT, V $ SYSMETRIC. Logical read data is also available from sql * plus with the “set autotrace on” parameter and tkprof reports on event traces 10046 . Logical readings come in two main forms: consistent and current. Oracle will use consistent reads when reading data with a given system change number SCN, current reads are used when performing DML operations on a block. Consistent reads can be traced in Oracle using events 10200 and 10201. Более подробную информацию можно найти в презентации Джулиана Дайка (Julian Dyke) о логическом вводе-выводе.
В этой статье мы покажем, как отслеживать логический ввод-вывод, подключая probes SystemTap к набору функций ядра Oracle. Роль функций kcbgtcr и kcbgcur для трассировки consistent и current reads уже обсуждалась Александром Анохиным в его превосходных статьях о Dtrace LIO. Совсем недавно я исследовал эту тему и счел полезным также оттрассировать функции kcbzib, kcbzgb, kcbzvb.
Резюме того, что я выяснил об этих функциях, приведено ниже в таблице. Примечание: выражения %rdi, %rsi, %r8 указывают значения соответствующих регистров процессора. Using SystemTap, a register value can be read using the register function, for example: register ('rdi'). CPU registers are used to extract function call arguments . Systemtap does not provide function call arguments when there are no debug symbols for the executable. In particular, calling conventions for x86-64, as described in www.x86-64.org/documentation/abi.pdf , indicate that the integer parameters for calling the function are available in the following registers (in order):% rdi,% rsi ,% rdx,% rcx,% r8 and% r9.
Примечание: на момент написания статьи только uprobes для ядер версии 3.10 и выше поддерживают возвращаемые probes для функций userspace. Поэтому данная функциональность недоступна в RHEL или OL 6.x, но доступна в ядрах RHEL 7.x.
Отслеживание событий ожидания
Интерфейс событий ожидания, вероятно, является одной из лучших функций в настройке производительности Oracle, поскольку он предоставляет количественный метод выявления проблем производительности и узких мест. Кэри Миллсап (Cary Millsap) опубликовала вдохновляющую работу на эту тему.
For our purposes, it is sufficient to indicate that the wait events are instrumentation points at which the selected actions of the Oracle kernel are measured. In particular, Oracle will mark the initial and final calls of most physical I / O operations, thereby allowing a good understanding of how much time is spent in the process of physical I / O (some limitations of this mechanism are discussed later in this post). There are restrictions on using the wait event interface to study I / O latency , in particular, in the examples section you can find interesting tasks related to measuring the duration of asynchronous I / O events .
Waiting event tracing is typically activated in Oracle using event 10046 or by calling DBMS_MONITOR.SESSION_TRACE_ENABLE. In this article, we get the details of the wait event by attaching probes SystemTap to the corresponding Oracle kernel functions, namely kskthbwt and kskthewt, to collect information at the beginning and at the end of each waiting event, respectively. Enabling the trace of wait events using event 10046 at level 8 and other methods is a way to retrieve out time information about the wait event (as well as other details about SQL processing), and not a way to include wait information. This, in turn, is done using the TIMED_STATISTICS database parameter, and in modern Oracle databases, it must be set to TRUE. A key piece of information provided to us by the kskthbwt and kskthewt functions of the Oracle kernel is a pointer to the X $ KSUSE service table(or rather, the “underlying” memory structure in SGA), which is the “underlying” table for V $ SESSION and, therefore, allows you to compare wait events with a lot of useful information about the session executing SQL. Dereferencing the pointer in register r13 to the base value X $ KSUSE and calculating the offset for the fields of interest requires additional work, described in detail in the previous article (see also the script: ksuse_find_offsets.sql ).
Running the script
Below you will find a few trace examples for common Oracle I / O access patterns. Traces were obtained with the command stap -v trace_oracle_logicalio_wait_events_physicalio_12102.stp -x _pid_ , where _pid_ is the identifier of the process of the monitored session. The result is passed to sed -f eventsname.sed, where the event number is translated into the name associated with it. The eventsname.sed file is created by running the eventsname.sql script. It is important to restart the script to generate event names before testing, as the numbers of wait events can change without warning, for example, after migration to newer versions or after installing the patch. The system used to collect traces in the following examples is Oracle 12.1.0.2, running on OEL 7.0 with UEK3 under Virtualbox.
Example 1: single-block random reads
This example belongs to the category of single block random reads. This way of accessing data is very important for many OLTP workloads.because it is used to access the table through an index key. In Figure 1 below, you can see a fragment of the trace file for a query of type select col_list from table tab where indexed_col = val .
Fig. 1: Oracle Single-block random read and sequential read wait event from db file
Key points :
Example 2: sequential I / O to the buffer cache with synchronous I / O
This example deals with multi-block reads to the cache. This is the access method that is used, for example, when Oracle runs a full table scan . There are several ways in which Oracle can access the repository to perform a full table scan, in this example you see a case where Oracle uses synchronous I / O (i.e. the pread call) to perform OS read operations, examples 3 and 4 cover sequential reads with asynchronous I / O. In Figure 2 below, you can see a snippet for tracking the workload of a query such as select col_list form table tab , Oracle performs a full scan of the table tab.
Fig. 2: Oracle Sequential I / O when reading is done using pread OS calls
Key trace points in Figure 2:
Example 3: Sequential I / O to the buffer cache with calls to the OS asynchronous I / O functions.
In Figure 3, you see a fragment of the SQL trace select col_list from table tab , where Oracle runs a full scan of the table tab. This is the same query that was used in Example 2, the difference is that the trace in Figure 3 was done at a later stage in the query when the Oracle process switched from using synchronous I / O via pread calls to using the asynchronous I / O interface . Asynchronous I / O is implemented in Linux by io_submit and io_getevents callsto send I / O requests and get results. Despite the fact that in this example, Oracle uses an asynchronous interface to perform read operations, the end result is very similar to the result presented in example 2 (i.e., using synchronous I / O). The same Oracle wait event, “db file scattered read” , reads are also performed in the buffer cache, and I / O results are collected using blocking calls, which actually makes the process synchronous. A more “aggressive use” of the interface for asynchronous I / O occurs in Oracle when performing direct reads, which are discussed in Example 4 below.
Fig. 3: Oracle performing sequential I / O reads to the buffer cache using asynchronous I / O calls
Key points:
Example 4: full scans using direct reads and asynchronous I / O
In this section you will find two examples of Oracle I / O in the case of direct reads in a block storage configured using ASM . It uses the same query that executes a full table scan as in examples 2 and 3. In this example, unlike examples 2 and 3, Oracle uses serial direct read to access the data. Serial direct read is a feature introduced in 11gR2 to optimize full segment scan operations and bypass the buffer cache layer., similar to what happens in parallel queries. Oracle chooses when to use serial direct reads instead of cached reads based on several parameters. In a first approximation, direct reads are preferable when the table is “large” compared to the size of the buffer cache, but the actual mechanism is more complex and could potentially change depending on the version of Oracle. You can learn more about this topic in an article by Tanel Poder .
To run these tests, I had to force the use of serial direct reads before starting the request, this was done using alter session set "_serial_direct_read" = always; .
Note that the request alter session set "_serial_direct_read" = never;can be used to disable serial direct reads and force Oracle to use cached reads (i.e., the type of I / O you saw in examples 2 and 3 above). The default value for the _serial_direct_read parameter is auto.
For research in this example, we can use a slightly different SystemTap probe than in the previous ones. I removed the trace of logical I / O and block I / O to reduce clutter, and also added a new probe to the userspace io_getevents_0_4 function from libaio (libaio is a Linux library that implements asynchronous I / O). This is because serial direct reads use non-blocking calls to collect I / O results, and these calls are best tracked in the libaio interface for the reasons describedFrits Hoogland in this article . The SystemTap script shown in Fig. 4a and 4b, it looks like this: trace_oracle_wait_events_asyncio_libaio_12102.stp (version 11.2.0.4 is available at trace_oracle_wait_events_asyncio_libaio_11204.stp ).
Fig. 4a: Tracing the wait event and Oracle I / O calls during a full table scan with serial direct read and asynchronous I / O. This is what happens in the early stages of full scan.
Key points in Figure 4a:
Using serial direct reads in Oracle is more complex than the simple example in Figure 2. 4a. Fritz Hoogland went into more detail on the direct reads internals at Oracle in this presentation . In particular, Oracle uses an adaptive mechanism for direct reads in order to achieve high throughput of the storage subsystem when possible. This is achieved by increasing the number of outstanding I / O operations when invoking the asynchronous I / O interface. Internally, this approach uses I / O slots. Note that from the comment on the Oracle 10353 diagnostic eventwe have additional indirect information on this topic: “Slots are an input-output unit, and this factor controls the number of outstanding input-output operations.”
Another discovery of Fritz is that there are I / O operations performed for direct reads, the time of which is not monitored by the wait event interface . This is because Oracle can make io_getevents calls with timeout = 0, which collects I / O data without blocking. You can see these mechanisms in action in the trace in fig. 4b. It was performed for the same full scan as in Figure 4a, but at a later time, when the adaptive serial direct reads mechanism configured Oracle I / O requests to be “more aggressive”in using storage resources.
Fig. 4b. Trace the wait event and Oracle I / O calls during the full table scan run using serial direct read and asynchronous I / O. This is an example of what happens when the adaptive mechanism increases the number of I / O operations to optimize throughput.
Key points of Figure 4b:
Example 5: asynchronous I / O for random reads
Oracle optimizes single-block reads in situations where it can prefetch and / or group I / O operations. This is an optimization for random I / O using the asynchronous I / O interface, because Oracle can group multiple I / O requests in one io_submit call, rather than send them one by one, as in example 1, where the pread OS call was used. The example in fig. 5 appeared after the launch of Kevin Closson's SLOB toolkit in database 12c. Take a look at step 3 of the execution plan, where Oracle uses the TABLE ACCESS BY INDEX ROWID BATCHED operation to access table data.
Fig. Figure 5: SLOB request execution plan illustrating batch access to the table.
In Figure 6 below, you can see a fragment of the Oracle I / O trace while executing the SLOB workload. As you can see, the asynchronous I / O interface is used to perform random I / O. In the case shown in fig. 6, 126 requests are transmitted in one io_submit operation, and then the results are collected using the io_getevents function, which waits for all requests to complete. The wait event in this case is “db file parallel read”, and the estimated wait time is the batch completion time of 126 requests. Therefore , db file parallel read latency does not provide an accurate measurement of I / O latency for single block random reads .
Fig. 6: Oracle performing I / O for random reads with asynchronous I / O in the case of batched reads. The trace was edited to replace a long list of similar trace lines (usually 126 lines that differ only in block #) with "..."
Example 6: DML trace
In Figure 7, you see an Oracle logical I / O trace fragment and wait events when a row is inserted into a table, followed by a commit. In this operation, physical I / O operations were not performed (the table was previously cached). You can see that most of the block reads were in current read mode, as expected for DML. Moreover, both blocks belong to the table (you can identify them in Figure 7 as belonging to tbs # = 11) inserted into rollback segments (you can identify them in Figure 7 as belonging to tbs # = 2). Note that for simplicity, the table in this example has no indexes.
Fig. 7: Oracle logical I / O performing an insert and commit operation
Example 7: tracing a log writer
In Figure 8, you can see the log writer trace during the commit operation. This is the same commit operation as in Example 6 (see also Figure 7 above). The wait event is “log file parallel write”, and its duration is detected by Oracle from start to completion of asynchronous I / O. Therefore, the measured time of the “log file parallel write” wait event is not an accurate indicator of block write delays, since its duration depends on the number of operations performed. In the case shown in Figure 8, you can see that the log writer writes data to two different block devices, which is consistent with using 1 redo-log from a group located on the ASM group with a redundancy level of normal: log writer sends 2 write requests I / O at the same time - to record the main and mirror extents of the log file - and waits for both of them to complete.
Fig. Figure 8: Log writer tracing during a commit operation.
Conclusion
Dynamic tracing and, in particular, SystemTap provide an opportunity to explore the physical and logical operations of Oracle I / O, and also open the way for in-depth study of Oracle I / Oat various levels of the Oracle engine down to the Linux kernel level. This article describes some methods, tools, and examples of how these studies can be performed.
Examples from Oracle's real workloads reveal some of the most common cases of logical and physical Oracle I / O and the relationship between the wait event interface and OS calls made by Oracle processes. The Oracle wait event interface can provide useful data for measuring I / O latency and for troubleshooting I / O in general . This is true for random reads measured using the db file sequential read wait event. However, the information provided by wait events associated withAsynchronous I / O requires additional attention : such events usually do not provide an accurate measurement of the delay in I / O operations. In addition, this article provides examples where some of the asynchronous I / O operations for Oracle direct reads are not handled by the wait event interface.
Acknowledgments : I would like to say thanks to Tanel Poder and Fritz Hoogland for the original ideas that inspired this work. Special thanks to Fritz Hoagland, who generously shared with me his deep expertise on this issue.
Downloading scripts : the scripts discussed in this article can be taken from the web page with the materials for downloading or atGitHub .
And we continue to publish translations of articles from database experts from CERN. Today's material is part two of a series of articles on tracing and debugging Oracle DB performance using SystemTap.
This article focuses on tracking logical and physical readings in Oracle using SystemTap. Here you will find several examples that illustrate the basic mechanisms used by Oracle to perform physical and logical I / O, and learn how to create SystemTap scripts to diagnose and troubleshoot Oracle I / O.
Introduction
Performing input / output from and onto the media, i.e. physical input / output is one of the most important tasks that underlie all database engines. When accessing data stored in the database cache in RAM, we can talk about logical input / output. Physical I / O is performed by the database using the OS kernel system calls. The type of system calls used depends on the database, OS, and storage type. Logical I / O is performed by database processes using Oracle kernel functions. Oracle provides physical I / O call time measurements using the wait event interface , and logical I / O time is generally considered CPU time. I / O tracing performed by Oracle processes and complemented by the details provided by the Oracle wait event interface is a powerful method for examining the internal deviceOracle I / O. Using a few real-life examples, we will find out what mechanisms Oracle uses to access the storage, how I / O operations correlate with the wait event interface, and also find out the time rate for the most common I / O wait events.
Creating a tracing tool
Investigation of the operations that Oracle and OS perform when performing an I / O operation requires special tools. SystemtapIs a dynamic tracing tool designed specifically for Linux that allows you to track the functions of both OS and Oracle (Linux kernel functions, system calls and Oracle database kernel functions). The following is a description of the systemtap script that can be used to track Oracle's logical and physical I / O: it consists of probes that can join function calls of interest and read function parameters, memory, and processor registers .
Systemtap requires relatively recent kernel versions to be able to profile userspace: utrace or uprobe_events functionality must be available. For example, RHEL / OL 6.5 and higher, but not RHEL 5, is suitable. In addition, for kernel trace calls, you must install the kernel packagedebuginfo.
Подробнее об этом и о том, как создать тестовую среду, можно прочитать в этой статье. Скрипт, который мы используем в данном посте для трассировки логического и физического ввода-вывода Oracle, а также интерфейса событий ожидания, выглядит следующим образом: trace_oracle_logicalio_wait_events_physicalio_12102.stp (для Oracle 12.1.0.2). Также доступен вариант для версии 11.2.0.4: trace_oracle_logicalio_wait_events_physicalio_11204.stp. Прежде, чем перейти к примерам исследования ввода-вывода Oracle, я бы хотел рассмотреть три основные части скрипта: отслеживание физического ввода-вывода, логического ввода-вывода и событий ожидания.
Отслеживание физического ввода-вывода
Ultimately, Oracle makes system calls to perform I / O. The type of calls depends on the database settings, OS type, and storage type. In this article, you can find examples for Oracle on Linux using ASM storage on block devices. Please note that direct NFS, local storage and ASM with asmlib were not included in the article and may be considered in future studies. System calls can also be tracked using the strace (1) utility on Linux. However, in this article you will see how probes SystemTap can be created to collect information about system calls of interest: pread, pwrite, io_submit and io_getevents. The following SystemTap probes can be useful for the underlying layer of block I / O: ioblock.request, ioblock_trace.request and ioblock.end (note that these are not system calls, but trace points for function calls inside the Linux kernel).
Name probe | Function | Favorite Options |
syscall.pread / pwrite | Synchronous I / O: read / write from / to file descriptor | fd (file descriptor number), offset, number. Return value: read / written byte |
syscall.io_submit | Asynchronous I / O: sending blocks for processing | nr (number of input-output operations). For each operation: file descriptor, offset, bytes, operation code (0 for reading, 1 for writing) Return value: number of I / O operations performed |
syscall.io_getevents | Asynchronous I / O: reading events from the execution queue | min_nr (minimum number of input / output operations to read), timeout for waiting for an event. Return value: number of collected I / O operations. For each operation: file descriptor, offset, bytes. |
ioblock.request, ioblock_trace.request | I / O requests sent to the kernel level of the block device interface | devname, sector, size, rw, block_io structure address |
ioblock.end | Return from the kernel level of the block device interface | devname, sector, rw, block_io structure address |
Примечание, добавленное в августе 2015 года: probes Systemtap для kernel.trace('block_rq_issue') и kernel.trace('block_rq_complete') также могут использоваться для исследования интерфейса блочного ввода-вывода с дополнительным преимуществом: они не требуют kernel debuginfo. Примеры таких probes можно найти на странице с материалами для загрузки или на GitHub.
Отслеживание логического ввода-вывода
Речь идет об истории и деталях операций логического ввода-вывода, когда Oracle читает блок данных из памяти (буферный кэш). Операции логического ввода-вывода включают в себя подмножество операций физического ввода-вывода: если Oracle не находит данные в памяти (буферном кэше), it performs a physical read. You can find some examples of this mechanism in action later in this article.
The Oracle toolkit provides extensive information on logical readings in several V $ views, such as V $ SYSSTAT , V $ SESSTAT, V $ SYSMETRIC. Logical read data is also available from sql * plus with the “set autotrace on” parameter and tkprof reports on event traces 10046 . Logical readings come in two main forms: consistent and current. Oracle will use consistent reads when reading data with a given system change number SCN, current reads are used when performing DML operations on a block. Consistent reads can be traced in Oracle using events 10200 and 10201. Более подробную информацию можно найти в презентации Джулиана Дайка (Julian Dyke) о логическом вводе-выводе.
В этой статье мы покажем, как отслеживать логический ввод-вывод, подключая probes SystemTap к набору функций ядра Oracle. Роль функций kcbgtcr и kcbgcur для трассировки consistent и current reads уже обсуждалась Александром Анохиным в его превосходных статьях о Dtrace LIO. Совсем недавно я исследовал эту тему и счел полезным также оттрассировать функции kcbzib, kcbzgb, kcbzvb.
Резюме того, что я выяснил об этих функциях, приведено ниже в таблице. Примечание: выражения %rdi, %rsi, %r8 указывают значения соответствующих регистров процессора. Using SystemTap, a register value can be read using the register function, for example: register ('rdi'). CPU registers are used to extract function call arguments . Systemtap does not provide function call arguments when there are no debug symbols for the executable. In particular, calling conventions for x86-64, as described in www.x86-64.org/documentation/abi.pdf , indicate that the integer parameters for calling the function are available in the following registers (in order):% rdi,% rsi ,% rdx,% rcx,% r8 and% r9.
Function name | Task in progress | Favorite Options |
kcbgtcr | Kernel Cache Buffers Get Consistent Read Used for consistent reads. | tbs # = user_int32 (% rdi) rel file n # = user_int32 (% rdi + 4) >> 22 & 0x003FFFFF block # = user_int32 (% rdi + 4) & 0x003FFFFF data_object_id # = user_int32 (% rdi + 8) object_id # = user_int32 (% rdi + 12) Note: for bigfile tablespaces: block # = user_int32 (% rdi + 4) |
kcbgcur | Kernel Cache Buffers Current Used for current reads | tbs # = user_int32 (% rdi) rel file n # = user_int32 (% rdi + 4) >> 22 & 0x003FFFFF block # = user_int32 (% rdi + 4) & 0x003FFFFF data_object_id # = user_int32 (% rdi + 8) object_id # = user_int32 (% rdi + 12) Note: for bigfile tablespaces: block # = user_int32 (% rdi + 4) |
kcbzib | kcbZIB stands for: Z (kcbz.o - module for auxiliary functions of physical input-output), IB: Input Buffer Oracle will perform physical read (s) to the buffer cache | |
kcbzgb | The GB suffix in kcbZGB stands for: Get (space for) Buffer Oracle allocates space in the buffer cache for a given block (usually before I / O). | tbs n # =% rsi, rel file n # =% rdx >> 22 & 0x003FFFFF block # =% rdx & 0x003FFFFF data_object_id # =% r8 object_id # =% r9 Note: for bigfile tablespaces: block # =% rdx |
kcbzvb | Called after Oracle reads the given block. This is part of the call chain to check the blocks controlled by the db_block_checking parameter. Estimated decryption: Kernel cache buffer verify block. Thanks @FritsHoogland for this clarification. This function is used both for reads in the buffer cache and for direct reads. | tbs n # =% rsi, rel file n # =% rdx >> 22 & 0x003FFFFF, block # =% rdx & 0x003FFFFF data_object_id = user_int32 (% rdi + 24) Note: for bigfile tablespaces: block # =% rdx |
Примечание: на момент написания статьи только uprobes для ядер версии 3.10 и выше поддерживают возвращаемые probes для функций userspace. Поэтому данная функциональность недоступна в RHEL или OL 6.x, но доступна в ядрах RHEL 7.x.
Отслеживание событий ожидания
Интерфейс событий ожидания, вероятно, является одной из лучших функций в настройке производительности Oracle, поскольку он предоставляет количественный метод выявления проблем производительности и узких мест. Кэри Миллсап (Cary Millsap) опубликовала вдохновляющую работу на эту тему.
For our purposes, it is sufficient to indicate that the wait events are instrumentation points at which the selected actions of the Oracle kernel are measured. In particular, Oracle will mark the initial and final calls of most physical I / O operations, thereby allowing a good understanding of how much time is spent in the process of physical I / O (some limitations of this mechanism are discussed later in this post). There are restrictions on using the wait event interface to study I / O latency , in particular, in the examples section you can find interesting tasks related to measuring the duration of asynchronous I / O events .
Waiting event tracing is typically activated in Oracle using event 10046 or by calling DBMS_MONITOR.SESSION_TRACE_ENABLE. In this article, we get the details of the wait event by attaching probes SystemTap to the corresponding Oracle kernel functions, namely kskthbwt and kskthewt, to collect information at the beginning and at the end of each waiting event, respectively. Enabling the trace of wait events using event 10046 at level 8 and other methods is a way to retrieve out time information about the wait event (as well as other details about SQL processing), and not a way to include wait information. This, in turn, is done using the TIMED_STATISTICS database parameter, and in modern Oracle databases, it must be set to TRUE. A key piece of information provided to us by the kskthbwt and kskthewt functions of the Oracle kernel is a pointer to the X $ KSUSE service table(or rather, the “underlying” memory structure in SGA), which is the “underlying” table for V $ SESSION and, therefore, allows you to compare wait events with a lot of useful information about the session executing SQL. Dereferencing the pointer in register r13 to the base value X $ KSUSE and calculating the offset for the fields of interest requires additional work, described in detail in the previous article (see also the script: ksuse_find_offsets.sql ).
Function name | Task in progress | Favorite Options |
kskthbwt | Kernel service Kompile thread begin wait. This function is called at the start of an Oracle wait event. kslwtbctx is a call to its parent function and marks the start of a wait event. | Useful information for profiling a function: register r13 -> points to a segmented array of X $ KSUSE (V $ SESSION) in SGA register rsi -> timestamp of the start of wait (in microseconds) register rdx -> number of wait events |
kskthewt | Kernel service Kompile thread end wait. This function is called at the end of the Oracle wait event. kslwtectx - call the parent function, marking the end of the wait event. The suffix "ewt" most likely means "end wait". | Useful information for profiling a function: register r13 -> indicates a segmented array X $ KSUSE (V $ SESSION) in SGA register rdi -> timestamp of the start of waiting (in microseconds) register rsi -> the number of wait events |
Running the script
Below you will find a few trace examples for common Oracle I / O access patterns. Traces were obtained with the command stap -v trace_oracle_logicalio_wait_events_physicalio_12102.stp -x _pid_ , where _pid_ is the identifier of the process of the monitored session. The result is passed to sed -f eventsname.sed, where the event number is translated into the name associated with it. The eventsname.sed file is created by running the eventsname.sql script. It is important to restart the script to generate event names before testing, as the numbers of wait events can change without warning, for example, after migration to newer versions or after installing the patch. The system used to collect traces in the following examples is Oracle 12.1.0.2, running on OEL 7.0 with UEK3 under Virtualbox.
Example 1: single-block random reads
This example belongs to the category of single block random reads. This way of accessing data is very important for many OLTP workloads.because it is used to access the table through an index key. In Figure 1 below, you can see a fragment of the trace file for a query of type select col_list from table tab where indexed_col = val .
Fig. 1: Oracle Single-block random read and sequential read wait event from db file
Key points :
- Oracle logically reads one of the blocks needed to retrieve data for SQL in this example:
- the logical read call contains additional information, which shows that Oracle wants to read block # = 2505675 from tbs # = 7, owned by obj # = 32174 (which is the table segment in this example);
- the block was not found in the cache, therefore, a new block is allocated in the buffer cache and physical reading is initiated;
- the database marks the start of a wait event for 'db file sequential read';
- a physical read request for a block becomes a pread call to the OS:
- an IO request arrives at the kernel through a system call; any function that occurs after a system call is a kernel function (for example, ioblock.request) until the system call ends and thus ends up in userland (in this case, the Oracle executable).
- This call to the read function is passed to the block I / O interface to read 8192 bytes (1 database block) from the block device (/ dev / sdl in the example in Figure 1);
- the block interface and, then, the OS call return the read data to the Oracle kernel;
- The Oracle wait interface marks the completion of a wait and calculates the elapsed time:
- waiting event - “db file sequential read”, parameters: p1 - file number, p2 - block number, p3 - number of blocks read (1 block);
- from the timestamp values in the trace, you can see that the duration of waiting for the sequential read db file depends on the time taken to complete the pread call with little overhead (note that the overhead could be more significant if we had a high load CPU on the database server, for example, due to workload from other Oracle sessions).
- We can conclude that in this example, the measured latency for the sequential read db file corresponds to the delay associated with physical reading .
Example 2: sequential I / O to the buffer cache with synchronous I / O
This example deals with multi-block reads to the cache. This is the access method that is used, for example, when Oracle runs a full table scan . There are several ways in which Oracle can access the repository to perform a full table scan, in this example you see a case where Oracle uses synchronous I / O (i.e. the pread call) to perform OS read operations, examples 3 and 4 cover sequential reads with asynchronous I / O. In Figure 2 below, you can see a snippet for tracking the workload of a query such as select col_list form table tab , Oracle performs a full scan of the table tab.
Fig. 2: Oracle Sequential I / O when reading is done using pread OS calls
Key trace points in Figure 2:
- Oracle provides sequential read for tbs # = 7 block # = 2621443 and finds it in the cache.
- Oracle performs a consistent read for the next block, does not find it in the cache, so it prepares to read the list of blocks from the repository in one physical read . This list consists of 124 blocks (it was shortened in this example for simplicity).
- The database marks the start of the db file scattered read wait event .
- Physical read requests become a single pread call to the OS to receive 1015808 bytes (124 blocks).
- This request moves down the I / O chain to the block I / O interfaces, which divide it into 2 reads from the “underlying” block device, since the maximum size of I / O in the system used for tests is 524288 bytes.
- According to Fritz Hoogland's research, the submitted I / O request is placed on the scattered / gather list of the I / O scheduler, where it is broken down to the I / O sizes declared by the block device.
- Input-output is performed by the block input-output interface.
- Return call pread.
- The wait event interface marks the end time of a wait
- The wait event is db file scattered read, and the parameters: p1 is the file number, p2 is the block number, p3 is the number of blocks.
- From the timestamp values in the trace, it can be seen that the length of the scattered read wait for the db file is mainly due to the time taken to complete the pread call with little overhead.
- From the trace of the calls to the kcbzvb function of the Oracle kernel, we get additional confirmation of the blocks that have been read.
- It can be concluded that in this example, the scattered read db file latency corresponds to an I / O delay .
Example 3: Sequential I / O to the buffer cache with calls to the OS asynchronous I / O functions.
In Figure 3, you see a fragment of the SQL trace select col_list from table tab , where Oracle runs a full scan of the table tab. This is the same query that was used in Example 2, the difference is that the trace in Figure 3 was done at a later stage in the query when the Oracle process switched from using synchronous I / O via pread calls to using the asynchronous I / O interface . Asynchronous I / O is implemented in Linux by io_submit and io_getevents callsto send I / O requests and get results. Despite the fact that in this example, Oracle uses an asynchronous interface to perform read operations, the end result is very similar to the result presented in example 2 (i.e., using synchronous I / O). The same Oracle wait event, “db file scattered read” , reads are also performed in the buffer cache, and I / O results are collected using blocking calls, which actually makes the process synchronous. A more “aggressive use” of the interface for asynchronous I / O occurs in Oracle when performing direct reads, which are discussed in Example 4 below.
Fig. 3: Oracle performing sequential I / O reads to the buffer cache using asynchronous I / O calls
Key points:
- Oracle performs consistent read for tbs # = 7 block # = 2522760, does not find it in the cache, so it prepares to read the list of blocks in the buffer cache. This list consists of 128 blocks and is shortened in this example.
- The database marks the start of a wait event for “db file scattered read” .
- Physical read requests are transferred to the OS as an io_submit call with two requests with a total size of 1 MB (i.e. 128 blocks): the first request for reading (operation code 0) is 65,536 bytes, the second is for reading from 983040 bytes (i.e. 1 MB - 64 KB).
- These requests are passed to the block I / O interface , which divides the largest request into 2 reads from the underlying block device (/ dev / sdi in this example). A reading of 64 KB is transferred to / dev / sdr without resizing.
- Oracle calls io_getevents to retrieve the results of just submitted asynchronous I / O requests, and waits. The timeout.tv_sec field is set to 600, which indicates that this is a blocking wait (this means that io_getevents waits 600 seconds (tv_sec) and 0 nanoseconds (tv_nsec)) to complete min_nr = 2 IO operations.
- The block device interface returns data up the call chain.
- The io_getevents call returns two reads.
- The wait event interface marks the end of a wait. The wait event in this case is the scattered read of the db file.
- Parameters of the wait event: p1 - file number, p2 - block number, p3 - number of blocks.
- The duration of a wait event is basically the time spent on calls to io_submit and io_getevents plus a little overhead.
- Note that while Oracle used the asynchronous interface, it also expected all I / O operations to complete, so I / O operations are actually synchronous .
- From calls to kcbzvb we get additional confirmation of the blocks read.
- Conclusion: the scattered read db file timeout in this example is not an accurate indication of I / O latency , since two reads on two different devices were requested in parallel , and the total time was reflected as if it were a single read.
Example 4: full scans using direct reads and asynchronous I / O
In this section you will find two examples of Oracle I / O in the case of direct reads in a block storage configured using ASM . It uses the same query that executes a full table scan as in examples 2 and 3. In this example, unlike examples 2 and 3, Oracle uses serial direct read to access the data. Serial direct read is a feature introduced in 11gR2 to optimize full segment scan operations and bypass the buffer cache layer., similar to what happens in parallel queries. Oracle chooses when to use serial direct reads instead of cached reads based on several parameters. In a first approximation, direct reads are preferable when the table is “large” compared to the size of the buffer cache, but the actual mechanism is more complex and could potentially change depending on the version of Oracle. You can learn more about this topic in an article by Tanel Poder .
To run these tests, I had to force the use of serial direct reads before starting the request, this was done using alter session set "_serial_direct_read" = always; .
Note that the request alter session set "_serial_direct_read" = never;can be used to disable serial direct reads and force Oracle to use cached reads (i.e., the type of I / O you saw in examples 2 and 3 above). The default value for the _serial_direct_read parameter is auto.
For research in this example, we can use a slightly different SystemTap probe than in the previous ones. I removed the trace of logical I / O and block I / O to reduce clutter, and also added a new probe to the userspace io_getevents_0_4 function from libaio (libaio is a Linux library that implements asynchronous I / O). This is because serial direct reads use non-blocking calls to collect I / O results, and these calls are best tracked in the libaio interface for the reasons describedFrits Hoogland in this article . The SystemTap script shown in Fig. 4a and 4b, it looks like this: trace_oracle_wait_events_asyncio_libaio_12102.stp (version 11.2.0.4 is available at trace_oracle_wait_events_asyncio_libaio_11204.stp ).
Fig. 4a: Tracing the wait event and Oracle I / O calls during a full table scan with serial direct read and asynchronous I / O. This is what happens in the early stages of full scan.
Key points in Figure 4a:
- Oracle requests I / O in continuous chunks of storage space of 1 MB. This is done to reduce the time spent searching and, potentially, using readahead whenever possible.
- The value of 1 MB for the read request comes from the parameter db_file_multiblock_read_count = 128, multiplied by the block size - 8 KB.
- Oracle sends one 1 MB I / O operation with an io_submit call , and then calls io_getevents_0_4 in the libiao library, using non-blocking mode to try to get the results. This happens twice, without any success. Then, Oracle sends another operation 1 MB in size and 4 times tries to collect the results in non-blocking mode without any success (I / O operations are not finished yet).
- The timing of the wait event begins immediately before the io_getevents calls that are used to retrieve the I / O results, this time in blocking mode (with a timeout of 600 seconds).
- In this case, the wait event is “direct path read” .
- Oracle collects the results of two I / O operations in two consecutive io_getevents calls .
- After collecting I / O data, Oracle marks the end of an event wait.
- As a result, 2 reads were performed, 1 MB each.
- The wait event interface does not report the correct number of I / O operations performed :
- The direct path read wait in this example reports a single read of 128 blocks (1 MB), and 2 MB were read.
- The elapsed time for the direct path read wait event does not measure I / O latency .
- Oracle only detects blocking calls to io_getevents that start after I / O requests.
- Oracle counts 2 reads together.
Using serial direct reads in Oracle is more complex than the simple example in Figure 2. 4a. Fritz Hoogland went into more detail on the direct reads internals at Oracle in this presentation . In particular, Oracle uses an adaptive mechanism for direct reads in order to achieve high throughput of the storage subsystem when possible. This is achieved by increasing the number of outstanding I / O operations when invoking the asynchronous I / O interface. Internally, this approach uses I / O slots. Note that from the comment on the Oracle 10353 diagnostic eventwe have additional indirect information on this topic: “Slots are an input-output unit, and this factor controls the number of outstanding input-output operations.”
Another discovery of Fritz is that there are I / O operations performed for direct reads, the time of which is not monitored by the wait event interface . This is because Oracle can make io_getevents calls with timeout = 0, which collects I / O data without blocking. You can see these mechanisms in action in the trace in fig. 4b. It was performed for the same full scan as in Figure 4a, but at a later time, when the adaptive serial direct reads mechanism configured Oracle I / O requests to be “more aggressive”in using storage resources.
Fig. 4b. Trace the wait event and Oracle I / O calls during the full table scan run using serial direct read and asynchronous I / O. This is an example of what happens when the adaptive mechanism increases the number of I / O operations to optimize throughput.
Key points of Figure 4b:
- This is the same full scan that is shown in fig. 4a. The difference is that the trace in Figure 4b was done at a later stage of the full scan operation (the table has 1 million blocks), when adaptive serial direct reads increased the number of I / O slots (failed asynchronous I / O operations).
- The figure shows that several io_submit operations are performed before the start of the timeout of the wait event.
- Oracle is trying to keep several failed I / O operations in the queue: up to nine unsuccessful asynchronous I / O operations can be seen in Figure 4b, while in the simpler case in Figure 4a (i.e., at the beginning of full scan) only 2 operations.
- Several attempts to collect I / O using io_getevents operations are performed by the Oracle process before the start of the wait event countdown:
- The results of two readings, 1 MB each, are collected by one call to io_getevents before the start of the countdown of the wait event, so these readings will not be taken into account in the wait event interface .
- It’s important to note that the Oracle process sets a timeout for this particular io_getevents call (and also for io_getevents_0_4 calls in libaio that did not return any I / O data) to 0, which means that Oracle does not wait for input results - withdrawals will be available, he will collect them only if they are available.
- Between the moment Oracle starts counting the “ direct path read ” wait event and the wait event ends, 3 io_getevents operations were performed (with a timeout set to 600, thereby blocking pending reads), a total of 9 reads were performed, 1 MB each.
- The wait event interface does not report all completed I / O operations :
- In the parameters of the wait event, we could find only a read trace of 1 MB (128 blocks), although in fact 9 MB was read.
- This is in addition to reading 2 MB outside of instrumentation wait events.
- The elapsed time for “db file direct read” does not accurately reflect the I / O latency , but rather the wait time for collecting I / O data, as shown in Figure 4b.
- The wait event interface does not report all completed I / O operations :
Example 5: asynchronous I / O for random reads
Oracle optimizes single-block reads in situations where it can prefetch and / or group I / O operations. This is an optimization for random I / O using the asynchronous I / O interface, because Oracle can group multiple I / O requests in one io_submit call, rather than send them one by one, as in example 1, where the pread OS call was used. The example in fig. 5 appeared after the launch of Kevin Closson's SLOB toolkit in database 12c. Take a look at step 3 of the execution plan, where Oracle uses the TABLE ACCESS BY INDEX ROWID BATCHED operation to access table data.
Fig. Figure 5: SLOB request execution plan illustrating batch access to the table.
In Figure 6 below, you can see a fragment of the Oracle I / O trace while executing the SLOB workload. As you can see, the asynchronous I / O interface is used to perform random I / O. In the case shown in fig. 6, 126 requests are transmitted in one io_submit operation, and then the results are collected using the io_getevents function, which waits for all requests to complete. The wait event in this case is “db file parallel read”, and the estimated wait time is the batch completion time of 126 requests. Therefore , db file parallel read latency does not provide an accurate measurement of I / O latency for single block random reads .
Fig. 6: Oracle performing I / O for random reads with asynchronous I / O in the case of batched reads. The trace was edited to replace a long list of similar trace lines (usually 126 lines that differ only in block #) with "..."
Example 6: DML trace
In Figure 7, you see an Oracle logical I / O trace fragment and wait events when a row is inserted into a table, followed by a commit. In this operation, physical I / O operations were not performed (the table was previously cached). You can see that most of the block reads were in current read mode, as expected for DML. Moreover, both blocks belong to the table (you can identify them in Figure 7 as belonging to tbs # = 11) inserted into rollback segments (you can identify them in Figure 7 as belonging to tbs # = 2). Note that for simplicity, the table in this example has no indexes.
Fig. 7: Oracle logical I / O performing an insert and commit operation
Example 7: tracing a log writer
In Figure 8, you can see the log writer trace during the commit operation. This is the same commit operation as in Example 6 (see also Figure 7 above). The wait event is “log file parallel write”, and its duration is detected by Oracle from start to completion of asynchronous I / O. Therefore, the measured time of the “log file parallel write” wait event is not an accurate indicator of block write delays, since its duration depends on the number of operations performed. In the case shown in Figure 8, you can see that the log writer writes data to two different block devices, which is consistent with using 1 redo-log from a group located on the ASM group with a redundancy level of normal: log writer sends 2 write requests I / O at the same time - to record the main and mirror extents of the log file - and waits for both of them to complete.
Fig. Figure 8: Log writer tracing during a commit operation.
Conclusion
Dynamic tracing and, in particular, SystemTap provide an opportunity to explore the physical and logical operations of Oracle I / O, and also open the way for in-depth study of Oracle I / Oat various levels of the Oracle engine down to the Linux kernel level. This article describes some methods, tools, and examples of how these studies can be performed.
Examples from Oracle's real workloads reveal some of the most common cases of logical and physical Oracle I / O and the relationship between the wait event interface and OS calls made by Oracle processes. The Oracle wait event interface can provide useful data for measuring I / O latency and for troubleshooting I / O in general . This is true for random reads measured using the db file sequential read wait event. However, the information provided by wait events associated withAsynchronous I / O requires additional attention : such events usually do not provide an accurate measurement of the delay in I / O operations. In addition, this article provides examples where some of the asynchronous I / O operations for Oracle direct reads are not handled by the wait event interface.
Acknowledgments : I would like to say thanks to Tanel Poder and Fritz Hoogland for the original ideas that inspired this work. Special thanks to Fritz Hoagland, who generously shared with me his deep expertise on this issue.
Downloading scripts : the scripts discussed in this article can be taken from the web page with the materials for downloading or atGitHub .