Learning the Windows 10 source tree: from telemetry to open source

    image

    No matter how closed Microsoft software is, there is plenty of information about its internal device. For example, exporting functions from a library by name gives an idea of ​​its interfaces. There are also debugging symbols in free access, which are widely used for diagnosing errors in the OS. However, we still only have compiled binary modules on hand. It becomes interesting: what were they before compilation? Let's try to figure out how to get more information about the source code without doing anything illegal.

    The idea, of course, is not new. At the time, both Mark Russinovich and Alex Ionescu did the same.. I was only interested in getting the latest data, adding to and clarifying a little work already done by others. For the experiment, we need packages of debugging symbols that are in the public domain. I took the packages for the latest release version of the “tens” (64 bits), and decided to investigate both the release package (free build) and the debugging (checked build).

    Debugging symbols are a set of files with the extension pdb (program database, program database), which stores various information to expand the debugging capabilities of binary modules in the OS, including the names of globals, functions, and data structures, sometimes together with their contents.

    In addition to symbols, you can take the conditionally available debug assembly “tens”. Such an assembly is rich in assertions, which describe not only undocumented and missing variable names in symbol files, but also the line number in the file in which the assert worked.

    image

    The example shows not only the file name and its extension, but also the directory structure before it, very useful even without a root.

    We set the strings utility from sysinternals on character files and get about 13 GB of raw data. But to feed all the files from the distribution kit of the debug assembly in a row - so-so idea, there will be too much unnecessary data. We restrict ourselves to a set of extensions: exe - executable files, sys - drivers, dll - libraries, ocx - ActiveX components, cpl - control panel components, efi - EFI applications, in particular the bootloader. The raw data from the distribution was 5.3 GB.

    To my surprise, I found that not many programs are capable of at least opening files of a dozen gigabytes in size, and even more so, units could support the search function inside such files. In this experiment, 010 Editor was used to manually view raw and intermediate data. Filtering data cheaply and angrily was carried out by python scripts.

    Filter data from character files


    Character files contain, among other things, linker information. That is, in the symbol file there is a list of object files that were used to compose the corresponding binary, and the full path to the object file is used in the linker.

    image


    • Hook-filter No. 1: we search for lines on a mask ": \\".


    We get the absolute paths, sort, delete duplicates. By the way, there was not so much garbage, and it was manually removed.

    When examining the received data, the approximate structure of the source tree became clear. The root is “d: \ th”, which apparently means threshold, in accordance with the name of the November version of Windows 10 - Threshold 1. However, there were few files with the root “d: \ th”. This is because the linker accepts already-collected files. And the assembly of objectors is carried out in the folders "d: \ th.obj.amd64fre" for release builds and "d: \ th.obj.amd64chk" for debugging.

    • Hook-filter No. 2: we assume that the source files are stored by analogy with object files after assembly, and we “sort” the object files into source files. Attention! This step can introduce a distortion of the structure for some folders, because the source assembly parameters are not reliably known.


    For example:
    d: \ th.obj.amd64fre \ shell \ osshell \ games \ freecell \ objfre \ amd64 \ freecellgame.obj
    is the former
    d: \ th \ shell \ osshell \ games \ freecell \ freecellgame.c ??

    As for the file extension: the object file is obtained from a bunch of different types of the source file: “c”, “cpp”, “cxx”, “asm”, etc. At this stage, it is not clear which type of source file was used, so we’ll leave the extension "C ??".

    In addition to the d: \ th folder, there are many other roots. For example, “d: \ th.public.chk” and “d: \ th.public.fre”. We will omit this folder due to the fact that it contains the public part of sdk, that is, it is not very interesting to us. It is also worth noting the various paths of projects for drivers, which, apparently, are going somewhere at the developers' workplaces:

    c: \ users \ joseph-liu \ desktop \ sources \ rtl819xp_src \ common \ objfre_win7_amd64 \ amd64 \ eeprom.obj
    C: \ ALLPROJECTS \ SW_MODEM \ pcm \ amd64 \ pcm.lib
    C: \ Palau \ palau_10.4.292.0 \ sw \ host \ drivers \ becndis \ inbox \ WS10 \ sandbox \ Debug \ x64 \ eth_tx.obj
    C: \ Users \ avarde \ Desktop \ inbox \ working \ Contents \ Sources \ wl \ sys \ amd64 \ bcmwl63a \ bcmwl63a \ x64 \ Windows8Debug \ nicpci.obj

    In other words, there is a set of device drivers that meet standards, for example, USB XHCI, which are included in the OS source tree. And all the specific drivers are going somewhere else.

    • Hook-filter No. 3: delete binary files, since we are only interested in the source files. We delete "pdb", "lib", "exp", etc. Files "res" are rolled back to "rc" - the source code of the resource file.


    image


    The output is getting more beautiful! However, at this stage, additional data is almost impossible to obtain. We move on to the next set of raw data.

    Filtering data from executable files


    Since there were few absolute paths in raw data, we will filter the rows by extensions:
    • “C” - source files in C,
    • “Cpp” - C ++ source files,
    • “Cxx” - source files in C or C ++,
    • “H” - C header files,
    • “Hpp” - header files in C ++,
    • “Hxx” - header files in C or C ++,
    • “Asm” - source files on MASM,
    • "Inc" - header files on MASM,
    • "Def" - a descriptive file for libraries

    After filtering the data, it becomes clear that although the resulting paths do not have a root, the directory structure indicates that it is built relative to it. That is, all the paths need only add the root “d: \ th” at the beginning.

    At this point, there are several problems with data derived from characters. The first problem: we are not sure that we correctly rolled back the assembly path of the source file to the object file.

    • Hook-filter No. 4: check if there are any matches between the paths to the object files and the paths to the source.


    And they really are! That is, for most directories, it can be argued that their structure has been restored correctly. Of course, dubious catalogs still remain, but I think this error is quite acceptable. Along the way, you can safely replace the extension “c ??” with the extension of the source code that matches the path.

    The second problem is header files. The fact is that this is an important part of the source files, however, the object file is not obtained from the header, which means that headers cannot be restored from information about object files. We have to be content with small, namely those headings that we found in the raw data of binary files.

    Third problem: we still don’t know most source file extensions.

    • Hook-filter No. 5: we assume that within the same folder the source files of the same type are stored.


    That is, if a file with the extension “cpp” is already present in any of the folders, most likely all its neighbors will have the same extension.
    image

    Well, what about the assembler sources? For the final touch, you can turn to Windows Research Kernel - the source code for Windows XP - and rename part of the assembler sources manually.

    We study the data obtained


    Telemetry


    For a while I studied the issue of telemetry device in Windows 10 . Unfortunately, a quick analysis did not reveal anything worthwhile. I did not find any keyloggers, no leak of sensitive data, nothing to touch. And the first keyword to search through the source files was “telemetry”. The result exceeded my expectations: 424 matches. I will give the most interesting below.

    Telemetry in source files
    d: \ th \ admin \ enterprisemgmt \ enterprisecsps \ v2 \ certificatecore \ certificatestoretelemetry.cpp
    d: \ th \ base \ appcompat \ appraiser \ heads \ telemetry \ telemetryappraiser.cpp
    d: \ th \ base \ appmodel \ search \ common \ telemetry \ telemetry.cpp
    d: \ th \ base \ diagnosis \ feedback \ siuf \ libs \ telemetry \ siufdatacustom.c ??
    d: \ th \ base \ diagnosis \ pdui \ de \ wizard \ wizardtelemetryprovider.c ??
    d: \ th \ base \ enterpriseclientsync \ settingsync \ azure \ lib \ azuresettingsyncprovidertelemetry.cpp
    d: \ th \ base \ fs \ exfat \ telemetry.c
    d: \ th \ base \ fs \ fastfat \ telemetry.c
    d: \ th \ base \ fs \ udfs \ telemetry.c
    d: \ th \ base \ power \ energy \ platformtelemetry.c ??
    d: \ th \ base \ power \ energy \ sleepstudytelemetry.c ??
    d: \ th \ base \ stor \ vds \ diskpart \ diskparttelemetry.c ??
    d: \ th \ base \ stor \ vds \ diskraid \ diskraidtelemetry.cpp
    d: \ th \ base \ win32 \ winnls \ els \ advancedservices \ spelling \ platformspecific \ current \ spellingtelemetry.c ??
    d: \ th \ drivers \ input \ hid \ hidcore \ hidclass \ telemetry.h
    d: \ th \ drivers \ mobilepc \ location \ product \ core \ crowdsource \ locationoriontelemetry.cpp
    d: \ th \ drivers \ mobilepc \ sensors \ common \ helpers \ sensorstelemetry.cpp
    d: \ th \ drivers \ wdm \ bluetooth \ user \ bthtelemetry \ bthtelemetry.c ??
    d: \ th \ drivers \ wdm \ bluetooth \ user \ bthtelemetry \ fingerprintcollector.c ??
    d: \ th \ drivers \ wdm \ bluetooth \ user \ bthtelemetry \ localradiocollector.c ??
    d: \ th \ drivers \ wdm \ usb \ telemetry \ registry.c ??
    d: \ th \ drivers \ wdm \ usb \ telemetry \ telemetry.c ??
    d: \ th \ ds \ dns \ server \ server \ dnsexe \ dnstelemetry.c ??
    d: \ th \ ds \ ext \ live \ identity \ lib \ tracing \ lite \ microsoftaccounttelemetry.c ??
    d: \ th \ ds \ security \ base \ lsa \ server \ cfiles \ telemetry.c
    d: \ th \ ds \ security \ protocols \ msv_sspi \ dll \ ntlmtelemetry.c ??
    d: \ th \ ds \ security \ protocols \ ssl \ telemetry \ telemetry.c ??
    d: \ th \ ds \ security \ protocols \ sspcommon \ ssptelemetry.c ??
    d: \ th \ enduser \ windowsupdate \ client \ installagent \ common \ commontelemetry.cpp
    d: \ th \ enduser \ winstore \ licensemanager \ lib \ telemetry.cpp
    d: \ th \ minio \ ndis \ sys \ mp \ ndistelemetry.c ??
    d: \ th \ minio \ security \ base \ lsa \ security \ driver \ telemetry.cxx
    d: \ th \ minkernel \ fs \ cdfs \ telemetry.c
    d: \ th \ minkernel \ fs \ ntfs \ mp \ telemetry.c ??
    d: \ th \ minkernel \ fs \ refs \ mp \ telemetry.c ??
    d: \ th \ net \ netio \ iphlpsvc \ service \ teredo_telemetry.c
    d: \ th \ net \ peernetng \ torino \ telemetry \ notelemetry \ peerdistnotelemetry.c ??
    d: \ th \ net \ rras \ ip \ nathlp \ dhcp \ telemetryutils.c ??
    d: \ th \ net \ winrt \ networking \ src \ sockets \ socketstelemetry.h
    d: \ th \ shell \ cortana \ cortanaui \ src \ telemetrymanager.cpp
    d: \ th \ shell \ explorer \ traynotificationareatelemetry.h
    d: \ th \ shell \ explorerframe \ dll \ ribbontelemetry.c ??
    d: \ th \ shell \ fileexplorer \ product \ fileexplorertelemetry.c ??
    d: \ th \ shell \ osshell \ control \ scrnsave \ default \ screensavertelemetryc.c ??
    d: \ th \ windows \ moderncore \ inputv2 \ inputprocessors \ devices \ keyboard \ lib \ keyboardprocessortelemetry.c ??
    d: \ th \ windows \ published \ main \ touchtelemetry.h
    d: \ th \ xbox \ onecore \ connectedstorage \ service \ lib \ connectedstoragetelemetryevents.cpp
    d: \ th \ xbox \ shellui \ common \ xbox.shell.data \ telemetryutil.c ??

    Perhaps, it is not worth commenting, since all the same, nothing is known for certain. However, these data can serve as a good starting point for a more detailed study.

    Kernel patch protection


    The next find is the beloved PatchGuard . True, in the source tree of the OS there is only one file of an incomprehensible, most likely binary type.
    d: \ th \ minkernel \ ntos \ ke \ patchgd.wmp
    Looking for matches in unfiltered data, I found that Kernel Patch Protection is actually a separate project.
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen00.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen01.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen02.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen03.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen04.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen05.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen06.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen07.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen08.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp \ xcptgen09.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp_noltcg \ patchgd.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp_noltcg \ patchgda.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp_noltcg \ patchgda2.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp_noltcg \ patchgda3.c ??
    d: \ bnb_kpg \ minkernel \ oem \ src \ kernel \ patchgd \ mp_noltcg \ patchgda4.c ??

    Dubious files


    Without inventing anything else that interests me, I started looking for everything in a row - and was satisfied!

    d: \ th \ windows \ core \ ntgdi \ fondrv \ otfd \ atmdrvr \ umlib \ backdoor.c ??
    in the font driver?

    d: \ th \ inetcore \ edgehtml \ src \ site \ webaudio \ opensource \ wtf \ wtfvector.h
    Web Template Framework, this is just the Web Template Framework, a controversial abbreviation. Wait a moment

    Open source?


    d: \ th \ printscan \ print \ drivers \ renderfilters \ msxpsfilters \ util \ opensource \ libjpeg \ jaricom.c ??
    d: \ th \ printscan \ print \ drivers \ renderfilters \ msxpsfilters \ util \ opensource \ libpng \ png.c ??
    d: \ th \ printscan \ print \ drivers \ renderfilters \ msxpsfilters \ util \ opensource \ libtiff \ tif_compress.c ??
    d: \ th \ printscan \ print \ drivers \ renderfilters \ msxpsfilters \ util \ opensource \ zlib \ deflate.c ??
    I think it's time to round off on this find.

    Archive with a text file with a list of sources is given here . Share your findings in the comments!

    Also popular now: