How to resurrect a Jaguar in a thousand hours?

    imageSometimes they ask me - how do I write emulators? I will try to answer on the example of one failed console.

    Emulation is an almost endless task, inaccuracies always remain, and if they ask me how much I spent on 3DO emulation, I just shrug my shoulders, but I know one thing for sure - everything is very good with 3DO emulation. So the time has come to find a new victim and it turned out to be Atari Jaguar. 1000 hours - I spent approximately so much on developing the kernel for emulating this console in the Phoenix project , and it will probably take as many more to raise compatibility from the current 95% to 99%, and the remaining 1% will require another, maybe not one thousand hours, but these are already separate boring stories about debugging subtle glitches.


    image

    Why Jaguar? Firstly, it was poorly emulated by existing emulators. Secondly, it is well “documented”, there are specifications, circuit diagrams and even HDL codes of its chips !

    The first stage is the collection of information about the enemy (300 hours)


    The image below shows a simplified plan of action for collecting the primary information necessary to create an emulator of your favorite game console.

    image

    This stage, as you can see, is not always limited to the word “google”, it is a full-fledged and most important stage of development, this is a whole study. As for “google”, first of all I recommend searching for patents for the subsystems of the console, datasheets for the markings of the chips and just reading the information on the resources related to the technical side of the console. Further, based on the found, focus on data collection according to the plan.

    Someone rightly notes that there is another source of knowledge - other people's projects. If you want to join the project, then this is the right idea, but if not, then I do not recommend relying on such a source of information, at least until your own release, so as not to repeat the mistakes of others. In the case of Jaguar, there is a Virtual Jaguar project , I remember I wanted to improve it, but quickly discarded this idea as soon as I saw the code itself .

    Retrieve Schema- a task that you should definitely do first of all, since it can be useful information on the operating modes of the elements, tires, and the purpose of at least part of the conclusions of the chips for which you could not find the documentation. If possible, you can do a computer tomography of the circuit board, but older consoles usually have two layers and a simple multimeter is enough to draw up a circuit diagram by dialing.

    Software searchfor an emulated platform, it covers both the software itself emulated in the future (dumps - BIOS, ROM, ISO), as well as development tools for this platform. The latter are especially important for creating tests that will serve to identify the internal structure of the element base, in addition, the results of these tests can serve in the future and to check the future emulator for compliance with real hardware. Quite often, the original SDKs are not compatible with modern OSs, so you may have to adapt these SDKs to modern compilers or even use old PC emulators to write tests. If it’s really bad with development tools, it makes sense to write your own assembler or even a simple compiler. Believe me, preparation of development tools will pay off many times in the future, it will help to save a lot of time.

    As for the dumps, if they have not been made by someone before, then to get the ROM, the easiest way is to unsolder the chip and read its contents in the programmer, more complicated, but neat is to find a way to connect the console to the computer through the debug interface or what some expansion port and write a special program for exchanging data with a PC. For example, on 3DO, I connected to the ARM processor's JTAG and dumped the ROM, I also connected to the factory debug port of the console and ran my code on it, for this I had to make a small device with a controller and FPGA.

    Element Base Definition- The most important of the tasks of this stage. If you managed to find enough detailed documentation for all components, then everything is fine and you can start creating an emulator, but what if among the main components you come across the real black boxes? There are several options. You can act in the forehead (if skills, time and money allow) - it is to open the chip, in other words - to shoot the entire chip on a microscope and restore the device circuit. I have never done such a thing, but this is a win-win solution, provided that you have the patience or the talent to automate the processing of the information thus obtained.

    Depending on the topology, in-circuit and off-circuit testing of black boxes can be applied or combined. In both cases, a digital signal analyzer and an oscilloscope will not be out of place. In-circuit testing is preferable and faster, with this approach, you write special code that will test the device of interest to you inside the console. Suppose you are testing a video processor based on data in the SDK or other considerations, you make various options for writing to registers and analyze the result on the screen, fix the interruptions and delays that occur.

    Sometimes it is difficult to determine the behavior of a component based on in-circuit tests, for example, when it does not work directly with the central processor. In this case, it is reasonable to perform off-circuit testing, for this you make a special printed circuit board for connecting to a PC and transfer the tested chip to it, then send signals to the inputs and read the outputs, and then it all depends on your mental abilities. The only thing is that you need to know exactly where and what inputs and outputs of this microcircuit, this can be partly determined by the interconnection of the components, and in especially severe cases by cutting the tracks (I do not recommend raising the legs - they break). If you have solved the puzzle, then it will not be superfluous to return the chip itself not to its place, but to attach its emulation to the FPGA in order to finally make sure that your conclusions are correct,

    What about the Jaguar? And everything is very cool with him, as I said, there are source HDL codes for its chips , specifications and circuit diagrams. Of course, I would also like a working prefix, but having all of the above, it is safe to say that you can completely do without it. Therefore, I immediately took the documentation to the printing house:

    1. Technical Reference Manual Tom & Jerry (these are not the heroes of the famous cartoon, but the main chips of the console, by the way, you can immediately see the general 3DO style, in 3DO there are also two main chips with interesting names - Madam & Clio, and if I remember correctly, some developers left the 3DO team Atari);
    2. Motoroll M68000 Famaly Programmer's Reference Manual - documentation on the CPU of the console;
    3. Schematic diagram of the prefix - it is very important to understand what with what and how it interacts, without it the prefix would still be necessary.

    Of course, this is not all the documentation, but only its main part, I had to look for datasheets for all sorts of little things like EEPROM or quartz, I really had to find photographs of them in the absence of a live console and cartridges and already search by marking electronic components.

    But the HDL code turned out to be a very tough nut to crack. It is written in Toshiba’s little-known HDL, in relation to modern Verilog or VHDL, it is like assembler in relation to C ++, scattered over a hundred files, and it’s not possible to keep components in mind (at least for me). From this code I needed to make ahem ... a "book", well then - let's write a translator! Code posted on github(whoever needs it), but since the code was needed for a single use, I decided to take an example from the authors of Virtual Jaguar and focus only on achieving an acceptable result in the shortest possible time, in other words, this my code is very poor for understanding.

    Toshiba’s HDL turned out to be a very specific language with lax syntax, for example, the description of circuits in it is lax, i.e. the bit depth of each circuit seems to be revealed only after linking all the modules and through decomposition, so dozens of heuristics had to be written to determine the bit depth and width of the chains. But there were still isolated cases when the bit depth of the circuit could not be established from the context, then I made the assumption that this circuit is single (which was justified). I will give a simple example:

    /*2 input nand gate */
    DEF ND2 (z:OUT; a,b:IN);
    

    this element can be used in very different ways:

    //единичные цепи могут записываться так:
    label		:= ND2 (z, a, b);
    label		:= ND2 (z[2], a[0..1]); //a[0..1] – разделяется по входам
    label		:= ND2 (z[2], a[0-1]); //эквивалентно предыдущему
    label		:= ND2 (z[1], a{5}, b[2]); //{} - означает выборку из цепи для которой разрядность больше единицы
    //с репликами все становится еще веселее:
    label[0-4]	:= ND2 (z[0-4], a{9-13}, b); //здесь b размножается, 
    label[0-4]	:= ND2 (z[0-4], a{9-13}, b[9-13]); //один к одному
    label[0..4]	:= ND2 (z[0..4], a{9..13}, b[9..13]); //эквивалентно предыдущему
    label[0..4]	:= ND2 (z[0..4], a[0..4], b[0-7]); //здесь b размножается на пять 8-битных цепей
    label[0-4]	:= ND2 (z[0-4], a[0-4], b[0..7]); //аналогично предыдущему
    

    This whole menagerie is compounded by the fact that, for example, the chain z, a or b does not have to be declared somewhere, it is considered declared the first time it is used, bundles of chains can also be with holes, for example: z [0..5] and z [10..12] - it’s normal that there is no z [6..9], it’s certainly not important for compilation, when everything is broken and optimized at the level of individual signals for FPGAs, but we have another task - to save all available structural information , so we are not interested in decomposition. The fact that this is a hardware description language, not a regular program code, adds a little bit of oil to the fire, so all of its blocks are executed in parallel and their order in the code is completely unimportant, hence the translator must analyze the links in a separate passage.

    At the end of the development of lexical and syntactic analyzers, when all the chains came together, it was time to make a kind of book from the code, in other words translate it into a more convenient presentation. To do this, I had to extract comments with binding them to blocks of code, as well as nesting all the modules in each other starting from the top, but it was not always possible to make a full inline because of the concatenation and decomposition of the chains at the connection level of blocks, so in such cases I had to include a connection map, for example:

    R1count := R1COUNT(
    	count[0..5] = reghalf/*OUT*/ @ sysr1[0..4]/*OUT*/, /// counter 
    	clk = clk[1]/*IN*/, /// system clock 
    	cnten = sromold/*IN*/, /// counter enable 
    	cntld = mmult/*IN*/, /// counter load 
    	mr1[0..4] = preinstr'16'{5..9}/*IN*/, /// value to load 
    )
    

    When all these mechanisms were ready, it was necessary to improve the readability of the resulting “book”, for example, replace ND2 (z, a, b) with z <= ~ (a & b). About a hundred primitives were worked out, non-functional blocks were removed, and finally, it was possible to start the emulation itself!

    What we have? A system of five processors (OB - provides the preparation and supply of data to the video DAC, DSP - is responsible for the sound, GPU - is responsible for the construction of graphic scenes, Blitter - is responsible for filling the polygons, m68k - controls all this economy), and all these processors have a common memory space and, accordingly, can write to each other's I / O registers with virtually no restrictions, in other words, it will be difficult to optimize this mess.

    The second stage is the creation of a virtual platform (500 hours)


    The Phoenix project was originally designed as a multi-platform emulator, this is exactly the case when an excessive running ahead turned out to be justified. Each platform is inherited from the class of the base platform and is formed by creating and composing in the tree structure its elements that are inherited from the base class of devices. In other words, OOP features of the C ++ language are used. Most emulators are written in style C, which in my opinion greatly complicates the process of their development. The base classes provide me with unified access to the front end of the emulator, including the debugger.

    image

    In the platform class, I usually implement a console memory card and connect all its components. In the classes of devices, their functionality is emulated and stored variables and their format for the debugger are written, as well as if the device is a processor - disassembly and debugging functions are implemented, it is better to do this right away , because even if the emulated instructions are erroneously implemented, the debugger greatly simplifies the search for errors.

    We should also mention dump loaders, their task is to ensure that the dump is checked for correctness and converted to the format used by the emulator, as well as to extract checksums that are independent of the format of storage on the disk. Checksums are necessary for two things: a unique connection with the preservation of the gameplay and the use of crutches for problematic games. For instance,some of the amateur designs for Jaguar have never been tested on real hardware , and they work on some emulators only because they were tested exclusively on these emulators, so be careful when writing your emulator, you have to make exceptions for such games, for example, turn off write or read alignment .

    So, to realize the sketch of the platform is a matter of one evening, then the main work begins ...

    CPU M68000


    Our central processor is the very popular M68000 at the time, it has very good documentation and quite a lot of ready-made emulators, even with suitable licenses that do not require opening the project code. But for me, each new processor written by hand - it’s like for a collector - a new unique thing, so we won’t take someone else’s - we write our own - with blackjack and ... In addition, to succeed, you need to “understand” the processor , and there is nothing better than emulating him on his own.

    How are processor emulators written? In the simplest version, which is worth sticking to, they are written as ordinary interpreters. You read the command at the address in the PC (register is the instruction counter), determine its type and perform actions on the registers that it must execute in the real processor. Interrupt handling can be implemented in the form of calling delegates that will be passed from external device classes. You should not immediately do all sorts of cool optimizations , because with them the code becomes less clear, it’s better to optimize at the very end of development, even if users throw you tomatoes because of poor performance.

    After the command interpreter and the disassembler are created, it makes sense to test the resulting processor, and here you can take advantage of the available experience! There are many ready-made tests - you should definitely run them to find errors in the implementation of instructions. If you have iron, you should definitely write your tests , which I also did, though for CD-i, in which a similar processor and most of the instructions match. Tests are very important in the future, let's say you performed an optimization and you need to make sure that nothing is broken.

    Blitter


    This tricky thing draws polygons in Jaguar, more precisely, the scanline of polygons, it also knows how to fill quadrangles and rotate them in the plane of the screen, do shading and work with the Z-buffer. It was with this part that I started, and unfortunately it was impossible to make sane tests for it without hardware, but the presence of an HDL code really helped. In fact, I already made a translator of this code in some similarity to C, so while analyzing different blocks, if necessary, I could do tests to compare my code and the original HDL code. For example, the code from my HDL broadcast:

    lowen <= width[4] | width[5];
    ya[0] <= lowen ? 0 : {ytm[2],ytm[1],ytm[0],0}[width[2..3]];
    ya[1] <= lowen ? 0 : {ytm[3],ytm[2],ytm[1],ytm[0]}[width[2..3]];
    ya[2] <= width[5] ? 0 : {ytm[4],ytm[3],ytm[2],ytm[1],ytm[0],0,0,0}[width[2..4]];
    ya[3] <= width[5] ? 0 : {ytm[5],ytm[4],ytm[3],ytm[2],ytm[1],ytm[0],0,0}[width[2..4]];
    ya[4] <= width[5] ? 0 : {ytm[6],ytm[5],ytm[4],ytm[3],ytm[2],ytm[1],ytm[0],0}[width[2..4]];
    ya[5] <= width[5] ? 0 : {ytm[7],ytm[6],ytm[5],ytm[4],ytm[3],ytm[2],ytm[1],ytm[0]}[width[2..4]];
    ya[6] <= {ytm[8],ytm[7],ytm[6],ytm[5],ytm[4],ytm[3],ytm[2],ytm[1],ytm[0],0,0,0,0,0,0,0}[width[2..5]];
    ya[7] <= {ytm[9],ytm[8],ytm[7],ytm[6],ytm[5],ytm[4],ytm[3],ytm[2],ytm[1],ytm[0],0,0,0,0,0,0}[width[2..5]];
    ya[8] <= {ytm[10],ytm[9],ytm[8],ytm[7],ytm[6],ytm[5],ytm[4],ytm[3],ytm[2],ytm[1],ytm[0],0,0,0,0,0}[width[2..5]];
    ya[9] <= {ytm[11],ytm[10],ytm[9],ytm[8],ytm[7],ytm[6],ytm[5],ytm[4],ytm[3],ytm[2],ytm[1],ytm[0],0,0,0,0}[width[2..5]];
    ya[10] <= {ytm[12],ytm[11],ytm[10],ytm[9],ytm[8],ytm[7],ytm[6],ytm[5],ytm[4],ytm[3],ytm[2],ytm[1],0,0,0,0}[width[2..5]];
    ya[11] <= {ytm[13],ytm[12],ytm[11],ytm[10],ytm[9],ytm[8],ytm[7],ytm[6],ytm[5],ytm[4],ytm[3],ytm[2],0,0,0,0}[width[2..5]];
    ya[12] <= {ytm[14],ytm[13],ytm[12],ytm[11],ytm[10],ytm[9],ytm[8],ytm[7],ytm[6],ytm[5],ytm[4],ytm[3],0,0,0,0}[width[2..5]];
    ya[13] <= {0,ytm[14],ytm[13],ytm[12],ytm[11],ytm[10],ytm[9],ytm[8],ytm[7],ytm[6],ytm[5],ytm[4],0,0,0,0}[width[2..5]];
    ya[14] <= {0,0,ytm[14],ytm[13],ytm[12],ytm[11],ytm[10],ytm[9],ytm[8],ytm[7],ytm[6],ytm[5],0,0,0,0}[width[2..5]];
    ya[15] <= {0,0,0,ytm[14],ytm[13],ytm[12],ytm[11],ytm[10],ytm[9],ytm[8],ytm[7],ytm[6],0,0,0,0}[width[2..5]];
    /// bits 16-19 use MX8G enabled for shifts 4-11, the low four and
    ///high four must be swapped 
    mid8en\ <= ~(width[4] ^ width[5]);
    ya[16] <= mid8en\ ? 0 : {ytm[10],ytm[9],ytm[8],ytm[7],ytm[14],ytm[13],ytm[12],ytm[11]}[width[2..4]];
    ya[17] <= mid8en\ ? 0 : {ytm[11],ytm[10],ytm[9],ytm[8],0,ytm[14],ytm[13],ytm[12]}[width[2..4]];
    ya[18] <= mid8en\ ? 0 : {ytm[12],ytm[11],ytm[10],ytm[9],0,0,ytm[14],ytm[13]}[width[2..4]];
    ya[19] <= mid8en\ ? 0 : {ytm[13],ytm[12],ytm[11],ytm[10],0,0,0,ytm[14]}[width[2..4]];
    tm4en\ <= ~(width\[4] & width[5]);
    ya[20] <= tm4en\ ? 0 : {ytm[14],ytm[13],ytm[12],ytm[11]}[width[2..3]];
    ya[21] <= tm4en\ ? 0 : {0,ytm[14],ytm[13],ytm[12]}[width[2..3]];
    ya[22] <= tm4en\ ? 0 : {0,0,ytm[14],ytm[13]}[width[2..3]];
    ya[23] <= ytm[14] & width[2] & width[3] & width\[4] & width[5];
    

    And this is the working equivalent in C ++:

    int ya=0;
    if(width<48)
    {
            if((width>>2)<2)ya=ytm>>(2-(width>>2));
            else if((width>>2)>2)ya=ytm<<((width>>2)-2);
            else ya=ytm;
            ya&=0xffffff;
    }
    

    Comparative tests made it possible to build all the components of Tom & Jerry with some hope of operability at the first start, which of course did not materialize, but nevertheless, the number of possible errors was radically reduced.

    People often ask me - is it possible for Jaguar to do the same thing that I did for 3DO - a hardware render with arbitrary resolution? The simple answer is no. But if you decompile the GPU code, then it’s probably possible for each game individually, the reason for this is that blitter draws with lines, and the polygons from the lines are formed by the GPU program, unique for each game, and alas, the GPU is a full-fledged RISC processor with full access to space console memory, which cannot be implemented inside the shader.

    Two fun friends GPU & DSP


    They are really funny - they have so many hardware errors! And they are very similar and differ only in a few instructions. Therefore, where there is one processor, there is another. The process was more fun here, because I already got on the blitter, and some of the HDL blocks coincided, so there was less work.

    The main cause of Jaguar's most unpleasant hardware errors was the Scoreboard., which is responsible for a kind of superscalarity, if you look at it, then perhaps you will understand why I did not try to understand it and postponed the problems associated with it to the debugging stage. This code seems to have been difficult for the authors of the console, and it’s rather difficult to understand what’s happening there, although its purpose is very clear, but you can repeat its mistakes only by repeating it one to one, which of course will lead to a fatal drop in the emulation speed. For example, a LOAD instruction can overwrite the contents of the destination register after the execution of the next instruction that writes to the same register. And although the main miracles are described in the documentation, I managed to find those that are not described. In general, the race of instructions without rules is normal for the Jaguar.

    When dealing with a processor -It is very important to accurately reproduce the calculation of all the flags of the arithmetic subsystem more accurately than described in the documentation , I succeeded, which positively affected compatibility, but at the cost of a significant decrease in performance. The fact is that ALU and the shift unit in these processors always form flags, i.e. for instructions with an undefined flag state (according to the manuals), it is necessary that the subsystems work out the result of which is not used, except for the formation of the flag. Two processors with a frequency of 26 + MHz with one instruction per clock cycle and a similar overhead - this is quite a hit in performance - but compatibility is paramount!

    Object Processor (OB)


    This fruit is very unusual, it not only draws scan lines for the DAC, but can also scale sprites and write them in these lines, and it can also launch the GPU, which in turn often launches Blitter, which can rewrite the GPU program, which in turn, it must send an interrupt to the object processor, which must continue to draw lines and start the GPU ... In general, keeping the system in a coherent state and ensuring normal emulator performance in this situation is another task.

    This block was probably the most complex state machine, among other things, I recommend sketching such things - it helps a lot . At least hundreds of lines of rich code become much clearer.

    image

    In general, I spent a whole bunch of paper on sketches of recoverable algorithms, and a couple of notebooks on notes, plans, and hypothesis lists. As they say - a bad pencil is better than a good memory (including a computer).

    The third stage - hunting for beetles (200 hours)


    So, all the blocks are ready and assembled, but according to Murphy’s law, they don’t want to work. As you do not mentally prepare for this, it is still unpleasant to observe a black screen or mottled with artifacts. But, if you are not too lazy to write a convenient debugger, then you will very soon see that the bulk of errors will appear on the very first instructions. If you haven’t written, you can of course be debugged through the debugger of the development environment and printf, but believe me, writing a debugger is much easier.

    image

    What do I use besides disassemblers and simple console output? For example, a source map that shows which subsystem wrote to this memory area, in particular with 5 Jaguar processors, helped a lot with tracking who was the bully (quite often it was Blitter).

    In addition to simple breakpoints for execution, reading or writing in the debugger, it is useful to implement the on-demand pause functions from the code, for example, I want to check if the GPU or DSP goes beyond the internal memory, and if it does, I cause the shutdown and transfer of control to the debugger. Thus, for example, I identified a serious hardware error in the Jaguar. When an entry is made in the flag register with switching banks, then if the following instruction reads the contents of the register, in real hardware it should be from the old bank. This error had to be implemented, otherwise Wolfenshtain and a number of games did not work. Yes, in emulation you need to implement bugs, and not just features!

    image

    They fixed a lot of errors and everything seems to work? No, this is not the end, this is only the beginning of the hunt! Next, you need to start testing all available games in all available ways! And here the main thing is not only to find errors, but also to make them repeatable so that it is easy to find the cause. My members of the forum help me with this, which I am very grateful, because it is extremely difficult to test even a hundred games alone, although playing old games is a pleasure!

    In conclusion


    I’ll talk a little about the basic techniques for improving the performance of emulators, which I use myself.

    1. Preprocessing . If there is data that is reused (mainly textures), then it makes sense to remember and save their most processed options for reuse.
    2. Interlock detection . If it is possible to determine that the processor is expecting a certain event, you do not need to emulate its instructions before the event occurs.
    3. Runtime quantization . From the point of view of the cache, it is better to execute each of several pieces of code many times than to execute them strictly in turn many times. In other words, if possible, it is better not to tick emulate all devices, but to execute each device with the maximum possible portion of measures.
    4. Recompilation . The recompilation can be dynamic or static. Static is the fastest, but possible only with a small variety of emulated device code. For example, the DSP code in the 3DO SDK has several hundred tools, so it’s normal to perform static recompilation, the CPU code is usually too diverse and dynamic recompilation is needed here.
    5. Parallel emulation of subsystems . This is not always possible without prejudice to compatibility, since older multiprocessor systems often did not use the most reliable synchronization techniques, and asynchrony in emulation streams can lead to freezes and other emulation errors, the correction of which can completely neutralize the gain from multithreading.

    What in the end is needed to successfully write an emulator? I think that nothing special is accuracy and perseverance, any skills develop with them. There are no special and magical methods.

    Only registered users can participate in the survey. Please come in.

    What console emulation would you like to see as part of the Phoenix project?

    • 1.6% Philips CD-i 4
    • 4% 3DO M2 10
    • 8.9% Sega Saturn 22
    • 20% Nintendo 64 49
    • 2% Amiga CD32 5
    • 5.3% Neo-Geo 13
    • 51.8% PlayStation 127
    • 6.1% The console I need is poorly emulated and is not listed here 15

    Also popular now: