Saturday, August 20, 2016

Virtual Machines and CPU emulation - performance.

My work on MOS 6502 emulator continues.

After adding code measuring emulation speed, I realized my emulator was slow. Really slow. Barely exceeding speed of 1 MHz MOS 6502 with only the CPU emulated and going down to ~30 % when character I/O and raster graphics devices were enabled. This is not acceptable on today's fast quad core GHz processors with tons of RAM etc.

Something was very wrong.

After few frustrating hours and some minor optimizations that didn't return satisfying enough results I finally found the major bottleneck. I piece of code implementing debugging facility I almost forgot about - the op-codes execute history, which can be displayed in Debug Console and clearly not needed during normal emulator use. Not only was it implemented without any considerations for speed, but it also unnecessarily disassembled the instructions during each op-code execution to keep the execute history in symbolic form.
During each 6502 op-code execute cycle, the previously executed instruction with argument was disassembled to a symbolic form and combined with CPU registers status then added to the history queue as a string.
Totally unnecessary and expensive time wise.
First of all, disassembling to symbolic form is only needed when displaying the history to the user, therefore it is sufficient to keep history data in a raw format. Huge performance bust. Second of all, the history is only needed during debugging, so I disabled the feature by default and added an option in Debug Console that allows to enable it if necessary.

After I fixed the code I am getting decent performance now. Time for some stats.

Program measures the emulation performance by counting the number of executed clock ticks (cycles) and dividing them by the number of microseconds during which the counted cycles were executed, then multiplying the result by 100 to get the percentage of 1 MHz model CPU and displays the stats in the Debug Console. This way it uses a 1 MHz CPU as a reference to return % of speed compared to assumed 1,000,000 CPU cycles or clock ticks per second - which is considered a 100 % speed.
Performance is measured during the whole execution cycle and calculated at the
end of the run. Captured speed is summed with previous result and divided by 2 to produce average emulation speed during single session.

Emulating of pure 6502 machine code with all peripherals emulation disabled (memory mapped devices, character I/O etc.) and time critical debugging facility, the op-codes execute history also disabled, returns performance in range of 646 % on PC1 and 468 % on PC2.
(* see annotation at the bottom for PC configurations used in tests)
Enabling the op-code execute history causes the performance to drop by about 25 %. With all peripherals disabled and op-code history enabled we are down to 411 % on PC1 and 312 % on PC2.

Enabling and adding the emulated memory mapped devices to the pool may cause the emulation speed to drop as well. However with currently implemented peripherals (character I/O, graphics raster device) enabled and actively used and op-codes execute history enabled the performance is still well above 300 % on both PC1 and on PC2.
The same case but with op-code execute history disabled - performance exceeds 400 % on both PC configurations.

Currently the main emulation loop is not synchronized to an accurate clock tick or raster synchronization signal but just runs as fast as it can (no real time simulation). Therefore emulation speed may vary per PC and per current load on the system.

If this code is to be used as a template to implement emulator of a real-world machine, like C-64 or Apple I, it may need some work to improve performance even further, but I think is leaves a pretty good margin for expansion as it is.
On a fast PC the emulation speed above 600 % with basically nothing but CPU emulated and op-codes execute history disabled is IMO decent if we don't want to emulate MOS 6502 machine with clock much faster than 1 MHz.

Debug Console before running 6502 functional test. Note all I/O and execute history are disabled.

6502 functional test program reached its final loop and has been interrupted.
Back in Debug Console at the top the emulation speed stats are shown.

Annotations to 'Performance considerations':

PC1 stats:
Type:           Desktop
CPU:            2.49 GHz (64-bit Quad-core Q8300)
RAM:            4,060 MB
OS:             Windows 10 Pro (no SP) [6.2.9200]

PC2 stats:
Type:           Laptop
CPU:            2.3 GHz (64-bit Quad-core i5-6300HQ)
RAM:            15.9 GB

OS:             Win 10 Home.

Thank you for visiting my blog.


Tuesday, August 9, 2016

Virtual Machines and CPU emulation - update.

The prior architecture of my emulator has not been easy to expand.
I decided to improve it a bit by adding a new abstraction layer.
To demonstrate it I also added basic graphics raster device to the system.

In microprocessor based systems in majority of cases communication with
peripheral devices is done via registers which in turn are located under
specific memory addresses.
Programming API responsible for modeling this functionality is implemented
in Memory and MemMapDev classes. The Memory class implements access to
specific memory locations and maintains the memory image.
The MemMapDev class implements specific device address spaces and handling
methods that are triggered when addresses of the device are accessed by the
Programmers can expand the functionality of this emulator by adding necessary
code emulating specific devices in MemMapDev and Memory classes implementation and header files.

In current version, two basic devices are implemented:

- character I/O and 
- raster (pixel based) graphics display. 

Character I/O device uses 2 memory locations, one for non-blocking I/O
and one for blocking I/O. Writing to location causes character output, while
reading from location waits for character input (blocking mode) or reads the
character from keyboard buffer if available (non-blocking mode).
The graphics display can be accessed by writing to multiple memory locations.

If we assume that GRDEVBASE is the base address of the Graphics Device, there are following registers:

Offset   Register               Description
 0       GRAPHDEVREG_X_LO       Least significant part of pixel's X (column)
                                coordinate or begin of line coord. (0-255)
 1       GRAPHDEVREG_X_HI       Most significant part of pixel's X (column)
                                coordinate or begin of line coord. (0-1)                                      
 2       GRAPHDEVREG_Y          Pixel's Y (row) coordinate (0-199)
 3       GRAPHDEVREG_PXCOL_R    Pixel's RGB color component - Red (0-255)
 4       GRAPHDEVREG_PXCOL_G    Pixel's RGB color component - Green (0-255)
 5       GRAPHDEVREG_PXCOL_B    Pixel's RGB color component - Blue (0-255)
 6       GRAPHDEVREG_BGCOL_R    Background RGB color component - Red (0-255)
 7       GRAPHDEVREG_BGCOL_G    Background RGB color component - Green (0-255)
 8       GRAPHDEVREG_BGCOL_B    Background RGB color component - Blue (0-255)
 9       GRAPHDEVREG_CMD        Command code
10       GRAPHDEVREG_X2_LO      Least significant part of end of line's X
11       GRAPHDEVREG_X2_HI      Most significant part of end of line's X
12       GRAPHDEVREG_Y2         End of line's Y (row) coordinate (0-199)

Writing values to above memory locations when Graphics Device is enabled
allows to set the corresponding parameters of the device, while writing to
command register executes corresponding command (performs action) per codes listed below:

Command code                    Command description
GRAPHDEVCMD_CLRSCR = 0          Clear screen
GRAPHDEVCMD_SETPXL = 1          Set the pixel location to pixel color
GRAPHDEVCMD_CLRPXL = 2          Clear the pixel location (set to bg color)
GRAPHDEVCMD_SETBGC = 3          Set the background color
GRAPHDEVCMD_SETFGC = 4          Set the foreground (pixel) color
GRAPHDEVCMD_DRAWLN = 5          Draw line
GRAPHDEVCMD_ERASLN = 6          Erase line

Reading from registers has no effect (returns 0).

Above method of interfacing GD requires no dedicated graphics memory space
in VM's RAM. It is also simple to implement.
The downside - slow performance (multiple memory writes to select/unselect 
a pixel or set color).
I plan to add graphics frame buffer in the VM's RAM address space in future

Simple demo program written in EhBasic that shows how to drive the graphics
screen is included. Please check the newest update in my github repository.

I added a User Manual and Programmers Reference document to the project that describes in more detail the architecture of the emulator and how programmer can expand upon it.

After you built the code, start DOS prompt in the program's directory and type:

vm65 ehbas_grdemo.snap

Re-focus on DOS window and in Debug Console prompt enter:

x ffc3

then press ENTER and then type:


and press ENTER.