Рус Eng Cn Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Software systems and computational methods
Reference:

Creating a common notation of the x86 processor software interface for automated disassembler construction

Gusenko Mikhail YUr'evich

ORCID: 0009-0007-0524-5604

PhD in Technical Science

Associate Professor; Department of Applied and Business Informatics; Moscow Technological University

119454, Russia, Moscow, Vernadsky str., 78, office 418

mikegus@yandex.ru
Other publications by this author
 

 

DOI:

10.7256/2454-0714.2024.2.70951

EDN:

EJJSYT

Received:

04-06-2024


Published:

03-07-2024


Abstract: The subject of the study is the process of reverse engineering of programs in order to obtain their source code in low- or high-level languages for processors with x86 architecture, the software interface of which is developed by Intel and AMD. The object of the study is the technical specifications in the documentation produced by these companies. The intensity of updating documentation for processors is investigated and the need to develop technological approaches aimed at automated disassembler construction, taking into account regularly released and frequent updates of the processor software interface, is justified. The article presents a method for processing documentation in order to obtain a generalized, formalized and uniform specification of processor commands for further automated translation into the disassembler program code. The article presents two main results: the first is an analysis of the various options for describing commands presented in the Intel and AMD documentation, and a concise reduction of these descriptions to a monotonous form of representation; the second is a comprehensive syntactic analysis of machine code description notations and the form of representation of each command in assembly language. This, taking into account some additional details of the description of the commands, for example, the permissible operating mode of the processor when executing the command, made it possible to create a generalized description of the command for translating the description into the disassembler code. The results of the study include the identification of a number of errors in both the documentation texts and in the operation of existing industrial disassemblers, built, as shown by the analysis of their implementation, using manual coding. The identification of such errors in the existing reverse engineering tools is an indirect result of the author's research.


Keywords:

disassembler, software reengineering, assembler, microprocessor, command specification, machine code syntax, assembler syntax, x86 Intel documentation, x86 AMD documentation, CPU mode

This article is automatically translated. You can find original text of the article here.

1. Introduction

When reverse engineering (reengineering) executable programs for the von Neumann machine, it is necessary to solve the problem of separating the binary code of commands and data, while translating the machine code of commands into an equivalent symbolic representation in assembly language and, thereby, disassembling the program.

Although the construction of a program for translating machine code into symbolic form for x86 processors has as long a history as the history of this family itself, in the author's opinion it does not have an elegant technological solution. The available texts of disassembler implementation programs show [1,2] that its developers manually encode machine code recognition procedures for individual commands. I.e., their disassembler implementations are technologically complex both in writing the actual program code and in testing it. Given that the processor software interface has been continuously expanding and has a linear increment dynamics of many commands for more than 20 years, which is reflected in the fairly frequent updating of manufacturers' documentation, such disassembler programs become obsolete quite quickly. Existing technological solutions for building disassemblers do not allow updating these programs synchronously with updates to the processor software interface, therefore, it is necessary to develop new approaches to creating such programs.

2. The tendency to expand the software interface of processors

For many years, x86 processors have remained the most common devices for creating personal computer systems. According to [3], the share of computer systems based on processors of this architecture currently accounts for about 70% of the total number of processors used. Such popularity of these devices among consumers and, accordingly, among computer equipment manufacturers is explained both by historical reasons (software compatibility with previously developed and sold computers is ensured) and by the constant desire of processor manufacturers to improve the computing potential of their products. The result of these aspirations are architectural, technological and interface updates of processors. Architectural updates are expressed, for example, in the placement of several computing cores on one processor chip; technological - in the transition to a more dense arrangement of transistors on the processor chip; interface - in the expansion of the set of commands executed by the processor.

x86 is an architecture with a complex instruction set computing (CISC), which traditionally involves the support of commands specialized for each type of calculation, performing arithmetic and logical operations on data contained in small register memory.

The flagships among x86 processor manufacturers - Intel and AMD - have developed and implemented a number of computing technologies over the past two decades, including the expansion of register memory and multiple commands. There is no reason to doubt that such improvements will stop in the future, since these companies are unlikely to decide to lose their market positions, and the CISC architecture does not suggest any other solution than adding new specialized teams to their existing set. The only notable competitor for Intel and AMD in terms of the production of x86 processors is the Chinese company Zhaoxin [4], which established production with the help of the Taiwanese company Via, but it is more a manufacturer of chips themselves than a developer influencing the software interface of processors.

Assuming that the trend in hardware improvement will continue, it is logical to expect changes in the software, specifically that it will gradually use the newly appeared processor capabilities. This means that solving the problem of researching programs, for example, for undocumented capabilities, will also require the development of new or modernization of existing analysis tools.

Traditionally, analysis of program behavior involves reverse engineering, of which disassembly is an integral part. The expansion of the set of commands and registers supported by processors indicates the need to update the disassembly tools. The obvious way in this direction is to improve the technology of creating disassemblers while meeting the following requirements:

1. Coverage of the entire set of processor commands;

2. The ability to fine-tune the disassembler to work with commands of a certain set or supported under a certain processor operating mode;

3. Minimization of manual coding of disassembler procedures.

A natural source of information about commands and register memory is the documentation of processor manufacturers. The dynamics of updating documentation from Intel [5] indicates that it is periodically updated approximately every 106 days (Fig. 1). If we compare this graph with the data presented in Table 1, we can see a correlation between the release dates of the updated documentation and the new computing technology presented by Intel. The implementation of computing technology involves the implementation of a certain set of commands in the processor, and therefore clarifications in the documentation. However, as the study of documentation versions shows, the publication of descriptions of new commands does not happen at the same time. Documentation versions are released regularly, and descriptions of processor programming interface commands are gradually refined or re-published in them.

AMD does not follow completely in Intel's footsteps and develops its own data processing technologies, implementing support for original instruction sets in its processors. AMD updates documentation less frequently than Intel – on average every 176 days (Fig. 1). And just like Intel, the publication of information about new teams is gradual.

Fig. 1 Dynamics of the release of documentation updates for Intel and AMD x86 processors with linear interpolation of the average period of its updates

Fig. 1 Dynamics of the release of documentation updates for x86 processors from Intel and AMD with linear interpolation of the average period of its updates

The combined dynamics of updating Intel and AMD documentation in Figure 1 is shown through linear interpolation (bold line). It can be seen that in the period from 2002 to 2023, it is maintained almost unchanged - on average, the frequency of updates in this case is 80 days. Thus, if we consider the official documentation from Intel and AMD as the main source of data on the software interface of processors and release updates to disassembly tools that support instruction sets from both manufacturers, then this will obviously have to be done with a period of 80 days.

Note that the release dates of documentation updates are directly related to the publication of descriptions of new commands. For example, the Intel documentation version from September 2023 contains descriptions of 4343 commands [6], which is 283 more commands than the documentation version from June 2021. Five more versions of the documentation were released between them. On average, this is 56 new teams in the next release. It should be recognized that this is a very dynamic trend of expanding the software interface of processors.

The sets of commands from Intel and AMD overlap only partially, each manufacturer implements its own original subsets of commands, which differ both in the structure of the machine code and in functional purpose.

The documentation from Intel and AMD has such a volume (4,595 commands) that its automated processing for compiling machine code specifications of commands seems to be the most promising approach to automating the construction of a disassembler.

In addition to the volume of the documentation text, it should also be noted the quality of the execution of its text, which regularly contains typos, inaccuracies and, in addition, Intel may change the syntax and ways of describing commands. It is practically impossible to analyze such features of text construction, identify and eliminate ambiguities and errors in command descriptions without an automated approach. The same claims can be addressed to AMD documentation.

Table 1. Designations of the main data processing technologies in x86 processors and the date of their announcement by Intel

Table 1. Designations of the main data processing technologies in x86 processors and the date of their announcement by Intel

Technology

Date of announcement

SSE

26.02.1999

SSE2

20.11.2000

SSE3

02.02.2004

SSSE3

26.07.2006

SSE4.1

27.09.2006

SSE4.2

27.09.2006

AESNI

01.03.2008

PCLMULQDQ

01.03.2008

AVX

01.03.2008

AVX2

04.06.2013

AVX-512

01.07.2013

RDRAND

29.04.2012

AMX

28.06.2020

A uniform and verified list of commands, together with their detailed notations, can allow you to automatically generate the text of the disassembler program. The idea of an approach where a formal description of a programming language is submitted to the input of a processing program and this program automatically generates a text recognition program in this language has been tested in systems such as YACC, Bison and LEXX. In them, a part of the program for parsing texts in this language is based on the grammatical description of the language in automatic or context-free grammar.

This idea can be applied to the automaton grammar of the machine code of x86 processors and, by submitting a special command specification translator to the input, obtain the generated code of the disassembler program in some high-level programming language.

To explain this idea in detail, it is necessary to first consider the structure of the documentation of processors with the x86 architecture from Intel and AMD.

3. The structure of Intel and AMD documentation

The author has been able to get acquainted with the description of the processor software interface according to Intel proprietary documentation since 1995. The content and volume of documentation, of course, have changed, but the structure of the description of individual processor commands remains virtually unchanged. In the version used to write the article [6], the description of a separate command has the following typical form (Fig. 2), where the characteristic elements are:

1) the title, which indicates: mnemonics (sometimes mnemonics of the command family) of the command and a brief description of its semantics (Fig. 2(1)): 'ADD' - mnemonics, 'Add' - a brief description of the semantics of the command;

2) a table describing the mnemonics of commands (Fig. 2(2)), including the syntactic notation of the machine code in the "Opcode" column; the syntactic notation of the assembly code of the command (or mnemonics) in the "Instruction" column and other information. A more detailed description of the columns of such tables is provided in paragraph 4. In addition, Figure 2(2) shows only one of the table variants, of which there are six types in the Intel documentation (Fig. 5, 6, 7, 8, 9, 10). Note that the structure of the machine code is not separated in some tables (Table. 6) from the syntax of the command and such a separation will need to be made;

3) the command operand encoding table (Fig. 2(3)), which always follows the command description table, and which presents options for encoding explicit command operands and the method (read/write) their use;

4) detailed text description of the team's work (Fig. 2(4));

5) description of the operational semantics of the command in an Algol-like language (Fig. 2(5)). The documentation describes the elements of this language, but its actual use for describing commands is somewhat arbitrary, it is used by the authors of each command description often without agreement with the original notation of the language and appeals more to the reader's intuitive perception than it is suitable for formal translation;

6) description of the effect on flags from the [R|E]FLAGS register (Here and further in the article the metalanguage is used to describe the grammar in the extended Backus-Naur notation (RBNF) [7]) of the processor that executes the command (Fig. 2(6));

7) various additional aspects of the processor's behavior in case it recognizes the erroneous computational context of command execution (Fig. 2(7)).

Of all this variety of information, only the contents of the command mnemonics description table (Fig. 2(2)) and the command operand encoding table (Fig. 2(3)) are essential for constructing the input specifications of the disassembler. Due to the different timing of the creation and updating of documentation, the description of the commands is represented by tables that differ in structure from each other. In this version of the documentation, the description is presented in tables of six types (Fig. 5, 6, 7, 8, 9, 10). For some tables, the relationship between the footnotes (different characters are used) and the textual representation of the necessary particulars is shown. Most of these footnotes indicate limitations in the machine code that affect the disassembled representation of the command. These limitations must be taken into account, since they allow you to distinguish the correct machine code of the command from the incorrect one, which, when trying to execute, will cause the processor to generate an exception #UD (Invalid Opcode/Undefined Opcode). Accordingly, the disassembler, solving the problem of separating the binary code of commands and data in the program for the von Neumann machine, must distinguish the correct code from the incorrect one. And the specified details of the description of the commands can be used for this.

Intel's command descriptions are grouped in the following sections [6]:

· A group of chapters describing common architecture commands under the general heading "Instruction set reference". At the time of writing, 4289 syntactic command notations (see below) of this type were considered jointly by Intel and AMD. Of these, 2,383 commands are original for Intel, 252 commands are original for AMD, and 1,654 commands are common for Intel and AMD. And these are calculations for commands, excluding duplication, i.e., for example, 'CMPS m8, m8' and 'CMPSB' are considered one command, because they have the same machine code.

· Commands related to the SAFER MODE extension are found only in Intel and, in fact, are represented by only one GETSEC command, the input parameters for which are the value of registers specifying the execution of private functionality (leaf functions).

· Commands unique to INTEL® XEON PHI™ processors are found only in Intel and these are 33 commands.

· Virtual Machine Extensions (VMX) support commands are found only at Intel and these are 17 commands.

· Intel® Software Guard Extensions (Intel® SGX) support commands are found only in Intel and these are three commands that allow leaf functions to be executed based on input values, as for GETSEC.

In the AMD documentation [8], the description of the command is shown in Figure 3 and includes:

1) a header that, like Intel's, includes the mnemonics of the command operation and a brief description of its semantics;

2) description of operational semantics in verbal form;

3) syntactic forms of individual commands;

4) description of the effect on flags from the [R|E]FLAGS register of the processor;

5) various additional aspects of the processor's behavior in case it recognizes the erroneous computational context of command execution.

If we compare these two options for describing commands, then the preferred option for building a disassembler is the description from Intel. Firstly, it is more detailed and includes a detailed description of the syntax of the machine code and the processor operating modes in which certain commands are allowed. Secondly, it contains a description of the encoding of individual command operands in a systematic way, convenient for automated analysis, unlike AMD's verbal description. Thirdly, Intel's documentation contains descriptions of a larger number of commands, in particular the Intel AVX-512 extension, GETSEC, Intel Xeon PHI, VMX, SGX, which are not supported by AMD.

However, it should be noted that the description of the [R|E]FLAGS of the processor register used in the AMD documentation is more convenient for automated processing when drawing up an information communication scheme for commands of a real program than the Intel version.

In addition, AMD describes commands that are not available in Intel documentation (details below). Therefore, the specifications of the set of command descriptions created according to Intel documentation can be supplemented with AMD command descriptions and obtain a generalized set of commands for the x86 architecture.

Both Intel and AMD have the semantic form of the command both in the prefix representation, for example, ADD DEST, SRC, where ADD is the designation of the operation, DEST, SRC are the designations of the operands of the receiver (DEST ination) and the source (S ou RC e) of values, and in the infix representation of the form: DEST DEST + SRC (fig. 2(5)).

Both processor manufacturers have the same semantic form of the description of the ADD DEST command, SRC corresponds, as a rule, to several syntactic forms of the form: ADD AL, imm8; ADD AX, imm16; ...; ADD r64, r/m64. The number of syntactic forms can be used to judge the power of the processor's software interface.

For example, in the group of chapters describing the general commands of the Intel x86 architecture [6], the number of semantic forms is 764, and syntactic forms are 3975.

The syntactic forms of commands are encoded by various variants of machine code. And since for many aspects of the article it will be necessary to refer to specific examples of commands, a notation of the form will be used for this: O(pcode)='machine code'; I(nstruction)='command syntax'. For example, for the command in Fig. 2, this notation will look like: O='04 ib'; I='ADD AL, imm8'. For the command in Fig. 3, this notation will look like: O='14 ib'; I='ADC AL, imm8', etc.

In machine code notation, numbers are represented in hexadecimal form, i.e. in the notation O='37'; I='AAA', the number '37' is a number in hexadecimal representation. In some cases, when machine code bits are mentioned, they will be written with the suffix 'b' (binary), for example, '101b' is the three bits '1', '0' and '1'.

Fig. 2. Typical view of the command description in the Intel documentation

Fig. 2. A typical view of the instruction description in the Intel documentation

Fig. 3 Typical view of the legacy command description in the AMD documentation

Fig. 3. A typical description of the legacy instruction in the AMD documentation

Fig. 4 Typical view of the description of VEX or XOP commands in the AMD documentation

Fig. 4. A typical description of VEX or XOP instructions in AMD documentation

Fig. 5. An example of a command description table of the form 1. The format is used to describe commands of early versions of x86 processors (legacy commands)

Fig. 5. An example of a instruction description table of the form 1. The format is used to describe instructions from early versions of x86 processors (legacy instructions)

Fig. 6. An example of a command description table of the form 2. The format is used for various commands

Fig. 6. An example of a instruction description table of type 2. The format is used for various instructions

Fig. 7. An example of a command description table of the form 3. The format is used for commands of the mathematical coprocessor (Numeric Data Processor, NDP)

Fig. 7. An example of a instruction description table of the form 3. The format is used for instructions of a mathematical coprocessor (Numeric Data Processor, NDP)

Fig. 8. An example of a command description table of the form 4. The format is used for various commands

Fig. 8. An example of a instruction description table of the form 4. The format is used for various instructions

Fig. 9. An example of a command description table of the form 5. The format is used for commands of the GETSEC family

Fig. 9. An example of a instruction description table of the form 5. The format is used for instructions of the GETSEC family

Fig. 10. An example of a command description table of the form 6. The format is used for VMX commands

Fig. 10. An example of a instruction description table of the form 6. The format is used for VMX instructions

but

b

Figure 11. Example of a table of descriptions of (a) VEX, (b) XOP commands in AMD documentation

Fig. 11. An example of a table of descriptions of (a) VEX, (b) XOP instructions in the AMD documentation

According to the structure of the machine code and the notation of its description, Intel teams can be divided into variants:

· EVEX commands. The machine code notation has the prefix 'EVEX': O='EVEX.512.66.0F38.WIG DF /r; I=VAESDECLAST zmm1, zmm2, zmm3/m512';

· VEX commands. The machine code notation has the prefix 'VEX': O='VEX.256.66.0F38.WIG DF/r; I=VAESDECLAST ymm1, ymm2, ymm3/m256';

· REX commands. The machine code notation has the prefix 'REX': O='F2 REX 0F 38 F0 /r; I=CRC32 r32, r/m8';

· Legacy teams. In machine code notation, one of the specified prefixes is not present: O='0F 38 F0 /r; I=MOVBE r16, m16'.

Another feature of the Intel command description is the indication of some limitations /exceptions in the structure of the machine code of the commands. Such instructions are usually transmitted by a footnote from an element of the mnemonic table (Table. 5), in which a reference can be specified both to the entire column of the table and to individual values in its cells. It is quite easy to take into account such limitations in the postprocessing of a disassembled command (clause 9).

In terms of legacy commands, AMD documentation adheres to a representation similar to Intel, but for other commands, for example, VEX commands, AMD uses a tabular representation for description (Fig. 11, a), which complicates comparison with the Intel equivalent. According to AMD's machine code notation, presented in the form of a table (Fig. 11), it is possible to generate a representation similar to that used by Intel. But this will only be an example, since Intel does not adhere to strict rules for labeling machine code commands. So, O='VEX.128.0F38.W0 F5 /r'; I='BZHI reg32, reg/mem32, reg32' in the AMD variant will correspond to O='VEX.LZ.0F38.W0 F5 /r'; I='BZHI r32a, r/m32, r32b' in the Intel version.

In addition, AMD's documentation contains a description of commands that are not found in Intel. These include:

· 3DNow technology support teams! Their machine code is similar to the legacy structure. For example, O='0F 0F /r 0C'; I='PI2FW mmx1, mmx2/mem64'. The difference concerns the bytes of the opcode - they are not placed in adjacent positions;

· XOP commands that include the XOP prefix and have a structure similar to VEX commands (Fig. 11, b), but still differ in details (there is no Intel equivalent here at all, so the commands are presented in the author's interpretation). For example, O='XOP.128.MAP9.W0 81 /r'; I='VFRCZPD xmm1, xmm2/mem128';

· some VEX commands not described by Intel. For example, O='VEX.128.66.0F3A.W0 78 /r /is4'; I='VFNMADDPS xmm1, xmm2, xmm3/mem128, xmm4'. In the same category, there are commands with five operands. This kind of command is not found at Intel at all. For example, O='VPERMIL2PD xmm1, xmm2, xmm3/mem128, xmm4, m2z';I='VEX.128.66.0F3A.W0 49 /r is5';

· some legacy format commands not described by Intel. For example, O='0F 77'; I='EMMS'.

4. Some elements of the command description tables

Command syntax tables (mnemonic tables) in the Intel documentation (Fig. 5, 6, 7, 8, 9, 10) they have the following column headers:

· "Opcode" is the header of the column containing the notation of the machine code of the command. Most often it is a separate column (Fig. 5, 6, 7), but it can be combined with the description of the command syntax - with the "Instruction" column (Fig. 8).

· "Instruction" is the header of the column containing the syntactic notation of the command. Most often it is a separate column (Fig. 5, 6, 7), but it can be combined with the description of the command syntax - with the column "Opcode" (Fig. 8).

· "Op/En" is the header of the column containing a reference to the rows of the "Instruction Operation Encoding" table following the command syntax description table (Fig. 2(3)).

· "64-bit Mode" or "64/32bit Mode Support" - the column header indicates support for the command in 64-bit and/or 32-bit processor modes. The values presented in such columns are described in clause 6.

· "Compatibility/Legacy Mode" - the column header indicates that the command is supported in compatibility mode and/or indicates that it belongs to the IA-32 legacy command family. The values presented in such columns are described in clause 6.

· "CPUID Feature Flag" - the column header indicates the flag in the values given by the command O='0F A2'; I='CPUID' of the processor, and indicates support for certain data processing technologies in the processor. The values presented in such columns are dealt with in clause 7.

· The "Description" column contains a brief verbal description of the semantics of the calculations performed by the command. Examples of such a description are presented in Fig. 5, 6, 7, 8, 9, 10.

5. Encoding of operands in the description of commands

The encoding of the operands of each command is described in a special way in a separate table following the table describing commands of this type (Fig. 12). The encoding of operands in the source table is shown in the "Op/En" column and includes special letter designations addressing the method of encoding operands in a separate command. The values in the "Op/En" column (Fig. 12(1)) are used only to correlate with the descriptions in the following operand encoding table (Fig. 12(2)). And in this table (According to clause 3.1.1.4 Operand Encoding Column in the Instruction Summary Table [6]), the designation "r" in a pair of parentheses indicates reading the operand, the designation "w" in a pair of parentheses indicates writing to this operand when performing calculations by the processor.

Examples of diverse descriptions of encoding operands in the Intel documentation are presented in column 1 (Table 2), and the full set of encoding type designations that are used to build the specification of commands in the author's generalized notation is presented in column 2 (Table 2). Thus, the variety of text descriptions from Intel can be reduced to eight types of decoding in the disassembler.

Fig. 12. The main tables (1) describe commands with semantic mnemonics ADDSD and ADC, and tables (2) "Instruction Operand Encoding" encoding operands in Intel documentation. The relationships of the values from the "Op/En" column of the main tables (1) and tables (2) are shown

Fig. 12. The main tables (1) describe instructions with semantic mnemonics ADD and ADC, and tables (2) "Instruction Operand Encoding" encoding operands in Intel documentation. The relationships of the values from the "Op/En" column of the main tables (1) and tables (2) are shown

Table 2. Representation of operand encoding options and equivalent types of operand decoders

Tabl. 2. Representation of operand encoding options and equivalent types of operand decoders

The operand in the Intel description

The type of decoder of the operand in the disassembler

Example of a command

1

2

3

AL/AX/EAX/RAX

ST(0)

Value

O='REX.W + 15 id'; I='ADC RAX, imm32'

O='DC C0+i'; I='FADD ST(i), ST(0)'

ModRM:r/m; ModRM:rm

BaseReg(R): VSIB

ST(i)

ModRM

O='NP 0F AE /2'; I='LDMXCSR m32'

O='NP 0F 1A /r'; I='BNDLDX bnd, mib'

O='D8 C0+i '; I='FADD ST(0), ST(i)'

ModRM:reg

ModReg

O='VEX.L0.F2.0F.W0 93 /r'; I='KMOVD r32, k1'

VEX.vvvv

VEX.1vvv

EVEX.vvvv

Vvvv

O='VEX.L1.66.0F.W1 4A /r'; I='KADDD k1, k2, k3'

O='EVEX.128.66.0F3A.W0 21 /r ib'; I='VINSERTPS xmm1, xmm2, xmm3/m32, imm8'

imm/imm8/16/32

offset

Segment+Absolute Address

Imm

O='REX.W + 15 id'; I='ADC RAX, imm32'

O='0F 84 cd'; I='JZ rel32'

O='EA cp'; I='JMP ptr16:32'

opcode + rd

Opcode

O='40+ rd'; I='INC r32'

Moffs

Moffs

O='A1'; I='MOV EAX,moffs32'

NA

implicit

None

O='NP 0F 38 CB /r'; I='SHA256RNDS2 xmm1,xmm2/m128, '

Note that Intel does not always adhere to uniformity in the description of the encoding of operands in the documentation. The operand encoding table may have a different form instead of the typical representation (Fig. 12(2)) (Fig. 13), operands 2-9 (Operands 2-9) are implicit in it and therefore commands are not specified in the disassembled representation. Since there are only six such cases of command descriptions in the current version of the documentation, they can be reduced to a typical description by manual editing (Fig.14).

Figure 13. Atypical description of the command operands

Fig. 13. An atypical description of the instruction operands

Figure 14. Edited description of the atypical description of the command and its operands for the original version in Fig. 13

Fig. 14. An edited description of the atypical description of the instruction and its operands for the original version in Fig. 13

6. Compatibility of commands with processor operation modes

The column "64/32-bit Mode" indicates support for the command in 64-bit mode (64-bit mode), Compatibility mode (Compatibility mode) and other IA-32 modes, which are used in combination with the flag returned by the command O='0F A2'; I='CPUID', associated with specific extensions to the functionality of the commands. The values in the column (Table 3) are shown according to clause 3.1.1.5 64/32-bit Mode Column in the Instruction Summary Table [6].

Table 3. Values of the "64/32-bit Mode" column in Fig. 2(2)

Tabl. 3. The values of the column "64/32-bit Mode" in Fig. 2(2)

Designation

Description

V or Valid

The command is supported (Valid) in this mode

I, Invalid, or Inv.

The command is not supported (Invalid) in this mode

NE or N.E.

Indicates that the command is not supported (Not encoded) in 64-bit mode, but may be part of supported commands in other modes

N.S.

Indicates the syntax of the command that requires an address replacement prefix in 64-bit mode. Using the address substitution prefix in 64-bit mode can lead to specific results depending on the specific processor model.

The possible values of the Compatibility/Legacy Mode column are shown in Table 4.

Table 4. Values of the "Compatibility/Legacy Mode" column in Fig. 2(2)

Fig. 4. Values of the "Compatibility/Legacy Mode" column in Fig. 2(2)

Designation

Description

V

The command is supported (Valid) in this mode

I

The command is not supported (Invalid) in this mode

NE or N.E.

Indicates that the command is not supported (Not encoded) in 64-bit mode, the machine code of the command is not allowed in compatibility mode or in IA-32 mode. The machine code can represent a valid command of the IA-32 legacy command family.

7. Notation of processor computing options

According to clause 3.1.1.6 CPUID Support Column in the Instruction Summary Table [6], the values in the "CPUID Feature Flag" column (Fig. 15) convey the processor's support for the corresponding technology. The value in the column is equivalent to the name of the flag that is set in the data output by the processor when executing the command O='0F A2'; I='CPUID'. In other words, if the processor supports some computing technology, then the corresponding flag/bit is set.

The values of this column, for example, 'ADX' in Figure 15, can be reflected in a set of values denoting data processing technologies implemented in processors (Table 1). Accordingly, by setting command filtering by the technologies marked in the "CPUID Feature Flag" column, the disassembler can check the code of the analyzed program to support the appropriate data processing technology.

Figure 15. Example of values in the "CPUID Feature Flag" column

Fig. 15. Example of values in the "CPUID Feature Flag" column

8. Some typical errors identified in Intel and AMD documentation

A detailed analysis showed that none of the versions of the documentation of any manufacturer can be considered as a reference. Both Intel and AMD have syntactic and semantic errors in their command descriptions. Some of them can be identified and corrected by cross-comparing the descriptions of the teams from both manufacturers.

The following are some typical errors that have to be corrected in the documentation text when compiling the input specification for building the disassembler.

8.1 Erroneous encoding of the operand in Intel documentation

In the description of the command O='VEX.128.66.0F.WIG 12 /r'; I='VMOVLPD xmm2, xmm1, m64', the encoding of the operands is described incorrectly: encoding via ModRM is specified simultaneously:r/m for both Operand 1 and Operand 3 in the "Instruction Operand Encoding".

8.2 Lack of an operand encoding method in Intel documentation

In the description of the command, O='EVEX.LLIG.66.0F3A.W0 0A /r ib'; I='VRNDSCALESS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}, imm8' the encoding of the operand operand 4 in the Instruction Operand Encoding table is not specified at all, whereas the command syntax specifies the operand imm8. This also applies to the commands VSHUFF64, VSHUFF64X2, VSHUFF32, VSHUFF32X4.

8.3 Incorrect notation of machine command code in AMD documentation

For some commands, the AMD documentation does not specify the machine code correctly. For example, for the command I='VBLENDVPD xmm1, xmm2, xmm3/ mem128, xmm4' AMD has the code in the notation O='VEX.128.66.0F3A.WIG 4B /r', whereas according to Intel documentation the correct code is: O='VEX.128.66.0F3A.W0 4B /r /is4'. The operand '/is4' is missing.

8.4 Incorrect command mnemonics notation in AMD documentation

For some commands, the mnemonics are specified incorrectly in the AMD documentation. For example, for the command I='VPINSRB xmm, reg/mem8, xmm, imm8'; O='VEX.128.66.0F3A.WIG 20 /r ib' the mnemonic is incorrect. For Intel, this command is represented in the correct notation I='VPINSRB xmm1, xmm2, r32/m8, imm8'. The operand 'r32/m8' must be the third operand in order.

In the command I='VMOVMSKB reg64, xmm1'; O='VEX.128.66.0F.WIG D7 /r' the mnemonic of the opcode is incorrect. Intel has the correct option I='VPMOVMSKB reg64,xmm1'.

For a command with O='0F 38 CD /r', AMD's mnemonics are represented as I='SHA256MSG1 xmm1, xmm2/m128' whereas Intel's correct version is I='SHA256MSG2 xmm1, xmm2/m128'.

9. Generalized description of commands and its use for automated processing

The technology of automated generation of disassembler code is illustrated in Fig. 16. The text description of the command syntax is created directly from the text of the Intel and AMD documentation. It is fed to a special translator, which analyzes this description and, if it is correct, generates a disassembler code in some imperative high-level programming language. The generated files are then compiled by the appropriate compiler into the executable module of the disassembler program.

It turned out to be convenient to use a JSON or XML description as a declarative programming language. The description of the command syntax in this case is expressed in the compilation of an array of data descriptions executed in any of these variants.

Fig. 16. Illustration of automated generation of disassembler code

Fig. 16. An illustration of automated generation of disassembler code

When creating a generalized description of the command syntax manually and/or using text processing automation tools supported by modern text editors, errors, typos and other defects that abound in the text of documentation from Intel and AMD can be removed. The text of the original specifications for the translator can be generated by the same means (Fig. 16).

The set and sequence of steps for building a generalized specification can be as follows:

· exporting text from a PDF representation, for example, to a Microsoft Word DOC file;

· bringing the text to a single stylistic representation;

· conversion of syntax tables with separation of machine code notation representation and command mnemonics;

· checking the syntactic representation of the text and encoding all restrictions applicable to commands in the documentation;

· comparison of descriptions of the same commands in Intel and AMD notations;

· generation of a generalized description of the command syntax (generalized specification).

An example of the JSON specification of the command O='REX+80 /2 ib'; I='ADC r/m8, imm8' is presented in Table 5. Fragments of this specification are consistent with the corresponding parts of the command descriptions in the Intel documentation.

Table 5. Example of a fragment of a JSON file with a description of the command O='REX + 80 /2 ib'; I='ADC r/m8, imm8'

Tabl. 5. Example of a fragment of a JSON file with a description of the instruction O='REX + 80 /2 ib'; I='ADC r/m8, imm8'

A fragment of the JSON specification

Explanations

{

"_comment": "This file contains descriptions of INTEL and AMD x86 instructions"

"_comment": "It based on: Intel release 325462-081US September 2023; AMD release 40332-4.07-June 2023"

The preamble of the description of the commands, containing information, in particular, about the versions of Intel and AMD documentation on which it is based

"Instructions": [

...

The beginning of the command syntax description array

The description of the command itself

"OrderNum": "12",

The serial number of the specification. It is convenient to use it to issue translator messages about recognized errors

"Opcode": "REX + 80 /2 ib",

The syntax of the command's machine code. For example, from "Opcode" (Fig. 2)

"Mnemonic": "ADC r/m8, imm8"

The syntax of the assembly language command code. For example, from the "Instruction" (Fig. 2)

"Operands": [

{ "Encoding": "ModRM:r/m (r, w)", "Decoder": "ModRM" },

{ "Encoding": "imm8/16/32", "Decoder": "Imm" },

{ "Encoding": "NA", "Decoder": "None" },

{ "Encoding": "NA", "Decoder": "None" },

{ "Encoding": "NA", "Decoder": "None" }

],

Description of operands, including their encoding in the documentation Fig. 2(2) and types of operand decoders (Table 2)

"WidthMode": "Valid",

Supported processor operating modes. For example, from a column of the type "64-bit Mode" (fig. 5 and item 6) or the type "64/32-bit Mode Support" (Fig. 8 and item 6)

"Cpu": "",

The value from the "CPUID Feature Flag" column (fig. 6 and item 7)

"CompatLegMode": "N.E.",

The value from the "Compat/Leg Mode" column (fig. 5 and item 6)

"Description": "Add with carry imm8 to r/m8.",

The value from the "Description" column (Fig. 2)

"Notes": "In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.",

Notes indicating for some commands the limitations of their implementation (Fig. 5)

This line in the form of a comment will allow you to correctly manually encode the postprocessing procedure in the disassembler.

"Features": "",

Some features of the command, for example, its applicability only for 32-bit addressing mode. For example, for the command O='A5'; I='MOVSD' compared to O='A5'; I='MOVSW' for 16-bit addressing mode.

"OEM": "INTEL",

An indication of the source of the command description

},

...

]

}

Epilogues of the descriptions of the command, the array of commands and the entire file

In the author's translator layout, the JSON file of command specifications is parsed and a table of the automaton is generated, the cells of which are constructed as follows (Fig. 18).

To understand the structure shown in Fig. 18, we can consider the generalized structure of the command, which is generally represented as a sequence including (Fig. 17,a): the optional sequence { Prefix-i}, where iÎ[0..3]; sequence { Opcode-j }, where J is[0..2]; a sequence of other bytes encoding operands. Such a unified structure is obtained as a result of generalization of command structures in EVEX, VEX/XOP, REX and Legacy variants. If we simplify the structure even more by eliminating byte divisions, we get the option shown in Fig. 17,b.

Prefix and Opcode together define a unique byte combination that allows you to uniquely recognize this command, which is completely consistent with the definition of an automaton grammar, in which each output rule has the form: α®T β, where α is a nonterminal; T is some terminal; β is a sequence of nonterminals and terminals. In fact, Prefix and Opcode together allow you to calculate the value of T, which ultimately allows you to: 1) address a specific cell of the disassembler automaton (Fig. 18); 2) select the elements for decoding mnemonics and operands in it, and perform (if necessary) postprocessing.

In this case, the value of terminal T has a dual role, acting as an index (router) on the machine table and determining the contextual conditions for selecting decoders in the machine cell. In a real disassembler, the automaton table is arranged hierarchically, at each level of the hierarchy, the cells in it are addressed with the corresponding byte Opcode-j, but for a general scheme that betrays the idea of building a disassembler, this is not essential.

Each operand decoder is a Boolean function that recognizes only a separate type of operand and calculates the value of the operand based on the optional part of the machine code following the Opcode. Postprocessing is used only in some cases to handle exceptions to the general rules of machine code structure.

a)

b)

Fig. 17. Generalized command structure: a) divided into individual bytes; b) divided only into structural elements

Fig. 17. The generalized structure of the instruction: a) divided into individual bytes; b) divided only into structural elements

Fig. 18. Generalized disassembler device

Fig. 18. Generalized disassembler construction

In the author's disassembler layout, the automaton table (taking into account its hierarchical structure - the automaton transition table), each cell of this table is generated automatically.

Alternative disassembler projects [1,2] are implemented differently, using the path of voluminous (due to the large number of x86 architecture commands) and very time-consuming manual programming, which makes updating the code in these projects also a laborious task. This means that maintaining the relevance of the disassembler versions in such implementations may not occur promptly with documentation updates at all, especially not synchronously with the pace of its update, as was presented above.

Fig. 19. Modules that make up the disassembler

Fig. 19. Modules that make up the disassembler

Figure 19 shows the structure of the program elements that make up the disassembler layout, built using the proposed technology in the C language. It includes:

· the input module, which is written manually and remains unchanged when updating the documentation version (extensions of many processor commands);

· the table of transitions of the disassembler automaton is the most complex and voluminous element of the disassembler, which is just generated automatically by the translator (Fig. 16) based on input specifications obtained from the documentation;

· operand decoders are created semi-automatically: the frame is created automatically, the body is written manually. Once written, the decoder no longer needs to be updated. For example, for the command O='13 /r'; I='ADC r32, r/m32' once written, the decoder for the operand 'r32' will be used to disassemble similar operands in all commands with the same designation, of the form O='03/r'; I='ADD r32, r/m32', O='3B /r'; I='CMP r32, r/m32', ..., O='33 /r'; I='XOR r32, r/m32';

· The output module is written manually and does not change.

In an ideal situation, when the new version of the documentation does not have a new command format and new options for representing operands, all disassembler modules are generated automatically. In reality, for example, when implementing the disassembler for the Intel documentation version from September 2023, compared with the documentation version from June 2021, it was necessary to slightly change the input module to work with commands like O='EVEX.128.NP.MAP5.W0 58 /r'; I='VADDPH xmm1{k1}{z}, xmm2, xmm3/m128/m16bcst' and O='EVEX.128.66.MAP6.W0 13 /r'; I='VCVTPH2PSX xmm1{k1}{z}, xmm2/m64/m16bcst', which have new elements 'MAP5' and 'MAP6' in their machine code.

10. Conclusion

Although the share of x86 processors in the global personal computer market has decreased by about thirteen percent over the six years from 2016 to 2022, x86 is still the most widely used microprocessor architecture [3]. Its main designers, Intel and AMD, are continuously improving their processors, including expanding their software interface. These improvements are aimed at improving performance when executing applications.

Modern software development technology is generally based on the use of high-level languages, which makes it necessary to improve the procedures for generating executable code in compilers that use new commands. At the same time, the relative ease of creating applications in high-level languages makes it possible to easily implement undocumented features in them, creating the risk of unpredictable application behavior on a certain array of input data. The likelihood of such application behavior leads to the need to check, at least, the critical program code for the correctness of the calculations performed by it. This can be done in a situation where the source code is unavailable using software reengineering methods, where one of the actions is disassembling the executable code.

Naturally, disassemblers should be updated taking into account updates to the processor software interface. The volume of documentation for processors, in particular, the x86 architecture, is such that it is possible to process its text and collect command descriptions into a complete and uniform data array only by automated methods. This article is devoted to the study of the possibility of creating such specifications.

The purpose of building such specifications is the subsequent automated generation of the disassembler code to maintain its current version in relation to the documentation versions.

References
1. The source code version of the Windows 2000 operating system (2022). Retrieved from https://github.com/pustladi/Windows-2000
2. Intel X86 Encoder Decoder (Intel XED) project (2022). Retrieved from https://github.com/intelxed/xed
3. Distribution of Intel and AMD x86 computer central processing units (CPUs) worldwide from 2012 to 2022, by quarter (2022). Retrieved from https/www.statista.com/statistics/735904/worldwide-x86-intel-amd-market-share
4. China has learned how to legally clone Intel processors. An old rival helped (2023). Retrieved from https://www.cnews.ru/news/top/2023-06-09_kitajtsy_nauchilis_legalno
5. Intel® 64 and IA-32 Architectures. Software Developer’s Manual. Documentation Changes. September 2023. Document Number: 252046-073 (2023) Retrieved from https://www.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-documentation-changes.html
6. Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4. Order Number: 325462-081US September 2023 (2023). Retrieved from https://www.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4.html
7. The extended Backus-Naur form (2024). Retrieved from https://ru.wikipedia.org/wiki/Extended Form_Backusa-Naura
8. AMD64 Technology. AMD64 Architecture Programmer’s Manual. Volumes 1–5. Publication No. 40332 Revision 4.07 Date June 2023 (2023). Retrieved from https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The article is devoted to the development of a generalized notation of the software interface of x86 processors for automated disassembler construction. The main purpose of the research is to create a system that will automate the process of creating disassemblers, which will significantly simplify and speed up work with the program code of various x86 processors. The author uses the method of analysis and synthesis of existing approaches to the development of disassemblers based on the documentation of processors from Intel and AMD. The article provides a detailed overview of the documentation structure, describes typical errors and inaccuracies, and suggests ways to correct them automatically. The relevance of the study is due to the constant development of x86 processors and the increase in the number of commands supported by these processors. Existing approaches to the development of disassemblers cannot always adapt quickly to changes in the software interface of processors, which makes the solution proposed by the author extremely in demand. The novelty of the work lies in the development of a generalized notation that allows you to automate the process of creating disassemblers. For the first time, an approach is proposed in which automated translators are created based on formal specifications of processor commands that generate the disassembler source code. The article is written in a scientific style, characterized by logical consistency and clarity of presentation of the material. The structure of the article includes an introduction, an overview of current solutions, a description of the methodology, an analysis of processor documentation, a proposal of generalized notation and conclusions. The content of the article corresponds to the stated topic, each part is logically linked to the previous one, which makes the work holistic and consistent. The article concludes that it is possible and necessary to automate the process of creating disassemblers for x86 processors. The author shows that the proposed notation and methodology can significantly reduce the time and effort required to update and develop disassemblers in accordance with changes in processor documentation. The article will be of interest to a wide range of specialists in the field of software development dealing with reverse engineering and executable code analysis. The work is also of interest to compiler developers and information security specialists who analyze software for vulnerabilities. For further development of the work, the author is recommended to focus on several key areas. First, a more detailed testing of the proposed methodology should be carried out on various versions of processors from Intel and AMD in order to confirm the versatility and reliability of the generalized notation created. Secondly, it is worth developing and implementing tools for automated updating of disassembler specifications based on new versions of processor documentation. This will ensure synchronization with the latest changes in the processor software interface and increase the relevance and efficiency of disassembler updates. It is also necessary to consider the possibility of integrating the proposed approach with existing software development and analysis systems. This will significantly expand the scope of the proposed methodology and make it accessible to a wide range of users. It is also important to pay attention to the creation of detailed documentation and training materials that will help other developers quickly master and effectively use the proposed tools and techniques. Finally, a promising direction is to conduct a comparative analysis of the effectiveness of the proposed approach with other modern methods of developing disassemblers. This will allow an objective assessment of the advantages and disadvantages of the developed methodology, as well as identify opportunities for further improvement and adaptation to new challenges and market requirements. It is recommended to accept the article for publication, as it represents a significant contribution to the field of automation of software analysis tools development, has high relevance and novelty, and is also interesting for a wide scientific and professional audience.