Swift compiler device. Part 2

The second part of my story about the Swift compiler. We will begin to study the frontend, or rather those parts that are responsible for the initial analysis and analysis of the source code.

The first part of the article can be found here .

Frontend

The frontend's task is to generate an intermediate representation from the source code and transfer it to LLVM. This process can be divided into 8 steps. The result of almost every one of them can be output by passing a special parameter to the compiler.

Below I will demonstrate the implementation of a compiler for a primitive "programming language" that contains only "scope" and number. Its only function is to output a number (if it exists) to the console. For example, as a result of executing this “code” the number 678 will be displayed:

{{{678}}}

This compiler is only needed to make it easier for you to understand what is happening at different stages. On the implementation of this, but simple language, you can look at the example of Kaledoscope .

Each scope consists of an opening bracket, a content, and a closing bracket. This can be written as:

scope ::= open_brace x close_brace
open_brace ::= "{"
close_brace ::= "}"

The content may be the same scope, number, or nothing, indicated here :

scope ::= open_brace x close_brace
x ::=  scope | number | <empty>
open_brace ::= "{"
close_brace ::= "}"

Symbol | means "or." A number consists of one or more digits:

scope ::= open_brace x close_brace
x ::=  scope | number | <empty>
open_brace ::= "{"
close_brace ::= "}"
number ::= digit | number digit
digit :: = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Such a record is called the Backus-Naur form (BNF), and the totality of all the rules is called the grammar of the language or syntax.

There is also an extended BNF (RBNF). Additional special characters have been added to it, for example, parentheses for grouping.

Braces denote repetitions. Such a record means that A is either empty or equal to any number of B:

A ::= { B }

Brackets are used for conditional entry. Such a record means that A is either empty or equal to B:

A ::= [B]

Using the RBNF grammar compiler grammar can be converted to this form:

scope ::= open_brace [scope | number] close_brace
open_brace ::= "{"
close_brace ::= "}"
number ::= digit {digit}
digit :: = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

In addition to the parenthesis compiler, I will show how to use the Swift compiler to get the results of each of the steps using the simplest example:

let x = 16

The code is so simple to make it easier to understand. The grammar of a real programming language is much more complicated than the one given above. Look at the Swift syntax on the official site .

Lexer

From the point of view of the compiler, the source code file is a stream of random characters, and maybe some kind of garbage. Therefore, the first step is to convert these random characters into words that can be used in a programming language.

This is done by a lexical analyzer (lexer / tokenizer). The words he searches for are called lexemes or tokens (we will not go into unnecessary details and will take these terms as synonyms). Therefore, this process is also called tokenization.

Lexer analyzes according to the syntax of the language. Scanning character by character, it matches the source code and the right side of the record, and then generates the corresponding token from the left side.

Lexer implementation example

The grammar allowed three tokens: an opening bracket, a closing bracket, and a number. They are presented as a listing. Like the list of possible errors:

enum Token: CustomStringConvertible {
    case openBraceToken
    case closeBraceToken
    case number(Int)
    var description: String {
        switch self {
        case .openBraceToken:
            return "openBraceToken"
        case .closeBraceToken:
            return "closeBraceToken"
        case .number(let value):
            return "\(value)"
        }
    }
}
enum LexerError: Error {
    case unexpectedCharacted
    case invalidNumber
}

The lexer itself stores the input string and the index of the current character:

class Lexer {
    private let string: String
    private var index: String.Index
    private var currentCharacter: Character? {
        return index != string.endIndex ? string[index] : nil
    }
    init(string: String) {
        self.string = string
        index = string.startIndex
    }

The analysis is started by calling the startAnalyze () method . In it, a method is called in a loop to get the next token, which is added to the array. At the end - interception of recognition errors:

func startAnalyzing() -> [Token] {
    var result: [Token] = []
    do {
        while let token = try getNextToken() {
            result.append(token)
        }
    } catch LexerError.unexpectedCharacted {
        print("Unexpected character at index \(index.encodedOffset)")
        result = []
    } catch LexerError.invalidNumber {
        print("Invalid number at index \(index.encodedOffset)")
        result = []
    } catch {
        print("Unexpected error")
        result = []
    }
    return result
}

Getting a token consists of checking the current character and moving the index forward one or more characters:

private func getNextToken() throws -> Token? {
    guard let character = currentCharacter else {
        return nil
    }
    switch character {
    case "{":
        return getOpenBrace()
    case "}":
        return getCloseBrace()
    default:
        break
    }
    if let scalar = character.unicodeScalars.first,
        CharacterSet.decimalDigits.contains(scalar) {
        return try getNumber()
    }
    throw LexerError.unexpectedCharacted
}
private func getOpenBrace() -> Token {
    moveIndex()
    return Token.openBraceToken
}
private func getCloseBrace() -> Token {
    moveIndex()
    return Token.closeBraceToken
}

If a digit is found, the getNumber () method will be called to parse it . It simply collects all the numbers in one line and converts it to an Int:

private func getNumber() throws -> Token {
    var numberString = ""
    while let character = currentCharacter,
        let scalar = character.unicodeScalars.first,
        CharacterSet.decimalDigits.contains(scalar) {
            numberString.append(character)
            moveIndex()
    }
    guard let number = Int(numberString) else {
        throw LexerError.invalidNumber
    }
    return Token.number(number)
}

To start the lexer, you need to pass a line with the source code to it and call the startAnalyze () method :

let lexer = Lexer(string: "{{5678}}")
let tokens = lexer.startAnalyzing()

The result is as expected:

[openBraceToken, openBraceToken, 5678, closeBraceToken, closeBraceToken]

In Swift, the lexer is part of the parser, and you cannot get a list of tokens corresponding to the source code.

Source:

Parser

The parser performs parsing. A sequence of tokens is passed to the input, and the result is an AST.

AST is a graph in which the vertices are operators, and the leaves are their operands. For example, the expression 1 + (2 * 3) consists of two operators: addition and multiplication. The first operand of addition is 1, and the second is the result of product 2 and 3. The grouping brackets in AST are not used, since they are not necessary. The graph itself determines the nesting of operations:

Each node can be represented, for example, by a structure or enumeration containing child elements.

During the formation of the tree, the parser checks the grammar of the language: is the correct sequence composed of "words"? For example, the string "{{}" will be successfully parsed by the lexer, however it is incorrect because the second closing bracket is missing.

The Swift parser is a recursive descending parser. This means that he first finds the external entity, and then recursively analyzes its contents. The ascending parser first finds the most nested entity, and then rises up.

Parser implementation example

There are a large number of types and implementations of parsers. Below are the simplest implementation of the descending and ascending parser to illustrate the difference.

A common feature of both parsers is the enumeration, which represents one of the errors and the AST node:

indirect enum ASTNode: CustomStringConvertible {
    case brace(childNode: ASTNode?)
    case number(value: Int)
    var description: String {
        switch self {
        case let .brace(childNode?):
            return "brace(\(childNode.description))"
        case .brace(nil):
            return "brace(nil)"
        case let .number(value):
            return "\(value)"
        }
    }
}
enum ParserError: Error {
    case expectedOpenBrace
    case expectedCloseBrace
    case expectedNumber
    case expectedEndOfArray
}

Downstream parser

The descending parser stores an array of tokens and the index of the current token:

class TopDownParser {
    private let tokens: [Token]
    private var index: Int
    private var currentToken: Token? {
        return index != tokens.endIndex ? tokens[index] : nil
    }
    init(tokens: [Token]) {
        self.tokens = tokens
        index = tokens.startIndex
    }

The only public method is parsing. At first parsing of pair of brackets is caused. Since the root entity can only be a pair of brackets (the string "{} {}" is invalid), this is sufficient. Further analysis will be performed recursively. It only remains to check that the array of tokens is empty, and to produce interception of exceptions:

func startParsing() -> ASTNode? {
    var rootNode: ASTNode?
    do {
        rootNode = try parseBraces()
        guard currentToken == nil else {
            rootNode = nil
            throw ParserError.expectedEndOfArray
        }
    } catch ParserError.expectedCloseBrace {
        print("Expected close brace at index \(index)")
    } catch ParserError.expectedOpenBrace {
        print("Expected open brace at index \(index)")
    } catch ParserError.expectedEndOfArray {
        print("Expected end of tokens array at index \(index)")
    } catch {
        print("Unexpected error")
    }
    return rootNode
}

When parsing brackets, the opening bracket is scanned first, then recursively - the contents inside the pair (if any), and at the end - the closing bracket.

An important point: as soon as the parser sees the opening bracket, he believes that he has successfully found the next pair and can proceed to the content analysis. That is why it goes from external entities to internal ones.

private func parseBraces() throws -> ASTNode? {
    try consumeOpenBrace()
    print("Pair found")
    let node: ASTNode?
    if let currentToken = self.currentToken,
        case .openBraceToken = currentToken {
        node = .brace(childNode: try parseBraces())
    } else if let currentToken = self.currentToken,
        case let .number(value) = currentToken {
        node = .brace(childNode: .number(value: value))
        try consumeNumber()
    } else {
        node = .brace(childNode: nil)
    }
    try consumeCloseBrace()
    return node
}

A scan consists of checking a token and moving a current index:

private func consumeOpenBrace() throws {
    if let currentToken = self.currentToken,
        case .openBraceToken = currentToken {
        print("Open brace found")
        moveIndex()
    } else {
        throw ParserError.expectedOpenBrace
    }
}

Upstream parser

Implementing an ascending parser is a bit more complicated. He needs to keep his state, token stack and transition table between states.

The state is presented as a listing. It is necessary in order to take into account previous tokens and respond correctly to the emergence of a new one. Since this example is very simple, it turned out that in the first state the parser parses the opening brackets, and in the second - the closing:

class BottomUpParser {
    private enum State {
        case state1
        case state2
    }

The parser is initialized in the same way as the first. There is no transition table as a separate entity. For the sake of simplicity, its role will be played by switch:

private let tokens: [Token]
private var index: Int
private var state: State = .state1
private var stack: [Token] = []
private var rootNode: ASTNode?
private var currentToken: Token? {
    return index != tokens.endIndex ? tokens[index] : nil
}
init(tokens: [Token]) {
    self.tokens = tokens
    index = tokens.startIndex
}

The analysis is also launched using the startParsing () method . In it, each token is processed in turn, and at the end a check is called that the stack is empty and the parsing is completed successfully:

func startParsing() -> ASTNode? {
    do {
        guard !tokens.isEmpty else {
            throw ParserError.expectedOpenBrace
        }
        while index != tokens.endIndex {
            try parseNextToken()
            moveIndex()
        }
        guard stack.isEmpty else {
            rootNode = nil
            throw ParserError.expectedCloseBrace
        }
    } catch ParserError.expectedCloseBrace {
        rootNode = nil
        print("Expected close brace at index \(index)")
    } catch ParserError.expectedOpenBrace {
        rootNode = nil
        print("Expected open brace at index \(index)")
    } catch ParserError.expectedEndOfArray {
        rootNode = nil
        print("Expected end of tokens array at index \(index)")
    } catch {
        rootNode = nil
        print("Unexpected error")
    }
    return rootNode
}

It processes the next token in the switch, taking into account the current state. If the opening bracket arrived and the state is 1, then it is added to the stack. If the closing one - the parser checks if there is an opening bracket for it in the stack, then removes it from the stack and goes to state 2. The transition to the second state is also performed when the number is found.

The parser continues to remove one item from the stack for each closing bracket. If at this moment comes the opening bracket or number - this is an error in the input array.

An important point: the ascending parser thinks that it found a pair only after, plunging down the hierarchy, it sees the closing bracket, provided that the opening pair is in the stack. Only after that he will look for the essence that contains it, and so on up to the root. That is why it is called ascending.

private func parseNextToken() throws {
    guard let currentToken = currentToken else {
        return
    }
    switch (state, currentToken) {
    case (.state1, .openBraceToken):
        print("Open brace found")
        stack.append(.openBraceToken)
    case (.state1, .number(let value)):
        if stack.isEmpty {
            throw ParserError.expectedOpenBrace
        } else {
            consumeNumber(value: value)
            state = .state2
        }
    case (.state1, .closeBraceToken):
        if stack.isEmpty {
            throw ParserError.expectedOpenBrace
        } else {
            consumeCloseBrace()
            state = .state2
        }
    case (.state2, .closeBraceToken):
        if stack.isEmpty {
            throw ParserError.expectedEndOfArray
        } else {
            consumeCloseBrace()
        }
    case (.state2, .number), (.state2, .openBraceToken):
        if stack.isEmpty {
            throw ParserError.expectedEndOfArray
        } else {
            throw ParserError.expectedCloseBrace
        }
    }
}
private func consumeCloseBrace() {
    print("Close brace found")
    _ = stack.popLast()
    print("Pair found")
    if rootNode == nil {
        rootNode = .brace(childNode: nil)
    } else {
        rootNode = .brace(childNode: rootNode)
    }
}
private func consumeNumber(value: Int) {
    rootNode = .number(value: value)
}

To start both parsers, you need to pass an array of tokens received from the lexer, and call the startParsing () method :

let parserTD = TopDownParser(tokens: tokens)
let ast1 = parserTD.startParsing()
let parserBU = BottomUpParser(tokens: tokens)
let ast2 = parserBU.startParsing()

The result of both parsers was correct:

brace(brace(5678))

Using the Swift parser

To run the Swift parser, you need to pass the -dump-parse flag to the compiler :

swiftc -dump-parse main.swift

Real AST has a more complex structure than the parentheses parser. But since it is small, it is easy to sort out and find the integer literal 16 and the constant x :

(source_file
  (top_level_code_decl
    (brace_stmt
      (pattern_binding_decl
        (pattern_named 'x')
        (integer_literal_expr type='<null>' value=16))
))
  (var_decl "x" type='<null type>' let storage_kind=stored))

This form of AST is an untyped tree. Therefore, the constant x does not specify the type type = ''. If you specify it explicitly - let x: Int = 16 the tree will change, but the type will not be registered anyway:

(source_file
  (top_level_code_decl
    (brace_stmt
      (pattern_binding_decl
        (pattern_typed
          (pattern_named 'x')
          (type_ident
            (component id='Int' bind=none)))
        (integer_literal_expr type='<null>' value=16))
))
  (var_decl "x" type='<null type>' let storage_kind=stored))

Source:

Sema

The tree received from the parser is grammatically correct, but there may still be errors in it. For example, during parsing it is impossible (impractical) to determine that a variable is declared before use. This is a semantic analyzer. It goes through the whole tree and assigns types to expressions, checks protocol support, synthesizes default initializers for structures, and more.

Using the Swift Semantic Analyzer

To launch the semantic analyzer, use the -dump-ast flag :

swiftc -dump-ast main.swift

The result of the command:

(source_file
  (top_level_code_decl
    (brace_stmt
      (pattern_binding_decl
        (pattern_named type='Int' 'x')
        (call_expr implicit type='Int' location=main.swift:1:9 range=[main.swift:1:9 - line:1:9] nothrow arg_labels=_builtinIntegerLiteral:
          (constructor_ref_call_expr implicit type='(_MaxBuiltinIntegerType) -> Int' location=main.swift:1:9 range=[main.swift:1:9 - line:1:9] nothrow
            (declref_expr implicit type='(Int.Type) -> (_MaxBuiltinIntegerType) -> Int' location=main.swift:1:9 range=[main.swift:1:9 - line:1:9] decl=Swift.(file).Int.init(_builtinIntegerLiteral:) function_ref=single)
            (type_expr implicit type='Int.Type' location=main.swift:1:9 range=[main.swift:1:9 - line:1:9] typerepr='Int'))
          (tuple_expr implicit type='(_builtinIntegerLiteral: Int2048)' location=main.swift:1:9 range=[main.swift:1:9 - line:1:9] names=_builtinIntegerLiteral
            (integer_literal_expr type='Int2048' location=main.swift:1:9 range=[main.swift:1:9 - line:1:9] value=16))))
))
  (var_decl "x" type='Int' interface type='Int' access=internal let storage_kind=stored))

The constant has type = 'Int' , as well as an access level. Initializing the constant is a little more complicated. Added a call to the _builtinIntegerLiteral constructor . If we present this tree in the form of Swift code, we get:

internal let x: Int = Int(_builtinIntegerLiteral: 16)

The following example contains an error:

let x: Bool = 16

If you pass it to the parser, it will not detect any deviations:

(source_file
  (top_level_code_decl
    (brace_stmt
      (pattern_binding_decl
        (pattern_typed
          (pattern_named 'x')
          (type_ident
            (component id='Bool' bind=none)))
        (integer_literal_expr type='<null>' value=16))
))
  (var_decl "x" type='<null type>' let storage_kind=stored))

But the semantic analyzer will indicate what is wrong with the code transmitted to it:

error: cannot convert value of type 'Int' to specified type 'Bool'
let x: Bool = 16

Obviously, the error was trying to assign an Int value to a constant of type Bool. Swift does not allow this. Thanks to the semantic analyzer.

Source:

Sema

Clang importer

At this stage, the Clang modules are imported and the C and Objective-C API are mapped to the corresponding calls from Swift.

Source:

Clang importer

Теперь у нас есть полностью разобранный исходный код, который прошёл первоначальную проверку. Но прежде чем переходить к генерации LLVM IR, нужно выполнить Swift специфичные оптимизации.

Полную версию исходного кода можно найти в репозитории.

Теги:

Swift compiler device. Part 2

Frontend

Lexer

Lexer implementation example

Parser

Parser implementation example

Downstream parser

Upstream parser

Using the Swift parser

Sema

Using the Swift Semantic Analyzer

Clang importer

Also popular now: