Lexical Analysis


Lexical Analysis breaks up the source file stream into a set of tokens.

Design Goals

  • Analyses text passed into the method.
    • Must not assume that the text is a whole source file.
    • Must not assume that the text is just part of a source file.
    • Must be able to handle large files (eg. concatonated files).
  • Minimises expensive statements eg iterative, method calls.
  • Compares integer values or single char and not strings when possible.
  • Must only scan through the text once. No multiple scans or rescans.
  • Must contain optimisations where possible to maximise efficiency.

Design Restrictions

Text passed in:
  • Must contain at least one character.
  • Must contain less than 2,147,483,647 characters.
  • Must be C# code. (This does analyzer does not check).
  • Must be Unicode (16 bits long).
  • It is up to the reset of the compiler to support error reporting and detection
  • The only error the this reports is if a char literal is split over a new line.

Lexicon Boundary Point Struct

A token is represented by the LexiconBoundaryPoint data structure (LBP for short). A LBP marks the starting character position, the ending character position, the type and the subtype of the token. A LBP's type and subtype is represented by an integer value. The integer value corresponds to the index value of a string array. This permits the caching of types and fast comparison techniques. Since they do not use a string values these integers can be easily compared by the processor. There are a maximum potential of 2,147,483,647 different types and subtypes. The negative numbers of the integer are ignored.

Warning this data struct may be reused by other analyzers and parts of the compiler.

Lexicon Class

The Lexicon Class object holds all the fields and methods required to break up the source file stream into a set of tokens. All Lexicon Boundary Points are stored in a List collection called LBP. LBP is used by the rest of the compiler to assist the compilation process.


Analyse(string text)

This method breaks the source text into tokens which we will refer to as lexicon boundary points (or LBPs). All tokens are stored in the LBP List collection.

It starts by creating a series of integer values which represent the different indexes in the LBPTypes field. These values are cached inside the processor. As an added bonus this makes the code more readable.

This method also creates a temporary LBP called boundary and a temporary string called token.

Then it begins to scan through the source text using a single for loop. The for loop contains a huge about of branching statements. These statements are structure to utilise the branching prediction of the processor.

boundary is used like a state machine to work out the what type of token a LBP is. By being able to set the start position and the type, it can force the scan to ignore/skip some branching statements. This helps to improve efficiency and performance. The subtype is used to help determine the end character position of a token.

The branching statements are organised into different types of detection:
  2. Pre-Processor Directives.
  3. Punctuators.
  4. Literals.
  5. Operators.
  6. Identifiers and Keywords (or reserved identifiers).

The detections work in the order as described. Analyse only scans through the source text one and then returns back to where it was called from. If you intend to use this class for auto completion or syntax highlighting, then you should only pass in the minimal amount of text possible. In this case, the ideal amount of text would be a single line of text. So that you get the best performance possible. It is likely that you will pass the same line of code in on multiple occasions. You should try to minimise this as much as possible. For example you could simply pass in code that has not been previously analysed.

In the compiler paradigm, the ideal amount is one or more compilation units (or source files). To get the best performance out of this analyzer, we recommend that you more 1,000 lines of code but less than 20,000 per analysis. On dual core (intel core duo) system it takes around 50-100 ms to analyse around 1500 lines (approx) of code. However, we found that 75,000 lines (approx) of code takes around 1000 ms. Quite often C# source files that use the same namespace are store in a single folder. We recommend concatenating all files in a single folder before passing it into the analyser. The only requirement that we need is that you add two null characters(u+0000) to at the end of each file. This so the compiler can split it into different assemblies. Warning: make sure that the total length of characters is less than 2,147,483,647.

By release 0.2, the analyser will support SMP which will help to reduce compilation time.

Our test using newer systems (intel I5) ran this analyser (75,000 lines approx) in less than 300ms.

WhatTypeIs(int LBPT)

WhatTypeIs(string LBPT)

WhereIs(LexiconBoundaryPoint rLBP)

WhereIs(int CharacterPosition)


List<LexiconBoundaryPoint> LBP



Last edited May 17, 2013 at 9:19 AM by danieljsamson, version 18


No comments yet.