****************************************
Lexical Structure
****************************************


Input Format
========================================


An alba source file is interpreted as a sequence of unicode code points encoded
in UTF-8.

A UTF-8 byte order mark

.. code-block:: none

    0xEF 0xBB 0xBF

and a *shebang*

.. code-block:: bash

    #! /usr/bin/env alba

at the beginning of a file are ignored by the compiler.


Wildcard
========================================

The symbol ``_`` is a *wildcard*.


Punctuation Symbols
========================================


The following symbols are punctuation symbols. The unicode equivalents are in
parentheses.

.. code-block::

    : ( ) [ ] { } \ (λ) , ; -> (→) := ! ?


Identifiers
========================================

There are two ways to construct an indentifier.

- An nonempty sequence of letters, digits and underscores starting with a letter.

- An underscore followed by a nonempty sequence of letters digits and
  underscores.

Examples:

.. code-block::

    -- legal identifiers

    Natural
    BigType
    is_empty
    isEmpty
    _all
    _convert
    __

    -- illegal identifiers

    _                           -- single underscore is a wildcard
    reverse-append              -- hyphen not allowed


Keywords
========================================

The following keywords are reserved and cannot be used as identifiers. Some
keywords have unicode equivalents which are written within parentheses.

.. code-block::

    all (∀) and And (∧) Any
    case class
    do
    else
    ghost
    in (∈) inspect if
    let Level
    module mutual
    not Not (¬)
    once or Or (∨)
    Prop
    record ref
    section some (∃)
    then
    use
    where

The recommended way to use a keyword as an identifer is to prefix it with an
underscore (e.g. ``_all``, ``_case``, ...).


Operators
========================================

We have the following ascii and unicode operator characters. Ascii equivalents
are in parentheses.

.. code-block::

    + - * / | \ & < > = ~ ^ : _     -- ascii operator characters

    ≤ (<=) ≥ (>=)                   -- unicode operator characters


An operator string is any nonempty sequence of operator characters which does
not consist only of a backslash ``'\'``, a colon ``:`` or an underline ``_``.
Reason: The backslash and the colon are a punctuation symbols. The underline is
a wildcard.

An operator string and its ascii equivalent are the same.

The following are keyword operators:

.. code-block::

    and
    not Not (¬)
    or


An operator symbol is one of:

- A nonempty operator string ending with an optional question mark.

- An operator keyword.


Literals
========================================


Numbers
----------------------------------------

.. code-block::

    -- integral numbers

    100
    123_4_75
    0xaf_Bc_012
    0b1110_10

    -- floating point numbers
    1.0
    76_53.123
    1e+10
    1.5e-15

- Underscores can be used to group digits.

- Hexidecimal digits can be used in uppercase or lower case.

- Floating point numbers cannot end in a ``.``.

- The exponent in a floating point number is optional, the sign in the exponent
  is mandatory.


Characters and Strings
----------------------------------------

Literal characters and strings can use escape sequences. An escape sequence
has form ``\cccc``. The following are legal escapes:

+---------------+-------------------------------+
| Escapes                                       |
+---------------+-------------------------------+
+ ``\xab``      + 7 bit character code          |
+---------------+-------------------------------+
+ ``\u{7FFFF}`` + Unicode code point            |
+---------------+-------------------------------+
+ ``\"``        + double quote                  |
+---------------+-------------------------------+
+ ``\'``        + quote                         |
+---------------+-------------------------------+
+ ``\n``        + newline                       |
+---------------+-------------------------------+
+ ``\r``        + carriage return               |
+---------------+-------------------------------+
+ ``\0``        + Null                          |
+---------------+-------------------------------+
+ ``\\``        + Backslash                     |
+---------------+-------------------------------+

If ``\c`` does not start a legal escape, it is treated as the character ``c``.


Character
    ``'c'``: ``c`` is either a unicode character or an escape.


Byte
    ``b'a'``: ``a`` is an ascii character or an escape representing an ascii
    character.


String
    ``"cccc"``: ``cccc`` is any sequence of unicode characters or escapes.


Raw string
    ``r###"cccc"###``: ``cccc`` is any sequence of unicode characters. Escapes
    are not processed.

    The number of ``#`` must be the same at the beginning and the end. It can be
    zero. The string cannot contain ``"###`` because this sequence would be
    interpreted as the end of the string. the number of ``#``\ s can be chosen
    such that the end of string can be recognized uniquely.


Byte string
    ``b"hello"``: Only ascii characters or escapes which represent ascii
    characters are allowed.


Raw byte string
    ``br###"hello"###``: Like a raw string with ascii only characters. Escapes
    are not processed. ``#``\ s can be used to recognize uniquely the end of the
    string.


Comments
========================================

.. code-block::

    -- spans to the end of line

    reverse {: short comment :} list

    {: Multiline comment

        spans several lines  {: can be nested :}
    :} all {A: Any}: A


Whitespace
========================================


A whitespace is one of:

- A sequence of blanks

- A newline ``U+000A`` or carriage return character ``U+000D``

- A comment

Note that tabs are not valid lexical tokens. Reason: Since the language is
indentation sensitive and tabs are interpreted differently by different editors,
the layout would  not be well defined independently from the tabsize.


Allowed UTF-8 Characters
==================================================

In strings and characters all UTF-8 code points are alllowed. In the remainder
only the following are valid UTF-8 code points within source code.

+--------+------------+------------------+
| Symbol | Code Point | Ascii Equivalent |
+--------+------------+------------------+
| α      | U+03B1     |                  |
+--------+------------+------------------+
| β      | U+03B2     |                  |
+--------+------------+------------------+
| γ      | U+03B3     |                  |
+--------+------------+------------------+
| λ      | U+03BB     | ``\``            |
+--------+------------+------------------+
| ≤      | U+2264     | ``<=``           |
+--------+------------+------------------+
| ≥      | U+2265     | ``>=``           |
+--------+------------+------------------+
| ∀      | U+2200     | ``all``          |
+--------+------------+------------------+
| →      | U+2192     | ``->``           |
+--------+------------+------------------+
| ⊢      | U+22A2     | ``|-``           |
+--------+------------+------------------+
| ⊨      | U+22A8     | ``|=``           |
+--------+------------+------------------+
| ∧      | U+2227     | /\\              |
+--------+------------+------------------+
| ∨      | U+2228     | \\/              |
+--------+------------+------------------+
| ¬      | U+00ac     | ``Not``          |
+--------+------------+------------------+
| ℕ      | U+2115     | ``Natural``      |
+--------+------------+------------------+
| ℤ      | U+2124     | ``Integer``      |
+--------+------------+------------------+


UTF-8 Encoding
========================================

A unicode code point in UTF-8 encoding consists of one or more bytes. The first
byte encodes how many continuation bytes are following to encode the code point.
**Rule**: In order to encode a unicode code point in UTF-8 the byte sequence
must have minimal length.

.. code-block:: none

    First byte in a unicode byte sequence

    0xxxxxxx                1 byte sequence i.e. ascii character
    110xxxxx                2 byte sequence
    1110xxxx                3 byte sequence
    11110xxx                4 byte sequence


    Continuation byte

    10xxxxxx                6 bits of information


The following encodings are possible:

1 Byte Sequence
    All Ascii characters up to ``U+7F`` i.e. 7 bits.

2 Byte Sequence
    Unicode code points up to ``U+07FF`` i.e. up to 11 bits.

3 Byte Sequence
    Unicode code points up to ``U+FFFF`` i.e. up to 16 bits.

4 Byte Sequence
    Unicode code points up to ``U+10FFFF`` i.e. up to 21 bits.

According to the unicode standard the values between ``U+D800`` and ``U+DFFF``
are not valid code points (they are used to encode surrogate pairs in UTF-16).