3.7. Lexical Structure

3.7.1. Input Format

An alba source file is interpreted as a sequence of unicode code points encoded in UTF-8.

A UTF-8 byte order mark

0xEF 0xBB 0xBF

and a shebang

#! /usr/bin/env alba

at the beginning of a file are ignored by the compiler.

3.7.2. Wildcard

The symbol _ is a wildcard.

3.7.3. Punctuation Symbols

The following symbols are punctuation symbols. The unicode equivalents are in parentheses.

: ( ) [ ] { } \ (λ) , ; -> (→) := ! ?

3.7.4. Identifiers

There are two ways to construct an indentifier.

An nonempty sequence of letters, digits and underscores starting with a letter.
An underscore followed by a nonempty sequence of letters digits and underscores.

Examples:

-- legal identifiers

Natural
BigType
is_empty
isEmpty
_all
_convert
__

-- illegal identifiers

_                           -- single underscore is a wildcard
reverse-append              -- hyphen not allowed

3.7.5. Keywords

The following keywords are reserved and cannot be used as identifiers. Some keywords have unicode equivalents which are written within parentheses.

all (∀) and And (∧) Any
case class
do
else
ghost
in (∈) inspect if
let Level
module mutual
not Not (¬)
once or Or (∨)
Prop
record ref
section some (∃)
then
use
where

The recommended way to use a keyword as an identifer is to prefix it with an underscore (e.g. _all, _case, …).

3.7.6. Operators

We have the following ascii and unicode operator characters. Ascii equivalents are in parentheses.

+ - * / | \ & < > = ~ ^ : _     -- ascii operator characters

≤ (<=) ≥ (>=)                   -- unicode operator characters

An operator string is any nonempty sequence of operator characters which does not consist only of a backslash '\', a colon : or an underline _. Reason: The backslash and the colon are a punctuation symbols. The underline is a wildcard.

An operator string and its ascii equivalent are the same.

The following are keyword operators:

and
not Not (¬)
or

An operator symbol is one of:

A nonempty operator string ending with an optional question mark.
An operator keyword.

3.7.7. Literals

3.7.7.1. Numbers

-- integral numbers

100
123_4_75
0xaf_Bc_012
0b1110_10

-- floating point numbers
1.0
76_53.123
1e+10
1.5e-15

Underscores can be used to group digits.
Hexidecimal digits can be used in uppercase or lower case.
Floating point numbers cannot end in a ..
The exponent in a floating point number is optional, the sign in the exponent is mandatory.

3.7.7.2. Characters and Strings

Literal characters and strings can use escape sequences. An escape sequence has form \cccc. The following are legal escapes:

Escapes
`\xab`	7 bit character code

`\u{7FFFF}`	Unicode code point

`\"`	double quote

`\'`	quote

`\n`	newline

`\r`	carriage return

`\0`	Null

`\\`	Backslash

If \c does not start a legal escape, it is treated as the character c.

Character

'c': c is either a unicode character or an escape.

Byte

b'a': a is an ascii character or an escape representing an ascii character.

String

"cccc": cccc is any sequence of unicode characters or escapes.

Raw string

r###"cccc"###: cccc is any sequence of unicode characters. Escapes are not processed.

The number of # must be the same at the beginning and the end. It can be zero. The string cannot contain "### because this sequence would be interpreted as the end of the string. the number of #s can be chosen such that the end of string can be recognized uniquely.

Byte string

b"hello": Only ascii characters or escapes which represent ascii characters are allowed.

Raw byte string

br###"hello"###: Like a raw string with ascii only characters. Escapes are not processed. #s can be used to recognize uniquely the end of the string.

3.7.8. Comments

-- spans to the end of line

reverse {: short comment :} list

{: Multiline comment

    spans several lines  {: can be nested :}
:} all {A: Any}: A

3.7.9. Whitespace

A whitespace is one of:

A sequence of blanks
A newline U+000A or carriage return character U+000D
A comment

Note that tabs are not valid lexical tokens. Reason: Since the language is indentation sensitive and tabs are interpreted differently by different editors, the layout would not be well defined independently from the tabsize.

3.7.10. Allowed UTF-8 Characters

In strings and characters all UTF-8 code points are alllowed. In the remainder only the following are valid UTF-8 code points within source code.

Symbol	Code Point	Ascii Equivalent
α	U+03B1
β	U+03B2
γ	U+03B3
λ	U+03BB	`\`
≤	U+2264	`<=`
≥	U+2265	`>=`
∀	U+2200	`all`
→	U+2192	`->`
⊢	U+22A2	`\|-`
⊨	U+22A8	`\|=`
∧	U+2227	/\
∨	U+2228	\/
¬	U+00ac	`Not`
ℕ	U+2115	`Natural`
ℤ	U+2124	`Integer`

3.7.11. UTF-8 Encoding

A unicode code point in UTF-8 encoding consists of one or more bytes. The first byte encodes how many continuation bytes are following to encode the code point. Rule: In order to encode a unicode code point in UTF-8 the byte sequence must have minimal length.

First byte in a unicode byte sequence

0xxxxxxx                1 byte sequence i.e. ascii character
110xxxxx                2 byte sequence
1110xxxx                3 byte sequence
11110xxx                4 byte sequence


Continuation byte

10xxxxxx                6 bits of information

The following encodings are possible:

1 Byte Sequence: All Ascii characters up to U+7F i.e. 7 bits.
2 Byte Sequence: Unicode code points up to U+07FF i.e. up to 11 bits.
3 Byte Sequence: Unicode code points up to U+FFFF i.e. up to 16 bits.
4 Byte Sequence: Unicode code points up to U+10FFFF i.e. up to 21 bits.

According to the unicode standard the values between U+D800 and U+DFFF are not valid code points (they are used to encode surrogate pairs in UTF-16).