3.7. Lexical Structure
3.7.1. Input Format
An alba source file is interpreted as a sequence of unicode code points encoded in UTF-8.
A UTF-8 byte order mark
0xEF 0xBB 0xBF
and a shebang
#! /usr/bin/env alba
at the beginning of a file are ignored by the compiler.
3.7.2. Wildcard
The symbol _ is a wildcard.
3.7.3. Punctuation Symbols
The following symbols are punctuation symbols. The unicode equivalents are in parentheses.
: ( ) [ ] { } \ (λ) , ; -> (→) := ! ?
3.7.4. Identifiers
There are two ways to construct an indentifier.
An nonempty sequence of letters, digits and underscores starting with a letter.
An underscore followed by a nonempty sequence of letters digits and underscores.
Examples:
-- legal identifiers
Natural
BigType
is_empty
isEmpty
_all
_convert
__
-- illegal identifiers
_ -- single underscore is a wildcard
reverse-append -- hyphen not allowed
3.7.5. Keywords
The following keywords are reserved and cannot be used as identifiers. Some keywords have unicode equivalents which are written within parentheses.
all (∀) and And (∧) Any
case class
do
else
ghost
in (∈) inspect if
let Level
module mutual
not Not (¬)
once or Or (∨)
Prop
record ref
section some (∃)
then
use
where
The recommended way to use a keyword as an identifer is to prefix it with an
underscore (e.g. _all, _case, …).
3.7.6. Operators
We have the following ascii and unicode operator characters. Ascii equivalents are in parentheses.
+ - * / | \ & < > = ~ ^ : _ -- ascii operator characters
≤ (<=) ≥ (>=) -- unicode operator characters
An operator string is any nonempty sequence of operator characters which does
not consist only of a backslash '\', a colon : or an underline _.
Reason: The backslash and the colon are a punctuation symbols. The underline is
a wildcard.
An operator string and its ascii equivalent are the same.
The following are keyword operators:
and
not Not (¬)
or
An operator symbol is one of:
A nonempty operator string ending with an optional question mark.
An operator keyword.
3.7.7. Literals
3.7.7.1. Numbers
-- integral numbers
100
123_4_75
0xaf_Bc_012
0b1110_10
-- floating point numbers
1.0
76_53.123
1e+10
1.5e-15
Underscores can be used to group digits.
Hexidecimal digits can be used in uppercase or lower case.
Floating point numbers cannot end in a
..The exponent in a floating point number is optional, the sign in the exponent is mandatory.
3.7.7.2. Characters and Strings
Literal characters and strings can use escape sequences. An escape sequence
has form \cccc. The following are legal escapes:
Escapes |
|
|
7 bit character code |
|
Unicode code point |
|
double quote |
|
quote |
|
newline |
|
carriage return |
|
Null |
|
Backslash |
If \c does not start a legal escape, it is treated as the character c.
- Character
'c':cis either a unicode character or an escape.- Byte
b'a':ais an ascii character or an escape representing an ascii character.- String
"cccc":ccccis any sequence of unicode characters or escapes.- Raw string
r###"cccc"###:ccccis any sequence of unicode characters. Escapes are not processed.The number of
#must be the same at the beginning and the end. It can be zero. The string cannot contain"###because this sequence would be interpreted as the end of the string. the number of#s can be chosen such that the end of string can be recognized uniquely.- Byte string
b"hello": Only ascii characters or escapes which represent ascii characters are allowed.- Raw byte string
br###"hello"###: Like a raw string with ascii only characters. Escapes are not processed.#s can be used to recognize uniquely the end of the string.
3.7.9. Whitespace
A whitespace is one of:
A sequence of blanks
A newline
U+000Aor carriage return characterU+000DA comment
Note that tabs are not valid lexical tokens. Reason: Since the language is indentation sensitive and tabs are interpreted differently by different editors, the layout would not be well defined independently from the tabsize.
3.7.10. Allowed UTF-8 Characters
In strings and characters all UTF-8 code points are alllowed. In the remainder only the following are valid UTF-8 code points within source code.
Symbol |
Code Point |
Ascii Equivalent |
α |
U+03B1 |
|
β |
U+03B2 |
|
γ |
U+03B3 |
|
λ |
U+03BB |
|
≤ |
U+2264 |
|
≥ |
U+2265 |
|
∀ |
U+2200 |
|
→ |
U+2192 |
|
⊢ |
U+22A2 |
|
⊨ |
U+22A8 |
|
∧ |
U+2227 |
/\ |
∨ |
U+2228 |
\/ |
¬ |
U+00ac |
|
ℕ |
U+2115 |
|
ℤ |
U+2124 |
|
3.7.11. UTF-8 Encoding
A unicode code point in UTF-8 encoding consists of one or more bytes. The first byte encodes how many continuation bytes are following to encode the code point. Rule: In order to encode a unicode code point in UTF-8 the byte sequence must have minimal length.
First byte in a unicode byte sequence
0xxxxxxx 1 byte sequence i.e. ascii character
110xxxxx 2 byte sequence
1110xxxx 3 byte sequence
11110xxx 4 byte sequence
Continuation byte
10xxxxxx 6 bits of information
The following encodings are possible:
- 1 Byte Sequence
All Ascii characters up to
U+7Fi.e. 7 bits.- 2 Byte Sequence
Unicode code points up to
U+07FFi.e. up to 11 bits.- 3 Byte Sequence
Unicode code points up to
U+FFFFi.e. up to 16 bits.- 4 Byte Sequence
Unicode code points up to
U+10FFFFi.e. up to 21 bits.
According to the unicode standard the values between U+D800 and U+DFFF
are not valid code points (they are used to encode surrogate pairs in UTF-16).
3.7.8. Comments