[Gmsh] file format
Rui Maciel
rui.maciel at gmail.com
Mon Mar 15 15:23:27 CET 2010
Christophe Geuzaine wrote:
> Hi - indeed, the "-4" is not documented yet (it only appears in nightly
> builds and in svn). We are testing ghost cells generation, which will be
> a new feature in Gmsh 2.5. (Gmsh 2.5 will bump the .msh file version to
> 2.2, with better support for partitioned meshes.)
HI,
That is good news. It's great to know that Gmsh keeps moving forward.
Regarding the .msh file format, and as it's currently under development, is it
possible to alter the format in order for it to be easier to parse?
My main gripe relies on adopted tokens for the <elm-type> and <number-of-tags>
used in the $Elements field. Up until now an integer has been used as
the terminal token to describe in effect all terminal tokens in that field,
which has the nasty consequence of making it a bit complicated to use lexical
analyzer/parser generators such as Flex&Bison to write parsers for the .msh
file format. For example, considering the following (incomplete) production:
<Elements field> ::= <Open elements field> <number of elements> *<element>
<Close elements field>
<Open elements field> ::= "$Elements" <new line>
<Close elements field> ::= "$EndElements" <new line>
<number of elements> ::= <positive integer> <new line>
<element> ::= <elm-number> <element definition> <new line>
<elm-number> ::= <positive integer>
<element definition> ::= <2-node line> <tags> <node> <node> |
<3-node triangle> <tags> <node> <node> <node> |
<4-node quadrangle> <tags> <node> <node> <node>
<node> | etc etc etc...
<2-node line> ::= "1"
<3-node triangle> ::= "2"
<node> ::= <positive integer>
<tags> ::= <zero tags> |
<one tag> <physical entity tag> |
<two tags> <physical entity tag> <geometrical entity tag> | etc etc
etc...
<zero tags> ::= "0"
<one tag> ::= "1"
<two tags> ::= "2"
etc etc etc...
From that production, if the terminal tokens used to describe the <elm-type>
and <number-of-tags> happen to also be matched by some other pattern used to
describe other non-terminal fields then it isn't possible to detect the right
token type of a given string or even if a certain element definition is
valid, without being forced to keep track of the language's context. For
example, let's say we need to parse the following elements field:
$Elements
2
1 3 2 99 2 1 2 3 4
2 3 2 99 2 2 5 6 3
$EndElements
Without tracking the token's context, if the lexical analyzer stumbles on a
string consisting of "1", the lexical analyzer won't be able to say if
it's supposed to be a <positive integer>, a <2-node line> or a <one tag>
terminal. As a consequence, the lexical analyzer won't be able to
unambiguously pass the correct token interpretation to the parser, which means
all hell breaks loose.
Granted, it's also possible to rely on a production similar to the following:
<Elements field> ::= <Open elements field> <number of elements> *<element>
<Close elements field>
<Open elements field> ::= "$Elements" <new line>
<Close elements field> ::= "$EndElements" <new line>
<number of elements> ::= <positive integer> <new line>
<element> ::= *<positive integer> <new line>
The downside of this sort of production is that it doesn't avoid the need to
analyze those tokens. It only "sweeps it under the hood", as it relies on
subsequent steps to extract the context and, from that, evaluate if the
grammar was valid, interpret the meaning of each token and extract all the
information needed to define that particular element.
One way to avoid these inconveniences is to eliminate the ambiguity inherent
to the adopted set of terminal tokens. One way to do this is to adopt
different, unambiguous tokens to describe both the element types and the
expected tokens. For example, let's say that, instead of using integers, a
set of string literals are used to describe element types. Let's say that the
string "quad4" is used to describe a 4-node quadrilateral and that "tags2" is
used to start a tags field. The previous elements field would look like:
$Elements
2
1 quad4 tags2 99 2 1 2 3 4
2 quad4 tags2 99 2 2 5 6 3
$EndElements
By doing this we removed all ambiguities which were present in the grammar,
which makes this format much easier to parse. Also, the file format becomes a
bit easier to read by humble humans.
But it's also possible to further improve this format. Let's say we place the
tags definition after the element's node list. The elements field shown before
would look like:
$Elements
2
1 quad4 1 2 3 4 tags2 99 2
2 quad4 2 5 6 3 tags2 99 2
$EndElements
This change makes this format much easier to parse. After the lexical
analyzer stumbles on a "quad4" string the parser will unequivocally expect 4
nodal references, which any parser can easily handle. Moreover, the <number-
of-tags> token is no longer needed to unambiguously parse the element tags, as
it's possible to interpret what tags were defined by checking what integer
tokens were present after the last nodal reference. So, for example, the
elements field can assume the following format:
$Elements
2
1 quad4 1 2 3 4 99 2
2 quad4 2 5 6 3 99 2
$EndElements
...which is matched by the following production:
<Elements field> ::= <Open elements field> <number of elements> *<element>
<Close elements field>
<Open elements field> ::= "$Elements" <new line>
<Close elements field> ::= "$EndElements" <new line>
<number of elements> ::= <positive integer> <new line>
<element> ::= <elm-number> <element definition> <tags>
<elm-number> ::= <positive integer>
<element definition> ::= "line2" <node> <node> |
<triangle3> <node> <node> <node> |
<quad4> <node> <node> <node> <node> | etc etc
etc...
<tags> ::= <new line> |
<physical entity tag> <new line> |
<physical entity tag> <geometrical entity tag> <new line> | etc etc
etc...
...which, in essence, means that if a "quad4" string is matched then the
parser is put in a state in which it expects to get a set of 4 positive
integers. After parsing those 4 integers, the parser is placed in a state
where it accepts the following scenarios:
a) a <positive integer> is passed, which means this element has a <phsyical
entity tag>
b) a <new line> is passed, which means this element definition has finished.
In this state, if the next token is an integer then the parser extracts that
element's physical entity and is placed in a state where it accepts the
following scenarios:
a) a <positive integer> is passed, which means this element has a <geometrical
entity>
b) a <new line> is passed. Element completely defined.
The same applies to the remaining tags.
I understand that it could be a bit painful to tweak Gmsh's import routine to
accept this format. Yet, the added simplicity would be a godsend to those who
rely on lexical analyzer/parser generators such as Flex&Bison to write parsers
for this particular format, which is a great tool to have when there is a need
to routinely update the parser to support new versions of the format.
Best regards,
Rui Maciel
P.S.: sorry for the terribly long post.