[Gmsh] file format

Mon Mar 15 15:23:27 CET 2010

Christophe Geuzaine wrote:
> Hi - indeed, the "-4" is not documented yet (it only appears in nightly 
> builds and in svn). We are testing ghost cells generation, which will be 
> a new feature in Gmsh 2.5. (Gmsh 2.5 will bump the .msh file version to 
> 2.2, with better support for partitioned meshes.)

HI,

That is good news.  It's great to know that Gmsh keeps moving forward.

Regarding the .msh file format, and as it's currently under development, is it 
possible to alter the format in order for it to be easier to parse? 

My main gripe relies on adopted tokens for the <elm-type> and <number-of-tags> 
used in the $Elements field.  Up until now an integer has been used as 
the terminal token to describe in effect all terminal tokens in that field, 
which has the nasty consequence of making it a bit complicated to use lexical 
analyzer/parser generators such as Flex&Bison to write parsers for the .msh 
file format.  For example, considering the following (incomplete) production:

<Elements field> ::= <Open elements field> <number of elements> *<element> 
<Close elements field>

<Open elements field> ::= "$Elements" <new line>

<Close elements field> ::= "$EndElements" <new line>

<number of elements> ::= <positive integer> <new line>

<element> ::= <elm-number> <element definition> <new line>

<elm-number> ::= <positive integer>

<element definition> ::= <2-node line>       <tags> <node> <node> |
                         <3-node triangle>   <tags> <node> <node> <node> |
                         <4-node quadrangle> <tags> <node> <node> <node> 
<node> | etc etc etc...

<2-node line> ::= "1"

<3-node triangle> ::= "2"

<node> ::= <positive integer>

<tags> ::= <zero tags> |
           <one tag> <physical entity tag> |
           <two tags> <physical entity tag> <geometrical entity tag> | etc etc 
etc...

<zero tags> ::= "0"

<one tag> ::= "1"

<two tags> ::= "2"

etc etc etc...

From that production, if the terminal tokens used to describe the <elm-type> 
and <number-of-tags> happen to also be matched by some other pattern used to 
describe other non-terminal fields then it isn't possible to detect the right 
token type of a given string or even if a certain element definition is 
valid, without being forced to keep track of the language's context.  For 
example, let's say we need to parse the following elements field:

$Elements
2
1 3 2 99 2 1 2 3 4
2 3 2 99 2 2 5 6 3
$EndElements

Without tracking the token's context, if the lexical analyzer stumbles on a 
string consisting of "1", the lexical analyzer won't be able to say if 
it's supposed to be a <positive integer>, a <2-node line> or a <one tag> 
terminal.  As a consequence, the lexical analyzer won't be able to 
unambiguously pass the correct token interpretation to the parser, which means 
all hell breaks loose.

Granted, it's also possible to rely on a production similar to the following:

<Elements field> ::= <Open elements field> <number of elements> *<element> 
<Close elements field>

<Open elements field> ::= "$Elements" <new line>
<Close elements field> ::= "$EndElements" <new line>

<number of elements> ::= <positive integer> <new line>

<element> ::= *<positive integer> <new line>

The downside of this sort of production is that it doesn't avoid the need to 
analyze those tokens.  It only "sweeps it under the hood", as it relies on 
subsequent steps to extract the context and, from that, evaluate if the 
grammar was valid, interpret the meaning of each token and extract all the 
information needed to define that particular element.

One way to avoid these inconveniences is to eliminate the ambiguity inherent 
to the adopted set of terminal tokens.  One way to do this is to adopt 
different, unambiguous tokens to describe both the element types and the 
expected tokens.  For example, let's say that, instead of using integers, a 
set of string literals are used to describe element types.  Let's say that the 
string "quad4" is used to describe a 4-node quadrilateral and that "tags2" is 
used to start a tags field.  The previous elements field would look like:

$Elements
2
1 quad4 tags2 99 2 1 2 3 4
2 quad4 tags2 99 2 2 5 6 3
$EndElements

By doing this we removed all ambiguities which were present in the grammar, 
which makes this format much easier to parse.  Also, the file format becomes a 
bit easier to read by humble humans.

But it's also possible to further improve this format.  Let's say we place the 
tags definition after the element's node list. The elements field shown before 
would look like:

$Elements
2
1 quad4 1 2 3 4 tags2 99 2 
2 quad4 2 5 6 3 tags2 99 2
$EndElements

This change makes this format much easier to parse.  After the lexical 
analyzer stumbles on a "quad4" string the parser will unequivocally expect 4 
nodal references, which any parser can easily handle.  Moreover, the <number-
of-tags> token is no longer needed to unambiguously parse the element tags, as 
it's possible to interpret what tags were defined by checking what integer 
tokens were present after the last nodal reference.  So, for example, the 
elements field can assume the following format:

$Elements
2
1 quad4 1 2 3 4 99 2 
2 quad4 2 5 6 3 99 2
$EndElements

...which is matched by the following production:

<Elements field> ::= <Open elements field> <number of elements> *<element> 
<Close elements field>

<Open elements field> ::= "$Elements" <new line>
<Close elements field> ::= "$EndElements" <new line>

<number of elements> ::= <positive integer> <new line>

<element> ::= <elm-number> <element definition> <tags>

<elm-number> ::= <positive integer>

<element definition> ::= "line2"     <node> <node> |
                         <triangle3> <node> <node> <node> |
                         <quad4>     <node> <node> <node> <node> | etc etc 
etc...

<tags> ::= <new line> |
           <physical entity tag> <new line> |
           <physical entity tag> <geometrical entity tag> <new line> | etc etc 
etc...

...which, in essence, means that if a "quad4" string is matched then the 
parser is put in a state in which it expects to get a set of 4 positive 
integers.  After parsing those 4 integers, the parser is placed in a state 
where it accepts the following scenarios:

a) a <positive integer> is passed, which means this element has a <phsyical 
entity tag>
b) a <new line> is passed, which means this element definition has finished.

In this state, if the next token is an integer then the parser extracts that 
element's physical entity and is placed in a state where it accepts the 
following scenarios:

a) a <positive integer> is passed, which means this element has a <geometrical 
entity>
b) a <new line> is passed. Element completely defined.

The same applies to the remaining tags.

I understand that it could be a bit painful to tweak Gmsh's import routine to 
accept this format.  Yet, the added simplicity would be a godsend to those who 
rely on lexical analyzer/parser generators such as Flex&Bison to write parsers 
for this particular format, which is a great tool to have when there is a need 
to routinely update the parser to support new versions of the format.

Best regards,
Rui Maciel

P.S.: sorry for the terribly long post.