Adventures in Parsing: PHP’s implicit semicolon (‘;’) before every close tag

Simply put, every time you enter a PHP closing tag, the interpreter automatically adds a semicolon to end the current statement before switching to HTML mode.  In general, this is fairly innocuous, and it might even seem like the interpreter is doing you a favor, since now you can omit semicolons just before the end of a PHP block, which might be convenient for quick calls to “echo” or, potentially, for converting the short-open print shortcut (<?=) into long-open explicit echos, e.g.

<?php echo $foo?>

Since this could conceivably have been converted automatically with code not unlike the following:

<?php
$contents = str_replace('<?=', '<?php echo ', $contents);

So, at the end of the day, this feature seems like it was kindly thought out and shouldn’t cause any problems (more on that later). Still, it can be used to create interesting programming conundrums, for example, not using any explicit semicolons anywhere in your entire script. A trivial example:

<?php
for ($a = 0 ?><?php $a < 10 ?><?php $a++)
    print $a ?><?php 
if ($a >= 10): ?> hooray <?php endif ?>

It’s interesting to note that you can even do this inside of a control structure like “for(;;)”, but it makes sense when we break down the way that PHP’s parser works in a little more detail. The token for ?>, or the close tag, is T_CLOSE_TAG; for <?php (since I prefer long open tags) it is T_OPEN_TAG*, and anything outside of PHP tags is T_INLINE_HTML. Thus, the token stream for “<?php ?> anything you want goes here <?php” is “T_OPEN_TAG T_CLOSE_TAG T_INLINE_HTML T_OPEN_TAG”**. This distinct in an important way from the example earlier, however, where we only used “?><?php” repeatedly, thus the token stream was simply “T_CLOSE_TAG T_OPEN_TAG”.

*NOTE: Technically, it also includes the first character of whitespace immediately following the open tag, and there must be one.

**NOTE: This only applies if there is any code following the last instance of T_OPEN_TAG; if not, and the document ends there (or after zero or more whitespace characters) then the latter instance will not be considered T_OPEN_TAG, but instead it will be a second instance of T_INLINE_HTML. As such, if we had ended the sample above with another T_OPEN_TAG, it would have been included in the output rather than starting the parser again.

The reason the example given works is because of the way PHP treats these tokens when parsing. The two important rules in play here are:

  1. T_OPEN_TAG is a sentinel that indicates when the PHP parser should be active or not, but in and of itself it has no meaning; as such, it is ignored during parsing
  2. T_CLOSE_TAG has an implicit semicolon added before it during parsing, but it is also otherwise ignored by the parser because it is also otherwise just a sentinel, and as such, becomes syntactically equivalent to a semicolon.

The second rule becomes especially evident when evaluating the error message returned for using it in the middle of an expression. Consider the following syntactically incorrect code fragment:

<?php $foo = 1 + ?> anything here <?php 5;

The error message reads:

Parse error: syntax error, unexpected ';'

— notice there is no mention of an unexpected T_CLOSE_TAG.

There is at least one case where this can do something in a way you may not expect: control structures used without braces for their bodies. Consider the following:

<?php
$foo = 10;
 
if ($foo > 20) ?> some output here <?php
else ?> more output <?php
//Parse error: syntax error, unexpected T_ELSE

This code will generate a parse error as it stands. If we removed the else clause, it would merely behave in an unexpected manner, showing our output even though the condition should have clearly failed (and did).

When the T_CLOSE_TAG is encountered, the current statement is concluded as with a semicolon, and that statement happens to be the entire body of the conditional. The T_CLOSE_TAG token is otherwise ignored, and we come upon T_INLINE_HTML, which is fine, followed by T_OPEN_TAG, which is still fine, but then we come across a T_ELSE, which does not make sense because we are not currently in the middle of parsing a conditional statement, and thus the parse error.

Of course, I would consider that more of an argument for always using braces with your control structures than anything else, but that’s just one man’s opinion.

Recent Entries

Comments are closed.