Adventures in Parsing: PHP’s implicit semicolon (‘;’) before every close tag

Simply put, every time you enter a PHP closing tag, the interpreter automatically adds a semicolon to end the current statement before switching to HTML mode.  In general, this is fairly innocuous, and it might even seem like the interpreter is doing you a favor, since now you can omit semicolons just before the end of a PHP block, which might be convenient for quick calls to “echo” or, potentially, for converting the short-open print shortcut (<?=) into long-open explicit echos, e.g.

<?php echo $foo?>

Since this could conceivably have been converted automatically with code not unlike the following:

<?php
$contents = str_replace('<?=', '<?php echo ', $contents);

So, at the end of the day, this feature seems like it was kindly thought out and shouldn’t cause any problems (more on that later). Still, it can be used to create interesting programming conundrums, for example, not using any explicit semicolons anywhere in your entire script. A trivial example:

<?php
for ($a = 0 ?><?php $a < 10 ?><?php $a++)
    print $a ?><?php 
if ($a >= 10): ?> hooray <?php endif ?>

It’s interesting to note that you can even do this inside of a control structure like “for(;;)”, but it makes sense when we break down the way that PHP’s parser works in a little more detail. The token for ?>, or the close tag, is T_CLOSE_TAG; for <?php (since I prefer long open tags) it is T_OPEN_TAG*, and anything outside of PHP tags is T_INLINE_HTML. Thus, the token stream for “<?php ?> anything you want goes here <?php” is “T_OPEN_TAG T_CLOSE_TAG T_INLINE_HTML T_OPEN_TAG”**. This distinct in an important way from the example earlier, however, where we only used “?><?php” repeatedly, thus the token stream was simply “T_CLOSE_TAG T_OPEN_TAG”.

*NOTE: Technically, it also includes the first character of whitespace immediately following the open tag, and there must be one.

**NOTE: This only applies if there is any code following the last instance of T_OPEN_TAG; if not, and the document ends there (or after zero or more whitespace characters) then the latter instance will not be considered T_OPEN_TAG, but instead it will be a second instance of T_INLINE_HTML. As such, if we had ended the sample above with another T_OPEN_TAG, it would have been included in the output rather than starting the parser again.

The reason the example given works is because of the way PHP treats these tokens when parsing. The two important rules in play here are:

  1. T_OPEN_TAG is a sentinel that indicates when the PHP parser should be active or not, but in and of itself it has no meaning; as such, it is ignored during parsing
  2. T_CLOSE_TAG has an implicit semicolon added before it during parsing, but it is also otherwise ignored by the parser because it is also otherwise just a sentinel, and as such, becomes syntactically equivalent to a semicolon.

The second rule becomes especially evident when evaluating the error message returned for using it in the middle of an expression. Consider the following syntactically incorrect code fragment:

<?php $foo = 1 + ?> anything here <?php 5;

The error message reads:

Parse error: syntax error, unexpected ';'

— notice there is no mention of an unexpected T_CLOSE_TAG.

There is at least one case where this can do something in a way you may not expect: control structures used without braces for their bodies. Consider the following:

<?php
$foo = 10;
 
if ($foo > 20) ?> some output here <?php
else ?> more output <?php
//Parse error: syntax error, unexpected T_ELSE

This code will generate a parse error as it stands. If we removed the else clause, it would merely behave in an unexpected manner, showing our output even though the condition should have clearly failed (and did).

When the T_CLOSE_TAG is encountered, the current statement is concluded as with a semicolon, and that statement happens to be the entire body of the conditional. The T_CLOSE_TAG token is otherwise ignored, and we come upon T_INLINE_HTML, which is fine, followed by T_OPEN_TAG, which is still fine, but then we come across a T_ELSE, which does not make sense because we are not currently in the middle of parsing a conditional statement, and thus the parse error.

Of course, I would consider that more of an argument for always using braces with your control structures than anything else, but that’s just one man’s opinion.

January 29th, 2013 by Clark | Comments Off

The importance of ZVals and Circular References

Just a quick post for now. Do you know how PHP’s symbol table works? To put it in nutshell, symbols are stored in one place and values (also called ZVals) are stored in another. Normally, this abstraction will mean nothing to you, but take the following sample code:

$foo = &$bar;
$bar = &$foo;

Pretty basic circular reference, and one that might be pretty difficult to assign in a few other languages. Now what? Well, let’s take a look at another reference construct for a moment.

$a = 'foo';
$b = 'bar';
$x = &$a;
$y = &$x;
$z = &$y;
 
var_dump($x, $y, $z);
/*
string(3) "foo"
string(3) "foo"
string(3) "foo"
*/

Pretty much what we expected. Now, let’s throw a wrench into the mix and reassign $y by reference to &$b, and then examine the results:

$y = &$b;
 
var_dump($x, $y, $z);
/*
string(3) "foo"
string(3) "bar"
string(3) "foo"
*/

Only the value of $y changed! That is because PHP, when assigning a reference to a reference, always points at the same ZVal, instead of creating a reference chain; this is one significant way in which PHP References are NOT pointers – they’re never more than one layer deep. Let’s go back to our original example and assign a value to one of those variables:

$foo = 3;
 
var_dump($foo, $bar);
/*
int(3);
int(3);
*/

Works like a charm! This is because both references pointed at the same location in the ZVal table. But what if we start over again, and reassign $foo by reference to something else?

$foo = &$bar;
$bar = &$foo;
$baz = 'baz';
 
$foo = &$baz;
 
var_dump($foo, $bar);
/*
string(3) "baz"
NULL
*/

If you’ve been following along, this should make perfect sense. $foo is created, and pointed at a ZVal location identified by $bar; when $bar is created, it points at the same place $foo was pointed. That location, of course, is null. When $foo is reassigned, the only thing that changes is to which ZVal $foo points; if we had assigned a different value to $foo first, then $bar would still retain that value.

While we’re on the topic of ZVals, I’ll mention just one more thing. PHP uses a lazy-copying (or, copy-on-write) mechanism, thanks to the ZVal table. Consider the following code:

$foo = str_repeat('x',100000);
$mem1 = memory_get_usage();
$bar1 = $bar2 = $bar3 = $bar4 = $bar5 = $bar6 = $foo;
$mem2 = memory_get_usage();
$bar1 .= "...";
$mem3 = memory_get_usage();

I leave the calls to memory_get_usage() in place so that their effects will be more obvious. If we dump those three values, we get 426040, 426408 and 526536, respectively. In the second phase, as you can see, we only increased memory usage by 386 bytes (and that includes the memory required to store the memory that was used). During the third phase, when a variable was altered, memory usage increased by 100128 bytes. PHP uses about 24 bytes of memory to make an entry into the symbol table, and 80 more to create a null entry in the ZVal table.

So, the next time you think about passing a parameter you don’t intend to modify to a function by reference in order to save memory, or returning one for the same reason, don’t worry about it so much; it’s only 24 bytes.

April 27th, 2011 by Clark | Comments Off

PHP Quirks – String manipulation by offset

Just a quick update for a mild PHP Quirk/annoyance I have noticed recently while doing some manipulation of strings by character offset.

Say you have a string, such as ‘abcde'; Now, suppose you want to check the value of the third character (at index 2). You might have done something like this:

$str = 'abcde';
if ($str{2} == 'c') {
  // do something...
}

And, of course, that’s all fine, well and dandy, it does what you expect and you can move on with your life. In fact, if you’re in to micro-optimizations, that construct provides a great way to check a string for minimum length, and is, on average, 44% faster than using strlen(). However, you can use this same construct to change the value of the character at whatever string you’re working with. It works roughly as expected, but with a few gotchas:

/**
 * Gotcha #0 - Adding multiple characters to a single offset; shouldn't really be a gotcha
 */
$str = 'abc123';
 
$str{1} = 'a'; // aac123
$str{4} = '123'; // aac113
 
/**
 * Gotcha #1 - Adding characters past the end of the string
 */
$str{7} = 'c'; // aac113 c
echo ord($str{6}); // prints '32', the space character
 
/**
 * Gotcha #2 - Adding characters to an empty string
 */
$str = '';
$str{0} = 'a'; // array( 0 => 'a' )

In the first case, we see that, rather than leave the “uninitialized” area between where we’ve defined characters as a null character, it has been silently converted to a space. Arguably, this is so that an isset($str[6]); check would not return false, but this is important to know if you expected the values of those spaces to remain at zero.

In the second case, we see PHP’s weak typing in place; since an empty string has no offsets to begin with, attempts to add characters results in silent conversion to an array.

April 27th, 2011 by Clark | Comments Off

Let’s talk about your password model

First off, let me just say that I am by no means an expert cryptographer; there are all sorts of wonderful, terrible things about hashes and block ciphers that I just don’t understand (I’d like to believe that it’s because I’ve not been exposed to them, whoever’s fault that is, and that if given a chance I would get it), but that’s also why I’m writing this – to give the opinion of someone who recognizes his own weakness, and how that translates to another’s strength. Furthermore, this explanation gives a very simplistic view of web security that only examines one aspect of a secure system. For loads more information about securing your web application, take a look at “Dos and Don’ts of Client Authentication on the Web” [PDF] written by some very smart folks at M.I.T.

So, let’s start with a beginner’s introduction. In the beginning, there were users, and users wanted to be able to log in because otherwise being a user was rather pointless indeed. Thus, the password is born, and forevermore it becomes the goal of clever crackers and security experts alike. The first problem someone encounters with passwords is how to store them, and that depends very much on a few key factors: Audience, Exposure, and Uniqueness. If you are running a “homegrown” application (shout out to MecTracker) for use only inside the company, containing (in general) zero sensitive data, and you intend to pick user’s passwords for them (preventing the loss of a life password, itself a bad-yet-unavoidable practice), then why not just store them in plain text? Certainly makes it easy to retrieve a password for someone without having to reset it (useful for someone away from their work machine with saved password who needs to log in).

Conversely, if you’re a bank, and you’re storing any of this in plain text, you will be razed to the ground by angry tech-savvy customers and auditors alike, hopefully BEFORE you get grandma and grandpa Jones to type in the password they use for everything else, too. Hopefully, if you’re a bank, you’re using some crazy method I’m not about to describe here.

Then, there’s the middle ground. I, for example, am not a bank (who would’ve guessed? Can someone please notify my ex-girlfriend?), so my needs are much more middle-of-the-road, which is why I’ve settled for hashing. When I started using PHP, I generally stuck to simple MD5 hashes; it was 10 years ago, and breaking MD5 seemed reasonably difficult. Then I was told not to use MD5 because, at 128 bits, it was too weak, and I should be using SHA-1, which was 160 bits. Then came the recommendation for SHA-256 (guess how many bits that one is!), and then whirlpool, and so on. If you’re using a proper password strategy then you’ve been salting all along (I’ll admit I wasn’t in the old days, but you’ve got to be a beginner sometime), but if you haven’t, allow me to give you a word on salt.

“Salting” a password hash is the practice of taking a piece of input data, adding in an extra piece of information (called “salt”; see where this is going?), and hashing that, instead of just hashing the raw input. In fact, with sites that act like a search engine for MD5 and SHA-1 hashes, not salting your input is, for general purpose storage, only one-degree of separation away from just storing the data in plain text. Furthermore, good salt will be ever-changing (in this practice, the salt is also known as a ‘nonce’), and can safely be stored without obfuscation, as having included it means that a table not accounting for the nonce is useless, and a table that accounts for the nonce is only good against one of the passwords in your database. Now you’ve just made an attack much more expensive, but that may not be as useful in reality as we’d like to believe.

MD5 and SHA-1 hashes can be calculated very, very quickly. In fact, it’s generally more expensive to include some data about the current time (for use in salting/as a nonce) than it is to calculate the actual hash. Here is some experimental code to prove my point:

define('ITERATIONS',5);
 
$tt = $th = 0;
for ($j = 0; $j < ITERATIONS; ++$j) {
	$start = microtime(true);
	for ($i = 0; microtime(true) - $start < 1; ++$i) {
		$k = md5($i);
	}
	$tt += (microtime(true) - $start);
	$th += $i;
}
 
var_dump($tt / ITERATIONS, $th / ITERATIONS);

Simply hashing the value of the counter averaged 320,000 hashes per second on my work machine, which is not very powerful, and is certainly not running this in a very optimized way. By changing what is being hashed to the current time to the microsecond, the number of hashes per second is reduced to an average of about 150,000 – in short, the hash is NOT the expensive part of what’s going on here. So, let’s say that, given a more optimized environment but a more expensive dictionary list to be hashed, that the average is 200,000 hashes per second, and the dictionary is about 50,000,000 common passwords. Simple math tells you that generating a hash list for this will take about 250 seconds, or less than 5 minutes. If it takes under 5 minutes to generate a table, and only a few seconds from there to query it, then even a database of 150,000 users can be fully cracked in just under a fortnight.

So how can this be combated? Well, strong password guidelines are a good start, but if you’re relying on users to implement password security for you, you’re probably doing it very, very wrong. I’d like to challenge one of the assumptions you’ve probably made that I’ve had to challenge recently, and that is the value of speed; speed is bad. Think about it: using a hash method that can generate a table of fifty million values in under 5 minutes sounds great from a performance perspective, but who are you really helping? Is your user going to notice that your hash method took under 1ms to calculate, or is this performance more likely to benefit someone trying to crack your passwords? Who would be more hurt if your passwords took closer to 12ms to generate and verify, your users or your would-be attacker?

If you haven’t heard of it yet, may I introduce you to Blowfish Encryption. Blowfish is designed to scale with Moore’s Law by allowing you, the programmer, to decide how long it takes to generate a hash. This is done by allowing you to specify a number which will be interpreted as a log-base-2 of how many iterations the hashing sequence should take; this metadata is then stored as part of the salt, prepended to the hash, and can be verified by the same function that created it since hashes are of fixed length and will be truncated or padded accordingly. By using a log-base-2 scale, every increment of that number (n) literally doubles the time required to calculate the hash, as it will have to undertake 2n iterations to generate the password. From what I can gather, a number like 7 or 8 is a fair industry standard at this time, and on my work machine limits the hashes-per-second to around 86.6 and 43.3, respectively.

Now, performance is a factor in real world applications, so let’s pick a number like 27, which as I said allows about 87 hashes per second. At that rate, a single dictionary table (useful for only one user, since we are salting these passwords) takes about six and a half days to generate. For that same database of 150,000 users, it would take over 2,733 years to crack. Of course, computational power will get less expensive as time goes on, and the same number of operations can and will get faster, but with the blowfish algorithm you need only increment the log to double the computational cost, keeping the cracking of your database safely outside the realm of technical feasibility.

So how does one use the blowfish algorithm in PHP? The crypt() function is your friend! However, the manual is not entirely clear on the implementation details of blowfish, as it does not include one key part (which caused me to tear my hear out a little bit, since, as a Windows user, I was unable to check the man pages for crypt(3)) in any great detail, and that is the log base. When you generate the salt, you will need to prepend it with an instruction string that tells it what kind of hash to generate, and what parameters to use. Furthermore, the salt is not sixteen characters, but sixteen BYTES, and the characters in your hash will be read as a BASE64 encoded string, which means that using characters not allowed in a base64 string will cause the function to revert back to whatever the default is on your system, probably STD_DES or MD5.

All of that information might have seemed a bit hazy, so I’ll include the timing example I used before modified to suit crypt/blowfish. Note also that I am storing the microtime result on every iteration of the for-loop, as in order to give you worst-case scenarios on the cracker’s timetable, I had to give best-case timings on the hashing, and that means as few calls to microtime as possible.

define('ITERATIONS',5);
 
$tt = $th = 0;
for ($j = 0; $j < ITERATIONS; ++$j) {
	$start = microtime(true);
	for ($i = 0; ($z = microtime(true)) - $start < 1; ++$i) {
		$k = crypt($i, '$2a$07$' . (string)$z);
	}
	$tt += ($z - $start);
	$th += $i;
}
 
var_dump($tt / ITERATIONS, $th / ITERATIONS);

Of paramount importance is the literal string prepended to the stored value. The first four characters, $2a$, simply instruct crypt to use the blowfish algorithm. The next three, 07$, pass the number 7 as our log-base-2 argument, meaning the computation will run for 27 iterations. After that, we concatenate our salt (values shorter than 22 characters will be padded in a predictable fashion, and longer than 22 will be truncated) to the argument string and send it off on its merry, 12ms way.

Do I think I’ve defeated all the clever crackers out there? Certainly not. However, I’m definitely in a better boat for having stood on the shoulders of giants and listened to people smarter than I am about security. In fact, don’t listen to me, check out these links for more info:

(Victor) Xi Wang talks about salt, nonces and rainbow tables

Matasano Security, LLC, talks about blowfish and why you shouldn’t design your own password protection scheme.

Linked earlier, explains blowfish encryption – very math/pseudocode heavy.

Also linked earlier, the PHP Manual Entry for Crypt()

Happy Hashing!

February 9th, 2010 by Clark | 6 Comments »