Dereleased

Apache Taught You The Wrong Way To Think About Web Applications

This is not about hating on Apache; for its faults, Apache is still a useful and powerful tool for its purpose. Nevertheless, people have been getting the concept of a web application wrong from the ground up because of the way Apache works out of the box. Let me be clear: There are many ways to learn to do things the wrong way, I simply point out Apache here for its ubiquity — one can still find Apache installed and ready to go, with little to no configuration, on most servers or OS packages, from Linux to OS X.

Apache’s default mode of operation guided a generation of self-taught web developers to believe that what a web server does is take a request, parse a path out of that request, and serve up a file that’s found somewhere on the filesystem that generally corresponds to the path requested in the browser bar. For example, if my web root is /var/www, then asking for www.example.com/foo/bar.ext will cause Apache to serve up /var/www/foo/bar.ext. This, in a way, makes a web server, and the applications that reside on it, seem like a fancy version of a file explorer; some files will be passed through extra processing layers (like PHP scripts for example), but the default mode is to ask for a file that lives on the filesystem, and then get it. It seems simple, because it is.

And yet, this is rarely how a web application behaves. We don’t typically ask for a file, we send in a path (or route) that we interpret in the application layer, and then render content cobbled together from a range of sources. Instead of each file being like a point that can be executed like a stand-alone program, a modern web application is generally a large number of small modules, ideally doing just one thing and doing it well, that come together to eventually form the web page that will be served to the requestor.

The problem arises because Apache teaches the neophyte web developer that this kind of application requires one to fight the very nature of the web server to create. This causes developers to choose from a range of bad options, such as navigating the archaic minefield of .htaccess files, whether to force everything to go through their index.php file (or whatever your DirectoryIndex file is), or to actually code in a bunch of static routes. Now, performing all of your routing from a single place is starting to get at the heart of the right idea, but too often the pattern seem to be that, if I access a resource /foo/bar, that means I want to access some file that corresponds to foo (probably containing a class), and call in it a function bar. I’ve done this in the past, and I see it done around me consistently as well. While this is an improvement over directly requesting the endpoint, it means that the developer is still thinking of their application intrinsically as a product of the filesystem. Worse still, this kind of thinking causes developers to then make a series of additional bad (or at least non-optimal) decisions, such as:

Ensuring that a class exists which matches a pattern specified by the URL.
This requires an attempt to load the class, and hopefully a check to make sure that the class is valid for use in this context. A particularly naïve implementation may allow someone to load something that was never intended to handle a request, and cause it to perform some action that was never intended by the developer to be directly invoked!
Ensuring that a function exists which matches a pattern specified by the URL.
Again, once the class is loaded, a check has to be performed to determine that it is capable of calling the specified function, and additional checks may be required to determine that this function is a valid function for handling requests.
Perhaps the worst, developers may tend to keep these files in a web accessible directory
This kind of thing has lead to all kinds of boilerplate code being inserted at the top of files to make sure they are not invoked “incorrectly”, e.g. called directly by their path on the server. Files that all begin with checks to see that some constant is defined, or that $_SERVER['PHP_SELF'] (or an equivalent, e.g. $_SERVER['REQUEST_URI']) does not point to the file itself, are examples of this kind of mechanism which approaches the problem exactly in reverse.

The consequences of this kind of design are less secure, less maintainable, less testable web applications. Forgot to declare a method private, or prepend its name with an underscore, or whatever your standard is for methods of a dispatching class which are not intended to be called? Now your application misbehaves. Forgot your boilerplate in one file? Who knows exactly what it will do. And even if you didn’t forget, the behavior of a file accessed “incorrectly” is often to die with a cryptic error message; not exactly ideal behavior for our web application.

In order to break free from these kinds of less-than-ideal design pitfalls, we have to think about a web application differently. The model of filesystem-first design should be thrown away, and instead, we should think of our applications using a different model: a closed source, compiled application. Just like you wouldn’t give out access to the individual source files of this kind of application, so should you not give access to users of your web application. That means, primarily, that your web application files, other than your application entry point, should NOT live below the document root! Anything that should not be accessed directly, should not be able to be accessed directly, and the application itself should not be responsible for handling this. This brings up a major point with how apache teaches you the wrong thing, as developers may be tempted to leave their structure unchanged and use mechanisms like .htaccess files to restrict access to the directory; the problem with doing this is that the webserver is as much a part of your application as any code you write. Relying on Apache directives, even if you store them in a *.conf file loaded outside the webroot, is that it is still relying on the application to protect itself.

Of course, this still hasn’t addressed how we get away from the problem of thinking of different files (or classes) in our application being thought of essentially as individual applications themselves. Now, I understand the allure of this kind of system. There are two primary benefits I think are drivers of this pattern: The ease with which new routes can be handled (just add another function to the handling class!), and getting to avoid solving the fairly difficult problem of dispatching to many different routes accurately and efficiently (with this technique, all you need are class_exists() and method_exists()). Addressing the latter first, the problem is largely solved by standing on the shoulders of giants. In this case, I recommend standing on the shoulders of Nikita Popov (or NikiC), who has researched the problem extensively and developed FastRoute to solve the problem.

Using FastRoute will absolutely solve the problem of how to do this efficiently and accurately, but the former problem of the ease with which new routes can be added still remains; this requires a centralized dispatcher to be updated with instructions for each individual route, and they must be enumerated. I’m here to say that this is not a problem, but a feature! There are tremendous benefits to knowing in advance what all possible routes your application can take. For one, nothing can be routed if not explicitly defined. Additionally, testing becomes possible without using reflection. But greatest at all, at least from the point of view of this post, is that you have stopped thinking of your application as a series of files which do things, and started thinking of it as a single, cohesive application that does things.

Now, with all of this said, I would be remiss if I did not point out that Apache can be convinced to behave this way as a part of your application. I don’t necessarily want to get into a holy war of recommending any one server over another, but if you haven’t considered any others, it would be well worth your while to look at the other options on the market. At the very least, taking a fresh look at the alternative approaches in use today should serve to help you understand why you wish to stay with Apache. Or, if you’re feeling adventurous, some of the talented minds behind PHP itself have written a web application server entirely in PHP; such a thing erases the imaginary line that may keep you from thinking of your web server as being just a much a part of your application as the code you personally write.

In the end, recognizing your web server as a piece of your application, not just as a glorified file explorer, will improve the quality of your application as whole.

August 16th, 2017 by Dereleased | Comments Off

Adventures in Parsing: PHP’s implicit semicolon (‘;’) before every close tag

Simply put, every time you enter a PHP closing tag, the interpreter automatically adds a semicolon to end the current statement before switching to HTML mode. In general, this is fairly innocuous, and it might even seem like the interpreter is doing you a favor, since now you can omit semicolons just before the end of a PHP block, which might be convenient for quick calls to “echo” or, potentially, for converting the short-open print shortcut (<?=) into long-open explicit echos, e.g.

<?php echo $foo?>

Since this could conceivably have been converted automatically with code not unlike the following:

<?php
$contents = str_replace('<?=', '<?php echo ', $contents);

So, at the end of the day, this feature seems like it was kindly thought out and shouldn’t cause any problems (more on that later). Still, it can be used to create interesting programming conundrums, for example, not using any explicit semicolons anywhere in your entire script. A trivial example:

<?php
for ($a = 0 ?><?php $a < 10 ?><?php $a++)
    print $a ?><?php 
if ($a >= 10): ?> hooray <?php endif ?>

It’s interesting to note that you can even do this inside of a control structure like “for(;;)”, but it makes sense when we break down the way that PHP’s parser works in a little more detail. The token for ?>, or the close tag, is T_CLOSE_TAG; for <?php (since I prefer long open tags) it is T_OPEN_TAG*, and anything outside of PHP tags is T_INLINE_HTML. Thus, the token stream for “<?php ?> anything you want goes here <?php” is “T_OPEN_TAG T_CLOSE_TAG T_INLINE_HTML T_OPEN_TAG”**. This distinct in an important way from the example earlier, however, where we only used “?><?php” repeatedly, thus the token stream was simply “T_CLOSE_TAG T_OPEN_TAG”.

*NOTE: Technically, it also includes the first character of whitespace immediately following the open tag, and there must be one.

**NOTE: This only applies if there is any code following the last instance of T_OPEN_TAG; if not, and the document ends there (or after zero or more whitespace characters) then the latter instance will not be considered T_OPEN_TAG, but instead it will be a second instance of T_INLINE_HTML. As such, if we had ended the sample above with another T_OPEN_TAG, it would have been included in the output rather than starting the parser again.

The reason the example given works is because of the way PHP treats these tokens when parsing. The two important rules in play here are:

T_OPEN_TAG is a sentinel that indicates when the PHP parser should be active or not, but in and of itself it has no meaning; as such, it is ignored during parsing
T_CLOSE_TAG has an implicit semicolon added before it during parsing, but it is also otherwise ignored by the parser because it is also otherwise just a sentinel, and as such, becomes syntactically equivalent to a semicolon.

The second rule becomes especially evident when evaluating the error message returned for using it in the middle of an expression. Consider the following syntactically incorrect code fragment:

<?php $foo = 1 + ?> anything here <?php 5;

The error message reads:

Parse error: syntax error, unexpected ';'

— notice there is no mention of an unexpected T_CLOSE_TAG.

There is at least one case where this can do something in a way you may not expect: control structures used without braces for their bodies. Consider the following:

<?php
$foo = 10;
 
if ($foo > 20) ?> some output here <?php
else ?> more output <?php
//Parse error: syntax error, unexpected T_ELSE

This code will generate a parse error as it stands. If we removed the else clause, it would merely behave in an unexpected manner, showing our output even though the condition should have clearly failed (and did).

When the T_CLOSE_TAG is encountered, the current statement is concluded as with a semicolon, and that statement happens to be the entire body of the conditional. The T_CLOSE_TAG token is otherwise ignored, and we come upon T_INLINE_HTML, which is fine, followed by T_OPEN_TAG, which is still fine, but then we come across a T_ELSE, which does not make sense because we are not currently in the middle of parsing a conditional statement, and thus the parse error.

Of course, I would consider that more of an argument for always using braces with your control structures than anything else, but that’s just one man’s opinion.

January 29th, 2013 by Dereleased | Comments Off

The importance of ZVals and Circular References

Just a quick post for now. Do you know how PHP’s symbol table works? To put it in nutshell, symbols are stored in one place and values (also called ZVals) are stored in another. Normally, this abstraction will mean nothing to you, but take the following sample code:

$foo = &$bar;
$bar = &$foo;

Pretty basic circular reference, and one that might be pretty difficult to assign in a few other languages. Now what? Well, let’s take a look at another reference construct for a moment.

$a = 'foo';
$b = 'bar';
$x = &$a;
$y = &$x;
$z = &$y;
 
var_dump($x, $y, $z);
/*
string(3) "foo"
string(3) "foo"
string(3) "foo"
*/

Pretty much what we expected. Now, let’s throw a wrench into the mix and reassign $y by reference to &$b, and then examine the results:

$y = &$b;
 
var_dump($x, $y, $z);
/*
string(3) "foo"
string(3) "bar"
string(3) "foo"
*/

Only the value of $y changed! That is because PHP, when assigning a reference to a reference, always points at the same ZVal, instead of creating a reference chain; this is one significant way in which PHP References are NOT pointers – they’re never more than one layer deep. Let’s go back to our original example and assign a value to one of those variables:

$foo = 3;
 
var_dump($foo, $bar);
/*
int(3);
int(3);
*/

Works like a charm! This is because both references pointed at the same location in the ZVal table. But what if we start over again, and reassign $foo by reference to something else?

$foo = &$bar;
$bar = &$foo;
$baz = 'baz';
 
$foo = &$baz;
 
var_dump($foo, $bar);
/*
string(3) "baz"
NULL
*/

If you’ve been following along, this should make perfect sense. $foo is created, and pointed at a ZVal location identified by $bar; when $bar is created, it points at the same place $foo was pointed. That location, of course, is null. When $foo is reassigned, the only thing that changes is to which ZVal $foo points; if we had assigned a different value to $foo first, then $bar would still retain that value.

While we’re on the topic of ZVals, I’ll mention just one more thing. PHP uses a lazy-copying (or, copy-on-write) mechanism, thanks to the ZVal table. Consider the following code:

$foo = str_repeat('x',100000);
$mem1 = memory_get_usage();
$bar1 = $bar2 = $bar3 = $bar4 = $bar5 = $bar6 = $foo;
$mem2 = memory_get_usage();
$bar1 .= "...";
$mem3 = memory_get_usage();

I leave the calls to memory_get_usage() in place so that their effects will be more obvious. If we dump those three values, we get 426040, 426408 and 526536, respectively. In the second phase, as you can see, we only increased memory usage by 386 bytes (and that includes the memory required to store the memory that was used). During the third phase, when a variable was altered, memory usage increased by 100128 bytes. PHP uses about 24 bytes of memory to make an entry into the symbol table, and 80 more to create a null entry in the ZVal table.

So, the next time you think about passing a parameter you don’t intend to modify to a function by reference in order to save memory, or returning one for the same reason, don’t worry about it so much; it’s only 24 bytes.

Update!

In my travels, I have learned much since writing this. There are actually a lot of reasons to not use references in most cases, because they actually will increase your memory usage. Because PHP uses copy-on-write, it must track which symbols point by reference, and which do not, as well as tracking which values are pointed to by reference. Essentially, as many things of one type (i.e. by-reference or by-value) can reference something of the same type without requiring a copy until write, but once you add something different in, an immediate copy is made.

$a = "123456";
$b = $a; // this initializes the symbol $b, but doesn't create a second value
$c = &$b; // this immediately splits off a new value for $b and $c to reference

References actually cost us memory in this case, because of the split required, and the same is true in reverse.

$a = "123456";
$b = &$a;
$c = $b;

Since $c was assigned by value, not by reference, a new value had to be created for that symbol to reference.

In summary, you should probably be avoiding references unless they are actually the solution to a programming problem you are having. They are not useful for saving memory, and will end up costing you more in the long run.

April 27th, 2011 by Dereleased | Comments Off

PHP Quirks – String manipulation by offset

Just a quick update for a mild PHP Quirk/annoyance I have noticed recently while doing some manipulation of strings by character offset.

Say you have a string, such as ‘abcde’; Now, suppose you want to check the value of the third character (at index 2). You might have done something like this:

$str = 'abcde';
if ($str{2} == 'c') {
  // do something...
}

And, of course, that’s all fine, well and dandy, it does what you expect and you can move on with your life. In fact, if you’re in to micro-optimizations, that construct provides a great way to check a string for minimum length, and is, on average, 44% faster than using strlen(). However, you can use this same construct to change the value of the character at whatever string you’re working with. It works roughly as expected, but with a few gotchas:

/**
 * Gotcha #0 - Adding multiple characters to a single offset; shouldn't really be a gotcha
 */
$str = 'abc123';
 
$str{1} = 'a'; // aac123
$str{4} = '123'; // aac113
 
/**
 * Gotcha #1 - Adding characters past the end of the string
 */
$str{7} = 'c'; // aac113 c
echo ord($str{6}); // prints '32', the space character
 
/**
 * Gotcha #2 - Adding characters to an empty string
 */
$str = '';
$str{0} = 'a'; // array( 0 => 'a' )

In the first case, we see that, rather than leave the “uninitialized” area between where we’ve defined characters as a null character, it has been silently converted to a space. Arguably, this is so that an isset($str[6]); check would not return false, but this is important to know if you expected the values of those spaces to remain at zero.

In the second case, we see PHP’s weak typing in place; since an empty string has no offsets to begin with, attempts to add characters results in silent conversion to an array.

April 27th, 2011 by Dereleased | Comments Off

Apache Taught You The Wrong Way To Think About Web Applications

Adventures in Parsing: PHP’s implicit semicolon (‘;’) before every close tag

The importance of ZVals and Circular References

Update!

PHP Quirks – String manipulation by offset

Info

Find Me At...

Recommended Reading

Ramblings