Thursday, May 1, 2014

Yet another reverse proxy

The general idea of reverse proxy is quite simple. Sometimes, one wants to designate a portion of a website  to return a copy of another website. Not redirection; instead, whenever a request comes for www.foo.com/bar/a.html, the server-side code at foo performs an HTTP request to www.bar.com/a.html and returns the page to the client. The client doesn't even know that the response originated at bar.com. In this scenario, the directory www.foo.com/bar acts as a reverse proxy for www.bar.com.

The business cases for this kind of functionality are several. I, specifically, want to move a set of Web services from one host to another while keeping the old URLs working. Since those are services, meant to be invoked by programs and not people, redirection won't work. Another possible scenario involves exposing an HTTP service from behind a firewall without exposing the whole host.


There are built-in means for creating reverse proxies both in Apache and in Internet Information Server. The problem is, leveraging those requires administrative rights, and with certain Web hosting packages, one might not have them.

On the other hand, there's nothing magical about forwarding an HTTP request. A piece of server side code (PHP in my case) is perfectly capable of issuing an outgoing HTTP request, passing the incoming request headers along, then sending the response headers and data back. So I went ahead and wrote one.

The reverse proxy script is done, it works (for me) and I don't mind sharing.

Setting up


If you're interested in this script, here are the installation steps.
  1. Download this archive from Dropbox
  2. Upzip into a folder on a Web server
  3. Open proxy_config.php, describe the location(s) under which the proxy is sitting and which targets they should invoke
  4. Open .htaccess,  change the value of RewriteBase to reflect the URI location where the proxy is sitting (multiple lines might or might not work, test it)
  5. Make log.txt writable to the world
That's the basic setup, and it assumes that the server is Apache, that it has mod_rewrite, that mod_rewrite is enabled, and that overriding rewrite rules on folder level is allowed. It's a reasonable assumption in this day and age; many CMS's out there depend on rewriting functionality.

If rewrite is not available, there's still a way to run the proxy. I won't go into that here, but the idea is either establishing symlinks to proxy.php all over the proxy folder, or placing renamed copies of rproxy.php all over the folder. It's not pretty, but it'll work. Static content needs to be duplicated outright (or a handler needs to be established).

Features


In addition to the most basic HTTP functionality, the proxy supports:
  • Passing headers back and forth as much as reasonably possible - so caching instructions, content type, user agent and such won't be lost
  • Arbitrary HTTP methods (i. e. REST)
  • POST/PUT/PATCH data in arbitrary format - not just forms
  • Cookies and sessions, unless the target uses path-specific cookies nontrivially
  • HTTPS -if you designate the target as protocol independent (with no leading http://)
  • Proxy folders that are accessible via several URIs
On the other hand, there are many ways a site can be proxy-unfriendly. The following scenarios won't work under my proxy script:
  • Absolute URLs in HTML
  • Redirection to absolute URLs within the same site
  • Domain- and path-specific cookies might break, depending on the way the target works
Some of those shortcomings may be fixed in future versions.

It probably doesn't scale well with content size. Rather than passing the bytes to the client as soon as they arrive from the target, the script stores the whole thing in memory.

The preferred environment is PHP 5 under Apache. In theory, the script should work under other environments too, but I had little chance to test it under those.