Subscribe to this feed

Navigation

Recent Posts

Archive

Mod_Rewrite and Regular Expressions

Friday 12 Jan, 2007 - 19:50pm | 0 comments |

The majority of the mod_rewrite examples you will find on the internet relate to .htaccess. I didn't want to use a .htaccess file but instead wanted to write the rules in either vhost.conf or in httpd.conf. There is a performance hit in using .htaccess on your server which is one of the reasons I didn't want to use it, also, if you don't need to use htaccess the recommendation is not to.

Although the rewrite rules in .htaccess are almost identical they didn't quite work for me in httpd.conf. I needed a series of rewrite rules to handle dynamic page content, namely to make them search engine friendly, such as turning page.php?id=2 into page-2.html

Mod_Rewrite can be complex, particularly in the use of regular expressions (regex). A regular expression is a string used to describe a pattern. They can be used for server side validation of user submitted input or for rewrite rules in Apache. If you haven't used regular expressions before their first sight might instil the same wonder as a wall filled with strange, alien hieroglyphics. Where do you even start to translate your dynamic URL into this odd language? Fortunately there is a key to the characters to help decipher a regex string or to translate your URL into a regex pattern..

Metacharacters used in Regular Expressions
Character Meaning
 Start matching from this point
$ End matching at this point
. Any character, equivalent to the wildcard (note: except: the dot will not match character to denote a new line i.e ). Be careful when using the wildcard, particularly in validation as you may not want to match every character
[ ] Denotes a character class. Will match any one of  the characters included between the square brackets, as in [xyz] will match any of x or y or z, not all three together. Note the dot is not a wildcard if used between square brackets, it's simply treated as a dot.
| Or
? Optional
+ Match at least one or more times
* Match zero to infinite number of times
{ } Curly braces are used to specify a specific number of times to match
( ) Used for Grouping
Use before characters to escape or negate the meaning of them  $ . +
- Range for matching, as in [0-9] numeric characters or [a-z] lowercase characters

When some of these characters are used in combination with each other their meaning may change

Combined Metacharacters in Regular Expressions (characters combined)
Character Meaning
[^ ] Not like the following as in [^xyz] not any one of x y or z

Some common Examples

Some common patterns (characters combined)
Character Meaning
[0-9] Numeric, will match any one numeric character
[a-z] Lower case alphabetic
[A-Z] Upper case alphabetic
[a-zA-z] Alphabetic (upper and lower case)
[^0-9] Not numeric
[0-9a-fA-F] Matches a single hexadecimal character
"[^" ]*" Matches between double quotes
([^/]+) Match any folder name

Shorthand characters can also be used in pattern matching. You might be familiar with some of these from your PHP scripts. The majority of these will not be used in Mod_Rewrite I only include them for completeness and so as to refer back to them later.

Regular Expressions (shorthand characters)
Character Meaning
\d Matches a single numeric character
\t Matches a tab character (ASCII 0x09)
\r Matches carriage return (ASCII 0x0D)
\n Matches line feed (ASCII 0x0A)
\A Only ever matches at the start of a string
\Z Only ever matches at the end of a string
\b Matches at a word boundary
\w  
\B  

The ^ and the $ are known as anchors. Anchors match a position before, after or between characters.

When you start to look at examples of using regex the terminology, metacharacters and their meanings becomes a lot easier to understand. Let's look at some simple examples first before applying what we know to Mod_Rewrite.

In testing our examples we will use PHP's function preg_match. Here we will define two variables $pattern (the pattern to test) and $match (the string we apply the pattern matching to). We pass both variables to the PHP function. The function will return 0 if there is no match and 1 if there is a match.

This will match one aphabetic character. It will fail if there is more than one character such as "sa". It will fail if the string contains a non alphabetic character. It will fail if the letter is in upper case.

$pattern = "/^[a-z]$/";
$match = "s";
echo preg_match ($pattern, $match);

This will match for a single uppercase or lowercase letter. Any other characters will fail

$pattern = "/^[a-zA-Z]$/";
$match = "S";
echo preg_match ($pattern, $match);

Using the curly braces we can define how many characters in the match. In this example any three letters will match, but the match will fail if it is only two letters or more than three.

$pattern = "/^[a-zA-Z]{3}$/";
$match = "SAP";
echo preg_match ($pattern, $match);

Bu using "+" instead of the curly braces we say the match can occur one to infinite times. i.e this will match for any word or string comprised of letters.

$pattern = "/^[a-zA-Z]+$/";
$match = "Navision";
echo preg_match ($pattern, $match);

Remember the caret ^ negates a class. In the following example only characters which are not letters will match. This includes symbols like the comma etc.

$pattern = "/^[^a-zA-Z]+$/";
$match = "1234,�£";
echo preg_match ($pattern, $match);

The following will match any number of aphanumeric characters

$pattern = "/^[0-9a-zA-Z]+$/";
$match = "Lotus123";
echo preg_match ($pattern, $match);

Our pattern might have different elements we want to match. Lets add extra classes. In this example Lotus and Lotus123 will match. We've made the 123 optional (note how we have gpouped it with () brackets

$pattern = "/^[a-zA-Z]+([0-9]+)?$/";
$match = "Lotus123";
echo preg_match ($pattern, $match);

In the following example we introduce another character s to denote a space. As we have added ? to it, i.e. s? we are saying it is optional. This pattern would match Lotus, Lotus123 and Lotus 123

$pattern = "/^[a-zA-Z]+s?([0-9]+)?$/";
$match = "Lotus123";
echo preg_match ($pattern, $match);

Of course if we wanted to match the word and only the word Lotus completely we would use the following. This will only match Lotus, Lotus123, Lotus 123, lotus, lotus123 and lotus 123. But of course this would also match Lotus 345

$pattern = "/^(Lotus|lotus)s?([0-9]{3})?$/";
$match = "Lotus123";
echo preg_match ($pattern, $match);

In these examples the [0-9] could equally have been written as [d] indicating a digit.

$pattern = "/^(Lotus)s?([d]{3})?$/";
$match = "Lotus123";
echo preg_match ($pattern, $match);

These are the basics but working through them should enable us to read and understand regular expressions, we can understand the quantifers(*, ?, +) and the anchors (^, $, \b, \w) and the other metacharacters used in regex to pattern match.

Lets now apply what we have learned to examples using Apaches mod_rewrite. At first we will just examine the pattern matching, we will then apply it to the rewrite syntax. Don't practice on your live server as, if you're unfamiliar with mod_rewrite and regex, your rules might render unexpected results.

The basic syntax for Apache mod_rewrite in httpd.conf is

RewriteEngine on
RewriteRule ^PatternToMatch$ WhatToDo

For example the rule below will match the web page ella.html and rewrite it to mark.html

RewriteEngine on
RewriteRule ^ella.html$ mark.html

i.e. the the URL will say ella.html but the content served up will be mark.html

The first part (^PatternToMatch$) is what a user will type in as a URL or click on to follow a link. For search engine indexing it is better if this link is a static page rather than a dynamic one. The mod_rewrite is a cloaking device. Our "bird of prey" php pages can disappear and appear as static html.

When the user clicks the link for a static html page mod_write will apply the matching rules we've given it and display the content from our dynamic php page.

Lets suppose we have a blog. The actual URL of a post might be blog.php?id=122. We might prefer a user to link to it as follows blog.php/2006/12/02/here-it-is.html

So in the URL we are looking for a specific match. Lets build it up

$pattern = "/^blog.php/([d]{4})/([d]{2})/([d]{2})/([-0-9a-zA-z]+).html$/";
$match = "blog.php/2006/12/02/here-it-is.html";
echo preg_match ($pattern, $match);

The more you define the greater the load on the server.

Anything between ( ) brackets in our pattern we can use as variables in our match. IN our example we have four which will be known as  $1, $2, $3 and $4. We can pass these to our match as these will be the variables needed for our PHP script to run without giving a 404 error. Our PHP script will look like this

blog.php?date=$1-$2-$3&name=$4

ReWriteBase /archive/

Command Flags

Command Flags (mod_rewrite)
Character Meaning
[R] Redirect. Write as [R=301] for example to change the type
[F] Forces the URL to be forbiden
[G] Results in a 401 message
[L] The last rule. Use this at the end of every rewrite rule that doesn't link together.
[N] Rerun the rules again from the start
[C] Chains the rule with the next one
[NC] No case. Make the rule case insensitive
 

When you change a URL to directory level remember you URL's for css, javascript, images etc. need to use absolute rather than relative path or they won't be found.

Posted in: Business
Tags: Regex | Regular Expressions | Apache | Mod Rewrite

Comment
 | Link | back to top | del.icio.us digg it furl reddit

© Eriginal Ltd 2011, all rights reserved