Working with regular expressions: HREF URL Extractor

HREF URL Extractor

Working with regular expressions

To extract the value (URL) of an HREF attribute from a string, use the preg_match or preg_match_all function. This example use preg_match_all to find all the matches.

preg_match_all - Perform a global regular expression match

preg_match_all (pattern (string), target (string), matches (array), optional flags)

Searches target for all matches to the regular expression given in pattern and puts them in matches in the order specified by the flags.

The regular expression: <\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

PHP Code:

<?php
$href_regex
="<"; // 1 start of the tag
$href_regex .="\s*"; // 2 zero or more whitespace
$href_regex .="a"; // 3 the a of the tag itself
$href_regex .="\s+"; // 4 one or more whitespace
$href_regex .="[^>]*"; // 5 zero or more of any character that is _not_ the end of the tag
$href_regex .="href"; // 6 the href bit of the tag
$href_regex .="\s*"; // 7 zero or more whitespace
$href_regex .="="; // 8 the = of the tag
$href_regex .="\s*"; // 9 zero or more whitespace
$href_regex .="[\"']?"; // 10 none or one of " or '
$href_regex .="("; // 11 opening parenthesis, start of the bit we want to capture
$href_regex .="[^\"' >]+"; // 12 one or more of any character _except_ our closing characters
$href_regex .=")"; // 13 closing parenthesis, end of the bit we want to capture
$href_regex .="[\"' >]"; // 14 closing chartacters of the bit we want to capture

$regex = "/"; // regex start delimiter
$regex .= $href_regex; //
$regex .= "/"; // regex end delimiter
$regex .= "i"; // Pattern Modifier - makes regex case insensative
$regex .= "s"; // Pattern Modifier - makes a dot metacharater in the pattern
// match all characters, including newlines
$regex .= "U"; // Pattern Modifier - makes the regex ungready

$html = "......";

//if preg_match was used here, it would only find the first match for the regular expression
if (preg_match_all($regex, $html, $links)) {

print(
"<P>Links: <pre>");
print_r($links[1]);
print(
"</pre></P>");
}

else {
print(
"No links.");
}
?>

Note: $links[1] now has all the href links from $html

Further Reading:
See PHP: preg_match_all

See it in action. ¦ Complete code

      Subscribe in a reader