Parsing HTML with regex

pmjv@lemmy.sdf.org · 8 months ago

Parsing HTML with regex

2xsaiko@discuss.tchncs.de · 8 months ago

None of these examples are for parsing English sentences. They parse completely different formal languages. That it’s text is irrelevant, regex usually operates on text.

You cannot write a regex to give you for example “the subject of an English sentence”, just as you can’t write a regex to give you “the contents of a complete div tag”, because neither of those are regular languages (HTML is context-free, not sure about English, my guess is it would be considered recursively enumerable).

You can’t even write a regex to just consume <div> repeated exactly n times followed by </div> repeated exactly n times, because that is already a context-free language instead of a regular language, in fact it is the classic example for a minimal context-free language that Wikipedia also uses.