As you’ve heard, we’ve been working on a whole lot of new projects. And with new projects comes new code. I can’t say a lot more about these projects right now, but we’ve been using Python and the Django framework to develop them. So here’s the first of the spinoff products that we’re putting out under a BSD license for everyone to benefit from.
Over to Dan MacKinlay, one of our Python gurus, to tell you all about the HTML Sanitiser and why it matters.
“So the idea with the Python HTML Sanitizer is that we are consuming data from a wide variety of client websites, and we need to get their HTML data in a form that’s useful to us. This means –
1) standards compliant XHTML
2) … bereft of formatting quirks which break our site …
3) … and free from exploits for cross-site scripting and other browser-bugs that can compromise user security.
Normally, you can sidestep the HTML sanitization process by writing your own content, or using a special markup language (say, Markdown) – but when you are consuming HTML from clients’ websites this is not an option. They simply aren’t written in Markdown.
Stripping ALL HTML tags out would be another common option. That’s not reasonable for us, however, since we are supposed to be extracting rich information from our clients sites, and some of it is really useful and semantic – links, citations and definitions. things we don’t want to filter out, or punish them for using.
Rather, we’d probably like to reward them by keeping that markup and indexing on it.
By the same token, many clients use old markup (think HTML3), invalid or badly-formed markup or merely use types of markup which are inconvenient for us to display. (br – or even td tags – instead of p) Moreover, when a site is old enough to have such ancient markup in, it’s reasonable to think that maybe other types of maintenance has lapsed too — such as security maintenance.
We can’t blithely assume that every client site is free from malicious Javascript or whatever – that’s a one way ticket to weakest-link security hell. Already we’ve noticed that two partner sites have been hacked in the course of the project so far (these days we’d assume that a fair proportion of traffic to most dynamic websites is malicious).
Solution – the HTML Sanitiser.
This a flexible, adaptable HTML sanitising module (both parsing and cleaning) that can be tweaked to let through rich markup from good client sites, and salvage what it can from bad client sites. This is the approach chosen by things like PHP5’s HTML Purifier and Ruby’s HTML:Sanitizer, but since our scraping code is in Python, we’ve had to build our own, leveraging the power of the awesome BeautifulSoup HTML parser.
Since a lot of people need to solve similar problems to this, and many eyes make for more secure code, we’ve open-sourced it.
Go and download it, make changes and update the codebase.”