Python way to keep out XSS?

I'm in a situation where I need to allow users to submit full HTML (e.g. using FCKeditor or TinyMCE) while at the same time maintaining nearly perfect protection against those pesky XSS attacks. Crazy, I know, but it just has to be done. Not my choice.

I want protection against all known attacks and most future 0-day attacks.

See http://ha.ckers.org/xss.html

If I were using PHP, I'd use HTML Purifier.

See http://htmlpurifier.org/

But I want to use Python (Django) for this app.

Can anyone point me towards a reliable Python library that (a) not only filters HTML tags and attributes but also checks the values and plain text between tags – because some brain-dead browsers (read: Internet Explorer) will execute scripts in seemingly benign locations; (b) uses a well-audited whitelist in doing so; (c) doesn't crash on seriously malformed HTML; and (d) produces valid (X)HTML as output?

I've been doing some heavy searching, but came up with nothing except a few home-brewed solutions based on BeautifulSoup. Unfortunately, all of these only look at tags and attributes, and hence vulnerable to more sophisticated tricks targeted at specific browsers. Even more unfortunately, a large percentage of internet users regularly use those brain-dead browsers.

There's also this: https://launchpad.net/python-html-sanitizer which sounds promising but it doesn't seem to have gone through any serious testing on the field.

C'mon… if PHP can do it, Python should be able to do it better…

Thanks in advance for any suggestions.

6 Replies

Why don't you use HTML Purifier through a "filter application"? You could make a simple PHP script that receives HTML code as a commandline parameter, runs HTML Purifier on it, and then prints the clean HTML. Then, you could use os.system() or similar from python to invoke the script and clean the code.

@turl:

Why don't you use HTML Purifier through a "filter application"?

Sure, that's a possibility. Maybe even run PHP as a daemon (as Apache module, or using FastCGI) and communicate with it over standard HTTP for better concurrency.

Still, I'd prefer a pure Python solution if at all possible. I don't want to lug PHP around.

http://genshi.edgewall.org/ ?

I used this once on a project I never finished, but it seemed like the smartest way to go at the time.

@mwalling:

http://genshi.edgewall.org/ ?

Didn't know that Genshi had a HTML Sanitizer feature… But then, it seems rather poorly documented. Not sure if I can trust this one. Maybe I'll dig into the source code a little bit 8)

@hybinet:

Didn't know that Genshi had a HTML Sanitizer feature… But then, it seems rather poorly documented. Not sure if I can trust this one. Maybe I'll dig into the source code a little bit 8)
I actually thought Genshi documentation was quite fine, although perhaps earlier exposure to TAL and Kid already had me thinking in the tag/attribute markup mode. I still like the approach for templating.

But I don't think Genshi has anything like a sanitizer, unless you count the fact that its template parsing is strict XML. But I suspect you're not looking to try to parse the user supplied HTML as a Genshi template, nor would that likely complain about well-formed XSS attacks.

– David

Genshi documentation is just fine. I was talking about the nonexistent documentation of the HTML Sanitizer feature mentioned above. Anyway, what it seems to do is to filter the tags and attributes. Not at all looking inside those tags.

I want to detect tricks like the following, which unfortunately works in IE6.

![](jav
ascript:doNastyThings();)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct