当我需要从WYSIWYG编辑器呈现HTML时，如何防止XSS攻击?

Non-Technical Background info: I am working for a school and we are building a new website using Django. The teachers that work for the school aren't technologically competent enough to use another MarkUp language such as MarkDown. We eventually decided that we should use a WYSIWYG editor, which poses security flaws. We aren't too worried about the teachers themselves, but more malicious students that might get the teacher's credentials.

非技术背景信息:我在一所学校工作，我们正在用Django开发一个新网站。为学校工作的教师在技术上没有足够的能力使用另一种标记语言，如MarkDown。我们最终决定使用WYSIWYG编辑器，它会造成安全缺陷。我们并不太担心老师本身，而是更多的心怀恶意的学生可能会得到老师的信任。

Technical Background info: We are running using Django 1.3 and have not chosen a specific editor yet. We are leaning towards a javascript one such as TINYMCE, but can be persuaded to use anything that allows security and ease of use. Because the WYSIWYG editor will output HTML to be rendered into the document, we cannot simply escape it.

技术背景信息:我们正在使用Django 1.3运行，还没有选择特定的编辑器。我们倾向于使用像TINYMCE这样的javascript，但是可以说服我们使用任何允许安全性和易用性的东西。因为WYSIWYG编辑器将输出HTML以呈现到文档中，所以我们不能简单地转义它。

What is the best way to prevent malicious code while still making it easy for non-technical teachers to write posts?

在防止恶意代码的同时，让非技术教师轻松撰写文章的最佳方式是什么?

3 个解决方案

#1

You need to parse the HTML on the server and remove any tags and attributes that don't meet a strict whitelist.
You should parse it (or at least re-render it) as strict XML to prevent attackers from exploiting differences between fuzzy parsers.

您需要解析服务器上的HTML，并删除任何不符合严格的白名单的标记和属性。您应该将它(或至少重新呈现它)解析为严格的XML，以防止攻击者利用模糊解析器之间的差异。

The whitelist must not include <script>, <style>, <link>, or <meta>, and must not include event handler attributes or style="".

白名单不能包含

You must also parse URLs in href="" and src="" and make sure that they are either relative paths, http://, or https://.

还必须解析href="" "和src="" "中的url，并确保它们是相对路径，http://或https://。

#2

This is late, but you can try Bleach, under the hood it uses the html5lib, and you'll also get tag balancing.

虽然已经很晚了，但是您可以尝试使用Bleach，它使用html5lib，您还可以获得标记平衡。

Here is a complete snippet:

以下是完整的片段:

settings.py

BLEACH_VALID_TAGS = ['p', 'b', 'i', 'strike', 'ul', 'li', 'ol', 'br',
                     'span', 'blockquote', 'hr', 'a', 'img']
BLEACH_VALID_ATTRS = {
    'span': ['style', ],
    'p': ['align', ],
    'a': ['href', 'rel'],
    'img': ['src', 'alt', 'style'],
}
BLEACH_VALID_STYLES = ['color', 'cursor', 'float', 'margin']

app/forms.py

app / forms.py

import bleach
from django.conf import settings

class MyModelForm(forms.ModelForm):
    myfield = forms.CharField(widget=MyWYSIWYGEditor)


    class Meta:
        model = MyModel

    def clean_myfield(self):
        myfield = self.cleaned_data.get('myfield', '')
        cleaned_text = bleach.clean(myfield, settings.BLEACH_VALID_TAGS, settings.BLEACH_VALID_ATTRS, settings.BLEACH_VALID_STYLES)
        return cleaned_text #sanitize html

You can read the bleach docs, so you can adapt it to your needs.

你可以阅读漂白剂文档，这样你就可以根据自己的需要进行调整。

#3

@SLaks is right that you need to do the sanitization on the server since students who steal a teacher's credentials could use those credentials to POST directly to your server.

@SLaks这样做是对的，你需要在服务器上进行消毒，因为偷取老师凭证的学生可以用这些凭证直接把凭证发到你的服务器上。

Python HTML sanitizer / scrubber / filter discusses existing HTML sanitizers available for python.

Python HTML杀菌剂/洗涤器/过滤器讨论了Python现有的HTML杀菌剂。

I would suggest starting with an empty white-list, then use the WYSIWYG editor to create a snippet of HTML using each button so that you know the varieties of HTML it produces, and then whitelist only the tags/attributes needed to support the HTML it produces. Hopefully it doesn't use the CSS style attribute because those can also be an XSS vector.

我建议从一个空的白名单开始，然后使用WYSIWYG编辑器使用每个按钮创建一个HTML片段，这样您就知道它生成的HTML的种类，然后只列出支持它生成的HTML所需的标记/属性。希望它不使用CSS样式属性，因为它们也可以是XSS向量。

#1