I have added to GitHub a small Sitecore module that I have developed and used in a couple of projects where we wanted to maintain tight control over the HTML that is output from rich-text fields.

The problem

One of the great benefits of Sitecore is that it allows you 100% control over the HTML output of your site - it doesn’t inject unwanted scripts or styles, and it doesn’t force you to conform to a certain HTML structure. However, with great power, comes great responsibility.

I am sure you have worked on a site that has been lovingly crafted by your UX team, beautifully imagined by designers, and expertly rendered by front-end developers, only for the site to then be gifted to content editors. At this point, using Sitecore’s standard rich-text editor, they add their own unique touch by way of tables for layouts, overriding fonts, and center-aligning text to their heart’s delight. The result is not quite the dream your team first had, or indeed, sold to the client.

Often, this is not the fault of the editors; content gets copy + pasted from Word (or other applications) and customized HTML comes along for the ride. And all those lovely styling buttons are right there in the editor, why not use them?

Workarounds

Yes, there are means of tackling this in Sitecore. You can limit the functionality of the rich-text editor by removing most of the options, but Sitecore’s editor can still result in some undesirable HTML even with very few options available to the user. And yes, there are buttons there to assist in the copying of text from Word, if your editors remember to use them.

You can even go down the road of validating the HTML created and show the user an error if their HTML isn’t up to scratch. But if you’re going to that trouble, why not present a solution rather than a problem and silently tidy up the content for the user immediately? That’s what this small module, Sweep, sets out to do.

Introducing Sitecore.Sweep

Despite the dramatic introduction, Sweep is actually a very straight-forward and simple module. Upon the saving of an item through the UI - i.e. in Experience Editor or Content Editor - Sweep will pass the fields being saved through a pipeline that will a) determine if they need to be cleaned and b) clean them if required.

Determining if they need to be cleaned is down to the configuration of the module. You can configure Sweep to clean all Rich-Text fields if you desire, which is not something I have used myself, or you can take a more granular approach and only apply it to certain templates and fields.

The cleaning is performed by an extensible pipeline which can do as little or as much as you like. Included in the module are options for:

  • Removing inline styling (e.g. <p style="margin-top:430px">A lovely paragraph</p> –> <p>A lovely paragraph</p>)
  • Removing unwanted classes (supports both whitelisting + blacklisting class names)
  • Removing empty elements (e.g. <p></p>)
  • Fixing shoddy headers (e.g. <p><strong>My title!</strong></p> –> <h2>My title!</h2>)
  • Ensuring text is wrapped in a paragraph tag if no root element is found
  • Fixing nested paragraphs (e.g. <p><p>My text</p></p> –> <p>My text</p>), yes this can happen!
  • Removing non-breaking spaces, because they shouldn’t be used for spacing text.
  • Removing inner-elements from headers (e.g. <h1><strong>My strong header!</strong></h1> –> <h1>My strong header!</h1>)

These are just a collection of examples that I have used in various projects. It is by no means suggested that a site should need to use all of these options, or that all of these make for Perfect HTML™. Every situation is different, and these are just some of the provided options that can be used. It is also super-easy to add your own by extending a provided processor base class.

This is definitely not a module for all sites, or even most sites, but I have certainly made use of it so I have added it to GitHub in case others can use it too.

Want to try it?

The module is hosted on GitHub. It has also been submitted to the Sitecore Marketplace and will hopefully be published shortly.

Feedback

I would be very keen to hear from anyone who has tried the module and wants to provide feedback. Or if you’ve read this blog post and have any questions / comments, please comment below or message me on Sitecore Slack for a chat about it.

I’d especially like to hear from you if you think it’s either fundamentally flawed or can be achieved in a much better manner!