Bulk editing HTML with WordPress and QueryPath

QueryPath’s jQuery-like API provides powerful tools for targeting and updating HTML elements. WordPress’ content filter modifies post/page content as it’s generated without altering what’s in the database. Combine them and you have a super flexible workflow for bulk editing HTML.

Cleaning up generated HTML is a common task

A regular task in my work-day is to take large amounts of HTML exported from desktop publishing apps like Word and InDesign and remove the extra tags and attributes which litter the auto-generated HTML files. Over the years the HTML export capabilities of desktop publishing apps have improved, but still fall short of hand written HTML as soon as you encounter anything beyond headings, paragraphs and lists.

Add to this the inevitability that clients or editors will want to update something in the Word/InDesign files after you’ve started the conversion process and you’ll quickly discover your workflow needs to be as programmable and repeatable as possible. Without being able to easily repeat your workflow you’ll be stuck in a horrible loop; first remembering all the changes you’ve made, and reapplying them over and over.

Our bulk modifying workflow should be automatic, repeatable, and flexible

From this position we can list a few goals a good bulk HTML editing workflow should meet:

changes are programmable to avoid manual editing of HTML
changes are applied to multiple files automatically to avoid manually initiating a bulk process
changes are not applied to the source HTML exports to avoid overwriting your progress should you have re-export HTML.

Regular expressions is the wrong tool manipulating HTML

I wanted to include some info about paths I took which did not meet these workflow goals.

Find and replace functions within my code editor was the first tool I tried for bulk modifying HTML. Queries could be run on multiple files or directories and simple changes like removing unwanted attributes was possible. The limits of find and replace tools are reached very quickly, even advanced find and replace queries which use regular expression.

Most code editors allow regular expression based find and replace queries for matching patterns in strings rather than matching strings exactly which. This flexibility is critical if you want to edit any HTML tags without editing the tag’s contents, or remove attributes with varying values. Dreamweaver does a nice job of putting a user interface on some of these advanced capabilities, as explained nicely by Jonathan Snook.

The limits of advanced regular expression queries are hit when you want to match elements based on their position in the DOM. For example, you want to headings which are within a <div> with the class of example. Regular expressions inappropriateness for this task is uniquely demonstrated in this very popular Stackoverflow post.

QueryPath is purpose built for manipulating HTML

When you read about why regular expression is the wrong tool for the job, QueryPath is often suggested as one of the right tools for the job. I selected QueryPath because it was in PHP and the job I was currently working on was using WordPress, plus it’s API is similar to jQuery’s which is familiar to me. It also means we write the our changes will be written in PHP which meets our first workflow goal of avoiding manually editing HTML by programing our change.

To put QueryPath to work in WordPress you need to include it in your site specific plugin, and then hook onto WordPress’ the_content filter which makes the HTML content of posts and pages available to manipulate as the page is being rendered. This is an important point. In WordPress, filters are applied as pages are being created and do not edit underlying code or content within the database, which nicely satisfy the final two goals of our workflow.

Include QueryPath via Composer and become familiar with the docs

Here is some example code to demonstrate how I put QueryPath to work in WordPress. The first step is to include the QueryPath library in your site specific plugin. As the QueryPath homepage mentions, Composer is the easiest way and using the provided composer.json file and running composer install pulled the latest version of QueryPath into my plugin within a new venor directory.

With Composer in place we can make use of its built in auto-load features to include with the following line at the top of your main plugin file, after the required plugin front matter of course, require plugin_dir_path( __FILE__ ) . 'vendor/autoload.php';

With QueryPath included you can write select HTML nodes, move around the DOM, and modify content and attributes easily. Check out the QueryPath docs, especially the QueryPath class reference which is the primary class of the library with all the helpfull jQuery-like methods.

The structure of my QueryPath filters is straight-forward, repeated in all my functions, and demonstrated in the code above. Firstly, the content string passed to the filter is checked to make sure it’s not empty. If the post/page has no content QueryPath throws errors. Next, you can chain QueryPath methods to select, traverse, and edit the HTML. Finally, you return a modified string of HTML to WordPress.

Import HTML files to populate WordPress, wget to export it

The above workflow is powerful and flexible but it doesn’t come without downsides. Getting large numbers of static HTML files into WordPress takes planning and a plugin called HTML Import 2.

Getting static HTML files back out again was a job for wget.

Use Grunt for a similar workflow without needing WordPress

If using WordPress adds too much overhead for your needs, you can use a similar approach with other technologies. One option I explored briefly was a custom grunt task which could iterate over HTML files within a directory and use a library called Cherio to select, traverse, and modify HTML, and ourput modified HTML files to another directory.