By Meredith Kinder and Sheila Loring, Senior Members, Carolina Chapter
Your chapter content is valuable to present and prospective members. Help them find it with ease by converting your newsletter articles to HTML format.
The Carolina Chapter STC is a mid-sized chapter with members from many different professional specialties dispersed across central and eastern North Carolina. We have been offering our quarterly newsletter online, in PDF format, for several years. This year we recognized that some of the feature articles in past issues contain valuable content that is still relevant and worth providing to our members in a medium more flexible than PDF. By providing that content in HTML on dedicated pages on our chapter Web site, we made it more accessible to members who might not want to dig through years’ worth of PDFs.
While most chapters put issues of their newsletter in PDF format on their Web site, few offer their best articles in an easy-to-find manner. Some chapters offer their newsletters in HTML but do not have a search engine for their site content, and do not pull out the feature articles as a separately navigable resource. Other chapters have switched to an HTML newsletter, but do not offer its issues on their Web site. Since 2001 we have been publishing large and colorful newsletters with relevant articles that can stand on their own. Here is a brief sample list of some of the feature articles:
- “Web Design for Small Companies: Pretend that You Have a Programmer,” by Kim Flint
- “Mentoring as a Two-Way Street,” by Andy Smith
- “Wield the Power of the Written Word,” by Michael Uhl
- “Improving Technical Reviews,” by Alexia Idoura
There are also articles on structured authoring, human factors, medical writing, and quality metrics, as well as balancing parenting with work, learning XML, and foretelling professional trends. All of these are as useful today as they were when they were written—some last month, some two years ago.
By finding a catchy title or scanning the list of articles for the name of a particular author, users can more easily find the content of some timeless articles. In addition, if the articles are offered in HTML, their content can be included in the search engine on the chapter’s Web site. So when a user types in “mentoring” or “editing,” the results can include links to the article contents. This is content that makes the chapter Web site more useful and is also another way of acknowledging the talent within your chapter membership.
Converting New Articles to HTML
Our chapter’s process of generating and publishing the newsletter now includes saving some of the articles to HTML for posting on the chapter Web site. Collaboration between the newsletter team and the webmaster has made the conversion process easy.
The articles are written and edited in Word. The production editor copies and pastes the articles into an Adobe InDesign template, refines the layout and formatting, and converts the files to PDF. While the print newsletter is being developed, the webmaster converts the original Word file to HTML. The conversion and cleanup involves the following steps:
Convert the Word files to HTML using a shareware program called Word Cleaner. Word Cleaner converts Word files to HTML format and strips Microsoft Office styles from the HTML files. Microsoft Word does let you save documents as filtered HTML; however, some Office styles still remain in the files. For example, most paragraphs are wrapped in the <p class=MsoNormal> element. Empty paragraphs look like this in HTML:
<p class=MsoNormal>
<span style=‘font-family:“Times New Roman”’> </span></p>
Word also embeds an inline stylesheet in the HTML file, even if you save the document as filtered HTML. Depending on the variations in formatting, the stylesheet can range from ten to over forty lines. Here’s a short excerpt:
<style>
<!—
/* Font Definitions */
@font-face
{font-family:Arial;
panose-1:2 7 4 9 2 2 5 2 4 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:“Trebuchet MS”;
panose-1:2 11 6 3 2 2 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:Arial;}
..</style>
This extraneous code is deleted so that all Web pages are formatted consistently and file sizes are kept to a minimum.
Scan the HTML files for structural and formatting issues that must be manually corrected. Copy the title of the article into the <title> and <h1> elements, wrap subheadings in <h2> and <h3> elements, and put bulleted lists in <ul>/<ol> and <li> elements.
Type in the header and footer “includes” commands. We store header and footer navigation, CSS and RSS feed links, and other frequently used content in separate files. The commands look like this:
<!—#include virtual=“/includes/header1.shtml” —>
<!—#include virtual=“/includes/header2.shtml” —>
<!—#include virtual=“navigation.shtml” —>
<!—#include virtual=“/includes/footer.shtml” —>
When a Web page with the .shtml extension is displayed, the server embeds code from the included files in the Web page. This method enforces consistency in navigation and formatting throughout the Web site.
If the images weren’t embedded in the source file, insert them in the HTML. Add alternate text and the image width and height to the <img> elements.
Add a link to the article on the newsletter archives Web page and upload all files.
That’s our single-sourcing workflow on a shoestring budget, using minimal volunteer resources. The STC Single Sourcing SIG should be proud of us: We now publish a single set of content to PDF and to HTML on the Web.
Converting Archived PDFs to HTML
We had a backlog of over three years of legacy content that took a bit of effort to transform into HTML, but one volunteer tackled it in a matter of about three working days spread over two weeks. Here’s the general process:
1. Open the PDF file in Adobe Reader, select the text, and paste it into a text editor (in this case, Notepad). This makes sure that there are no special characters or formatting—just text.
2. Do this for several articles at a time, say for an entire issue. Most issues had three or four good articles.
3. Remove forced line returns needed by copying the text into a Word document, searching for “^p^p”, and replacing them with a unique and noticeable string (such as “&-&-&”). This delineates the article paragraphs.
4. Remove single paragraph marks by searching for “^p” and replacing them with a blank space.
5. Restore the pseudo-article paragraph marks, “&-&-&,” with “</p><p>.” This gives you pretty much all the text enclosed in <p> elements.
6. Copy the text back into to a text editor and save as a Web page.
7. Clean up the HTML (see steps 2 and 3 in the section Converting New Articles to HTML).
8. Either export the graphics from Acrobat (if you have Acrobat 6 or higher) or screen capture them from the original PDF. Crop the graphics, if necessary, and save them as JPEGs.
9. Send the files to the webmaster.
The process usually took about an hour for six or eight articles. While the process sounds labor-intensive, it was effective for the number of articles we had to handle. If you have a Web site management tool like Dreamweaver or HomeSite, or a development environment tool such as Visual Studio, you can edit many HTML files at once, making the task that much easier. The tools that let you do quick search-and-replace are very helpful.
Final Thoughts
As technical communicators, we should practice what we preach. The value of content should not be underestimated, especially when the content can help your members and be offered as a valuable asset to all those in the profession. Imagine the wealth of content that could be available if all the chapters of STC made their best newsletter articles more readily available on Web sites.
In the future, this content can be stored in a database and the HTML pages created dynamically as needed. Many businesses are doing this with their technical content. As the amount of content grows, a content management system becomes essential. For now, we’ll start with a few manually created HTML pages. If nothing else, the chapter gives volunteers a place to practice some of the ideas of single sourcing, with a manageable amount of material and a friendly, deadline-free environment. Once you see what the requirements are for posting on the Web and for publishing in PDF, then you can improve your process to handle both steps.
We encourage other chapters to adopt this process of making content available to members. If you want to know details of how we output to HTML, handled our legacy content, or provided a search engine for our site content, feel free to contact us. If you have experience doing this for your chapter or some good ideas about how to further improve the process, let us know.
For more information, visit our chapter Web site. To download our PDF newsletter and read the articles in HTML format, select Newsletter from the chapter information menu.
Editor’s note: This article was originally published, in slightly different form, in the second quarter 2006 issue of Carolina Communiqué.
Meredith Blackwelder Kinder is the managing editor for Carolina Communiqué. She’s past president of the chapter and currently a documentation specialist for SAS Institute Inc. She can be reached at meredith.kinder@sas.com .
Sheila Loring is the production editor for Carolina Communiqué. She’s also the communications manager for the chapter and currently a consultant for Scriptorium Publishing. She can be reached at loring@scriptorium.com .
The authors would like to thank Bill Albing for his help in creating the single-sourcing process and his contributions to the article.