Selecting Input and Output Formats : Understanding Output Formats : WebWorks Reverb : Indexing WebWorks Reverb - Client-Side Search for Source Documents, Baggage Files and External URLs
Indexing WebWorks Reverb - Client-Side Search for Source Documents, Baggage Files and External URLs
Reverb now has the ability to index baggage files for use in Reverb’s Client-side Search. An indexable Baggage File in this context is any PDF or HTML file that is linked from a source document that will be included in the generated output for producing useful search results. For more detailed information on baggage files, see “Understanding Baggage Files”.
Note: In order to determine what baggage files are indexed, ePublisher examines the file extension and if it matches one on the following then it will be indexed.
.pdf
.html
.htm
.shtml
.shtm
.xhtml
.xhtm
Baggage files are indexed in the same way that source documents are, as long as the “Client-side Search” is ON (see “Client-side Search”). Indexable baggage files will be indexed as long as the Index baggage files setting is Enabled. External URLs will be indexed as long as the Index external links setting is Enabled.
Using Tidy for Indexing HTML Pages
In order to index HTML baggage files, Reverb creates an XHTML copy of the files using Tidy (tool for cleaning up HTML files) to get valid XML files that ePublisher can read. But there might be cases it fails and you will have to teach Tidy how to handle it.
One of the things you might need to teach Tidy about is new tags. You’ll know you have to do that if you receive a warning in the log saying something like the following:
line 33 column 3 - Error: <not_recognized_tag> is not recognized!
When Tidy shows this warning that means it wasn’t able to generate an XHTML copy, and therefore ePublisher won’t index that baggage file. But fortunately there is a way we can fix that.
To Teach Tidy About New Tags
1. Go to your Tidy directory under the installation directory in your local computer: ...\WebWorks\ePublisher\<VERSION>\Helpers\tidy\
2. Create a Format override of this helper. To do this: in the sub-folder of your project called: Formats, where the Format overrides live, create a new folder called Helpers and copy the entire folder called tidy (from step 1) to this new folder.
3. In the newly created tidy folder, open your config.txt file.
4. Depending on the kind of tag you want to add, you’ll have to uncomment line 8 or 10, or maybe both in the config.txt file.
5. Substitute the placeholder we put there and after the colon, with your new tag name (for example: not_recognized_tag).
6. Save and close the file.
To know more about how to customize Tidy go to https://www.w3.org/People/Raggett/tidy/.
Assigning Relevance Weight to Your Source Documents Styles
Search results are displayed in the Search tab when a user types a word to search for and clicks Go. The search results are sorted by the relevancy ranking, which, in case of source documents, is calculated based on the Search relevance weight option defined in your Paragraph and Marker Styles. By default, WebWorks Reverb assigns relevance weight of 1 to all styles.
To Modify the Relevancy Ranking in Source Documents for Search Results
1. Open your project with ePublisher Designer.
2. Scan the document, to pull all styles into the Style Designer.
3. Open the Style Designer (F10 or View > Style Designer).
4. Select the style you want to assign a weight to (either in Paragraph Styles or Marker Styles).
5. Open the Options window.
6. Change the Value of the Search relevance weight option to a decimal number you determine or you can just ignore it (which is going to be 0), meaning that the style is not going to be shown in your results.
Assigning Relevance Weight to Your HTML and PDF Baggage Files
The search results are sorted by relevancy ranking, which, in case of HTML baggage files, is calculated based on the scoring preference defined for the HTML tags in the search_settings.xml file. By default, WebWorks Reverb assigns relevancy rankings based on where in a topic a particular item is found.
To Modify the Relevancy Ranking in Baggage Files for Search Results
1. Open your project with ePublisher Designer.
2. If you want to override the relevancy ranking for all WebWorks Help targets, create the Formats\WebWorks Reverb\Transforms folder in your projectname folder, where projectname is the name of your ePublisher project.
3. If you want to override the relevancy ranking for one WebWorks Help target, create the Targets\WebWorks Help 5.0\Transforms folder in your projectname folder, where projectname is the name of your ePublisher project.
4. Create a customization of your search_settings.xml file.
5. You’ll see the following block of code:
<Settings version="1.0" xmlns="urn:WebWorks-Settings-Schema">
<ScoringPrefs default-weight="0.05" pdf-weight="0.05">
<meta name="keywords" weight="1.0"/>
<meta name="description" weight="1.0"/>
<meta name="summary" weight="1.0"/>
<title weight="1.0"/>
<div class="myclass" weight="0.05"/>
<div weight="0.05"/>
<h1 weight="0.1"/>
<h2 weight="0.1"/>
<caption weight="0.1"/>
<h3 weight="0.1"/>
<th weight="0.1"/>
<h4 weight="0.1"/>
<h5 weight="0.1"/>
<h6 weight="0.1"/>
<h7 weight="0.1"/>
<p weight="0.05"/>
</ScoringPrefs>
</Settings>
6. Modify the weight attributes for any tags, such as h1 and h2, you want to change. You can also specify additional tags with or without class attributes to further refine weights for your HTML baggage files. You may use decimal values to modify the weight attribute value.
Note: If you wish to set a default weight to tags that are not defined in this file simply update the default-weight attribute value.
Note: You can change the default weight for all of the text in a PDF file by changing the pdf-weight attribute value.
7. Save and close the search_settings.xml file.
8. Regenerate your project to review the changes.