Selecting Input and Output Formats : Understanding Output Formats : WebWorks Reverb 2.0 : Indexing WebWorks Reverb - Client-Side Search for Source Documents, Baggage Files and External URLs
Indexing WebWorks Reverb - Client-Side Search for Source Documents, Baggage Files and External URLs
With WebWorks Reverb 2.0, files can be indexed to produce as search results with the user’s help set. An indexable Baggage File in this context is any PDF or HTML file that is linked from a source document that will be included in the generated output for producing useful search results. For more detailed information on baggage files, see “Understanding Baggage Files”.
Note: In order to determine what baggage files are indexed, ePublisher examines the file extension and if it matches one on the following then it will be indexed.
.pdf
.html
.htm
.shtml
.shtm
.xhtml
.xhtm
Baggage files are indexed in the same way that source documents are. Indexable baggage files will be indexed as long as the Index baggage files Target Setting is Enabled. External URLs will be downloaded & indexed as long as the Index external links Target Setting is Enabled.
Using Tidy for Indexing HTML Pages
In order to index an HTML baggage file, Reverb creates an XHTML copy of the file using Tidy (tool for cleaning up HTML files) to get a valid XML file that ePublisher can read. As useful as Tidy is, there may be times where it does not recognize a tag or generates something improperly. Tidy is configurable and can be adjusted to convert the HTML in the proper way.
When Tidy does not recognize a tag in an HTML file, an error like the following is produced:
line 33 column 3 - Error: <not_recognized_tag> is not recognized!
This error means that Tidy wasn’t able to generate an XHTML copy of the HTML file, and therefore ePublisher won’t be able to index it as a baggage file. With the right adjustments, this can be fixed.
Configuring Tidy To Recognize New Tags
1. Go to your Tidy directory under the installation directory in your local computer: ...\WebWorks\ePublisher\<VERSION>\Helpers\tidy\
2. Create a Format override of this helper. To do this: in the sub-folder of your project called: Formats, where the Format overrides live, create a new folder called Helpers and copy the entire folder called tidy (from step 1) to this new folder.
3. In the newly created tidy folder, open your config.txt file.
4. Depending on the kind of tag you want to add, you’ll have to uncomment line 8 or 10, or maybe both in the config.txt file.
5. Substitute the placeholder we put there and after the colon, with your new tag name (for example: not_recognized_tag).
6. Save and close the file.
To know more about how to customize Tidy go to https://www.w3.org/People/Raggett/tidy/.
Assigning Relevance Weight to Your Source Documents Styles
Search results are displayed in the Search tab when a user types a word to search for. The search results are sorted by a relevancy ranking, which, in the case of source documents, is calculated based on the Search relevance weight option defined in your Paragraph and Marker Styles. By default, WebWorks Reverb 2.0 assigns relevance weight of 1 to all styles.
To Modify the Relevancy Ranking in Source Documents for Search Results
1. Open your project with ePublisher Designer.
2. Scan the document, to pull all styles into the Style Designer.
3. Open the Style Designer (F10 or View > Style Designer).
4. Select the style you want to assign a weight to (either in Paragraph Styles or Marker Styles).
5. Open the Options window.
6. Change the Value of the Search relevance weight option to a decimal number you determine or you can just ignore it (which is going to be 0), meaning that the style is not going to be shown in your results.
Assigning Relevance Weight to Your HTML and PDF Baggage Files
The search results are sorted by relevancy ranking, which, in case of HTML baggage files, is calculated based on the scoring preference defined for the HTML tags in the search_settings.xml file. By default, WebWorks Reverb 2.0 assigns relevancy rankings based on where in a topic a particular item is found.
To Modify the Relevancy Ranking in Baggage Files for Search Results
1. Open your project with ePublisher Designer.
2. If you want to override the relevancy ranking for all WebWorks Reverb 2.0 targets, create the Formats\WebWorks Reverb 2.0\Transforms folder in your projectname folder, where projectname is the name of your ePublisher project.
3. If you want to override the relevancy ranking for one WebWorks Reverb 2.0 target, create the Targets\WebWorks Reverb 2.0\Transforms folder in your projectname folder, where projectname is the name of your ePublisher project.
4. Create a customization of your search_settings.xml file.
5. You’ll see the following block of code:
<Settings version="1.0" xmlns="urn:WebWorks-Settings-Schema">
<ScoringPrefs default-weight="0.05" pdf-weight="0.05">
<meta name="keywords" weight="1.0"/>
<meta name="description" weight="1.0"/>
<meta name="summary" weight="1.0"/>
<title weight="1.0"/>
<div class="myclass" weight="0.05"/>
<div weight="0.05"/>
<h1 weight="0.1"/>
<h2 weight="0.1"/>
<caption weight="0.1"/>
<h3 weight="0.1"/>
<th weight="0.1"/>
<h4 weight="0.1"/>
<h5 weight="0.1"/>
<h6 weight="0.1"/>
<h7 weight="0.1"/>
<p weight="0.05"/>
</ScoringPrefs>
</Settings>
6. Modify the weight attributes for any tags, such as h1 and h2, you want to change. You can also specify additional tags with or without class attributes to further refine weights for your HTML baggage files. You may use decimal values to modify the weight attribute value.
Note: If you wish to set a default weight to tags that are not defined in this file simply update the default-weight attribute value.
Note: You can change the default weight for all of the text in a PDF file by changing the pdf-weight attribute value.
7. Save and close the search_settings.xml file.
8. Regenerate your project to review the changes.