Monday 15 February 2016

Introducing the Subject Index

Please start with the Subject Index if you are looking for something.  It is in the panel on the right of the page.

Humans still do a much better job of indexing books and websites than computers. So I have indexed all of the posts on this blog manually and the result is the subject index. Compared with search features that come with Blogger, an index made by a human is like a magic map of the blog that lets you find things instantly.

If you have a great blog where old posts are getting buried in archives, read on.

Making a back-of-book index

At first, I just tried to create an index directly as a word document.  This was very painful as it is very hard to be consistent and it is so much hard work to scan posts for items that should be indexed.  So I wrote a program to help me index the posts.  The type of index I created is called a back-of-book index and can only be made by people.

At first the program scans all of the posts and makes a list of words that it contains.  The list of words
is called a Concordance.  There are a lot of words that no one is interested in within the Concordance such as [the] or [a].  More than 95% of the words in the Concordance tell us nothing about the subject of a post.  These words just help make sentences.  The trick is to display only the 'subject' words and hide the rest.  So after each post is scanned, the program displays the new words it has found so that I can flag the subject words.

To create an index, I view each post and the list of subject words found in the post.  I cannot make an index directly from the subject words as there are many issues.  For example scientific names get broken up into two separate words when they should be treated as one word.  Most of the words in the list are also marginal to the main subject.  Putting too many words in an index creates information pollution.

This is the process for indexing a page using a program I wrote in ms-access which is a very easy place to write small programs.  It is good to have a switchboard which opens the forms you need in work flow order.


Next, I have to paste in text from the blog (see below). The post that I am indexing is the one on Popeye Mullet.   One day I will automate this step.


Then I scan for new words and flag the words that have subject relevance.


Next, I scan the new post for subject words.  The list is not bad but there are many words that are peripheral to the subject of the post.


I type the items I want in the index into the box below.  A category can be added in front of any item by using a back slash.  I have also decided that common names will be followed by scientific names and that scientific names will not be followed by common names.  Making cross-references like this is a key skill of human indexers.


Next I generate the index for all posts as a table and there I can edit entries to make the formatting more consistent and to add information that points to the differences between similar posts.  The edits are saved so I only have to make them once.


Finally I preview the index in a browser.  Here I can see errors such as the fiddler crab entry being displayed without a category.



It is quite easy to generate html from a database table.  Here is the actual code that generates the code for the index from the database table.


The code above makes the html text, which is then placed in a text box so I can just copy and paste  into blogger to create the site index. Clicking on the label beside the text box selects the text, then I press Ctrl-C to copy it (including all the lines that do not fit in the box).


I wrote my own indexing program because I could not find any on the net other than old programs that no longer work or very expensive professional programs.  The entire program took about 8 hours to develop.  In its current form the program is basic but it works.  Let me know if you are interested.


No comments:

Post a Comment