Consider your content


For a description of our methodology, John K. Sterling has provided quite a bibliography.
categorize is a project which will automatically classify html content into you taxonomy. You define your taxonomy, we categorize selected pages into it. There will soon be a tool to automatically generate the taxonomy based on training data. The project page is at SourceForge I am not big on project dependencies, but I have decided that the following projects are worthy. Before using categorize you need to build the following:
  • xerces
  • ekhtml

    We rely on these 2 project to handle our parsing. Xerces is a common xml parser we have selected because it is so widely used. Ekhtml happens to be the fastest html parser that exists. We store our taxonomy in an xml document, and we allow consumers to pass in raw html - hence we use el kabong (ekhtml) to parse those files.

    For build instructions tune in here