Instructions (note: only tested on mac OSX):
  • Download xerces
  • Download ekhtml

    We rely on these 2 project to handle our parsing. Xerces is a common xml parser we have selected because it is so widely used. Ekhtml happens to be the fastest html parser that exists. We store our taxonomy in an xml document, and we allow consumers to pass in raw html - hence we use el kabong (ekhtml) to parse those files.
    Once you have those downloaded and built, simply:

    ./autogen.sh --with-xerces=/path/to/xerces --with-ekhtml=/path/to/ekhtml
    then type:
    make;
    make check;

    If you want to try to categorize a page with the default taxonomy, simply type:
    ./src/categorize /path/to/document.html
    if you then want to modify the taxonomy, you can currently manually edit ./seed/schema.xml and add categories and keys. I've found that manually editing that file is pretty successful. Just ensure that you score words highly if they uniquely identify this category.