Abstract
Vilistextum is a html to ascii converter specifically programmed to output ascii text suitable for reading.
Some features:
- can swallow multiple empty lines
- set width of output text
- removes empty ALT tags
- set default string for IMG without an ALT tag
- can convert characters and entities between 128 and 159 from the windows1252 charset to meaningful strings in 8859-1. Eg 0x93 is converted to '"'. There are quite a lot of broken documents on the web that use windows1252.
- output can be optimized for ebook reading
INSTALL:
make or gmake
It should compile on any platform with a decent gcc.
DOWNLOAD:
vilistextum-v2.3.1.tar.gz
vilistextum-v2.3.1.tar.bz2
vilistextum_v2.3.0.tar.gz
vilistextum_v2.22.tar.gz
USAGE:
vilistextum [OPTIONS] [inputfile|-] [outputfile|-]
- inputfile,- resp. outputfile,-
- Replace inputfile with '-' for reading from standard input, likewise outputfile with '-' for writing to standard output.
- --version
- Reports version number and release date.
- -h,--help
- Prints a list of the command line options.
- -c, --convert-tags
- Some of the tags will be converted to special characters.
Eg: "<B>Bold</B> isn't <I>italic</I> isn't <U>underlined</U> isn't <EM>emphasized<EM> but is like <STRONG>strong</STRONG>."
will be output as "*Bold* isn't /italic/ isn't _underlined_ isn't /emphasized/ but is like *strong*."
- -p, --palm
- This outputs text more suitable for reading on a PDA.
Palm textreader do their own wordwrapping, so the width is set to infinity and the program doesn't rightjustify or center the text.
- -w, --width number
- The width of the output text.
Default: 72.
- -m, --nomicrosoft
- The entities from windows1252 that are € - Ÿ and their proper names will not be converted.
- -i, --defimage string
- IMG tags without alt attribute are output as [string].
Default: Image.
- -r, --remove-empty-alt
- If there is an empty ALT attribute in a IMG tag (eg <IMG href="..." alt='">), don't output '[]'.
- -s, --shrinklines
- If there are more than two newlines, output only two. There is at most one completely empty line.
- -l, --links
- Numbers the links in the document and prints the corresponding addresses at the end of the file. Similar to 'lynx -dump'. Note: Relative URIs are not resolved and won't be printed.
- -e, --errorlevel number
- Increase level of verbosity for error messages.
0: No error messages
1: Show unrecognized entities
2: Show unknown tags
>2: Mostly debugging information
BUGS and similar features:
The handling of OL is broken. The program treats it as UL and more than 6 nested lists confuse it.
Text is never justified.
Bugreports or comments:
You can send your comments or bugreports to this address. If you've discovered a bug, please give the link or attach a copy of the html file that caused that particular bug.
Patric Müller
Last modified: Thu May 3 23:06:07 CEST 2001