'CH HTMLConvert' is a software component that enables you to parse HTML strings and extract the plain text content.
'CH HTMLConvert' is freeware. You may freely use and distribute it with any commercial software. The source code of this component is also available. See your website for details: www.ch-software.de/htmlconvert
There a two versions of this component:
chHtmlConvert32.dll: a Win32 dynamic link library written in C++
chHtmlConvert.dll: a .NET assembly written in C#
You can use the Win32 DLL within every programming environment that supports calling Win32 API functions (C/C++, VB, Delphi). There are some example projects included in this distribution.
The Win 32 DLL provides 2 functions:
BOOL HTMLToText( LPCTSTR szHTML, // address of HTML string to convert LPTSTR szText, // address of buffer receiving output text BOOL bPreferASCII, LPTSTR szCharset, // address of buffer receiving the charset int nCharsetLen, // size of charset buffer UINT* pCodePage // address of integer receiving the codepage );
Points to the character string to be converted.
szText
Points to the buffer that receives the output text. The size of this buffer is assumed to be equal to the length of szHTML + 1!
bPreferASCII
Specifies whether some often used unicode characters will be translated to similar ASCII characters, or not. See table below:
Unicode character
ASCII character
‑ 8209 - 45 – 8211 - 45 — 8212 - 45 ’ 8217 ' 39 “ 8220 " 34 ” 8221 " 34 • 8226 * 42
szCharset
Points to the buffer that receives the charset that has been specified in the HTML input. This parameter can be NULL.
nCharsetLen
Specifies the size in characters of the buffer pointed to by the szCharset parameter.
pCodePage
Points to an integer variable that receives the codepage number of the HTML input. For example, if the HTML input specifies the charset "Windows-1252" the codepage number will be 1252. This parameter can be NULL.
If the function succeeds, the return value is non-zero.
If the return value is zero, there are problems with the code page that the HTML input requires:
ANSI version:
The return value is zero if the required codepage does not match to the system's ANSI codepage. The output text may contain unreadable characters.
Unicode version:
The return value is zero if the required codepage is not installed on the system. Therefore the input text can not be converted to the correct unicode representation. The output text is unlikely to contain any readable characters.
THIS SOFTWARE AND THE ACCOMPANYING FILES ARE DISTRIBUTED 'AS IS' AND WITHOUT WARRANTIES AS TO PERFORMANCE OF MERCHANTABILITY OR ANY OTHER WARRANTIES WHETHER EXPRESSED OR IMPLIED. NO WARRANTY OF FITNESS FOR A PARTICULAR PURPOSE IS OFFERED. TEST THE PROGRAM THOROUGHLY WITH NON-CRITICAL DATA BEFORE RELYING ON IT. THE USER MUST ASSUME THE ENTIRE RISK OF USING THE PROGRAM.