CH HTMLConvert 1.0

'CH HTMLConvert' is a software component that enables you to parse HTML strings and extract the plain text content.

'CH HTMLConvert' is freeware. You may freely use and distribute it with any commercial software. The source code of this component is also available. See your website for details: www.ch-software.de/htmlconvert

There a two versions of this component:

You can use the Win32 DLL within every programming environment that supports calling Win32 API functions (C/C++, VB, Delphi). There are some example projects included in this distribution.

The Win 32 DLL provides 2 functions:


BOOL HTMLToText(
  LPCTSTR szHTML,     // address of HTML string to convert 
  LPTSTR szText,      // address of buffer receiving output text
  BOOL bPreferASCII,   
  LPTSTR szCharset,   // address of buffer receiving the charset
  int nCharsetLen,    // size of charset buffer
  UINT* pCodePage     // address of integer receiving the codepage
);

Parameters

szHTML

Points to the character string to be converted.

szText

Points to the buffer that receives the output text. The size of this buffer is assumed to be equal to the length of szHTML + 1!

bPreferASCII

Specifies whether some often used unicode characters will be translated to similar ASCII characters, or not. See table below: 

Unicode character

ASCII character

 8209   -  45
 8211   -  45
 8212   -  45
 8217   '  39
 8220   "  34
 8221   "  34
 8226   *  42

szCharset

Points to the buffer that receives the charset that has been specified in the HTML input. This parameter can be NULL.

nCharsetLen

Specifies the size in characters of the buffer pointed to by the szCharset parameter.

pCodePage

Points to an integer variable that receives the codepage number of the HTML input. For example, if the HTML input specifies the charset "Windows-1252" the codepage number will be 1252. This parameter can be NULL.


Return Values

If the function succeeds, the return value is non-zero.

If the return value is zero, there are problems with the code page that the HTML input requires:

ANSI version:

The return value is zero if the required codepage does not match to the system's ANSI codepage. The output text may contain unreadable characters.

Unicode version:

The return value is zero if the required codepage is not installed on the system. Therefore the input text can not be converted to the correct unicode representation. The output text is unlikely to contain any readable characters.

 


THIS SOFTWARE AND THE ACCOMPANYING FILES ARE DISTRIBUTED 'AS IS' AND WITHOUT WARRANTIES AS TO PERFORMANCE OF MERCHANTABILITY OR ANY OTHER WARRANTIES WHETHER EXPRESSED OR IMPLIED. NO WARRANTY OF FITNESS FOR A PARTICULAR PURPOSE IS OFFERED. TEST THE PROGRAM THOROUGHLY WITH NON-CRITICAL DATA BEFORE RELYING ON IT. THE USER MUST ASSUME THE ENTIRE RISK OF USING THE PROGRAM.