How to extract information from a MHT file
I have a set of files saved as MHT. I need to programmatically process them and extract the information. How do I go about this?Â
I have a set of files saved as MHT. I need to programmatically process them and extract the information. How do I go about this?Â
An MHT file is a web archive of HTML. Â Actually it's a plain ASCII file that you can open up in Notepad. All you need to do is to first identify a unique "string" to identify a chop point. Then you need to do some heavy string manipulation to cut, chop the string to extract the portion of data required. Its kind of a heavy work!
Why do you say it is heavy work? Can't I use the XMLDom document object and load the data into it? All of the web pages are W3C XHTML compliant!
When you choose to save the file as an MHT, the file you save is actually a "Web Archive". This implies that all supporting material are "embedded" inside one file. When this happens, the saved MHT file, actually will not be XML compliant. Therefore, you cannot load it into the XMLDom object "as-is". Furthermore, the HTML is actually altered by Internet Explorer to point to resources within the file. i. e. images etc need to be pointing to a location inside the file.
For my information, how do the images and text get stored in a single file? I thought you cannot mix the two! Text are ASCII and are not  images, binary?
Lukas,
You are correct in stating that ASCII and binary cannot be mixed. What actually happens is that the binary images and resources are encoded using Base64 encoding. The Web Archive format (MHT) is saved using the MIME reference model. For example, when you send emails with attachments, the attachment binary data is encoded as a Base64 string and appended to the text portion of the email.
Base64 consists of  human readable characters which is considered as "web safe". If you open up the MHT file you will notice these "sections" and the encoded strings. Following is how the data is organized. Notice the "—-=_NextPart " line. This delimits the sections. Top part is the HTML, the next part is the binary image.
Thank you Stella! This is great! Does this mean that I can safely "chop" the MHT file into sections by this "—-_Nexpart" delimiter and then process the sections individually?
Â
Thank you Stella for your advise! I see that I will still have to use string manipulation! Your information helped me to shorten the development time! Thank you again!