How to extract information from a MHT file

Asked By Lukas 240 points N/A Posted on - 05/15/2011

I have a set of files saved as MHT. I need to programmatically process them and extract the information. How do I go about this?

Status: Open
Question Views: 6432
Answer Count: 9
Vote Up 24 Vote Down

Answer Accepted: Yes
Question Category: .NET Programming

Best Answer by Stella

Go To Solution

Answered By Stella 0 points N/A #97233

How to extract information from a MHT file

An MHT file is a web archive of HTML. Actually it's a plain ASCII file that you can open up in Notepad. All you need to do is to first identify a unique "string" to identify a chop point. Then you need to do some heavy string manipulation to cut, chop the string to extract the portion of data required. Its kind of a heavy work!

About Stella

Questions
0

Answers
48

Best Answers
10

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

Answered By Lukas 240 points N/A #97235

How to extract information from a MHT file

Why do you say it is heavy work? Can't I use the XMLDom document object and load the data into it? All of the web pages are W3C XHTML compliant!

About Lukas

Questions
1

Answers
4

Best Answers
0

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

Answered By Stella 0 points N/A #97237

How to extract information from a MHT file

When you choose to save the file as an MHT, the file you save is actually a "Web Archive". This implies that all supporting material are "embedded" inside one file. When this happens, the saved MHT file, actually will not be XML compliant. Therefore, you cannot load it into the XMLDom object "as-is". Furthermore, the HTML is actually altered by Internet Explorer to point to resources within the file. i. e. images etc need to be pointing to a location inside the file.

About Stella

Questions
0

Answers
48

Best Answers
10

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

Answered By Lukas 240 points N/A #97239

How to extract information from a MHT file

For my information, how do the images and text get stored in a single file? I thought you cannot mix the two! Text are ASCII and are not images, binary?

About Lukas

Questions
1

Answers
4

Best Answers
0

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

Answered By Stella 0 points N/A #97241

How to extract information from a MHT file

Lukas,

You are correct in stating that ASCII and binary cannot be mixed. What actually happens is that the binary images and resources are encoded using Base64 encoding. The Web Archive format (MHT) is saved using the MIME reference model. For example, when you send emails with attachments, the attachment binary data is encoded as a Base64 string and appended to the text portion of the email.

Base64 consists of human readable characters which is considered as "web safe". If you open up the MHT file you will notice these "sections" and the encoded strings. Following is how the data is organized. Notice the "—-=_NextPart " line. This delimits the sections. Top part is the HTML, the next part is the binary image.

About Stella

Questions
0

Answers
48

Best Answers
10

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

Answered By Lukas 240 points N/A #97243

How to extract information from a MHT file

Thank you Stella! This is great! Does this mean that I can safely "chop" the MHT file into sections by this "—-_Nexpart" delimiter and then process the sections individually?

About Lukas

Questions
1

Answers
4

Best Answers
0

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

Best Answer

Answered By Stella 0 points N/A #97245

How to extract information from a MHT file

The delimiter can be used but since it differs from browser to browser, the exact string might vary. What you should do is use the delimiter then work on the HTML part, which will be changed.

You will have to make use of trial and error to make sure that the XMLDom object reads the HTML.

About Stella

Questions
0

Answers
48

Best Answers
10

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

Answered By Lukas 240 points N/A #97246

How to extract information from a MHT file

Thank you Stella for your advise! I see that I will still have to use string manipulation! Your information helped me to shorten the development time! Thank you again!

About Lukas

Questions
1

Answers
4

Best Answers
0

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

Answered By Stella 0 points N/A #97247

How to extract information from a MHT file

Glad to be of help! Have a nice day!

About Stella

Questions
0

Answers
48

Best Answers
10

Vote Up 0 Vote Down

Posted on - 05/15/2011
Question Category: .NET Programming

How to extract information from a MHT file

How to extract information from a MHT file

How to extract information from a MHT file

How to extract information from a MHT file

How to extract information from a MHT file

How to extract information from a MHT file

How to extract information from a MHT file

How to extract information from a MHT file

How to extract information from a MHT file

How to extract information from a MHT file

What is the best data type in C# for storing amounts

How to solve TCP/IP configuration error?

Related Questions

Latest Articles

Rokid Max 2 Review: I Tried AR Glasses So I Could Watch Netflix in...

Top 10 Technology Trends For 2025

How To Choose The Right Linux VPS For Your Needs

Latest Blogs

Top 10 New Laptop Entrants That Shook The Public

10 Facts About The Dark Web

Top 10 Latest Steam Cleaner Machines

Latest Tips

Top 10 Internet Monitoring Software

Top 10 Best Partition Manager Software

Top 10 Best Online Music Production Software