2-cent Tip: Reading MHT files in Linux

Ben Okopnik [ben at linuxgazette.net]

Fri, 13 Feb 2009 20:08:48 -0500

Just ran across Yet Another Proprietary Format from Micr0s0ft: .mht files. Seems that Internet Explorer saves emails and HTML as an ugly mess that somewhat resembles an email; according to Wikipedia, there's no single standard, and the state of the state is best described as 'sauve qui peut' (which translates, at least in Redmond, as "all your ass are belong to me!") Bleh.

Searching the Web shows that there are a lot of people - the just-converted-to-Linux newbies, particularly - who have loads of these things and don't know what to do with them. Some people recommend Opera (I suppose a couple of hours of Kiri Te Kanawa is good for relieving all kind of stress...); some have had luck with various conversion utilities. I looked at it, and it looked something like a mangled email header, soooo...

I didn't go searching for more than just the one file that I had, but here's what worked fine for opening it:

# Convert line-ends to Unix format
flip -ub file.mht
# Prepend a standard 'From ' mail header to the file
sed -i '1i\'"$(echo From $USER $(date))" file.mht
# You should now be able to open it with your favorite MUA
mutt -f file.mht

It worked fine for me.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Mulyadi Santosa [mulyadi.santosa at gmail.com]

Sat, 14 Feb 2009 22:02:20 +0700

On Sat, Feb 14, 2009 at 8:08 AM, Ben Okopnik <ben@linuxgazette.net> wrote:

> Just ran across Yet Another Proprietary Format from Micr0s0ft: .mht
> files. Seems that Internet Explorer saves emails and HTML as an ugly
> mess that somewhat resembles an email; according to Wikipedia, there's
> no single standard, and the state of the state is best described as
> 'sauve qui peut' (which translates, at least in Redmond, as "all your
> ass are belong to me!") Bleh.
>
> Searching the Web shows that there are a lot of people - the
> just-converted-to-Linux newbies, particularly - who have loads of these
> things and don't know what to do with them. Some people recommend Opera
> (I suppose a couple of hours of Kiri Te Kanawa is good for relieving
> all kind of stress...); some have had luck with various conversion
> utilities. I looked at it, and it looked something like a mangled email
> header, soooo...
>
> I didn't go searching for more than just the one file that I had, but
> here's what worked fine for opening it:
>
> ```
> # Convert line-ends to Unix format
> flip -ub file.mht
> # Prepend a standard 'From ' mail header to the file
> sed -i '1i\'"$(echo From $USER $(date))" file.mht
> # You should now be able to open it with your favorite MUA
> mutt -f file.mht
> '''
>
> It worked fine for me.

perhaps as the alternative of "flip", dos2unix could be used too here?

regards,

Mulyadi.

Top Back

Thomas Adam [thomas.adam22 at gmail.com]

Sat, 14 Feb 2009 15:42:27 +0000

2009/2/14 Mulyadi Santosa <mulyadi.santosa@gmail.com>:

> perhaps as the alternative of "flip", dos2unix could be used too here?

Yes they could, but as I have said before in similar posts, "col" is perhaps the most portable way, across almost all UNIX variants.

-- Thomas Adam

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sat, 14 Feb 2009 11:23:09 -0500

On Sat, Feb 14, 2009 at 03:42:27PM +0000, Thomas Adam wrote:

> 2009/2/14 Mulyadi Santosa <mulyadi.santosa@gmail.com>:
> > perhaps as the alternative of "flip", dos2unix could be used too here?
> 
> Yes they could, but as I have said before in similar posts, "col" is
> perhaps the most portable way, across almost all UNIX variants.

I've got to say, I've never been a fan of 'col' - the man page has always confused the living hell out of me. What is a "half-reverse line feed", anyway?

If you have any favorite recipes you like to use with it, please share.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Sun, 15 Feb 2009 15:10:55 +0000

2009/2/14 Ben Okopnik <ben@linuxgazette.net>:

> On Sat, Feb 14, 2009 at 03:42:27PM +0000, Thomas Adam wrote:
>> 2009/2/14 Mulyadi Santosa <mulyadi.santosa@gmail.com>:
>> > perhaps as the alternative of "flip", dos2unix could be used too here?
>>
>> Yes they could, but as I have said before in similar posts, "col" is
>> perhaps the most portable way, across almost all UNIX variants.
>
> I've got to say, I've never been a fan of 'col' - the man page has
> always confused the living hell out of me. What is a "half-reverse line
> feed", anyway?
>
> If you have any favorite recipes you like to use with it, please share.

I just got this (the culprit shall remain nameless

 for i in `wc -l * | grep ' * 0 ' | sed 's/ 0 /@/g'  | cut -f2 -d'@'`;
do rm $i; done

(but I think 'find . -size 0 -exec rm {} \;' is much easier to remember)

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sun, 15 Feb 2009 10:52:46 -0500

On Sun, Feb 15, 2009 at 03:10:55PM +0000, Jimmy O'Regan wrote:

> 2009/2/14 Ben Okopnik <ben@linuxgazette.net>:
> > On Sat, Feb 14, 2009 at 03:42:27PM +0000, Thomas Adam wrote:
> >> 2009/2/14 Mulyadi Santosa <mulyadi.santosa@gmail.com>:
> >> > perhaps as the alternative of "flip", dos2unix could be used too here?
> >>
> >> Yes they could, but as I have said before in similar posts, "col" is
> >> perhaps the most portable way, across almost all UNIX variants.
> >
> > I've got to say, I've never been a fan of 'col' - the man page has
> > always confused the living hell out of me. What is a "half-reverse line
> > feed", anyway?
> >
> > If you have any favorite recipes you like to use with it, please share.
> 
> I just got this (the culprit shall remain nameless 
> 
>  for i in `wc -l * | grep ' * 0 ' | sed 's/ 0 /@/g'  | cut -f2 -d'@'`;
> do rm $i; done

(Whoops. I meant "If you have any recipes using 'col'" - I guess I didn't make it clear enough.)

Ye ghods, what a convoluted mess. 'find', as you point out, is the tool that would come to my mind first; it would do recursive removals, which the other one won't. If you just want to remove all the zero-length files in the current dir, this is simple enough:

ls -l|awk '$5~/^0$/{system("rm "$8)}'

If you wanted to do it with shell tools, there's always

for n in *; do [ -f $n -a ! -s "$n" ] && rm "$n"; done

> (but I think 'find . -size 0 -exec rm {} \;' is much easier to remember)

Yep.

I've been doing the LPI certification stuff lately, and one of the on-line study guides had some horribly convoluted tool kit stuff; a really ugly "Identify which line will not sort a file by the third word on a line" question. One of the examples went something like this:

cut -d ' ' -f 2 file.txt|paste - file.txt|sort|cut -f 2-

I guess this is part of the price I pay for knowing how to use "sort" properly ("sort -k2", anyone?) - and even if I didn't, I would have just used Perl or something. When stuff like this shows up... oh, the suffering. Took me a while to get it.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Sun, 15 Feb 2009 17:50:38 +0000

2009/2/15 Ben Okopnik <ben@linuxgazette.net>:

> On Sun, Feb 15, 2009 at 03:10:55PM +0000, Jimmy O'Regan wrote:
>> 2009/2/14 Ben Okopnik <ben@linuxgazette.net>:
>> > On Sat, Feb 14, 2009 at 03:42:27PM +0000, Thomas Adam wrote:
>> >> 2009/2/14 Mulyadi Santosa <mulyadi.santosa@gmail.com>:
>> >> > perhaps as the alternative of "flip", dos2unix could be used too here?
>> >>
>> >> Yes they could, but as I have said before in similar posts, "col" is
>> >> perhaps the most portable way, across almost all UNIX variants.
>> >
>> > I've got to say, I've never been a fan of 'col' - the man page has
>> > always confused the living hell out of me. What is a "half-reverse line
>> > feed", anyway?
>> >
>> > If you have any favorite recipes you like to use with it, please share.
>>
>> I just got this (the culprit shall remain nameless 
>>
>>  for i in `wc -l * | grep ' * 0 ' | sed 's/ 0 /@/g'  | cut -f2 -d'@'`;
>> do rm $i; done
>
> (Whoops. I meant "If you have any recipes using 'col'" - I guess I
> didn't make it clear enough.)
>

Misread that; I was just struck by the awfulness of that line -- he also uses col in similarly evil ways

That was to work around the output of a script to grab Bible text for a parallel corpus; it turned out to be unneccessary: ftp://ftp.funet.fi/pub/doc/bible/texts/danish/dkbibel.txt.gz

and a little perl to split it by book and chapter:

#!/usr/bin/perl
 
use warnings;
use strict;
 
use open IN => ':encoding(iso-8859-1)';
use open OUT => ':encoding(utf8)';
 
my $reading=0;
my $sent;
my $file;
my $part;
my ($a, $b, $htoi);
 
while (<>)
{
	if (/^horn\@proinf.dk$/) {$reading=1; next;}
	next if $reading==0;
	next if (/^$/);
	if(/^\*([\d]+)\/[\W]/)
	{
		if ($sent)
		{
			print OUTF "$sent\n";
			$sent=""; # reset state on changing books
		}
		next;
	}
	if(/^\*([\d]*)\/([0-9]+).*$/)
	{
		$file=sprintf("book%02d.chapter%03d.txt",int($1),int($2));
		open (OUTF, ">$file");
		next;
	}
	if(/^\*([\d]*)\/([a-f])([0-9]+).*$/)
	{
		$htoi=(hex($2)*10)+int($3);
		$file=sprintf("book%02d.chapter%03d.txt",int($1),$htoi);
		open (OUTF, ">$file");
		next;
	}
	if (/[\W]?([\d]+)[\W]?(.*)$/)
	{
		print OUTF "$sent\n" if ($sent); #Last sentence
		$part=$2;
		chomp $part;
		$sent="$1" . $part;
		next;
	}
	if(/^[\W]+(.*)$/)
	{
		$part=$1;
		chomp $part;
		$sent.=" " . $part;
		next;
	}
}

and we don't have to go fighting with someone's website.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Mon, 16 Feb 2009 23:03:46 -0500

On Sun, Feb 15, 2009 at 05:50:38PM +0000, Jimmy O'Regan wrote:

> 
> That was to work around the output of a script to grab Bible text for
> a parallel corpus; it turned out to be unneccessary:
> ftp://ftp.funet.fi/pub/doc/bible/texts/danish/dkbibel.txt.gz
> 
> and a little perl to split it by book and chapter:

[snip]

I don't think that does what it's supposed to. I just tried it out, and 'book01.chapter001.txt' ends with a line numbered '30', while 'book01.chapter002.txt' starts with '31' followed by '1'. Now, I am not a Bible scholar by any means, but that numbering scheme just doesn't seem right.

How about this instead:

#!/usr/bin/perl
use warnings;
use strict;
use open IN => ':encoding(iso-8859-1)';
use open OUT => ':encoding(utf8)';
 
my ($book, $chapter, $fn);
while (<>){
	if (m{^\*(\d\d)[/ ]([0-9a-f]\d)?}){
		close Fn if defined \*Fn;
		$book = $1;
		$chapter = defined $2 ? $2 : 1;
		$chapter =~ s/([a-f])(\d)/100 + (ord($1) - 97) * 10 + $2/e;
		$fn = sprintf "book%02d.chapter%03d.txt", $book, $chapter;
		open Fn, ">$fn" or die "$fn: $!\n";
		next;
	}
	next unless defined $fn;
	print Fn;
}
close Fn;

You might want to verify with a bit of spot-testing - that's a very strange numbering scheme they used - but I think I've got it right.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back

Lew Pitcher [lew.pitcher at digitalfreehold.ca]

Tue, 17 Feb 2009 12:39:29 -0500

Well, I've just received four copies (one after another) of this email. All four have

  Date: Mon, 16 Feb 2009 23:03:46 -0500

as their datestamp.

It looks like either

a) Ben is trying to be /very/ emphatic about his perl script ( ;-) ), or
b) the LJ email server has hiccuped

So, can we expect more copies? <grin>

-- 
Lew Pitcher
 
Master Codewright & JOAT-in-training | Registered Linux User #112576
https://pitcher.digitalfreehold.ca/   | GPG public key available by request
----------      Slackware - Because I know what I'm doing.          ------

Top Back

Rick Moen [rick at linuxmafia.com]

Tue, 17 Feb 2009 09:58:06 -0800

Quoting Lew Pitcher (lew.pitcher@digitalfreehold.ca):

> It looks like either
> a) Ben is trying to be /very/ emphatic about his perl script ( ;-) ), or
> b) the LJ email server has hiccuped

Temporary mail glitch. I know how to break the cycle, and have done so.

Top Back

Jimmy O'Regan [joregan at gmail.com]

Tue, 17 Feb 2009 18:40:40 +0000

2009/2/17 Ben Okopnik <ben@linuxgazette.net>:

> On Sun, Feb 15, 2009 at 05:50:38PM +0000, Jimmy O'Regan wrote:
>>
>> That was to work around the output of a script to grab Bible text for
>> a parallel corpus; it turned out to be unneccessary:
>> ftp://ftp.funet.fi/pub/doc/bible/texts/danish/dkbibel.txt.gz
>>
>> and a little perl to split it by book and chapter:
>
> [snip]
>
> I don't think that does what it's supposed to. I just tried it out, and
> 'book01.chapter001.txt' ends with a line numbered '30', while
> 'book01.chapter002.txt' starts with '31' followed by '1'.  Now, I am not
> a Bible scholar by any means, but that numbering scheme just doesn't
> seem right. 
>
> How about this instead:
>
> ```
> #!/usr/bin/perl
> use warnings;
> use strict;
> use open IN => ':encoding(iso-8859-1)';
> use open OUT => ':encoding(utf8)';
>
> my ($book, $chapter, $fn);
> while (<>){
>        if (m{^\*(\d\d)[/ ]([0-9a-f]\d)?}){
>                close Fn if defined \*Fn;
>                $book = $1;
>                $chapter = defined $2 ? $2 : 1;
>                $chapter =~ s/([a-f])(\d)/100 + (ord($1) - 97) * 10 + $2/e;
>                $fn = sprintf "book%02d.chapter%03d.txt", $book, $chapter;
>                open Fn, ">$fn" or die "$fn: $!\n";
>                next;
>        }
>        next unless defined $fn;
>        print Fn;
> }
> close Fn;
> '''
>
> You might want to verify with a bit of spot-testing - that's a very
> strange numbering scheme they used - but I think I've got it right.

Weeell... as it happens, the source text has quite a number of other glitches: the book of Daniel starts in the middle of chapter 1, verse 2, for example. The combination of both the errors in my script, and the freeness of some bible translations has lead to a new research topic for the guy I wrote it for: he's now studying ways to measure when to drop candidate sentences from a bilingual corpus: 'The Spirit of God moved over *the face of* the waters', is an example -- extra junk thrown in to turn a better phrase is not productive in any kind of machine translation, and yet there's been surprisingly little work done in automating the process of removing it.

Top Back

Thomas Adam [thomas.adam22 at gmail.com]

Tue, 17 Feb 2009 18:46:17 +0000

2009/2/17 Ben Okopnik <ben@linuxgazette.net>:

> How about this instead:

Works better here.

>                $chapter = defined $2 ? $2 : 1;

I'm fortunate to not be affected by portability in perl, but I do much prefer that check above to perl's 5.10's "//" operator. I can't not parse that as some weird regexp operator. ;)

-- Thomas Adam

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sat, 14 Feb 2009 11:17:52 -0500

On Sat, Feb 14, 2009 at 10:02:20PM +0700, Mulyadi Santosa wrote:

> On Sat, Feb 14, 2009 at 8:08 AM, Ben Okopnik <ben@linuxgazette.net> wrote:
> > Just ran across Yet Another Proprietary Format from Micr0s0ft: .mht
> > files. Seems that Internet Explorer saves emails and HTML as an ugly
> > mess that somewhat resembles an email; according to Wikipedia, there's
> > no single standard, and the state of the state is best described as
> > 'sauve qui peut' (which translates, at least in Redmond, as "all your
> > ass are belong to me!") Bleh.
> >
> > Searching the Web shows that there are a lot of people - the
> > just-converted-to-Linux newbies, particularly - who have loads of these
> > things and don't know what to do with them. Some people recommend Opera
> > (I suppose a couple of hours of Kiri Te Kanawa is good for relieving
> > all kind of stress...); some have had luck with various conversion
> > utilities. I looked at it, and it looked something like a mangled email
> > header, soooo...
> >
> > I didn't go searching for more than just the one file that I had, but
> > here's what worked fine for opening it:
> >
> > ```
> > # Convert line-ends to Unix format
> > flip -ub file.mht
> > # Prepend a standard 'From ' mail header to the file
> > sed -i '1i\'"$(echo From $USER $(date))" file.mht
> > # You should now be able to open it with your favorite MUA
> > mutt -f file.mht
> > '''
> >
> > It worked fine for me.
> 
> perhaps as the alternative of "flip", dos2unix could be used too here?

Sure; so could "sed -i 's/\r//' file" and "tr -d '\015' < file", etc.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *

Top Back