...making Linux just a little more fun!
Ben Okopnik [ben at linuxgazette.net]
On Wed, Apr 20, 2011 at 06:48:52PM +1000, Amit Saha wrote:
> > # from https://stackoverflow.com/questions/54139[...]cting-extension-from-filename-in-python/ > file_type = os.path.splitext(filename)[1]
I generally try to avoid importing modules if there's a built-in method that works just as well. In this case, you're importing "os" anyway, but this also works in scripts where you don't.
file_type = filename.split('.')[-1]
> # Uses execlp so that the system PATH is used for finding the program file > # location > os.execlp(program,'',filename)
The main problem with this type of script is that you have to be a Python programmer to add a filetype/program pair. I'd suggest breaking out the dictionary as an easily-parseable text file, or adding a simple interface ("Unsupported filetype. What program would you like to use?") that updates the list.
> Is there any Linux command line tool which can easily do this?
Midnight Commander is quite good at this; in fact, I've used that functionality of 'mc' in some of my shell scripts for just this purpose. They use a mix of regex matching, MIME-type matching, and literal extension recognition to work with various files. E.g., the ".gz" extension in "foo.1.gz" doesn't tell you anything about the type of file that it is (a man page).
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Wed, Apr 20, 2011 at 10:10:40AM +0100, Francis Daly wrote:
> On Wed, Apr 20, 2011 at 06:48:52PM +1000, Amit Saha wrote: > > Hi there, > > > Is there any Linux command line tool which can easily do this? > > "see", an alias of "run-mailcap".
If I recall correctly, that's just the viewer. There's also "edit", from the same kit.
(The syntax of .mailcap was created by half-insane crackmonkeys working their way up to Shakespeare, but a lot of the stuff is already defined for you.)
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Wed, Apr 20, 2011 at 05:09:03PM +0100, Francis Daly wrote:
> On Wed, Apr 20, 2011 at 09:59:08AM -0400, Ben Okopnik wrote: > > On Wed, Apr 20, 2011 at 10:10:40AM +0100, Francis Daly wrote: > > > On Wed, Apr 20, 2011 at 06:48:52PM +1000, Amit Saha wrote: > > > > > Is there any Linux command line tool which can easily do this? > > > > > > "see", an alias of "run-mailcap". > > > > If I recall correctly, that's just the viewer. There's also "edit", from > > the same kit. > > Yes -- "see", "edit", "compose", and "print" are the usual aliases. > > I've added a "hear" to my version, which pipes text to "festival --tts", > for when reading is too much like work
Nice! You might want to submit that to the maintainers; seems like a natural addition to the toolkit.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Wed, Apr 20, 2011 at 09:52:25AM -0400, Benjamin Okopnik wrote:
> On Wed, Apr 20, 2011 at 06:48:52PM +1000, Amit Saha wrote: > > > > # Uses execlp so that the system PATH is used for finding the program file > > # location > > os.execlp(program,'',filename) > > The main problem with this type of script is that you have to be a > Python programmer to add a filetype/program pair. I'd suggest breaking > out the dictionary as an easily-parseable text file, or adding a simple > interface ("Unsupported filetype. What program would you like to use?") > that updates the list.
Taking a short break from motorcycle repair and boat work and client work and instructional manual development (yep, somewhat busy these days), and keeping in mind the perils of trying to figure out the file type just from the extension (consider "less.1.gz", "john.smith.txt", "foo.txt.tar.gz", "hello_world", and "chunk.1/chunk.2/chunk.3", etc. as some 'interesting' edge cases), I just cranked out the following for my own amusement:
#!/usr/bin/python3 # Created by Ben Okopnik on Sat Apr 23 21:52:59 EDT 2011 import os import sys import json # Advantage over pickle: JSON can be edited manually! if len(sys.argv) != 2: sys.exit('Syntax: '+sys.argv[0].split('/')[-1]+' <file.ext>') funclookup = {} conffile = os.getenv('HOME')+'/.lookup.json' updatedb = False ext = sys.argv[1].split('.')[-1] # Way, way too simplistic... if os.access(conffile, os.R_OK): with open(conffile, 'r') as pfile: funclookup = json.load(pfile) while not ext in funclookup: not_done = True while not_done: answer = input('Program to use for "'+ext+'" (or "q" to quit)? ') if answer == 'q': exit() for path in os.defpath.split(os.pathsep): # Only look up the first space-delimited string; this allows args if os.access(path+'/'+answer.split(' ')[0], os.X_OK): funclookup[ext] = path+'/'+answer not_done = False updatedb = True break else: retcode = os.system(funclookup[ext]+' '+sys.argv[1]) if retcode: print('Just so you know, your program exited with a code of', retcode) if updatedb: with open(conffile, 'w') as pfile: json.dump(funclookup, pfile)
This starts out without a pre-set lookup table, but allows you to build one. You can even use program names with arguments; it all works fine.
In Perl, I'd have gone with XML::Simple. By default, the XMLin/XMLout methods write to an intuitively-named config file (if my script is called "foo", the config file will be "foo.xml".) Nifty module.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Sun, Apr 24, 2011 at 03:19:42PM +1000, Amit Saha wrote:
> On 04/24/2011 12:33 PM, Ben Okopnik wrote: > > > >else: > > retcode = os.system(funclookup[ext]+' '+sys.argv[1]) > > if retcode: > > print('Just so you know, your program exited with a code of', retcode) > > > >if updatedb: > > with open(conffile, 'w') as pfile: > > json.dump(funclookup, pfile) > >''' > > > >This starts out without a pre-set lookup table, but allows you to build > >one. You can even use program names with arguments; it all works fine. > > Cool. Good way to avoid pickling and unpickling.
Yeah, I wanted a human-editable format in case of mistakes. Also, if the file gets damaged in some way, it's fixable by hand - you don't have to lose everything and start over again. Pickling is awesome for networking and such, but it's not an all-in-one answer to serialization - particularly when you have to talk to programs written in other languages. Or to humans. I wish Python had something as simple as XML::Simple for these purposes, but even 'lxml' and 'xml.sax' are nowhere near it.
Writer:
#!/usr/bin/perl -w use XML::Simple; my $team = {apache => {name => 'Aldo Ray', motto => 'Ah wants me some Natzi scalps'}}; open my $xml, '>', 'loadteam.xml' or die "loadteam.xml: $!\n"; print $xml XMLout($team); close $xml;
Reader ('loadteam'):
#!/usr/bin/perl -w use XML::Simple; my $team = XMLin(); # Do stuff with data
> Any particular > reason you are using os.system() instead of os.execlp()?
Yep. 'execlp' replaces the current process, which means that the script is finished at that point. In other words, there will never be a return value - i.e., no feedback about your subprocess failing.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Mon, Apr 25, 2011 at 12:00:09PM +1000, Henry Grebler wrote:
> > Hi, > > Perhaps this comment is out of line. I've watched this thread with > a raised eyebrow, starting to construct a response and then changing > my mind. It's none of my business. Don't make waves.
Ah, that famous Aussie shyness. C'mon, Henry, this is a Linux forum. If you can't express your opinion here, where can you express it?
(USENET truism: the best way to get an answer is not to ask a question but to strongly state a (preferably somewhat wrong) opinion, and to watch the resulting fireworks.
> But I kept seeing issues, disturbing issues. I think my concerns > deserve consideration.
Oh, good - you got over that reticence!
> My first concern is that Linux does not have the concept of an > official filetype.
Well, certainly not one attached to a file extension. Extensions are used widely in Linux - let's admit it, they are useful - but they're neither universal nor consistent. That's a kiss of death for software that tries to do any kind of parsing; thus, my multiple warnings and my "for my own amusement" statement.
> On the other hand, Linux/Unix does have a long-standing > well-established tradition of magic numbers.
Indeed. My usual solution to determining a filetype is
/usr/bin/file -Lbz
(follow symlinks, try to look inside compressed files, don't output the filename.) However, this is still less than useful unless your viewer can also look inside compressed files... thus, my usual fallback to Midnight Commander's "mcview", which handles all of those issues nicely - especially once you've spent some time defining all the weird filetypes. That's my premier solution for pretty much anything of this sort.
> My second concern is that it seems that the these solutions have a > distinctly Microsoft perspective. For me, this is always a warning > sign. Microsoft software is subject to various attacks from malware.
Oh-oh, bad logical extension. "MS software is vulnerable to attacks, therefore Linux software that attempts to follow the same protocols is also vulnerable." Sorry, no. Different OS, different privilege separation scheme (as in "none" vs. "rather sophisticated"), etc.
But you do have a point: it's not a protocol worth emulating in any serious manner. It may be a useful tool in a restricted scenario - which is, I believe, how Amit was looking at it - but that's it. It's not something that's likely to make it into the standard distro, nor is it likely to win the "Linux software of the year" award.
...but then, neither are any of the "useful for this one task" scripts that we tend to publish here rather often - and yet, we do pub them. Why? Because people learn from this kind of thing. It's nice if they have advice like yours to balance it and give it a sense of broader perspective, but like many other things, it takes little steps of this kind to learn programming - and thinking as a programmer.
As I often tell my students, "In the process of teaching this course, I'm going to tell you some lies. But I promise they'll be useful ones."
The truth is often far too complicated to be reached in one swell foop, and must be sneaked up on. Sometimes, by years of effort.
> There are many reasons for this, but I submit that a naive faith in > the inherent "honesty" of filetypes certainly assists the creators of > malware.
OK... go ahead and sneak something into a text file that will exploit the weaknesses of Vim. 'display'? 'xpdf'? Ghostscript? I strongly doubt it. See, that's just not how Linux works.
> I guess I could generalise and argue that naive faith in anything > leads to susceptibility to exploitation.
That is true enough, certainly; e.g., installing software from a random tarball found on the Net is theoretically not the wisest move. In practice, though? I've been doing it for many years now, and haven't had a single problem. One of the major strengths of Linux is that, in most situations, even foolish actions by the user fail to do any significant damage. Of course, if someone is out to attack your specific system, some of those foolish actions have opened a door to that attacker... but really, in a home desktop machine, how often is that actually the case? By contrast, installing Windows automatically sends a message containing your home address to Big Bubba who just got out of prison, and applies the lube.
(Oh, wait - that's only Windows 7. The rest of them skip the lube.)
> I guess I'm less concerned by a person constructing such software for > personal use, to make the author's life easier or more convenient. The > danger lies when the software is passed on to the next person and then > the person after. Unless each person understands what the software does > and how it does it; and more importantly understands the limitations > of how the software operates, it seems to me that, before long, there > will be problems of varying seriousness down the track.
And this is a very reasonable statement indeed. There's a sharp distinction to be drawn between software developed for your own convenience and software developed for general usage. E.g., I've just rigged up a convenience function for my own use called "totals" that looks like this:
#!/usr/bin/perl -w while (<>){ last if /^\.$/; $sum += $1 if /([\d.]+)\D*$/ } print "The total is $sum\n";
This processes an arbitrary number of text lines, and sums up the last number on each line (whether it's followed by more text or not, and regardless of whether it contains a period; input is terminated by a period on a line by itself.) This is useful to me because I tend to keep my expense lists in text files, and might prefix the total with a date, or a description which may contain numbers. Also, I might want to feed this thing via STDIN (i.e., start it and paste the lines to be processed.) This script serves my preferences, and does exactly what I want. Whether it would serve anyone else's is questionable - but if it did, and I wanted to publish it for that purpose, it would be at least 50 times longer. Input validation, comments, syntax error messages, etc. - and let's not forget the man page to go with it - make that 100 times longer. An entirely different kind of program.
I can't imagine that Amit was saying that this program was ready for public consumption. I certainly wasn't even close to thinking that about mine.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Mon, Apr 25, 2011 at 02:01:16PM +1000, Amit Saha wrote:
> > That said, I think all programs which use the MIME type of a file by > consulting '/etc/mime.types' is also using the file extension. Isn't > it?
The "magic" system doesn't rely on MIME types; different animal altogether. When I do need to work with MIME, though, I find the whole 'run-mailcap' kit to be a bit past its sell-by date, and prefer 'gst-typefind' from the GStreamer kit:
ben@Jotunheim:/tmp$ for n in txt pdf jpg doc blah> do > ln -s stuff.py stuff.$n > gst-typefind-0.10 stuff.$n > rm stuff.$n > donestuff.txt - text/x-python stuff.pdf - text/x-python stuff.jpg - text/x-python stuff.doc - text/x-python stuff.blah - text/x-python
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Mon, Apr 25, 2011 at 02:34:04PM +1000, Amit Saha wrote:
> On Mon, Apr 25, 2011 at 2:27 PM, Ben Okopnik <ben@linuxgazette.net> wrote: > > > > The "magic" system doesn't rely on MIME types; different animal > > altogether. When I do need to work with MIME, though, I find the whole > > 'run-mailcap' kit to be a bit past its sell-by date, and prefer > > 'gst-typefind' from the GStreamer kit: > > Yes, I am aware of the magic system. Just wanted to clarify my > understanding that a lot of other programs out there using the > /etc/mime.types or similar is actually doing it using the file > extension.
More or less. If you read the 'run-mailcap' manpage, it discusses a series of fallback procedures for determining the MIME type, with file extension being the last one in the chain. In reality, it seems to classify everything unknown as "application/octet-stream", and happily trusts the extension for anything beyond that.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *