CINXE.COM
[Python-Dev] My summary of the scandir (PEP 471)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE> [Python-Dev] My summary of the scandir (PEP 471) </TITLE> <LINK REL="Index" HREF="index.html" > <LINK REL="made" HREF="mailto:python-dev%40python.org?Subject=Re%3A%20%5BPython-Dev%5D%20My%20summary%20of%20the%20scandir%20%28PEP%20471%29&In-Reply-To=%3CCAL9jXCErmjcg9SuWChZLjFhxOVcEpHxBBfDjRmWFts59dG4a8Q%40mail.gmail.com%3E"> <META NAME="robots" CONTENT="index,nofollow"> <style type="text/css"> pre { white-space: pre-wrap; /* css-2.1, curent FF, Opera, Safari */ } </style> <META http-equiv="Content-Type" content="text/html; charset=us-ascii"> <LINK REL="Previous" HREF="135309.html"> <LINK REL="Next" HREF="135313.html"> </HEAD> <BODY BGCOLOR="#ffffff"> <H1>[Python-Dev] My summary of the scandir (PEP 471)</H1> <B>Ben Hoyt</B> <A HREF="mailto:python-dev%40python.org?Subject=Re%3A%20%5BPython-Dev%5D%20My%20summary%20of%20the%20scandir%20%28PEP%20471%29&In-Reply-To=%3CCAL9jXCErmjcg9SuWChZLjFhxOVcEpHxBBfDjRmWFts59dG4a8Q%40mail.gmail.com%3E" TITLE="[Python-Dev] My summary of the scandir (PEP 471)">benhoyt at gmail.com </A><BR> <I>Wed Jul 2 14:41:28 CEST 2014</I> <P><UL> <LI>Previous message: <A HREF="135309.html">[Python-Dev] My summary of the scandir (PEP 471) </A></li> <LI>Next message: <A HREF="135313.html">[Python-Dev] My summary of the scandir (PEP 471) </A></li> <LI> <B>Messages sorted by:</B> <a href="date.html#135312">[ date ]</a> <a href="thread.html#135312">[ thread ]</a> <a href="subject.html#135312">[ subject ]</a> <a href="author.html#135312">[ author ]</a> </LI> </UL> <HR> <!--beginarticle--> <PRE>Thanks for the effort in your response, Paul. I'm all for KISS, but let's just slow down a bit here. ><i> I think that thin wrapper is needed - even </I>><i> if the various bells and whistles are useful, they can be built on top </I>><i> of a low-level version (whereas the converse is not the case). </I> Yes, but API design is important. For example, urllib2 has a kind of the "thin wrapper approach", but millions of people use the 3rd-party "requests" library because it's just so much nicer to use. There are low-level functions in the "os" module, but there are also a lot of higher-level functions (os.walk) and functions that smooth over cross-platform issues (os.stat). Detailed comments below. ><i> The return value is an object whose attributes correspond to the data </I>><i> the OS returns about a directory entry: </I>><i> </I>><i> * name - the object's name </I>><i> * full_name - the object's full name (including path) </I>><i> * is_dir - whether the object is a directory </I>><i> * is file - whether the object is a plain file </I>><i> * is_symlink - whether the object is a symbolic link </I>><i> </I>><i> On Windows, the following attributes are also available </I>><i> </I>><i> * st_size - the size, in bytes, of the object (only meaningful for files) </I>><i> * st_atime - time of last access </I>><i> * st_mtime - time of last write </I>><i> * st_ctime - time of creation </I>><i> * st_file_attributes - Windows file attribute bits (see the </I>><i> FILE_ATTRIBUTE_* constants in the stat module) </I> Again, this seems like a nice simple idea, but I think it's actually a worst-of-both-worlds solution -- it has a few problems: 1) It's a nasty API to actually write code with. If you try to use it, it gives off a "made only for low-level library authors" rather than "designed for developers" smell. For example, here's a get_tree_size() function I use written in both versions (original is the PEP 471 version with the addition of .full_name): def get_tree_size_original(path): """Return total size of all files in directory tree at path.""" total = 0 for entry in os.scandir(path): if entry.is_dir(): total += get_tree_size_original(entry.full_name) else: total += entry.lstat().st_size return total def get_tree_size_new(path): """Return total size of all files in directory tree at path.""" total = 0 for entry in os.scandir(path): if hasattr(entry, 'is_dir') and hasattr(entry, 'st_size'): is_dir = entry.is_dir size = entry.st_size else: st = os.lstat(entry.full_name) is_dir = stat.S_ISDIR(st.st_mode) size = st.st_size if is_dir: total += get_tree_size_new(entry.full_name) else: total += size return total I know which version I'd rather write and maintain! It seems to me new users and folks new to Python could easily write the top version, but the bottom is longer, more complicated, and harder to get right. It would also be very easy to write code in a way that works on Windows but bombs hard on POSIX. 2) It seems like your assumption is that is_dir/is_file/is_symlink are always available on POSIX via readdir. This isn't actually the case (this was discussed in the original threads) -- if readdir() returns dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat() anyway to get it. So, as the above definition of get_tree_size_new() shows, you have to use getattr/hasattr on everything: is_dir/is_file/is_symlink as well as the st_* attributes. 3) It's not much different in concept to the PEP 471 version, except that PEP 471 has a built-in .lstat() method, making the user's life much easier. This is the sense in which it's the worst of both worlds -- it's a far less nice API to use, but it still has the same issues with race conditions the original does. So thinking about this again: First, based on the +1's to Paul's new solution, I don't think people are too concerned about the race condition issue (attributes being different between the original readdir and the os.stat calls). I think this is probably fair -- if folks care, they can handle it in an application-specific way. So that means Paul's new solution and the original PEP 471 approach are both okay on that score. Second, comparing PEP 471 to Nick's solution: error handling is much more straight-forward and simple to document with the original PEP 471 approach (just try/catch around the function calls) than with Nick's get_lstat=True approach of doing the stat() if needed inside the iterator. To catch errors with that approach, you'd either have to do a "while True" loop and try/catch around next(it) manually (which is very yucky code), or we'd have to add an onerror callback, which is somewhat less nice to use and harder to document (signature of the callback, exception object, etc). So given all of the above, I'm fairly strongly in favour of the approach in the original PEP 471 due to it's easy-to-use API and straight-forward try/catch approach to error handling. (My second option would be Nick's get_lstat=True with the onerror callback. My third option would be Paul's attribute-only solution, as it's just very hard to use.) Thoughts? -Ben </PRE> <!--endarticle--> <HR> <P><UL> <!--threads--> <LI>Previous message: <A HREF="135309.html">[Python-Dev] My summary of the scandir (PEP 471) </A></li> <LI>Next message: <A HREF="135313.html">[Python-Dev] My summary of the scandir (PEP 471) </A></li> <LI> <B>Messages sorted by:</B> <a href="date.html#135312">[ date ]</a> <a href="thread.html#135312">[ thread ]</a> <a href="subject.html#135312">[ subject ]</a> <a href="author.html#135312">[ author ]</a> </LI> </UL> <hr> <a href="https://mail.python.org/mailman/listinfo/python-dev">More information about the Python-Dev mailing list</a><br> </body></html>