CINXE.COM
[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE> [Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info) </TITLE> <LINK REL="Index" HREF="index.html" > <LINK REL="made" HREF="mailto:python-dev%40python.org?Subject=Re%3A%20%5BPython-Dev%5D%20pathlib%20and%20issue%2011406%20%28a%20directory%20iterator%0A%09returning%20stat-like%20info%29&In-Reply-To=%3CCAL9jXCGBN6RMJLyf-t7QDEv48xGkAP7tYA-Av%3D7fCnwo-85sMQ%40mail.gmail.com%3E"> <META NAME="robots" CONTENT="index,nofollow"> <style type="text/css"> pre { white-space: pre-wrap; /* css-2.1, curent FF, Opera, Safari */ } </style> <META http-equiv="Content-Type" content="text/html; charset=us-ascii"> <LINK REL="Previous" HREF="130581.html"> <LINK REL="Next" HREF="130575.html"> </HEAD> <BODY BGCOLOR="#ffffff"> <H1>[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info)</H1> <B>Ben Hoyt</B> <A HREF="mailto:python-dev%40python.org?Subject=Re%3A%20%5BPython-Dev%5D%20pathlib%20and%20issue%2011406%20%28a%20directory%20iterator%0A%09returning%20stat-like%20info%29&In-Reply-To=%3CCAL9jXCGBN6RMJLyf-t7QDEv48xGkAP7tYA-Av%3D7fCnwo-85sMQ%40mail.gmail.com%3E" TITLE="[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info)">benhoyt at gmail.com </A><BR> <I>Sun Nov 24 23:20:08 CET 2013</I> <P><UL> <LI>Previous message: <A HREF="130581.html">[Python-Dev] [RELEASED] Python 3.4.0b1 </A></li> <LI>Next message: <A HREF="130575.html">[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info) </A></li> <LI> <B>Messages sorted by:</B> <a href="date.html#130572">[ date ]</a> <a href="thread.html#130572">[ thread ]</a> <a href="subject.html#130572">[ subject ]</a> <a href="author.html#130572">[ author ]</a> </LI> </UL> <HR> <!--beginarticle--> <PRE>Hi folks, I decided to start another thread for my thoughts on the interaction between pathlib (Antoine's new PEP 428), issue 11406 (proposal for a directory iterator returning stat-like info), and my own scandir library, which implements something along the lines of issue 11406. My scandir library (<A HREF="https://github.com/benhoyt/scandir">https://github.com/benhoyt/scandir</A>) is something I've been working on for a while -- it provides a scandir() function which uses the OS's directory iterator functions to expose as much stat-like information as possible (readdir and FindFirstFile etc). This way functions like os.walk() can use the info (particularly "is_dir()") and not require tons of extra calls to os.stat(). This provides a huge speed boost for os.walk() in many cases: I've seen 3-4x on Linux, and up to 20x on Windows. (It depends on various things, not least of which is Windows' weird stat caching -- if I run my scandir benchmark "fresh", I get os.walk() running 8-9 times as fast as the built-in one. But if I run it after an un-hibernate, suddenly it runs 18-20 times as fast as the built-in one. Either way, huge gains, especially on Windows.) scandir.scandir() returns a DirEntry object, which has .isdir(), .isfile(), .islink(), and .lstat() attributes. Look familiar? When I was reading PEP 428 and saw .is_file(), .is_dir(), and .stat(), I thought -- surely I can merge this with pathlib and Path objects. The first thing I can do to scandir is rename my isdir() type attributes to match PEP 428's, so that DirEntry quacks like a Path object where it can. However, I'm wondering if I can change scandir to return actual Path objects. Or better, because Path already helpfully provides iterdir() which yields Path objects, and Path objects have .is_dir() etc, can scandir()-like behaviour simply work out-of-the-box? This mainly depends on how Path is going to cache stat information. If it caches it, then this will just work. Sounds like Guido's opinion was that both cached and uncached use cases are important, but that it should be very clear which one you're getting. I personally like the .stat() and .restat() idea. The other related thing is that DirEntry only provides .lstat(), because it's providing stat-like info without following links. Note in this context that it's not just "network filesystems" on which stat() is slow (<A HREF="https://mail.python.org/pipermail/python-dev/2013-May/125805.html">https://mail.python.org/pipermail/python-dev/2013-May/125805.html</A>). It's quite slow in Windows under various conditions too. See also Nick Coghlan's post about a DirEntry-style object on the issue 11406 thread: <A HREF="https://mail.python.org/pipermail/python-dev/2013-May/126148.html">https://mail.python.org/pipermail/python-dev/2013-May/126148.html</A> Thoughts and suggestions for how to merge scandir with pathlib's approach? It's important to me that pathlib's API doesn't cut itself off from a more efficient implement of the ideas from issue 11406 and scandir... Thanks, Ben. </PRE> <!--endarticle--> <HR> <P><UL> <!--threads--> <LI>Previous message: <A HREF="130581.html">[Python-Dev] [RELEASED] Python 3.4.0b1 </A></li> <LI>Next message: <A HREF="130575.html">[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info) </A></li> <LI> <B>Messages sorted by:</B> <a href="date.html#130572">[ date ]</a> <a href="thread.html#130572">[ thread ]</a> <a href="subject.html#130572">[ subject ]</a> <a href="author.html#130572">[ author ]</a> </LI> </UL> <hr> <a href="https://mail.python.org/mailman/listinfo/python-dev">More information about the Python-Dev mailing list</a><br> </body></html>