~sschwarzer/ftputil

1

ftputil.listdir slows down

Details
Message ID
<CAHBVgkFectxHJH-fMeSFJE=BdDhPf7+n3G11hS3p3Jn3PoWcYA@mail.gmail.com>
DKIM signature
pass
Download raw message
Hi,

Here is what I am trying to do -

1) Ftp into a remote server
2) get the list of parent directories
3) iterate over all the parent directories
4) get the subdirectory content.

Have about 800s parent directory. Step (4) slows down after about
400th parent directory.

Code -


def fetch_non_empty_dir(self, rmt_dir):

    self._ftp_obj.stat_cache.resize(20000)
    rmt_ctnt = self._ftp_obj.listdir(rmt_dir)
    rmt_dir_list = []
    count = 0
    for item in rmt_ctnt:
        file_list = self._ftp_obj.listdir(rmt_dir+item+"/")
        if (len(file_list) == 0):
            continue
        else:
            rmt_dir_list.append(item)
    return rmt_dir_list


stat_cashe.resize() did not help solve my issue. Should I clean the
local cache maintained for every listdir()? How do I clean the local
cache?

Appreciate your help.


thanks
Rajesh
Details
Message ID
<20ede097-1055-0c6e-ac0b-14485c181524@sschwarzer.net>
In-Reply-To
<CAHBVgkFectxHJH-fMeSFJE=BdDhPf7+n3G11hS3p3Jn3PoWcYA@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Hi Rajesh,

On 2022-01-11 03:05, Rajesh Kay wrote:
> Here is what I am trying to do -
> 
> 1) Ftp into a remote server
> 2) get the list of parent directories
> 3) iterate over all the parent directories
> 4) get the subdirectory content.

How many file system items are usually in each directory?
What's a typical number and what is (roughly) the maximum
number?

> Have about 800s parent directory. Step (4) slows down after about
> 400th parent directory.

When you say "slows down", it would be interesting to know
_how_ it slows down. When the execution slows down, do _all_
directories after that take a long time or is there a
repeating pattern where a bunch of directories is processed
fast and then a directory is processed slowly and so on?

> Code -

Code is helpful to understand what you're doing. Thanks for
providing it. :-) Is this the actual code or is it shortened
somehow? I ask because I see that the `count` variable in
the loop body is unused, so maybe it's a leftover from
removing some of the original code that shows the slow-down
problem. Anyway, the important thing is that all operations
involving ftputil are present, including what they operate on.

> def fetch_non_empty_dir(self, rmt_dir):
>      self._ftp_obj.stat_cache.resize(20000)
>      rmt_ctnt = self._ftp_obj.listdir(rmt_dir)
>      rmt_dir_list = []
>      count = 0
>      for item in rmt_ctnt:
>          file_list = self._ftp_obj.listdir(rmt_dir+item+"/")

To see the slow-down pattern (see above), you could
implement some logging here. Replace the above line with

   start_time = time.time()  # Need to import `time`
   file_list = self._ftp_obj.listdir(rmt_dir+item+"/")
   end_time = time.time()
   print(f"Got {len(file_list)} items in {end_time-start_time} seconds",
         flush=True)

and run the changed code. You can also change the `for`
statement to use `enumerate` and include the number of the
inspected directory in the logged output.

Please try to attach the output in a zip file, but I don't
know how Sourcehut (the mailing list hoster) handles
attachments.

>          if (len(file_list) == 0):
>              continue
>          else:
>              rmt_dir_list.append(item)
>      return rmt_dir_list
> 
> stat_cashe.resize() did not help solve my issue.

Depending on the number of items in your directories, they
might not fit in 20000 cache entries. That said, your usage
pattern looks like the cache size should be mostly
irrelevant. As far as I can tell, you never (re)use the
cached entries (unless there's some code deleted that _does_
use the cache).

> Should I clean the
> local cache maintained for every listdir()? How do I clean the local
> cache?

I don't think it should change anything, but you can clear
the cache by calling `self._ftp_obj.stat_cache.clear()`.

> Appreciate your help.

I hope I can help. :-)

Stefan
Reply to thread Export thread (mbox)