Discussion:
[S3tools-general] Memory leak in s3cmd?
Christian Bjørnbak
2017-03-01 06:50:35 UTC
Permalink
Hi,

I am trying to upload a directory containing 60 GB of jpegs in various
sizes of 3-6 KB to a ceph storage.

First I tried using sync:

s3cmd sync -P /path-to-src/directory s3://bucket

It takes 24+ hours and at some point the process is killed. I tried a
couple of times and noticed that while it is running it uses all of the
source server's memory and swap.

I'm syncing from a 16 GB RAM / 16 GB swap server.

I thought maybe sync keeps the files in memory to compare or something and
changed to put:

s3cmd put -P --recursive /path-to-src/directory s3://bucket

But I still I experience the same - s3cmd uses all the memory.

Is there an memory leak in s3cmd so it does not remove the file from memory
after it has been uploaded?


Med venlig hilsen / Kind regards,

Christian BjÞrnbak

Chefudvikler / Lead Developer
TouristOnline A/S
Islands Brygge 43
2300 KÞbenhavn S
Denmark
TLF: +45 32888230
Dir. TLF: +45 32888235
Florent Viard
2017-03-01 09:29:11 UTC
Permalink
Hi Christian,

It is not a leak but a current limitation of s3cmd.
To perform the put / sync, s3cmd get the complete list of files of source
and destination in memory dict before being to merge them in new dicts
holding the operations that will have to be done "transfer", "copy",
"delete".

So, at this moment, it is expected that it can take a long time to
"prepare" and that it uses a lot of memory.
Doing a fast estimation with 6kb file sizes, i guess you can have at least:
10000000 files.
Just for the local list itself, i think that it is safe to guess that each
"entry" will consume at least around (80 [avg path size] * 2 + 16 [hash] +
10 [a few more bytes]), resulting in around 1.8/2GB of RAM only for that.

FYI, you can use the "-v" and "--progress" flags to have more details about
what is going on.

To fix your situation, what I would advise is to try to partition your task:
let's say that there is 10 big subfolders at the root of the dataset, run
s3cmd on each subfolder instead of on the parent.
Example:
s3cmd sync root/a s3://bucket/mydest/a
s3cmd sync root/b s3://bucket/mydest/b
...
s3cmd sync root/g s3://bucket/mydest/g

instead of:
s3cmd sync root s3://bucket/mydest

The added value of such a partition is that, provided that you have enough
RAM, you could run multiple sync in parallel to speed things up.

Regards,


--
Florent
<http://www.seagate.com>
Post by Christian Bjørnbak
Hi,
I am trying to upload a directory containing 60 GB of jpegs in various
sizes of 3-6 KB to a ceph storage.
s3cmd sync -P /path-to-src/directory s3://bucket
It takes 24+ hours and at some point the process is killed. I tried a
couple of times and noticed that while it is running it uses all of the
source server's memory and swap.
I'm syncing from a 16 GB RAM / 16 GB swap server.
I thought maybe sync keeps the files in memory to compare or something and
s3cmd put -P --recursive /path-to-src/directory s3://bucket
But I still I experience the same - s3cmd uses all the memory.
Is there an memory leak in s3cmd so it does not remove the file from
memory after it has been uploaded?
Med venlig hilsen / Kind regards,
Christian BjÞrnbak
Chefudvikler / Lead Developer
TouristOnline A/S
Islands Brygge 43
2300 KÞbenhavn S
Denmark
TLF: +45 32888230 <+45%2032%2088%2082%2030>
Dir. TLF: +45 32888235 <+45%2032%2088%2082%2035>
------------------------------------------------------------
------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! https://urldefense.proofpoint.
com/v2/url?u=http-3A__sdm.link_slashdot&d=DwICAg&c=
IGDlg0lD0b-nebmJJ0Kp8A&r=GEhQqSrCDlzPsOu9ww_S8dL0RpfPwWzg7DpciZD7d7Y&m=
qpOID5eY0r4CZ9xs-XahlzB4gUgC_es4RDJZg-rSHAw&s=
kltIPljk4EHr1ob1cI00SStcnOveFNnaYdQnnJfW2ug&e=
_______________________________________________
S3tools-general mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.
sourceforge.net_lists_listinfo_s3tools-2Dgeneral&d=DwICAg&c=IGDlg0lD0b-
nebmJJ0Kp8A&r=GEhQqSrCDlzPsOu9ww_S8dL0RpfPwWzg7DpciZD7d7Y&m=
qpOID5eY0r4CZ9xs-XahlzB4gUgC_es4RDJZg-rSHAw&s=
5Cu4QwQbU35iqzH7dWEyQ3VC9CIEX60nqUKMuWWyGfM&e=
Loading...