[S3tools-general] Out of memory: Kill process s3cmd

Discussion:

[S3tools-general] Out of memory: Kill process s3cmd - v1.5.0-beta1

WagnerOne

2014-03-07 00:18:47 UTC

Hi,

I was recently charged with moving a lot of data (TBs) into s3 and discovered the great tool that is s3cmd. It's working well and I like the familiar rsync-like interactions.

I'm attempting to use s3cmd to copy a directory with tons of small files amounting to about 700GB to s3.

During my tests with ~1GB transfers, things went well. When I got to this larger test set, s3cmd worked for upwards of 40 minutes (gathering md5 data I assume) on the local data before the kernel killed the process due to excessive RAM consumption.

I'm was using an ec2 t1.micro with a NAS NFS mounted to it to transfer data to said NAS to s3. The t1.micro had only 500MB of ram, so I bumped it to a m3.medium, which has 4 GB of ram.

When I attempted this failed copy with the m3.medium, s3cmd ran about 3x longer before being terminated as above.

I was hoping for a painless, big single sync job, but it's looking like I might have to write a wrapper to iterate over the big directories I need to copy to get them to a more manageable size for s3cmd.

I'm guessing I've hit a limitation of the implementation as it stands currently, but wondered if anyone has suggestions in terms of s3cmd itself.

Thanks and thanks for a great tool!

Mike

# s3cmd --version
s3cmd version 1.5.0-beta1

# time s3cmd sync --verbose --progress content s3://somewhere
INFO: Compiling list of local files... Killed

real 214m53.181s
user 8m34.448s
sys 4m5.803s

# tail /var/log/messages
xxxx Out of memory: Kill process 1680 (s3cmd) score 948 or sacrifice child
xxxx Killed process 1680 (s3cmd) total-vm:3942604kB, anon-rss:3755584kB, filers:0kB

--
***@wagnerone.com
"Linux supports the notion of a command line for the same reason that only children read books with only pictures in them."

Matt Domsch

2014-03-07 01:55:05 UTC

Permalink

Thanks for the kudos. Unfortunately, memory consumption is based on the
number of objects in the trees being synchronized. On a 32-bit system, it
tends to hit a python MemoryError syncing trees that are ~1M files in size.
You are hitting a kernel OOM well before that though. You have several
options available:
1) run on a 64-bit VM with 8+GB RAM (64-bit python is a huge memory hog,
compared to 32-bit python; you have to have 2x RAM on 64-bit python to have
equivalent number of objects as on 32-bit python).
2) split your sync into multiple subtrees (as you have surmised)

There are no significant efforts under way to figure out a better way to
handle this in s3cmd itself, given how python operates. One option would
be to add in a sqlite on-disk or in-memory database for transient use in
storing and comparing the local and remote file lists, but that's a fairly
heavy undertaking and not one anyone has chosen to develop.

Thanks,
Matt

Post by WagnerOne
Hi,
I was recently charged with moving a lot of data (TBs) into s3 and
discovered the great tool that is s3cmd. It's working well and I like the
familiar rsync-like interactions.
I'm attempting to use s3cmd to copy a directory with tons of small files
amounting to about 700GB to s3.
During my tests with ~1GB transfers, things went well. When I got to this
larger test set, s3cmd worked for upwards of 40 minutes (gathering md5 data
I assume) on the local data before the kernel killed the process due to
excessive RAM consumption.
I'm was using an ec2 t1.micro with a NAS NFS mounted to it to transfer
data to said NAS to s3. The t1.micro had only 500MB of ram, so I bumped it
to a m3.medium, which has 4 GB of ram.
When I attempted this failed copy with the m3.medium, s3cmd ran about 3x
longer before being terminated as above.
I was hoping for a painless, big single sync job, but it's looking like I
might have to write a wrapper to iterate over the big directories I need to
copy to get them to a more manageable size for s3cmd.
I'm guessing I've hit a limitation of the implementation as it stands
currently, but wondered if anyone has suggestions in terms of s3cmd itself.
Thanks and thanks for a great tool!
Mike
# s3cmd --version
s3cmd version 1.5.0-beta1
# time s3cmd sync --verbose --progress content s3://somewhere
INFO: Compiling list of local files... Killed
real 214m53.181s
user 8m34.448s
sys 4m5.803s
# tail /var/log/messages
xxxx Out of memory: Kill process 1680 (s3cmd) score 948 or sacrifice child
xxxx Killed process 1680 (s3cmd) total-vm:3942604kB, anon-rss:3755584kB, filers:0kB
--
"Linux supports the notion of a command line for the same reason that only
children read books with only pictures in them."
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to
Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and
the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
S3tools-general mailing list
https://lists.sourceforge.net/lists/listinfo/s3tools-general

WagnerOne

2014-03-10 22:55:07 UTC

Permalink

Thanks for the responses, folks. I appreciate your feedback!

Mike

1) run on a 64-bit VM with 8+GB RAM (64-bit python is a huge memory hog, compared to 32-bit python; you have to have 2x RAM on 64-bit python to have equivalent number of objects as on 32-bit python).
2) split your sync into multiple subtrees (as you have surmised)
There are no significant efforts under way to figure out a better way to handle this in s3cmd itself, given how python operates. One option would be to add in a sqlite on-disk or in-memory database for transient use in storing and comparing the local and remote file lists, but that's a fairly heavy undertaking and not one anyone has chosen to develop.
Thanks,
Matt
Hi,
I was recently charged with moving a lot of data (TBs) into s3 and discovered the great tool that is s3cmd. It's working well and I like the familiar rsync-like interactions.
I'm attempting to use s3cmd to copy a directory with tons of small files amounting to about 700GB to s3.
During my tests with ~1GB transfers, things went well. When I got to this larger test set, s3cmd worked for upwards of 40 minutes (gathering md5 data I assume) on the local data before the kernel killed the process due to excessive RAM consumption.
I'm was using an ec2 t1.micro with a NAS NFS mounted to it to transfer data to said NAS to s3. The t1.micro had only 500MB of ram, so I bumped it to a m3.medium, which has 4 GB of ram.
When I attempted this failed copy with the m3.medium, s3cmd ran about 3x longer before being terminated as above.
I was hoping for a painless, big single sync job, but it's looking like I might have to write a wrapper to iterate over the big directories I need to copy to get them to a more manageable size for s3cmd.
I'm guessing I've hit a limitation of the implementation as it stands currently, but wondered if anyone has suggestions in terms of s3cmd itself.
Thanks and thanks for a great tool!
Mike
# s3cmd --version
s3cmd version 1.5.0-beta1
# time s3cmd sync --verbose --progress content s3://somewhere
INFO: Compiling list of local files... Killed
real 214m53.181s
user 8m34.448s
sys 4m5.803s
# tail /var/log/messages
xxxx Out of memory: Kill process 1680 (s3cmd) score 948 or sacrifice child
xxxx Killed process 1680 (s3cmd) total-vm:3942604kB, anon-rss:3755584kB, filers:0kB
--
"Linux supports the notion of a command line for the same reason that only children read books with only pictures in them."
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
S3tools-general mailing list
https://lists.sourceforge.net/lists/listinfo/s3tools-general
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk_______________________________________________
S3tools-general mailing list
https://lists.sourceforge.net/lists/listinfo/s3tools-general

--
***@wagnerone.com
"An inglorious peace is better than a dishonorable war."- Mark Twain

WagnerOne

2014-03-10 23:07:52 UTC

Permalink

I've identified the subdir in my content to be transferred with the huge file count that I need to systematically transfer.

Will --exclude allow me to sync everything but said directory, so I can then work within that subdir or will I hit the same memory related problems?

In other words, if I --exclude something, is exclude entirely during the source discovery stage?

Thanks,
Mike

--
***@wagnerone.com
"Always consider the possibility your assumptions are wrong."-Wheel of Time

Matt Domsch

2014-03-11 00:33:07 UTC

Permalink

yes, --exclude will remove the whole directory (and its child files and
subdirs) from the run. At least from the local os.walk(); it can't do so
getting the list from S3.

Post by WagnerOne
I've identified the subdir in my content to be transferred with the huge
file count that I need to systematically transfer.
Will --exclude allow me to sync everything but said directory, so I can
then work within that subdir or will I hit the same memory related problems?
In other words, if I --exclude something, is exclude entirely during the
source discovery stage?
Thanks,
Mike
Thanks for the kudos. Unfortunately, memory consumption is based on the
number of objects in the trees being synchronized. On a 32-bit system, it
tends to hit a python MemoryError syncing trees that are ~1M files in size.
You are hitting a kernel OOM well before that though. You have several
1) run on a 64-bit VM with 8+GB RAM (64-bit python is a huge memory hog,
compared to 32-bit python; you have to have 2x RAM on 64-bit python to have
equivalent number of objects as on 32-bit python).
2) split your sync into multiple subtrees (as you have surmised)
There are no significant efforts under way to figure out a better way to
handle this in s3cmd itself, given how python operates. One option would
be to add in a sqlite on-disk or in-memory database for transient use in
storing and comparing the local and remote file lists, but that's a fairly
heavy undertaking and not one anyone has chosen to develop.
Thanks,
Matt

Post by WagnerOne
Hi,
I was recently charged with moving a lot of data (TBs) into s3 and
discovered the great tool that is s3cmd. It's working well and I like the
familiar rsync-like interactions.
I'm attempting to use s3cmd to copy a directory with tons of small files
amounting to about 700GB to s3.
During my tests with ~1GB transfers, things went well. When I got to this
larger test set, s3cmd worked for upwards of 40 minutes (gathering md5 data
I assume) on the local data before the kernel killed the process due to
excessive RAM consumption.
I'm was using an ec2 t1.micro with a NAS NFS mounted to it to transfer
data to said NAS to s3. The t1.micro had only 500MB of ram, so I bumped it
to a m3.medium, which has 4 GB of ram.
When I attempted this failed copy with the m3.medium, s3cmd ran about 3x
longer before being terminated as above.
I was hoping for a painless, big single sync job, but it's looking like I
might have to write a wrapper to iterate over the big directories I need to
copy to get them to a more manageable size for s3cmd.
I'm guessing I've hit a limitation of the implementation as it stands
currently, but wondered if anyone has suggestions in terms of s3cmd itself.
Thanks and thanks for a great tool!
Mike
# s3cmd --version
s3cmd version 1.5.0-beta1
# time s3cmd sync --verbose --progress content s3://somewhere
INFO: Compiling list of local files... Killed
real 214m53.181s
user 8m34.448s
sys 4m5.803s
# tail /var/log/messages
xxxx Out of memory: Kill process 1680 (s3cmd) score 948 or sacrifice child
xxxx Killed process 1680 (s3cmd) total-vm:3942604kB, anon-rss:3755584kB, filers:0kB
--
"Linux supports the notion of a command line for the same reason that
only children read books with only pictures in them."
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
S3tools-general mailing list
https://lists.sourceforge.net/lists/listinfo/s3tools-general

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk_______________________________________________
S3tools-general mailing list
https://lists.sourceforge.net/lists/listinfo/s3tools-general
--
"Always consider the possibility your assumptions are wrong."-Wheel of Time
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
S3tools-general mailing list
https://lists.sourceforge.net/lists/listinfo/s3tools-general

WagnerOne

2014-03-11 13:25:33 UTC

Permalink

Thank you, Matt.

Mike

yes, --exclude will remove the whole directory (and its child files and subdirs) from the run. At least from the local os.walk(); it can't do so getting the list from S3.
I've identified the subdir in my content to be transferred with the huge file count that I need to systematically transfer.
Will --exclude allow me to sync everything but said directory, so I can then work within that subdir or will I hit the same memory related problems?
In other words, if I --exclude something, is exclude entirely during the source discovery stage?
Thanks,
Mike

--
"Always consider the possibility your assumptions are wrong."-Wheel of Time
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
S3tools-general mailing list
https://lists.sourceforge.net/lists/listinfo/s3tools-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
S3tools-general mailing list
https://lists.sourceforge.net/lists/listinfo/s3tools-general

--
***@wagnerone.com
"It's nice to be important. It's more important to be nice."-Rob Zombie

Matt Rogers

2014-03-07 07:42:30 UTC

Permalink

Maybe try syncing using wildcards on the filename? i.e.

s3cmd sync /your/folder/a* s3://your-bucket
s3cmd sync /your/folder/b* s3://your-bucket
s3cmd sync /your/folder/c* s3://your-bucket
s3cmd sync /your/folder/d* s3://your-bucket
s3cmd sync /your/folder/e* s3://your-bucket

etc.

Note may have the source and destination clauses in those commands the wrong way around, operating from memory....

M.

Matt Rogers

UK Cell/SMS: +447958002382
Email: ***@feedspark.com<mailto:***@feedspark.com>
Skype: mattrogers1975

From: WagnerOne [mailto:***@wagnerone.com]
Sent: 07 March 2014 00:19
To: s3tools-***@lists.sourceforge.net
Subject: [S3tools-general] Out of memory: Kill process s3cmd - v1.5.0-beta1

Hi,

I was recently charged with moving a lot of data (TBs) into s3 and discovered the great tool that is s3cmd. It's working well and I like the familiar rsync-like interactions.

I'm attempting to use s3cmd to copy a directory with tons of small files amounting to about 700GB to s3.

During my tests with ~1GB transfers, things went well. When I got to this larger test set, s3cmd worked for upwards of 40 minutes (gathering md5 data I assume) on the local data before the kernel killed the process due to excessive RAM consumption.

I'm was using an ec2 t1.micro with a NAS NFS mounted to it to transfer data to said NAS to s3. The t1.micro had only 500MB of ram, so I bumped it to a m3.medium, which has 4 GB of ram.

When I attempted this failed copy with the m3.medium, s3cmd ran about 3x longer before being terminated as above.

I was hoping for a painless, big single sync job, but it's looking like I might have to write a wrapper to iterate over the big directories I need to copy to get them to a more manageable size for s3cmd.

I'm guessing I've hit a limitation of the implementation as it stands currently, but wondered if anyone has suggestions in terms of s3cmd itself.

Thanks and thanks for a great tool!

Mike

# s3cmd --version
s3cmd version 1.5.0-beta1

# time s3cmd sync --verbose --progress content s3://somewhere
INFO: Compiling list of local files... Killed

real 214m53.181s
user 8m34.448s
sys 4m5.803s

# tail /var/log/messages
xxxx Out of memory: Kill process 1680 (s3cmd) score 948 or sacrifice child
xxxx Killed process 1680 (s3cmd) total-vm:3942604kB, anon-rss:3755584kB, filers:0kB

--
***@wagnerone.com<mailto:***@wagnerone.com>
"Linux supports the notion of a command line for the same reason that only children read books with only pictures in them."