Ashley Moran wrote: > I think the total data size is about 1.5GB, but the individual files > are smaller, the largest being a few hundred GB. The most rows in a > file is ~15,000,000 I think. The server I run it on has 2GB RAM (an > Athlon 3500+ running FreeBSD/amd64, so the hardware is not really an > issue)... it can get all the way through without swapping (just!)
The problem isn't the raw size of the dataset. It's the number of objects you create and the amount of garbage that has to be cleaned up. If you're clever about how you write, you can help Ruby by not creating so much garbage. That can give a huge benefit.
> > The processing is pretty trivial, and mainly involves incrementing > some ID columns so we can merge datasets together, adding a text > column to the start of every row, and eliminating a few duplicates.
Eliminating the dupes is the only scary thing I've seen here. What's the absolute smallest piece of data that you need to look at in order to distinguish a dupe? (If it's the whole line, then the answer is 16 bytes- the length of the MD5 hash ;-)) That's the critical working set. If you can't get the Ruby version fast enough, it's cheap and easy to sort through 15,000,000 of them in C. Then one pass through the sorted set finds your dupes. I've never found a consistently-fastest performer among Ruby's several different ways of storing sorted sets.
Make sure that your inner loop doesn't allocate any new variables, especially arrays- declare them outside your inner loop and re-use them with Array#clear.
> doesn't improve things, there's always the option of going dual-core > and forking to do independent files.
Obviously I haven't seen your code or your data, but if the Ruby app is memory-bus-bound, then this approach may make your problem worse, not better.
Good luck. I recently got a Ruby program that aggregates several LDAP directory-pulls with about a million entries down from a few hours to a few seconds, without having to drop into C. It can be done, and it's kindof fun too.
[Quoted text hidden]
|
|
Chad Perrin
<perrin@apotheon.com>
|
Thu, Jul 27, 2006 at 10:29 AM
|
|
Reply-To:
ruby-talk@ruby-lang.org
To:
ruby-talk ML <ruby-talk@ruby-lang.org>
|
On Thu, Jul 27, 2006 at 12:19:24PM +0900, Keith Gaughan wrote: > On Thu, Jul 27, 2006 at 11:38:34AM +0900, Chad Perrin wrote: > > > > ...most of which is down to Windows itself, not Excel. Excel's > > > contribution to that lag isn't, I believe, all that great. So in this > > > regard, your complaint is more to do with GDI and so on than with Excel > > > itself. > > > > Excel doesn't run so well on Linux, so I'll just leave that lying where > > it is. > > In fairness, if you're judging it's performance based on running it in > Wine, that's not really a fair comparison. :-)
I'm judging it based on running it on Windows. My point is that divorcing it from the only environment in which it runs (natively) is less than strictly sporting of you, when trying to discuss its performance characteristics (or lack thereof).
-- CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ] "The measure on a man's real character is what he would do if he knew he would never be found out." - Thomas McCauley
|
|
|
William James
<w_a_x_man@yahoo.com>
|
Thu, Jul 27, 2006 at 10:50 AM
|
|
Reply-To:
ruby-talk@ruby-lang.org
To:
ruby-talk ML <ruby-talk@ruby-lang.org>
|
Kroeger, Simon (ext) wrote: > Hi Peter! > > > Whenever the question of performance comes up with scripting > > languages > > such as Ruby, Perl or Python there will be people whose > > response can be > > summarised as "Write it in C". I am one such person. Some people take > > offence at this and label us trolls or heretics of the true > > programming > > language (take your pick). > > The last (and only) time I called someone a troll for saying > 'Write it C' it was in response to a rails related question. > Further the OP asked for configuration items and such, but maybe > that's a whole other storry. (and of course you can write > C Extensions for rails... yeah, yadda, yadda :) ) > > ..snip 52 lines Perl, some hundred lines C ... > > > [Latin]$ time ./Latin1.pl 5 > x5 > > > > real 473m45.370s > > user 248m59.752s > > sys 2m54.598s > > > > [Latin]$ time ./Latin4.pl 5 > x5 > > > > real 12m51.178s > > user 12m14.066s > > sys 0m7.101s > > > > [Latin]$ time ./c_version.sh 5 > > > > real 0m5.519s > > user 0m4.585s > > sys 0m0.691s > > Just to show the beauty of ruby: > ----------------------------------------------------------- > require 'rubygems' > require 'permutation' > require 'set' > > $size = (ARGV.shift || 5).to_i > > $perms = Permutation.new($size).map{|p| p.value} > $out = $perms.map{|p| p.map{|v| v+1}.join} > $filter = $perms.map do |p| > s = SortedSet.new > $perms.each_with_index do |o, i| > o.each_with_index {|v, j| s.add(i) if p[j] == v} > end && s.to_a > end > > $latins = [] > def search lines, possibs > return $latins << lines if lines.size == $size > possibs.each do |p| > search lines + [p], (possibs - > $filter[p]).subtract(lines.last.to_i..p) > end > end > > search [], SortedSet[*(0...$perms.size)] > > $latins.each do |latin| > $perms.each do |perm| > perm.each{|p| puts $out[latin[p]]} > puts > end > end > ----------------------------------------------------------- > (does someone has a nicer/even faster version?)
Here's a much slower version that has no 'require'.
Wd = ARGV.pop.to_i $board = []
# Generate all possible valid rows. Rows = (0...Wd**Wd).map{|n| n.to_s(Wd).rjust(Wd,'0')}. reject{|s| s=~/(.).*\1/}.map{|s| s.split(//).map{|n| n.to_i + 1} }
def check( ary, n ) ary[0,n+1].transpose.all?{|x| x.size == x.uniq.size } end
def add_a_row( row_num ) if Wd == row_num puts $board.map{|row| row.join}.join(':') else Rows.size.times { |i| $board[row_num] = Rows[i] if check( $board, row_num ) add_a_row( row_num + 1 ) end } end end
add_a_row 0
|
|
|
Keith Gaughan
<kmgaughan@eircom.net>
|
Thu, Jul 27, 2006 at 10:56 AM
|
|
Reply-To:
ruby-talk@ruby-lang.org
To:
ruby-talk ML <ruby-talk@ruby-lang.org>
|
On Thu, Jul 27, 2006 at 01:59:37PM +0900, Chad Perrin wrote: > On Thu, Jul 27, 2006 at 12:19:24PM +0900, Keith Gaughan wrote: > > On Thu, Jul 27, 2006 at 11:38:34AM +0900, Chad Perrin wrote: > > > > > > ...most of which is down to Windows itself, not Excel. Excel's > > > > contribution to that lag isn't, I believe, all that great. So in this > > > > regard, your complaint is more to do with GDI and so on than with Excel > > > > itself. > > > > > > Excel doesn't run so well on Linux, so I'll just leave that lying where > > > it is. > > > > In fairness, if you're judging it's performance based on running it in > > Wine, that's not really a fair comparison. :-) > > I'm judging it based on running it on Windows. My point is that > divorcing it from the only environment in which it runs (natively) is > less than strictly sporting of you, when trying to discuss its > performance characteristics (or lack thereof).
Wait... I did no such thing. All I said was that what interface sluggishness you get from Excel can't be blamed on Excel. They're performance characteristics that *can* be divorced from Excel (because they're Window's own performance characteristic, not Excel's). Argue those points, and you're arguing about the wrong software.
But Wine is an emulator, and while it does a good job approaching the speed of Windows, it doesn't hit it, nor can it hit it. You're not comparing like with like. Now that's far from sporting.
You're argument it disingenuous. Consider Cygwin running on Windows compared to FreeBSD running on the same machine. I can make this comparison because the machine I'm currently using dual-boots such a setup. I run many of the same applications under Cygwin as I do under FreeBSD on the same box. Those same applications running under Cygwin are noticably slower than the native equivalents under FreeBSD. Do I blame the software I'm running under Cygwin for being slow? No, because I'm well aware that it zips along in its native environment. Do a blame Cygwin? No, because it does an awful lot of work to trick the software running under it that it's running on a *nix. Do I blame Windows? No, because I use some of that software--gcc being an example--natively under Windows and it performs just as well as when it's ran natively under FreeBSD. Bringing Wine in is a red herring. Software cannot be blamed for the environment it's executed in.
K. -- Keith Gaughan - kmgaughan@eircom.net - http://talideon.com/ Fear is the greatest salesman. -- Robert Klein
|
|
|
Chad Perrin
<perrin@apotheon.com>
|
Thu, Jul 27, 2006 at 11:12 AM
|
|
Reply-To:
ruby-talk@ruby-lang.org
To:
ruby-talk ML <ruby-talk@ruby-lang.org>
|
On Thu, Jul 27, 2006 at 02:26:37PM +0900, Keith Gaughan wrote: > On Thu, Jul 27, 2006 at 01:59:37PM +0900, Chad Perrin wrote: > > > > I'm judging it based on running it on Windows. My point is that > > divorcing it from the only environment in which it runs (natively) is > > less than strictly sporting of you, when trying to discuss its > > performance characteristics (or lack thereof). > > Wait... I did no such thing. All I said was that what interface > sluggishness you get from Excel can't be blamed on Excel. They're > performance characteristics that *can* be divorced from Excel (because > they're Window's own performance characteristic, not Excel's). Argue > those points, and you're arguing about the wrong software.
Design decisions that involve interfacing with interface software that sucks is related to the software under discussion -- and not all of the interface is entirely delegated to Windows, either. No software can be evaluated for its performance characteristics separate from its environment except insofar as it runs without that environment.
> > But Wine is an emulator, and while it does a good job approaching the > speed of Windows, it doesn't hit it, nor can it hit it. You're not > comparing like with like. Now that's far from sporting.
Actually, no, it's not an emulator. It's a set of libraries (or a single library -- I'm a little sketchy on the details) that provides the same API as Windows software finds in a Windows environment. An emulator actually creates a faux/copy version of the environment it's emulating. It is to Linux compared with Unix as an actual emulator is to Cygwin compared with Unix: one is a differing implementation and the other is an emulator.
. . . and, in fact, there are things that run faster via Wine on Linux than natively on Windows.
[ snip ] > under FreeBSD. Bringing Wine in is a red herring. Software cannot be > blamed for the environment it's executed in.
I didn't bring it up. You did. I made a comment about Excel not working in Linux as a bit of a joke, attempting to make the point that saying Excel performance can be evaluated separately from its dependence on Windows doesn't strike me as useful.
[Quoted text hidden]
|
|
|
Peter Hickman
<peter@semantico.com>
|
Thu, Jul 27, 2006 at 1:25 PM
|
|
Reply-To:
ruby-talk@ruby-lang.org
To:
ruby-talk ML <ruby-talk@ruby-lang.org>
|
On my machine it took around 33 seconds but I think that I can improve it a little, besides I have to test the results first.
|
|
|
Ashley Moran
<work@ashleymoran.me.uk>
|
Thu, Jul 27, 2006 at 2:43 PM
|
|
Reply-To:
ruby-talk@ruby-lang.org
To:
ruby-talk ML <ruby-talk@ruby-lang.org>
|
On Thursday 27 July 2006 05:23, Francis Cianfrocca wrote: > Eliminating the dupes is the only scary thing I've seen here. What's the > absolute smallest piece of data that you need to look at in order to > distinguish a dupe? (If it's the whole line, then the answer is 16 > bytes- the length of the MD5 hash ;-)) That's the critical working set. > If you can't get the Ruby version fast enough, it's cheap and easy to > sort through 15,000,000 of them in C. Then one pass through the sorted > set finds your dupes. I've never found a consistently-fastest performer > among Ruby's several different ways of storing sorted sets. > > Make sure that your inner loop doesn't allocate any new variables, > especially arrays- declare them outside your inner loop and re-use them > with Array#clear.
Nice MD5 trick! I'll remember that. Fortunately the files that need duplicate elimination are really small, so I won't need to resort to that. But I'll remember it for future reference.
> > Obviously I haven't seen your code or your data, but if the Ruby app is > memory-bus-bound, then this approach may make your problem worse, not > better.
Hadn't thought of that, good point...
> Good luck. I recently got a Ruby program that aggregates several LDAP > directory-pulls with about a million entries down from a few hours to a > few seconds, without having to drop into C. It can be done, and it's > kindof fun too.
Next time I get a morning free I might apply some of the tweaks that have been suggested. Might be interested to see how much I can improve the performance.
Cheers Ashley
-- "If you do it the stupid way, you will have to do it again" - Gregory Chudnovsky
|
|
|
Ashley Moran
<work@ashleymoran.me.uk>
|
Thu, Jul 27, 2006 at 2:48 PM
|
|
Reply-To:
ruby-talk@ruby-lang.org
To:
ruby-talk ML <ruby-talk@ruby-lang.org>
|
On Wednesday 26 July 2006 23:13, Chad Perrin wrote: > > I recently rewrote an 830 line Java/ > > Hibernate web service client as 67 lines of Ruby, in about an hour. > > With that kind of productivity, performance can go to hell! > > With a 92% cut in code weight, I can certainly sympathize with that > sentiment. Wow.
Even the last remaining member of the Anti-Ruby Defence League in my office admitted (reluctantly) that it was impressive!
[Quoted text hidden]
|
|
|