Clustering commodity PC hardware - A web log

Printable View

Show 100 post(s) from this thread on one page

24-01-2005, 12:57 PM
Applecrusher

get it to run pifast :)
24-01-2005, 04:55 PM
Zak33

right...getting it now(yeah...1% ;) ) So its RAID but..er......RAIC... (if I'm thefirst person to think of that I'd like royalties pls)

IS the software that runs it self-written, or is their a defacto set of software developers who create it?

And in either case IS IT a network....CAT5 cable and stuff? Or does it slot together like interconnectabe motherboards? Or is it a kinda "blade" that slots like a daughterboard into a big motherboard?

I'm well intrigued :) sounds very cool indeed.
24-01-2005, 05:04 PM
Stewart

Trying to take me on at my own game I see Rys, geting people invovled with a project and getting them posting, ala the HEXUS living document.

Rest assured, I shall dig deep and rise to the challenge.

:D

Seriously though mate, must be enough trouble to do, let alone make posts telling everyone about it. Nice work. :)
24-01-2005, 06:14 PM
Rys

Quote:

Originally Posted by Zak33

right...getting it now(yeah...1% ;) ) So its RAID but..er......RAIC... (if I'm thefirst person to think of that I'd like royalties pls)

IS the software that runs it self-written, or is their a defacto set of software developers who create it?

And in either case IS IT a network....CAT5 cable and stuff? Or does it slot together like interconnectabe motherboards? Or is it a kinda "blade" that slots like a daughterboard into a big motherboard?

I'm well intrigued :) sounds very cool indeed.

The software can be self written if you know what you're doing and in many cases existing software can be adapted to a clustered computing environment without much bother, too. There are companies and software teams out there that specialise creating applications just for cluster setups, but that doesn't have to be the case.

Then you can take things like video encoding, something that lends itself well to distribution in a clustered system and that's something that most people can understand. Video encoding can go lots faster on a cluster, distributing the workload to all the systems in the cluster to get your work done faster.

As far as connecting them goes, my cluster is currently only connected via regular network cable (CAT5) into a 100Mbit Ethernet switch. And the cluster will run just fine over that. It's hardware most people have already.

I also have a Myrinet interconnect for the compute nodes too, which is much (orders of magnitude!) lower latency than any Ethernet variant, allowing the nodes to talk to each other faster, and at higher bandwidth (Myrinet 1000, which I have, is 1.28Gbit/sec) than even gigabit Ethernet. So my compute nodes, when configured with the Myrinet, will allow them to talk to each other faster and at a higher rate.

RIAC is a great term to use for home clustering! Cobble together a bunch of older machines, installing an operating system on them that supports clustering and run the applications that can take advantage of it. It doesn't have to be an expensive cluster (like mine, which when first deployed in 1990 cost the wrong side of £50,000), it can just be a bunch of EPIAs or something.

The thing to remember is that in terms of hardware, there's many ways to create a cluster. Blade servers with a common backplane and chassis, separate machines connected via CAT5 cable and Ethernet, or Myrinet (or Infiniband and loads of other interconnect standards), or what have you.

Just connect machines really :devilish:

It's then the software on top that's the difficult part, which I'll cover in due course with my own cluster.

Rys
24-01-2005, 09:20 PM
Steve

I'm just disappointed at the lack of the word "beowulf". I guess I spend far too much time on slashdot.

Keep up the good work Rys :)
24-01-2005, 10:54 PM
Mblaster

Quote:

As far as connecting them goes, my cluster is currently only connected via regular network cable (CAT5) into a 100Mbit Ethernet switch .... I also have a Myrinet interconnect for the compute nodes too, which is much (orders of magnitude!) lower latency than any Ethernet variant, allowing the nodes to talk to each other faster, and at higher bandwidth

So are all they connected by ethernet now and will be using Myrinet when you have set it up? Or are just some of the connections by Myrinet, and the rest ethernet?
25-01-2005, 04:31 AM
Rys

Quote:

Originally Posted by Mblaster

So are all they connected by ethernet now and will be using Myrinet when you have set it up? Or are just some of the connections by Myrinet, and the rest ethernet?

You were right the first time :) So...

They're all connected via Ethernet atm, which will be a permanent fixture since the front-end has no Myrinet and has to issue jobs and the like over Ethernet. However the compute nodes have Myrinet, which when configured will be the interface they use to talk to each other when jobs are running. The Myrinet is idle just now though, unconfigured so far.

So just now, if I send out a job, it goes out over Ethernet, is computed using Ethernet as the transport, and comes back to the front-end using Ethernet.

In the future it'll be out over Ethernet, compute over Myrinet, back over Ethernet when the job is done.

Rys
25-01-2005, 04:32 AM
Rys

January 25, 2005

The final four chodenodes are built

The final four chodenodes, configured in Rocks as a separate rack cabinet since they're in two piles of four, got built tonight. There was an issue with the front-end dropping the connection that the chodenodes pulled their install images from, but I got there in the end.

I'll boot the full cluster with all nodes attached at lunchtime tomorrow (today, since it's 3.26am), so expect another update soon after with an obligatory Ganglia screenshot to celebrate bringing all CPUs online.

What I haven't done so far is detail what software is being used, how it was setup on the front-end and how you bootstrap the compute nodes over the network using the front-end, so I'll do a series of posts sometime this week detailing that. It'll be much what you'll find in the Rocks base install guide, with insight to match my particular cluster setup.

So that's milestone number one reached: the front-end machine and all the compute nodes are setup and can talk to each other correctly over Ethernet.

Milestone two will be to have them running simultaneously, with milestone three being able successfully run jobs over configured Myrinet.

Posted by Rys at January 25, 2005 03:24 AM
25-01-2005, 12:43 PM
Mauler

Nice work Rys! I'm good friends with The Tim and have experience with Win2k clusters but very little with Linux - I'll be hawking this thread closely :D
25-01-2005, 03:16 PM
Rys

January 25, 2005

Milestone two reached; all 18 CPUs connected

I switched the front-end on during lunch to have a quick look at the Myrinet configuration (the kernel isn't loading the Myrinet module yet) before turning on the two banks of nodes. Everything came up just fine and the end result brought a smile to my face. All eighteeen processors (sixteen for the compute nodes and two for the front-end) can be seen and the graphs at the top of that Ganglia display show you just how I booted it.

If you look at the memory graph which I've overlayed with a few labels, you can see me bring up the front-end which registers itself and says hello to Ganglia, then I bring up the first 'cabinet' of compute nodes a short while after, when I was done playing with the Myrinet config, then I bring up the second cabinet of nodes after the first four are showing in Ganglia.

That information is shown properly in the load graph too, I just used the memory size graph since it had less metrics to look at. So in that graph, CPU count starts at 2, rises to 10, then to 18 (red metric). You can see the same trend in the node count metric (green). Notice how the process count spikes as each node is switched on, as they talk to the front-end to say hello and join the cluster.

Myrinet configuration is next, sometime this week or at the weekend.

Posted by Rys at January 25, 2005 02:09 PM
26-01-2005, 02:58 AM
Rys

January 26, 2005

99% of the way to milestone three, Myrinet is up (just)

Tim and I successfully brought up the Myrinet interconnect on superchode tonight. Rocks gets you most of the way there on a default 3.3.0 install, but not quite. Instead of quickly fixing the default GM (Myricom's message passing layer for Myrinet) install supplied with Rocks, I decided to self-upgrade to 2.0.17 instead, which is the current release.

I did the initial testing and bootstrap of the GM mapper - software which controls the Myrinet routing amongst other things - on chodenode-0-0 (the first one in cabinet 0) which went fine, so Tim and I deployed it out onto a cabinet each, using NFS stores for common data. I'll document that process in due course. chodenode-1-0 has a slight cabling issue (masses of CRC errors logged by the hardware and no route to the mapper running on 0-1) which seems to be fixed now, but I'll be investigating getting some spare cables, just incase.

It all looks good now anyway. In terms of milestones, we didn't quite get round to pushing a job out over the cluster when the Myrinet was up, due to the problems with 1-0 and the time it took (it's now nearly 2am and Alex will kill me when I get to bed), so milestone three isn't officially met, but I'm sure we'll get there tomorrow when I bring it back up either at lunchtime if I'm not too busy (I probably will be) or at night.

In terms of performance, GM's benchmarking tools showed very low latency (data being moved in periods measured in the low numberrs of microseconds) across seven hosts earlier, and an excess of 2Gbit/sec of bandwidth both reading and writing, per node. Inter-node bandwidth with all nodes should therefore be in the region of 16Gbit/sec.

Posted by Rys at January 26, 2005 01:45 AM
07-02-2005, 02:25 AM
Rys

February 07, 2005

Rebuild success; milestone three reached

Tim and I spent the day rebuilding superchode to add some new Rolls to the cluster (Rolls are bundles of files and configuration data used to add functionality to a Rocks cluster). It's now in much better shape to be used for useful work, with clustered Java, C, C++ and Fortran implementations available for coding with, and their requisite libs for precompiled stuff.

Sun's Grid Engine is the default scheduler now too. You submit jobs to the SGE queue and it sends them out to the required nodes for processing. There's a bunch of queue tools for monitoring, and there's a web queue monitor if we're feeling lazy and can't be bothered SSH'ing in to the front-end machine.

The rebuild was made painful by an automount issue on the front-end (automount is the daemon that manages NFS shares for the home directories of users on the cluster) and the requirement for more swap space than we'd originally allocated, meaning that the first attempted rebuild fell over during the configuration phase.

However it's all up and running again, Myrinet is up and jobs can be scheduled over GM on the compute nodes. In short, the cluster is fully operational and better equipped to do some real work.

Milestone three was reached with a successful run of Linpack over Myrinet and two compute nodes, just before I shut it down for the night.

Milestone four will be the completion of useful work using the cluster. More on that after I rope Tom into things.

Posted by Rys at February 7, 2005 01:16 AM
28-04-2005, 03:25 PM
Matt1eD

pity ur selling it

is there like a simple way to get that sort of power shared? Like a rendering farm
28-04-2005, 03:29 PM
directhex

Quote:

Originally Posted by Matt1eD

is there like a simple way to get that sort of power shared? Like a rendering farm

what do you mean by "shared", and what's your understanding of the words "rendering farm"?
28-04-2005, 03:42 PM
Matt1eD

Quote:

Originally Posted by directhex

what do you mean by "shared", and what's your understanding of the words "rendering farm"?

shared - like my computer connected to loads of others over some sort of network switch do do more work for me.

rendering farm - a computer cluser thing that like has everything split up (like images of a film) between the processors of the farm's other computers. No doubt am wrong but a big version of what a dual processor mobo does
28-04-2005, 03:53 PM
directhex

well, that's what confused me - a cluster is designed for precisely that!

you connect to a "front end" or "master" node, and submit a "job" (usually a script detailing work to be done) using a scheduler - Sun GridEngine is a common scheduler, as is PBSPro or Torque. The scheduler then sends your job script to X number of machines in a cluster (as specified when you ran the submission), and lets the job run to either competion, or until a time limit has been reached.

What you need though are programs designed to be run on a cluster - more specifically, they need to use MPI (Message Passing Interface) to send messages to each other. In this case, SuperChode has a dedicated Myrinet (low latency) network interconnect to do message passing, as well as regular ethernet for file transfer et al. Using a myrinet-capable application (or compiling one on the master node) means you can pslit the job up at will

Alternatively, you could submit sixteen 1-cpu jobs with their own submissions scripts and no need to interoperate. the scheduler would take care of that.

Show 100 post(s) from this thread on one page