How to Run Nodes in a Cluster

Tyga · September 19, 2024, 3:27pm

The only time that error occurs is if you do not specify the data-workers in your DataWorkerMuliaddrs field:

This is only useful for data-workers that will be running on your localhost (same as control process) as remote machines will not be able to look on the control process server for the control process PID.

Looking further into the code, the use of the parent process PID is to monitor the parent:
Here it passes the parent process ID input to the data worker:

And then it saves it to the Data worker object to be used later:

and then on Start calls the monitorParent function:

which calls the monitorParent function:

Noteably, the monitor parent function will just log that it is running in detatched mode if the parentProcessId is - (default if not passed in).

It’s the only time the PID is used at all.

So to me, this points to your config not working on your remote machines (non-master).

I’d check:

config not being properly formatted on your clustered machines
your indexing is off starting the node command
the config is not duplicated to the non-master machines (or not findable/in right location)
network connection (can your devices actually communicate)
ports listening to see if your data workers are actually listening

NewExpat · September 19, 2024, 8:14pm

Do we need same performance servers for a cluster? If one server is much faster than the others, I’m wondering how quil deals with it, as we only have one “time_taken”, will this fast server have to wait for the others?

Tyga · September 19, 2024, 8:22pm

yes. You’d want similar hardware across your entire cluster as anything else just introduces inefficiencies/wasted time— you could start a new cluster with more powerful hardware that isn’t dragged down by your older hardware and you wouldn’t want to add slower hardware into your clusters as you would be shooting yourself in the foot.

LaMat · September 20, 2024, 7:16am

Thank you for explaining the code

I was able to get this working thanks to @blacks1

In the ceremonyclient.service file I had ‘/root’ specified as a working directory. So the node binary was looking for (and generated default) config files in /root/.config directory. So I updated the working directory to ‘/root/ceremonyclient/node’.
The config files had '- ’ chars appended to dataWorkerMuktiaddrs entries which is not valid for yml. I removed those prefixes.

@Tyga, maybe you should add a notation in the main tutorial to either use the square brackets or the '- ’ chars for the dataWorkerMultiaddrs: entries
It can be confusing for first timers.

Ming · September 20, 2024, 12:41pm

How can I leave breathing room on my master node? I tried deleting 1 worker from multiadrs but it crashed, then i launched from --core 2 and runned less cores but it crashed aswell.

Now i’m thinking about cpulimit but im not sure will the control process take resources. Any suggetsions?

blacks1 · September 20, 2024, 1:13pm

I removed '- ’ as you were using sware bracket ‘[’ and ‘]’ to enclose dataWorkerMuktiaddrs entries (just like the default config file suggests).
In the tutorial an alternative notation without square brackets is used which is likely valid.

Ming · September 20, 2024, 2:07pm

ok. Just removed last workers from master’s config and in script changed to less cores and it worked out. Gonna test is it yielding more rewards!

LaMat · September 20, 2024, 3:48pm

Does someone know if one could use Tailscale magic DNS, such as machine-a.taila7r35.ts.net in the config.yml and in the firewall rules?

This would allow for easy management if switching a machine, since AFAIU you can maintain the same DNS which is derived from the name you assign to the machine in Tailscale.

dataWorkerMultiaddrs: [
   # Machine A
   /ip4/machine-a.taila7r35.ts.net/tcp/40000,
   /ip4/machine-a.taila7r35.ts.net/tcp/40001,
   /ip4/machine-a.taila7r35.ts.net/tcp/40002,
   ...

and
sudo ufw allow from machine-a.taila7r35.ts.net to any port 40003:40006 proto tcp

LaMat · September 20, 2024, 4:38pm

I made this thingy

Tyga · September 20, 2024, 6:40pm

/dns/<dns-address>/tcp/40000

and

LaMat · September 21, 2024, 8:31am

I have looked around at other firewall solution, but I don’t find anything that supports DNS in their rules.
But I haven’t dug too much TBH.

Pity, because this way when switching machines one would have to re-edit at least the firewall rules.

Although, I see that Tailscale allows you to change the IP address of each machine. Wondering if you would be able to delete an old machine, add a new one and then change its IP to the one of the old machine…

LaMat · September 21, 2024, 8:43am

Would it be enough for the slaves node to only have this in the config file and nothing else?

engine:
  dataWorkerMultiaddrs: 
    # Machine A data workers
    - /ip4/192.168.0.200/tcp/40000
    - /ip4/192.168.0.200/tcp/40001
    - /ip4/192.168.0.200/tcp/40002
    # Machine B data workers
    - /ip4/192.168.0.201/tcp/40003
    - /ip4/192.168.0.201/tcp/40004
    - /ip4/192.168.0.201/tcp/40005
    - /ip4/192.168.0.201/tcp/40006

UPDATE:
Just tested and I confirm that it works. This is very good for security to avoid keeping any sensible data on the slave nodes.

Tyga · September 21, 2024, 4:42pm

yes. As you found out, currently the only reason for the config on the slave node(s) is to get the addresses to create DataWorker processes for (and their ports).

mutou · September 22, 2024, 10:47pm

How to solve the issue where a problem with any single node in a cluster causes the entire cluster to stop working?

Tyga · November 22, 2024, 7:05pm

This should be noted that in 2.1 you need to copy your keys and config files to the slaves (store not needed).

On the slave:

$HOME/ceremonyclient/node/
$HOME/ceremonyclient/node/.config/
$HOME/ceremonyclient/node/.config/keys.yml
$HOME/ceremonyclient/node/.config/config.yml

Tyga · November 22, 2024, 7:06pm

this is solved in 2.0. It will just continue without that worker, even though it initially complains about it.

Topic		Replies	Views
Can we have a proper tutorial for multi workers? Node Running question , answered	2	534	September 16, 2024
Can someone help me with a large cluster setup? I have 32 servers, each with 64 cores Node Running question , unanswered	0	283	October 6, 2024
Can I copy a .config directory from one machine to another? Node Running question , answered	4	254	June 20, 2024
Help with multiple node setup Node Running question , answered	1	198	October 18, 2024
Master node can't connect to worker after upgrading to 1.4.20.1 Node Running question , answered	7	215	June 30, 2024

How to Run Nodes in a Cluster

Related topics