RE: Properly Decentralising Steem & Cutting Costs by Witnesses running their own Servers & APIs
It is not that simple. You would need to use Jussi to route incoming API requests to deal with sharding a full node. This isn’t without issues. I’m one of the few who run Jussi outside of Steemit Inc.
You can’t just put home hardware in data centers and running a full node from your home is a disaster. Not only is it a poor experience for users you are likely to have many replays (each will take days/weeks).
ECC ram is critical to a service that is meant to be online for months at a time.
Collocation costs of 3 machines will likely be near the cost of renting an appropriate larger server.
Scaling horizontally with steemd is critical as more are definitely better than bigger as long as you reach the minimium requirements. The problem is now you have three machines to manage, each with a replay time of days/weeks for every outage and patch that comes up. Because you are sharding one node into three, unless you have redundancy you have just introduced 300% higher risk of downtime. If any one node goes down, users of the node will likely have complete failure are dapps generally need more than one part of the API. With the time to come back online (replay time) being days/weeks you could in theory have cascading downtime between nodes that keeps it offline even longer.
I’m the end, you might it even be cheaper, perhaps a little bit. If you bought all the hardware upfront like @anyx did, you certainly will have savings but collocation costs x 3 will likely get you near rental costs of a proper node.
Hi @markymark. Thanks for your detailed comments. I have responded to each below and would appreciate your further feedback.
Sure there is, you have 3 machines for the typical one machine. Unless you have redundancy by having 2 of each there will be more downtime. Nothing to do with the CPU, in fact many public full nodes outside of Steemit are in fact AMD CPUs already.
This isn't the time-consuming part of a replay for a full node.
Because it takes days on Intel Gold Xeon as well, if you don't have a lot of ram or super fast array of NVMe. A full node replay ranges from about 12 hours to 7 days depending on configuration.
While Internet speed is important (full nodes will easily go 5-20 TB/month of traffic) power is equally important and UPS isn't good enough to ensure outages in storm situations, depending on where you are during the winter 2-5 days without power isn't uncommon.
A full node is a public service, a consensus witness node is supposed to be private, hidden, and secure. It is not in the best interest of the network to have them combined on a server that is being accessed by the public.
Thanks again. A few clarifications on replay times.
What do you mean by "a lot" of RAM for replay purposes? Is 128Gb not a lot?
NVMe drives doing Up to 3400 MBps are super cheap these days. See https://www.newegg.com/Product/Product.aspx?Item=N82E16820147691
Is this what you mean?
No, 128GB is not enough. The shared memory file (which is nothing but RAM) is around 260GB right now.
I am actually testing using various crazy configurations to reduce the amount of RAM that is needed & even managed to get Intel's blessing for the same. ( https://steemit.com/steemdev/@bobinson/accelerating-steem-with-intel-optane )
Optane is better than NVMe in many cases and if the direct memory access works as expected, full nodes, witness nodes etc can be lot more cheaper.
Yes, I had read your Optane post. It will be very interesting to see the replay improvement from NVMe to Optane.
So replay is slightly faster in optane in comparison to SSD when running in seed and witness modes. But the real challenge is the full nodes where both the memory usage is very high and history files needs additional ~150 GB disk space. For steemit too full nodes seems to be where majority of the infrastructure expenses comes from. There are very few like @anyx and @themarkymark and handful of others running full nodes apart from Steemit Inc
Given that replay times are primarily single core CPU bound, I'm interested in whether the much higher single core performance of HEDT CPUs with only 64Gb or 128Gb RAM combined with Optane and/or NVMe can't provide a much faster replay times for a much lower price for both witness and full RPC nodes.
If Optane can act as extra RAM (via cache) then the RAM limitations of 64Gb (for LGA 1151 motherboards) & 128Gb (for LGA 2066 motherboards) are overcome and the dramatically faster single core performance will dominate.
@bobinson did a replay in little over 5 hours on a Xeon Gold 6142 with "block_log, block_log.index and shared_memory.bin all on Optane". This CPU is clocked at 2.6 Ghz base / 3.7 Ghz boost with a single core PassMark of only 1909 (based on similar Xeon Gold 6126). This is faster than @anyx's Xeon Gold 6130 (at 1636) but much slower than HEDT CPUs.
By comparison the i9-9900K does a PassMark of 2909 single core and the i7-7740X does 2622 single core. Even the lowly i5-8400 does 2335.
With the same all on Optane setup these HEDT CPUs with much faster single core performance can bring replay times under 3 hours.
@themarkymark & @anyx: If replay times are less than 3 hours then multiple cheap HEDT machines (even without ECC) will provide much greater overall uptime & performance than a single Xeon Gold for both witness & full RPC nodes.
I am doing a test with 1.4TB memory which is system memory + Optane in IMDT. Hope this will give more details.
Not for a full node it isn't but if you are breaking modules across machines, it isn't so bad. But unless 100% is in ram, you will notice a huge increase in replay times.
Most NVME drives will do no where near 3400 MB/s in reality, more like 1000-2400 far more than SSD but no where near where they claim, and FAR FAR from real memory speeds.
Also 3 machines can create redundancy that single mega server doesn’t have. Heavily used APIs can be on 2 of 3 machines while lightly used only on one. If you go to 4 x $2000 machines you get heaps of redundancy for half annual price of renting a mega server.
Posted using Partiko iOS
Not really, unless you are talking 3 machines running ALL plugins. If so, 128GB will be a problem. If they are running only some of the plugins, then you will have problems when one goes down, they basically all go down because your account history may be up but your tags or follows are down and that breaks whatever app you are using.
Jussi would handle the routing, but it does not have load balancing.
So splitting plugins across three machines increases your risk for downtime. Much like Raid 0, if any disk goes down, you are down.
Re security, can it not be achieved by running witness node and API node in separate docker containers or VMs, sharing only the consensus data. You can even use the dual ethernet on most HEDT motherboards to provide 2 completely separate internet access and IP addresses (one hidden, one public). Only in the event of one internet going down would they share internet. Redundant internet is a small cost. This is the sort of setup you can only do on your own machines, not in a data center.
Just knowing the IP of a witness node is a security risk. Even if you had multiple IP's they would be on the same subnet most likely and easy to track it down.
You can't share consensus data between VM's, they can't be both writing to the blockchain file.
Two diverse internet providers would provide completely different IPs.
Surely only the witness node would be writing to the blockchain file? Isn't that the whole point of dPOS consensus?
The APIs should just be reading from it? Or am I misunderstanding something.
If so then Docker can allow this. https://www.digitalocean.com/community/tutorials/how-to-share-data-between-docker-containers
@anyx your thoughts on these issues would be appreciated.
An RPC node runs the witness plugin and the blockchain file grows on all nodes even seed nodes. They cannot share the same file.
While docker images can easily share data, that does not mean the underlying applications can.
I thought you were talking about sharing witness and full node on same hardware, now you are talking about two ISP?
HEDT motherboard often have two Ethernet Ports. Internet via two separate ISPs can plug into each port and each VM or docker container can use a different ISP as main and the alternate as backup. Only in event of outage of one would witness & API nodes be using same ISP & IP address range.
Posted using Partiko iOS
Power outages are extremely rare in central Tel Aviv but I know that they are much more common in parts of the US.
Posted using Partiko iOS
Maybe so, but there are so many other issues that's only one of the many, and I can assure you they are far more common than a data center which virtually has zero. Even a 10 second power outage will cost you 18 hours to 14 days of downtime.