Steemit Retro: August & HF21/22

in #steemit5 years ago

Hello Steemians, it’s been a long couple of weeks which is precisely why it was so important that we hold an engineering retrospective while important events were fresh in our heads.

Retro Recap

For those who aren’t already aware, we perform monthly retrospectives during which we systematically reflect on how we function as a team with the goal of continuously improving our processes. We want Steemians to have as much insight into what we are doing as possible, so today we’d like to share with you a summary of what we discussed in our most recent retrospective which covered the past month. If you would like to see last month’s retrospective, go here.

All retros use the same format in the same sequence, starting with “what went well,” so if you just want to read about what we think we did wrong, you can feel free to skip to that section ;)

What went well?

  • We continued to make good progress on SMTs, remaining ahead of schedule
  • Most of the backend work in Hivemind for Communities was completed
  • Preparation for the front end development for Communities began
  • HF 21 occurred (certainly more about this later)
  • We released video interviews with the some of our engineers which were well received
  • Testing for HF21 was much better than HF20 (or any other previous hardfork) in that it unearthed a number of bugs that would have made hardforking even more difficult
  • Despite the difficulties associated with the hardfork, the community seemed less anxious about the temporary interruption of services. We believe this was because the changes were so heavily directed by the community, and because communications were so much more extensive leading up to the hardfork
  • The economic changes already appear to be having a positive impact on Steem
  • The proposal system seems to be inspiring users to come up with new ways to add value to Steem
  • Whether due to the changes included in the hardfork, or the intent behind those changes, it would appear that a non-trivial number of inactive users, including influential users, have become active once again
  • We feel that our relationship with the Witnesses has become more collaborative and improved generally. A consequence of this is that we are better able to work together to come up with solutions, form a consensus, and implement necessary changes. This enabled us all to respond to the delegation bug extremely rapidly by releasing HF22
  • Tests performed on our seed node (or “exchange node”) proved useful
  • MIRA in memory replays actually work on our account history config (as opposed to a full node) and are surprisingly fast
  • Communications on twitter and Steemit during the outages were better than they have been in the past

What could have gone better?

  • Communications can always be better, especially during a crisis
  • CI Issues for steemd caused longer build times
  • SPS API calls could be easier to work with. It would have been great to have a separate service that could handle the data on release day. Another option might be to handle a lot of this in client libraries
  • Overflow on what we thought were safe calculations were actually not - this led to a chain halt and problems with certain operations on chain.
  • For the purposes of improved debugging, newer code could have been wrapped in FC_CAPTURE_AND_RETHROW
  • The growth of the chain has resulted in reindex times taking a very long time
  • While in memory MIRA replays were surprisingly fast, migrating state to disk took much longer than expected, effectively neutralizing the unexpected benefit that could accrue from in memory MIRA replays
  • The challenges that have arisen out of hardforks has placed an abnormal, and unacceptable, burden on engineers. This is not only unfair to the engineers, but also leads to fear and anxiety about future hardforks. While Steem’s facility with respect to system upgrades is a feature we believe should be exploited, we must dedicate more effort to ensuring that this can be done in a way that sufficiently considers the psychological well being of not just engineers, but community members, stakeholders, users, exchanges and Witnesses.

Escalations

  • Tests should be instrumented to exercise integers with higher values that could possibly trigger overflow situations
  • Only saving state files dating back 5 days is insufficient as we are leading up to hardforks
  • We should consider setting up a system to archive historical state files for a very long time
  • @vandeberg and @gerbino need more fast local storage so that they can debug live nodes locally
  • Platform independent state files, which were already part of the SMT spec, would have dramatically reduced downtime
  • MIRA in memory replays should be further optimized
  • We need to profile reindexes and consider optimizing the business logic
  • MIRA itself could benefit from further optimizations
  • We should explore how we can optimize reindexes or engineer future releases so that reindexes are not needed
  • We need better testnet infrastructure. Tinman should be copying values that are as close to 1:1 to the mainnet as possible. Delegations should also be copied to the testnet
  • We must review SMT vesting calculations via tests and code inspection to ensure there is no overflow
  • We should separate production deployment code from the steemd repo to prevent requiring a rebuild for config/deployment changes
  • We should investigate whether a debug build for a seed node is capable of keeping up with the live chain to a degree that will be useable
  • We should consider on-call rotations for coverage to alleviate other team members
  • The blockchain team should take some time off as soon as they can, and consider planning on taking time off immediately prior to hardforks to be sufficiently rested in the event of a worst case scenario
  • We should explore ways to expose more of our engineers to steemd code, including those who do not work on the back end. One way to do this might be regular “brown-bags” led by @vandeberg

This was by far our longest, and most extensive retro yet, and for good reason. Few months have included such exciting developments, and such difficult circumstances. We remain extremely excited about how Communities and SMTs are progressing, and believe that the preparations for HF21 were better than ever. That is part of what makes the downtime as a result of HF21 so disappointing. That being said, we do feel that we’ve come out of this experience with priceless information that can help ensure that the SMT hardfork proceeds more smoothly.

Stay Tuned


This post is only intended to summarize the results of our recent retrospective. We will continue to think very deeply about HF21/HF22; what went wrong, and what we can do better next time. We look forward to communicating more about this soon, so be sure to follow @steemitblog for more information.

Thank you for keeping calm and Steeming On.

The Steemit Team

Sort:  

This is really well done.

I'd like to see a communications plan revamped in escalations. 24hrs between tweets when down seems less than ideal.

Also, a lot of the ecosystem is dead even if nodes are up if they aren't Steemit nodes. This place is too centralized on your guys and that needs to change too. It's great that you're reliable for a year and people trust the service, but it's bad that things aren't working if you're not up and running.

Thanks for your hard work on this. This was rough, but still could have been much worse, and the down time is the price we pay for being able to upgrade the chain.

I'm not a widely active Twitter user myself, but the downtime did make me realise how large of a following Steem/Steemit has on there. As well as more activity during downtime, it would be great to see more communication in general.

Also, completely agree on the centralisation issue. Everything shouldn't crumble because Steemit Inc's stuff goes down. The more true decentralisation, the better.

Decentralization is good. What pushes Steem/blockchain into having such an incredible technological leap forward is our ability to create platforms that can/will improve how humans communicate with each other. Although, It's still very early days and experimentation with these new systems is compulsory.

Can we have a noobs guide to setting up a mira node?

I'd love to see that as well!

me too :-D

There is already kind of one, steem in a box by @someguy123 supports mira and it's as easy as just set a parameter to true and you'll be running MIRA.

  • Most of the backend work in Hivemind for Communities was completed

that's pretty cool

  • Preparation for the front end development for Communities began

that's huge!

Thanks for continuing to put yourself out there and sharing the retrospective.

As I care about the future of Steem I am pleased with what I read here - being really judgy when I can't talk (with my poor Engish graces), but it could use a little word smithing..just things like using the word should, needs to be more detailed..why is it just a should? please assume I'm not real technical - I assume the 'shoulds' mean It's desirable and a lower priority but you will do or do you mean you'll just consider it more....if so when? I know your a pro team and have a concept of these things, but if you don't share ....I feel bit bad as can see such improvements coming through, but you need a bit more push from us users and lovers of steem. Pls continue to ignore the crappy complaints and keep taking the reasonable ones on board and you have my vote.
I really appreciate the commitment to continual improvement. We as a community also need to improve and help you...Pls help us to help you
If you give me a UAT test or if you want me to create one even, happy to if it helps, just ask. Even better post one with your announcement of heightened chance of problems change HF window.

From the top of my head without enough techo background my only other feedback/thoughts are:

If you have a HF, the backout plan should include another fast HF
...your change window should include UAT testers throughout the community and its ok to suggest something like 'during the first week of a HF, can the community pls report problems as we have heightened risk of outage and/or speedy HF fix'...something like this. It's our blockchain as well, let us be part of its future success and give you immediate and helpful feedback

You simply then just need one person collating all the UAT feedback and engaging with the community during the heightened risk of outage/problems one week change window (I think 1 week of people being more attentive and reporting issues and expecting an outage for a global blockchain in rare case of HF is reasonable - let the entire eco system be your UAT testers as we are all beneficiaries. It's also a great way to keep the two way contact up in a way we feel more useful to support

Cheers and keep Steeming on!

Any info on deposits being changed?
I went to transfer some steem from my binance account and its been rejected. Is there a new method that I am unaware of after the update?

Loading...

Excellent retrospective, love the transparency specifically under escalations; It is essential that they do get prioritized to avoid similar issues in upcoming HFs.

You forgot something. Communication and support for exchange nodes. I assume it is non-existent as it has been in previous forks and chain interruptions leaving exchanges without the ability to transfer STEEM in or out of the exchanges.

 5 years ago (edited)

We are in constant communication with all active exchanges whenever required updates are necessary and are here to support them with anything that they may need. In general, most of them have been very quick to respond. We also took time earlier this year to update our exchange node setup guide and associated deployment scripts to prevent common issues. Further, we provide real time support for exchanges that are even in opposite timezones from us.

#newsteem on

That is the first time I've heard of this. Thank you very much for informing us on this. I've seen in the past exchanges taking 2 weeks up to 6 months to get their node back in operation which prevents their STEEM wallets transfer and receive STEEM. One exchange has never recovered from February of 2018. Witnesses can get their nodes back in operation in a matter of hours. Exchanges should be able as well provided that coordination is maintained. In fact, it should be easier for exchanges to keep their nodes running sas they don't need to keep track of social media info (only STEEM transfers). They should have a stripped down version of STEEM node software that is very easy to replay. Clock is ticking. How many days has it been since HF21/22 and exchanges don't have nodes operational? I appreciate all you do and your thoughtful response. I just want to know if we are going to be waiting additional hours, days, weeks, months, or years for exchanges to get back on line. I was reading a post a couple days ago by someone that just bought a bunch of STEEM. They were very excited to get in at this price and power up, but they then discovered that they cannot move it from the exchange so they are in a waiting game. Eventually that person is going to get pissed off. You know STEEM community can take advantage of a hard fork. People read about the news and want to join, but then realize that they can't transfer their newly purchased STEEM to their account and get discouraged. It is a real shame that this has happened regularly when there has been a disruption of the STEEM blockchain and especially during a pre-planned hard fork.

@justinw do you know if Binance or Bittrex have planned or disclosed ETAs for when they will make the upgrades necessary on their end? Keep us posted if you are provided any details. Thank you!

so are we any better off in the long-run ??

👍
~Smartsteem Curation Team