一千萬個為什麽

搜索

如何改進SE網站的DRP測試?



查看有關“短期停電計劃2017年5月3日星期三UTC時間晚上8點美國/東部(如計算機的消防演習)“,這是關於對整個SE系列網站現有”<�災難恢復計劃“(= DRP)的測試。

如果您負責這項工作,那麽您有什麽建議來改進這些生產中的DRP測試?

轉載註明原文: 如何改進SE網站的DRP測試?

一共有 1 個回答:

註意:可能不值得深入理解StackExchange是否管理災難恢復場景。我懷疑他們正在遵循下面的許多最佳做法,並簡單地測試方案以驗證其配置。

根據您在以下環境中操作的環境:

  • 災難恢復計劃可能構成更大的業務連續性計劃的一部分,業務連續性計劃也可能會考慮對您的人員,組織,地點,信息,合作夥伴和管理系統造成操作風險。

  • 災難恢復計劃可能會分解為許多個別服務的IT服務連續性計劃。災難恢復計劃可能會將人員和流程與服務的技術方面結合在一起。

考慮到這些定義,你可能會考慮改進整個組織對失敗的適應能力的方法:

  • Service Recovery:

    • Enable individual services to be Active-Active across two geographically dispersed data centres. This does assume that applications are capable of replicating state between data centres, for example using BASE Semantics for data.
    • Create Self-Healing services, this means expecting failure and building with Resilience Engineering in mind. An example is by using a tool such as Chaos Monkey to simulate a failure.
  • Disaster Recovery Plan:

    • Again enable Active-Active mechanisms across data centres, the difference from SRPs is that capacity needs to be carefully considered, i.e. if you have to DCs in an Active-Active pattern and one fails then a single DC must be sufficiently scaled to support 100% of the traffic.
    • War Games and Rehearsals are really important for disaster recovery plans as this tests the people and the process, in the most mature DevOps environments much of this can be automated as evidenced by Chaos Gorilla.
  • Business Continuity Plan:

    • On the basis that this is a DevOps site I won't go into the long process of building Business Continuity plans. However, the rules of not keeping all of your eggs in one basket apply - have a plan for when your office is flooded:
      • Allow your staff to work remotely from home one day a week, this tests your BCP strategies.
      • If possible have to geographically and politically separate locations for your workforce.
      • Define and test a clear process for communicating business continuity events and practice them through fire drills.