The Story of Apollo - Amazon’s Deployment Engine - All Things Distributed

The Story of Apollo - Amazon's Deployment Engine - All Things Distributed

Automated deployments are the backbone of a strong DevOps environment. Without efficient, reliable, and repeatable software updates, engineers need to redirect their focus from developing new features to managing and debugging their deployments. Amazon first faced this challenge many years ago.

When making the move to a service-oriented architecture, Amazon refactored its software into small independent services and restructured its organization into small autonomous teams. Each team took on full ownership of the development and operation of a single service, and they worked directly with their customers to improve it. With this clear focus and control, the teams were able to quickly produce new features, but their deployment process soon became a bottleneck. Manual deployment steps slowed down releases and introduced bugs caused by human error. Many teams started to fully automate their deployments to fix this, but that was not as simple as it first appeared.

Deploying software to a single host is easy. You can SSH into a machine, run a script, get the result, and you're done. The Amazon production environment, however, is more complex than that. Amazon web applications and web services run across large fleets of hosts spanning multiple data centers. The applications cannot afford any downtime, planned or otherwise. An automated deployment system needs to carefully sequence a software update across a fleet while it is actively receiving traffic. The system also requires the built-in logic to correctly respond to the many potential failure cases.

It didn't make sense for each of the small service teams to duplicate this work, so Amazon created a shared internal deployment service called Apollo. Apollo's job was to reliably deploy a specified set of software across a target fleet of hosts. Developers could define their software setup process for a single host, and Apollo would coordinate that update across an entire fleet of hosts. This made it easy for developers to "push-button" deploy their application to a development host for debugging, to a staging environment for tests, and finally to production to release an update to customers. The added efficiency and reliability of automated deployments removed the bottleneck and enabled the teams to rapidly deliver new features for their services.

Read full article from The Story of Apollo - Amazon's Deployment Engine - All Things Distributed

MySQL :: MySQL Connector/J 6.0 Developer Guide :: 4.3.1.3 Changes in the Connector/J API

This section describes the changes to the Connector/J API going from version 5.1 to 6.0. You might need to adjust your API calls accordingly:

The name of the class that implements java.sql.Driver in MySQL Connector/J has changed from com.mysql.jdbc.Driver to com.mysql.cj.jdbc.Driver. The old class name has been deprecated.
The names of these commonly-used interfaces have also been changed:
- ExceptionInterceptor: from com.mysql.jdbc.ExceptionInterceptor to com.mysql.cj.api.exceptions.ExceptionInterceptor
- StatementInterceptor: from com.mysql.jdbc.StatementInterceptorV2 to com.mysql.cj.api.jdbc.interceptors.StatementInterceptorV2
- ConnectionLifecycleInterceptor: from com.mysql.jdbc.ConnectionLifecycleInterceptor to com.mysql.cj.api.jdbc.interceptors.ConnectionLifecycleInterceptor
- AuthenticationPlugin: from com.mysql.jdbc.AuthenticationPlugin to com.mysql.cj.api.authentication.AuthenticationPlugin
- BalanceStrategy: from com.mysql.jdbc.BalanceStrategy to com.mysql.cj.api.jdbc.ha.BalanceStrategy.

Read full article from MySQL :: MySQL Connector/J 6.0 Developer Guide :: 4.3.1.3 Changes in the Connector/J API

(8) High Availability: What is a canary request? - Quora

It comes from the expression "canary in the coal mine". Miners used canaries to test for mine gases. If the canary had problems miners would know to get out.

Read full article from (8) High Availability: What is a canary request? - Quora

Designing a Search Engine: Design Patterns for Crawlers | Alejandro Moreno López | Pulse | LinkedIn

One of the things I've been really passionate about for the last few years are the crawl technologies and how the search engines work. In fact, it is probably when I was in my third year in the University, specialising in Artificial Intelligence when I started to be interested on that field.

It was there when I wrote a small engine in python which was actively crawling your hard disk and indexing all the information in a data base. The idea was that once the user had to search something, all the data was already there, ready to be displayed a lot quicker than the technologies in that time did it.

Time past, I passed the exam for that class, MacOS did something similar which is simply awesome (try to search a file in your computer nowadays, Ha!, beat that), and then my interests moved to the internet world… but I always kept an eye and the original passion for crawling techniques. That's when I started CruiseHunter, a group of algorithms that crawl the web indexing the best offers and prices for… yes, Cruises.

On the beginning CruiseHunter was coded in Ruby, language which I found quite nice to deal with xml/html files and all the problems you could find in crawling information from a site. Some time later, the project is still alive, more than ever I'd say, but now it is living a huge rewriting, using Symfony, Drupal and proper Software Design principles.

My life in Capgemini has basically changed my way of seeing things, and I can say now that the software I write is much more maintainable... I'd say it is even beautiful.

Read full article from Designing a Search Engine: Design Patterns for Crawlers | Alejandro Moreno López | Pulse | LinkedIn

Trie树和其它数据结构的比较 | 四火的唠叨

其实二叉搜索树的优势已经在与查找、插入的时间复杂度上了，通常只有O(log n)，很多集合都是通过它来实现的。在进行插入的时候，实质上是给树添加新的叶子节点，避免了节点移动，搜索、插入和删除的复杂度等于树的高度，属于O(log n)，最坏情况下整棵树所有的节点都只有一个子节点，完全变成一个线性表，复杂度是O(n)。

Trie树在最坏情况下查找要快过二叉搜索树，如果搜索字符串长度用m来表示的话，它只有O(m)，通常情况（树的节点个数要远大于搜索字符串的长度）下要远小于O(n)。

我们给Trie树举例子都是拿字符串举例的，其实它本身对key的适宜性是有严格要求的，如果key是浮点数的话，就可能导致整个Trie树巨长无比，节点可读性也非常差，这种情况下是不适宜用Trie树来保存数据的；而二叉搜索树就不存在这个问题。

和Hash表相比

考虑一下Hash表键冲突的问题。Hash表通常我们说它的复杂度是O(1)，其实严格说起来这是接近完美的Hash表的复杂度，另外还需要考虑到hash函数本身需要遍历搜索字符串，复杂度是O(m)。在不同键被映射到"同一个位置"（考虑closed hashing，这"同一个位置"可以由一个普通链表来取代）的时候，需要进行查找的复杂度取决于这"同一个位置"下节点的数目，因此，在最坏情况下，Hash表也是可以成为一张单向链表的（对于Hash冲突问题，请阅读《Hash Collision DoS问题》）。

Trie树可以比较方便地按照key的字母序来排序（整棵树先序遍历一次就好了），这是绝大多数Hash表是不同的（Hash表一般对于不同的key来说是无序的）。

在较理想的情况下，Hash表可以以O(1)的速度迅速命中目标，如果这张表非常大，需要放到磁盘上的话，Hash表的查找访问在理想情况下只需要一次即可；但是Trie树访问磁盘的数目需要等于节点深度。

很多时候Trie树比Hash表需要更多的空间，我们考虑这种一个节点存放一个字符的情况的话，在保存一个字符串的时候，没有办法把它保存成一个单独的块。Trie树的节点压缩可以明显缓解这个问题，后面会讲到。

和后缀树相比

Read full article from Trie树和其它数据结构的比较 | 四火的唠叨

Donald Trump Stands A Real Chance Of Being The Biggest Loser In Modern Elections | Huffington Post

Donald Trump Stands A Real Chance Of Being The Biggest Loser In Modern Elections The magic number is 37.4 percent -- and some polls show him below that. 10/27/2016 11:22 am ET | Updated 3 hours ago 4.1k Ryan Grim Washington bureau chief for The Huffington Post Donald Trump 's greatest fear is probably public humiliation ― and there's a decent chance he's headed for a historic one. The most unpopular presidential candidate in modern history is George McGovern, whose 1972 race against Richard Nixon ended with him raking in a puny 37.4 percent of the vote. There's at least one surface similarity to today's politics: Though it didn't happen in anywhere near as public a fashion as the GOP flight from Trump, the Democratic establishment abandoned McGovern, the darling of lefty activists. The political fallout was immense. McGovern haunts Democrats to this day.

Read full article from Donald Trump Stands A Real Chance Of Being The Biggest Loser In Modern Elections | Huffington Post

On Designing and Deploying Internet-Scale Services

The system-to-administrator ratio is commonly used as a rough metric to understand administrative costs in high-scale services. With smaller, less automated services this ratio can be as low as 2:1, whereas on industry leading, highly automated services, we've seen ratios as high as 2,500:1. Within Microsoft services, Autopilot [1] is often cited as the magic behind the success of the Windows Live Search team in achieving high system-to-administrator ratios. While auto-administration is important, the most important factor is actually the service itself. Is the service efficient to automate? Is it what we refer to more generally as operations-friendly? Services that are operations-friendly require little human intervention, and both detect and recover from all but the most obscure failures without administrative intervention. This paper summarizes the best practices accumulated over many years in scaling some of the largest services at MSN and Windows Live.

Introduction

This paper summarizes a set of best practices for designing and developing operations-friendly services. This is a rapidly evolving subject area and, consequently, any list of best practices will likely grow and morph over time. Our aim is to help others

deliver operations-friendly services quickly and
avoid the early morning phone calls and meetings with unhappy customers that non-operations-friendly services tend to yield.

The work draws on our experiences over the last 20 years in high-scale data-centric software systems and internet-scale services, most recently from leading the Exchange Hosted Services team (at the time, a mid-sized service of roughly 700 servers and just over 2.2M users). We also incorporate the experiences of the Windows Live Search, Windows Live Mail, Exchange Hosted Services, Live Communications Server, Windows Live Address Book Clearing House (ABCH), MSN Spaces, Xbox Live, Rackable Systems Engineering Team, and the Messenger Operations teams in addition to that of the overall Microsoft Global Foundation Services Operations team. Several of these contributing services have grown to more than a quarter billion users. The paper also draws heavily on the work done at Berkeley on Recovery Oriented Computing [2, 3] and at Stanford on Crash-Only Software [4, 5].

Bill Hoffman [6] contributed many best practices to this paper, but also a set of three simple tenets worth considering up front:

Expect failures. A component may crash or be stopped at any time. Dependent components might fail or be stopped at any time. There will be network failures. Disks will run out of space. Handle all failures gracefully.
Keep things simple. Complexity breeds problems. Simple things are easier to get right. Avoid unnecessary dependencies. Installation should be simple. Failures on one server should have no impact on the rest of the data center.
Automate everything. People make mistakes. People need sleep. People forget things. Automated processes are testable, fixable, and therefore ultimately much more reliable. Automate wherever possible.

These three tenets form a common thread throughout much of the discussion that follows.

Recommendations

This section is organized into ten sub-sections, each covering a different aspect of what is required to design and deploy an operations-friendly service. These sub-sections include overall service design; designing for automation and provisioning; dependency management; release cycle and testing; hardware selection and standardization; operations and capacity planning; auditing, monitoring and alerting; graceful degradation and admission control; customer and press communications plan; and customer self provisioning and self help.

Overall Application Design

We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there.

Throughout the sections that follow, a consensus emerges that firm separation of development, test, and operations isn't the most effective approach in the services world. The trend we've seen when looking across many services is that low-cost administration correlates highly with how closely the development, test, and operations teams work together.

In addition to the best practices on service design discussed here, the subsequent section, "Designing for Automation Management and Provisioning," also has substantial influence on service design. Effective automatic management and provisioning are generally achieved only with a constrained service model. This is a repeating theme throughout: simplicity is the key to efficient operations. Rational constraints on hardware selection, service design, and deployment models are a big driver of reduced administrative costs and greater service reliability.

Some of the operations-friendly basics that have the biggest impact on overall service design are:

Design for failure. This is a core concept when developing large services that comprise many cooperating components. Those components will fail and they will fail frequently. The components don't always cooperate and fail independently either. Once the service has scaled beyond 10,000 servers and 50,000 disks, failures will occur multiple times a day. If a hardware failure requires any immediate administrative action, the service simply won't scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford [4, 5] has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren't frequently used, they won't work when needed [7].
Redundancy and fault recovery. The mainframe model was to buy one very large, very expensive server. Mainframes have redundant power supplies, hot-swappable CPUs, and exotic bus architectures that provide respectable I/O throughput in a single, tightly-coupled system. The obvious problem with these systems is their expense. And even with all the costly engineering, they still aren't sufficiently reliable. In order to get the fifth 9 of reliability, redundancy is required. Even getting four 9's on a single-system deployment is difficult. This concept is fairly well understood industry-wide, yet it's still common to see services built upon fragile, non-redundant data tiers. Designing a service such that any system can crash (or be brought down for service) at any time while still meeting the service level agreement (SLA) requires careful engineering. The acid test for full compliance with this design principle is the following: is the operations team willing and able to bring down any server in the service at any time without draining the work load first? If they are, then there is synchronous redundancy (no data loss), failure detection, and automatic take-over. As a design approach, we recommend one commonly used approach to find and correct potential service security issues: security threat modeling. In security threat modeling [8], we consider each possible security threat and, for each, implement adequate mitigation. The same approach can be applied to designing for fault resiliency and recovery. Document all conceivable component failures modes and combinations thereof. For each failure, ensure that the service can continue to operate without unacceptable loss in service quality, or determine that this failure risk is acceptable for this particular service (e.g., loss of an entire data center in a non-geo-redundant service). Very unusual combinations of failures may be determined sufficiently unlikely that ensuring the system can operate through them is uneconomical. Be cautious when making this judgment. We've been surprised at how frequently "unusual" combinations of events take place when running thousands of servers that produce millions of opportunities for component failures each day. Rare combinations can become commonplace.
Commodity hardware slice. All components of the service should target a commodity hardware slice. For example, storage-light servers will be dual socket, 2- to 4-core systems in the $1,000 to $2,500 range with a boot disk. Storage-heavy servers are similar servers with 16 to 24 disks. The key observations are:
- large clusters of commodity servers are much less expensive than the small number of large servers they replace,
- server performance continues to increase much faster than I/O performance, making a small server a more balanced system for a given amount of disk,
- power consumption scales linearly with servers but cubically with clock frequency, making higher performance servers more expensive to operate, and
- a small server affects a smaller proportion of the overall service workload when failing over.
Single-version software. Two factors that make some services less expensive to develop and faster to evolve than most packaged products are
- the software needs to only target a single internal deployment and
- previous versions don't have to be supported for a decade as is the case for enterprise-targeted products.
Single-version software is relatively easy to achieve with a consumer service, especially one provided without charge. But it's equally important when selling subscription-based services to non-consumers. Enterprises are used to having significant influence over their software providers and to having complete control over when they deploy new versions (typically slowly). This drives up the cost of their operations and the cost of supporting them since so many versions of the software need to be supported. The most economic services don't give customers control over the version they run, and only host one version. Holding this single-version software line requires
- care in not producing substantial user experience changes release-to-release and
- a willingness to allow customers that need this level of control to either host internally or switch to an application service provider willing to provide this people-intensive multi-version support.
Multi-tenancy. Multi-tenancy is the hosting of all companies or end users of a service in the same service without physical isolation, whereas single tenancy is the segregation of groups of users in an isolated cluster. The argument for multi-tenancy is nearly identical to the argument for single version support and is based upon providing fundamentally lower cost of service built upon automation and large-scale.

In review, the basic design tenets and considerations we have laid out above are:

design for failure,
implement redundancy and fault recovery,
depend upon a commodity hardware slice,
support single-version software, and
enable multi-tenancy.

We are constraining the service design and operations model to maximize our ability to automate and to reduce the overall costs of the service. We draw a clear distinction between these goals and those of application service providers or IT outsourcers. Those businesses tend to be more people intensive and more willing to run complex, customer specific configurations.

More specific best practices for designing operations-friendly services are:

Quick service health check. This is the services version of a build verification test. It's a sniff test that can be run quickly on a developer's system to ensure that the service isn't broken in any substantive way. Not all edge cases are tested, but if the quick health check passes, the code can be checked in.
Develop in the full environment. Developers should be unit testing their components, but should also be testing the full service with their component changes. Achieving this goal efficiently requires single-server deployment (section 2.4), and the preceding best practice, a quick service health check.
Zero trust of underlying components. Assume that underlying components will fail and ensure that components will be able to recover and continue to provide service. The recovery technique is service-specific, but common techniques are to
- continue to operate on cached data in read-only mode or
- continue to provide service to all but a tiny fraction of the user base during the short time while the service is accessing the redundant copy of the failed component.
Do not build the same functionality in multiple components. Foreseeing future interactions is hard, and fixes have to be made in multiple parts of the system if code redundancy creeps in. Services grow and evolve quickly. Without care, the code base can deteriorate rapidly.
One pod or cluster should not affect another pod or cluster. Most services are formed of pods or sub-clusters of systems that work together to provide the service, where each pod is able to operate relatively independently. Each pod should be as close to 100% independent and without inter-pod correlated failures. Global services even with redundancy are a central point of failure. Sometimes they cannot be avoided but try to have everything that a cluster needs inside the clusters.
Allow (rare) emergency human intervention. The common scenario for this is the movement of user data due to a catastrophic event or other emergency. Design the system to never need human interaction, but understand that rare events will occur where combined failures or unanticipated failures require human interaction. These events will happen and operator error under these circumstances is a common source of catastrophic data loss. An operations engineer working under pressure at 2 a.m. will make mistakes. Design the system to first not require operations intervention under most circumstances, but work with operations to come up with recovery plans if they need to intervene. Rather than documenting these as multi-step, error-prone procedures, write them as scripts and test them in production to ensure they work. What isn't tested in production won't work, so periodically the operations team should conduct a "fire drill" using these tools. If the service-availability risk of a drill is excessively high, then insufficient investment has been made in the design, development, and testing of the tools.
Keep things simple and robust. Complicated algorithms and component interactions multiply the difficulty of debugging, deploying, etc. Simple and nearly stupid is almost always better in a high-scale service-the number of interacting failure modes is already daunting before complex optimizations are delivered. Our general rule is that optimizations that bring an order of magnitude improvement are worth considering, but percentage or even small factor gains aren't worth it.
Enforce admission control at all levels. Any good system is designed with admission control at the front door. This follows the long-understood principle that it's better to not let more work into an overloaded system than to continue accepting work and beginning to thrash. Some form of throttling or admission control is common at the entry to the service, but there should also be admission control at all major components boundaries. Work load characteristic changes will eventually lead to sub-component overload even though the overall service is operating within acceptable load levels. See the note below in section 2.8 on the "big red switch" as one way of gracefully degrading under excess load. The general rule is to attempt to gracefully degrade rather than hard failing and to block entry to the service before giving uniform poor service to all users.
Partition the service. Partitions should be infinitely-adjustable and fine-grained, and not be bounded by any real world entity (person, collection...). If the partition is by company, then a big company will exceed the size of a single partition. If the partition is by name prefix, then eventually all the P's, for example, won't fit on a single server. We recommend using a look-up table at the mid-tier that maps fine-grained entities, typically users, to the system where their data is managed. Those fine-grained partitions can then be moved freely between servers.
Understand the network design. Test early to understand what load is driven between servers in a rack, across racks, and across data centers. Application developers must understand the network design and it must be reviewed early with networking specialists on the operations team.
Analyze throughput and latency. Analysis of the throughput and latency of core service user interactions should be performed to understand impact. Do so with other operations running such as regular database maintenance, operations configuration (new users added, users migrated), service debugging, etc. This will help catch issues driven by periodic management tasks. For each service, a metric should emerge for capacity planning such as user requests per second per system, concurrent on-line users per system, or some related metric that maps relevant work load to resource requirements.
Treat operations utilities as part of the service. Operations utilities produced by development, test, program management, and operations should be code-reviewed by development, checked into the main source tree, and tracked on the same schedule and with the same testing. Frequently these utilities are mission critical and yet nearly untested.
Understand access patterns. When planning new features, always consider what load they are going to put on the backend store. Often the service model and service developers become so abstracted away from the store that they lose sight of the load they are putting on the underlying database. A best practice is to build it into the specification with a section such as, "What impacts will this feature have on the rest of the infrastructure?" Then measure and validate the feature for load when it goes live.
Version everything. Expect to run in a mixed-version environment. The goal is to run single version software but multiple versions will be live during rollout and production testing. Versions n and n+1 of all components need to coexist peacefully.
Keep the unit/functional tests from the previous release. These tests are a great way of verifying that version n-1 functionality doesn't get broken. We recommend going one step further and constantly running service verification tests in production (more detail below).
Avoid single points of failure. Single points of failure will bring down the service or portions of the service when they fail. Prefer stateless implementations. Don't affinitize requests or clients to specific servers. Instead, load balance over a group of servers able to handle the load. Static hashing or any static work allocation to servers will suffer from data and/or query skew problems over time. Scaling out is easy when machines in a class are interchangeable. Databases are often single points of failure and database scaling remains one of the hardest problems in designing internet-scale services. Good designs use fine-grained partitioning and don't support cross-partition operations to allow efficient scaling across many database servers. All database state is stored redundantly (on at least one) fully redundant hot standby server and failover is tested frequently in production.

Read full article from On Designing and Deploying Internet-Scale Services

A Conversation with Bruce Lindsay - ACM Queue

A Conversation with Bruce Lindsay

Designing for failure may be the key to success.

Photography by Tom Upton

If you were looking for an expert in designing database management systems, you couldn't find many more qualified than IBM Fellow Bruce Lindsay. He has been involved in the architecture of RDBMS (relational database management systems) practically since before there were such systems. In 1978, fresh out of graduate school at the University of California at Berkeley with a Ph.D. in computer science, he joined IBM's San Jose Research Laboratory, where researchers were then working on what would become the foundation for IBM's SQL and DB2 database products. Lindsay has had a guiding hand in the evolution of RDBMS ever since.

In the late 1980s he helped define the DRDA (Distributed Relational Database Architecture) protocol and later was the principal architect of Starburst, an extensible database system that eventually became the query optimizer and interpreter for IBM's DB2 on Unix, Windows, and Linux. Lindsay developed the concept of database extenders, which treat multimedia data—images, voice, and audio—as objects that are extensions of standard relational database and can be queried using standard SQL (Structured Query Language). Today he is still at work deep in the data management lab at IBM's Almaden Research Center, helping to create the next generation in database management products.

Our interviewer this month is Steve Bourne, of Unix "Bourne Shell" fame. He has spent 20 years in senior engineering management positions at Cisco Systems, Sun Microsystems, Digital Equipment, and Silicon Graphics, and is now chief technology officer at the venture capital partnership El Dorado Ventures in Menlo Park, California. Earlier in his career he spent nine years at Bell Laboratories as a member of the Seventh Edition Unix team. While there, he designed the Unix Command Language ("Bourne Shell"), which is used for scripting in the Unix programming environment, and he wrote the ADB debugger tool. Bourne graduated with a degree in mathematics from King's College, London, and has a Ph.D. in mathematics from Trinity College in Cambridge, England.

Read full article from A Conversation with Bruce Lindsay - ACM Queue

Eventually Consistent - All Things Distributed

Client side consistency

At the client side there are four components:

A storage system. For the moment we'll treat it as a black box, but if you want you should assume that under the covers it is something big and distributed and built to guarantee durability and availability.
Process A. A process that writes to and reads from the storage system.
Process B & C. Two processes independent of process A that also write to and read from the storage system. It is irrelevant whether these are really processes or threads within the same process, important is that they are independent and need to communicate the share information.

At the client side consistency has to do with how and when an observer (in this case processes A, B or C) sees updates made to a data object in the storage systems. In the following examples Process A has made an update to a data object.

Strong consistency. After the update completes any subsequent access (by A, B or C) will return the updated value.
Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. Often this condition is the passing of time. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window.
Eventual consistency. The storage system guarantees that if no new updates are made to the object eventually (after the inconsistency window closes) all accesses will return the last updated value. The most popular system that implements eventual consistency is DNS, the domain name system. Updates to a name are distributed according to a configured pattern and in combination with time controlled caches, eventually of client will see the update.

There are a number of variations on the eventual consistency model that are important to consider:

Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.
Read-your-writes consistency. This is an important model where process A after it has updated a data item always accesses the updated value and never will see an older value. This is a special case of the causal consistency model.
Session consistency. This is a practical version of the previous model, where a process accesses the storage system in the context of a session. As long as the session exists, the system guarantees read-your-writes consistency. If the session terminates because of certain failure scenarios a new session needs to be created, and the guarantees do not overlap the sessions.
Monotonic read consistency. If a process has seen a particular value for the object any subsequent accesses will never return any previous values.
Monotonic write consistency. In this case the system guarantees to serialize the writes by the same process. Systems that do not guarantee this level of consistency are notoriously hard to program.

Read full article from Eventually Consistent - All Things Distributed

Trie树和其它数据结构的比较 | 四火的唠叨

Trie树，又叫做前缀树或者是字典树，是一种有序的树。从空字符串的根开始，往下遍历到某个节点，确定了对应的字符串，也就是说，任意一个节点的所有子孙都具备相同的前缀。每一棵Trie树都可以被看做是一个简单版的确定有限状态的自动机（DFA，deterministic finite automaton），也就是说，对于一个任意给定的属于该自动机的状态(①)和一个属于该自动机字母表的字符(②)，都可以根据给定的转移函数(③)转到下一个状态去。其中：

① 对于Trie树中的每一个节点都确定了一个自动机的状态；
② 给定一个属于该自动机字母表的字符，在图中可以看到根据不同的字符形成的分支；
③ 从当前节点进入下一层次节点的过程经过状态转移函数得出。

一个非常常见的应用就是搜索提示，在搜索框中输入搜索信息的前缀，如"乌鲁"，提示"乌鲁木齐"；再有就是输入法的联想功能，也是一样的道理。

和二叉搜索树（binary search tree）相比

二叉搜索树又叫做二叉排序树，它满足：

任意节点如果左子树不为空，左子树所有节点的值都小于根节点的值；
任意节点如果右子树不为空，右子树所有节点的值都大于根节点的值；
左右子树也都是二叉搜索树；
所有节点的值都不相同。

和Hash表相比

Trie树可以比较方便地按照key的字母序来排序（整棵树先序遍历一次就好了），这是绝大多数Hash表是不同的（Hash表一般对于不同的key来说是无序的）。

和后缀树相比

后缀树压缩存储了一段文本的所有可能的后缀，如给定单词"banana"，可能的后缀包括：a、na、ana、nana、anana、banana这几种，上图已经将所有可能全部放在树中表示出来了，"$"表示一个后缀的结束，同时，还做到了尽量的分支压缩（分支压缩的说明在下文中也有提及）。对于给定长度为n的文本构造后缀树，它的定义要点包括：

树有n个叶子节点，分别从1到n来命名；
除了根节点，所有的非叶子节点至少有两个孩子；
每一条边代表原文本的一个非空子串；
不存在两条边以同一个字符开串标记且以同一个字符结尾；
在从根节点到叶子 i 的路径上，连接所有字符串标记形成的字符串，都表示了一个原文本的后缀子串。

构造后缀树根据文本长度需要消耗线性的时间。和Trie树相比，后缀树做到了用空间换时间，考虑全文搜索的情况，后缀树把所有可能的后缀子串都索引化了，就避免了Trie树深度遍历整棵树的过程。在算法题中许多关于"前缀子串"问题上，我们经常使用Trie树来求解，但是如果问题仅仅涉及"子串"，往往选用后缀树；另外，还有一个重要的使用在文本压缩算法上，通过后缀树可以找到重复率高的文本，实现重复文本抽取，放到字典映射表中去。

Trie树的改进

1. 按位Trie树（Bitwise Trie）：原理上和普通Trie树差不多，只不过普通Trie树存储的最小单位是字符，但是Bitwise Trie存放的是位而已。位数据的存取由CPU指令一次直接实现，对于二进制数据，它理论上要比普通Trie树快。

2. 节点压缩。

① 分支压缩：对于稳定的Trie树，基本上都是查找和读取操作，完全可以把一些分支进行压缩。例如，前图中最右侧分支的inn可以直接压缩成一个节点"inn"，而不需要作为一棵常规的子树存在。Radix树就是根据这个原理来解决Trie树过深问题的。
② 节点映射表：这种方式也是在Trie树的节点可能已经几乎完全确定的情况下采用的，针对Trie树中节点的每一个状态，如果状态总数重复很多的话，通过一个元素为数字的多维数组（比如Triple Array Trie）来表示，这样存储Trie树本身的空间开销会小一些，虽说引入了一张额外的映射表。

Read full article from Trie树和其它数据结构的比较 | 四火的唠叨

Typeahead Archives - Useful Stuff

Facebook is the biggest photo sharing service in the world and grows by several millions of images every week. Pre-2009 infrastructure uses three NFS tier. Also with some optimization this solution can't easily scale over a few billions of images.

So in 2009 Facebook develop Haystack, an HTTP based photo server. It is composed by 5 layers: HTTP server, Photo Store, Haystack Object Store, Filesystem and Storage.

Storage is made on storage blades using a RAID-6 configuration who provides adequate redundancy and excellent read performance. The poor write performance is partially mitigated by the RAID controller NVRAM write-back cache. Filesystem used is XFS and manage only storage-blade-local files, no NFS is used.

Read full article from Typeahead Archives - Useful Stuff

为什么不要把ZooKeeper用于服务发现

ZooKeeper是Apache基金会下的一个开源的、高可用的分布式应用协调服务。许多公司都把它用于服务发现。但在云环境中，面对设备及网络故障时的恢复能力是需要重点考虑的问题。因此，将应用部署在云上，就必须要预见到硬件故障、网络延迟以及网络分区等问题，进而构建出恢复能力强的系统。Peter Kelley是个性化教育初创公司 Knewton的一名软件工程师。他认为，从根本上讲，把ZooKeeper用于服务发现是个错误的做法，理由如下：

在ZooKeeper中，网络分区中的客户端节点无法到达Quorum时，就会与ZooKeeper失去联系，从而也就无法使用其服务发现机制。因此，在用于服务发现时，ZooKeeper无法很好地处理网络分区问题。作为一个协调服务，这没问题。但对于服务发现来说，信息中可能包含错误要好于没有信息。虽然可以通过客户端缓存和其它技术弥补这种缺陷，像Pinterest和Airbnb等公司所做的那样，但这并不能从根本上解决问题，如果Quorum完全不可用，或者集群分区和客户端都恰好连接到了不属于这个Quorum但仍然健康的节点，那么客户端状态仍将丢失。

更重要地，上述做法的本质是试图用缓存提高一个一致性系统的可用性，即在一个CP系统之上构建AP系统，这根本就是错误的方法。服务发现系统从设计之初就应该针对可用性而设计。

抛开CAP理论不说，ZooKeeper的设置和维护非常困难，以致Knewton多次因为错误的使用出现问题。一些看似很简单的事情，实际操作起来也非常容易出错，如在客户端重建Watcher，处理Session和异常。另外，ZooKeeper本身确实也存在一些问题，如ZOOKEEPER-1159、ZOOKEEPER-1576。

Read full article from 为什么不要把ZooKeeper用于服务发现

Linux按照CPU、内存、磁盘IO、网络性能监测 - chape的个人页面 - 开源中国社区

系统优化是一项复杂、繁琐、长期的工作，优化前需要监测、采集、测试、评估，优化后也需要测试、采集、评估、监测，而且是一个长期和持续的过程，不是说现在优化了，测试了，以后就可以一劳永逸了，也不是说书本上的优化就适合眼下正在运行的系统，不同的系统、不同的硬件、不同的应用优化的重点也不同、优化的方法也不同、优化的参数也不同。性能监测是系统优化过程中重要的一环，如果没有监测、不清楚性能瓶颈在哪里，怎么优化呢?所以找到性能瓶颈是性能监测的目的，也是系统优化的关键。系统由若干子系统构成，通常修改一个子系统有可能影响到另外一个子系统，甚至会导致整个系统不稳定、崩溃。所以说优化、监测、测试通常是连在一起的，而且是一个循环而且长期的过程，通常监测的子系统有以下这些：
•    CPU
•    Memory
•    IO
•    Network
这些子系统互相依赖，了解这些子系统的特性，监测这些子系统的性能参数以及及时发现可能会出现的瓶颈对系统优化很有帮助。
应用类型
不同的系统用途也不同，要找到性能瓶颈需要知道系统跑的是什么应用、有些什么特点，比如 web server 对系统的要求肯定和 file server 不一样，所以分清不同系统的应用类型很重要，通常应用可以分为两种类型：
•    IO 相关，IO 相关的应用通常用来处理大量数据，需要大量内存和存储，频繁 IO 操作读写数据，而对 CPU 的要求则较少，大部分时候 CPU 都在等待硬盘，比如，数据库服务器、文件服务器等。

• CPU 相关，CPU 相关的应用需要使用大量 CPU，比如高并发的 web/mail 服务器、图像/视频处理、科学计算等都可被视作 CPU 相关的应用。

监测工具
我们只需要简单的工具就可以对 Linux 的性能进行监测，以下是 VPSee 常用的工具：
工具    简单介绍
top    查看进程活动状态以及一些系统状况
vmstat    查看系统状态、硬件和系统信息等
iostat    查看CPU 负载，硬盘状况
sar    综合工具，查看系统状况
mpstat    查看多处理器状况
netstat    查看网络状况
iptraf    实时网络状况监测
tcpdump    抓取网络数据包，详细分析
tcptrace    数据包分析工具
netperf    网络带宽工具
dstat    综合工具，综合了 vmstat, iostat, ifstat, netstat 等多个信息
本系列将按照CPU、内存、磁盘IO、网络这几个方面分别介绍。

Linux性能监测：CPU篇

CPU 的占用主要取决于什么样的资源正在 CPU 上面运行，比如拷贝一个文件通常占用较少 CPU，因为大部分工作是由 DMA（Direct Memory Access）完成，只是在完成拷贝以后给一个中断让 CPU 知道拷贝已经完成；科学计算通常占用较多的 CPU，大部分计算工作都需要在 CPU 上完成，内存、硬盘等子系统只做暂时的数据存储工作。要想监测和理解 CPU 的性能需要知道一些的操作系统的基本知识，比如：中断、进程调度、进程上下文切换、可运行队列等。这里 VPSee 用个例子来简单介绍一下这些概念和他们的关系，CPU 很无辜，是个任劳任怨的打工仔，每时每刻都有工作在做（进程、线程）并且自己有一张工作清单（可运行队列），由老板（进程调度）来决定他该干什么，他需要和老板沟通以便得到老板的想法并及时调整自己的工作（上下文切换），部分工作做完以后还需要及时向老板汇报（中断），所以打工仔（CPU）除了做自己该做的工作以外，还有大量时间和精力花在沟通和汇报上。
CPU 也是一种硬件资源，和任何其他硬件设备一样也需要驱动和管理程序才能使用，我们可以把内核的进程调度看作是 CPU 的管理程序，用来管理和分配 CPU 资源，合理安排进程抢占 CPU，并决定哪个进程该使用 CPU、哪个进程该等待。操作系统内核里的进程调度主要用来调度两类资源：进程（或线程）和中断，进程调度给不同的资源分配了不同的优先级，优先级最高的是硬件中断，其次是内核（系统）进程，最后是用户进程。每个 CPU 都维护着一个可运行队列，用来存放那些可运行的线程。线程要么在睡眠状态（blocked 正在等待 IO）要么在可运行状态，如果 CPU 当前负载太高而新的请求不断，就会出现进程调度暂时应付不过来的情况，这个时候就不得不把线程暂时放到可运行队列里。VPSee 在这里要讨论的是性能监测，上面谈了一堆都没提到性能，那么这些概念和性能监测有什么关系呢？关系重大。如果你是老板，你如何检查打工仔的效率（性能）呢？我们一般会通过以下这些信息来判断打工仔是否偷懒：
•    打工仔接受和完成多少任务并向老板汇报了（中断）；
•    打工仔和老板沟通、协商每项工作的工作进度（上下文切换）；
•    打工仔的工作列表是不是都有排满（可运行队列）；
•    打工仔工作效率如何，是不是在偷懒（CPU 利用率）。
现在把打工仔换成 CPU，我们可以通过查看这些重要参数：中断、上下文切换、可运行队列、CPU 利用率来监测 CPU 的性能。

Read full article from Linux按照CPU、内存、磁盘IO、网络性能监测 - chape的个人页面 - 开源中国社区

pprof的原理

性能调优时，要分析找出哪里操作比较费时，这里pprof工具就十分重要了，它可以告诉你哪个函数被调用了多少次，帮你找到性能瓶颈。那么，pprof的原理是怎样的呢？

先从最简单的说起，关键字"取样"和"统计"。我们知道操作系统是有时钟中断的，每个进程运行一些CPU时间片，当中断到来就会切到其它进程。假设我们在每个时钟中断时对程序进行取样。具体方式是，将程序代码段的每条指令都分配一个槽，每个中断时，通过查看PC寄存器可以知道执行的哪条指令，将对应的指令槽加一。最后就可以统计，得到程序中每条指令被调用的频率。

这个是最基本的，但是有些问题。首先，指令级别的统计信息是没有太大意义的，比如说我们得到mov指令10000次，jmp指令400次是没用的，我们至少要做函数级的取样，得到函数被调用的次数。当然，只有函数信息还是不够的，比如说我们得到memset函数占整个函数调用的23%，但显然是这个库函数，我们得知道是到底是谁在调它才能找到问题。

其实，由操作系统每次中断去统计的方式，显然是不够灵活的。

不过上面已经把pprof最核心的原理说明白了，就是取样和统计。虽然存在一些问题，但是可以改进嘛。先说第二个问题，其实可以让操作系统把计时器抽象出来，给程序发SIGALRM信号，然后由用户层去处理计时器取样，这样就灵活多了。

再看第一个问题，我们要做函数级别的信息统计并得到调用关系，其实是应该对栈帧进行取样。

假设我们在程序执行的某个时刻取样得到一个栈帧序列是ABC，可以得到的信息包括：此刻运行的函数是C，是从函数B调用到C的。当取样很多次后进行统计，就可以得到调用的信息。比如对下面这段代码的取样：

Read full article from pprof的原理

The C10K problem

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

And computers are big, too. You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1200 or so. Let's see - at 20000 clients, that's 50KHz, 100Kbytes, and 50Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of twenty thousand clients. (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.

In 1999 one of the busiest ftp sites, cdrom.com, actually handled 10000 clients simultaneously through a Gigabit Ethernet pipe. As of 2001, that same speed is now being offered by several ISPs, who expect it to become increasingly popular with large business customers.

And the thin client model of computing appears to be coming back in style -- this time with the server out on the Internet, serving thousands of clients.

With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, as that's my personal area of interest, but Windows is also covered a bit.

Read full article from The C10K problem

Yaws

Yaws is a HTTP high perfomance 1.1 webserver particularly well suited for dynamic-content web applications. Two separate modes of operations are supported:

Standalone mode where Yaws runs as a regular webserver daemon. This is the default mode.
Embedded mode where Yaws runs as an embedded webserver in another Erlang application.

Yaws is entirely written in Erlang, and furthermore it is a multithreaded webserver where one Erlang lightweight process is used to handle each client.

The main advantages of Yaws compared to other Web technologies are performance and elegance. The performance comes from the underlying Erlang system and its ability to handle concurrent processes in an efficent way. Its elegance comes from Erlang as well. Web applications don't have to be written in ugly ad hoc languages.

Read full article from Yaws

可持续的软件时尚设计【5】可持续性_邓侃_新浪博客

设计的风格依赖于大师。大师仙去，他的风格就难以传承。

设计的学习依赖于学生的悟性。优秀学生的经验，难以扩散到其他同学中去，也难以造福后代学生。

要改变这种局面，是否可以尝试以下的办法？

1. 建立类似Wikipedia的设计的积累与交流平台。

2. 大量搜集各种设计的图片，视频和录音，存放到设计平台上去。

3. 逐个分析比较，解构出风格元素和套路。元素包括质地，纹理，颜色，光影等等。套路包括布局，搭配，互动等等。

4. 解构出来的元素和套路，也存放在设计平台之上。形式包括图片，视频，录音，营造图纸，还包括程序。建立好元素和套路，与正文的链接，方便使用。

5. 正文，元素和套路，不仅原作者可以添加和修改，其他人也有权添加和修改。以往的历史版本并没有丢弃，仍然存放在平台中，必要时可以恢复老版本。

6. 每位读者都可以针对一个设计的不同诠释，进行投票。以民意统计为基础，细分不同的时尚，风格和式样。

7. 提供搜索功能，不仅支持关键词的搜索，而且支持关联搜索，例如"看过这个图片的人，也经常看那些图片" 等等。

这个平台，兼具着图书馆，训练MBA学生管理能力的案例库，工具库，民意调查等等功能。在众人的集体参与下，在一代又一代人的不断积累中，平台的内容不断充实，纠正，发展。

这个平台，不仅有利于保存以往的设计结晶，而且方便学生们通过比较相似案例，掌握设计的方法，同时也方便设计者们使用现成的元素和套路，提高设计

Read full article from 可持续的软件时尚设计【5】可持续性_邓侃_新浪博客

解剖Twitter 【2】三段论_邓侃_新浪博客

网站的架构设计，传统的做法是三段论。所谓"传统的"，并不等同于"过时的"。大型网站的架构设计，强调实用。新潮的设计，固然吸引人，但是技术可能不成熟，风险高。所以，很多大型网站，走的是稳妥的传统的路子。

2006年5月Twitter刚上线的时候，为了简化网站的开发，他们使用了Ruby-On-Rails工具，而Ruby-On-Rails的设计思想，就是三段论。

1. 前段，即表述层(Presentation Tier) 用的工具是Apache Web Server，主要任务是解析HTTP协议，把来自不同用户的，不同类型的请求，分发给逻辑层。

2. 中段，即逻辑层 (Logic Tier）用的工具是Mongrel Rails Server，利用Rails现成的模块，降低开发的工作量。

3. 后段，即数据层 (Data Tier) 用的工具是MySQL 数据库。

先说后段，数据层。

Twitter 的服务，可以概括为两个核心，1. 用户，2. 短信。用户与用户之间的关系，是追与被追的关系，也就是Following和Be followed。对于一个用户来说，他只读自己"追"的那些人写的短信。而他自己写的短信，只有那些"追"自己的人才会读。抓住这两个核心，就不难理解 Twitter的其它功能是如何实现的[7]。

围绕这两个核心，就可以着手设计Data Schema，也就是存放在数据层(Data Tier)中的数据的组织方式。不妨设置三个表[8]，

1. 用户表：用户ID，姓名，登录名和密码，状态（在线与否）。

2. 短信表：短信ID，作者ID，正文（定长，140字），时间戳。

3. 用户关系表，记录追与被追的关系：用户ID，他追的用户IDs (Following)，追他的用户IDs (Be followed)。

再说中段，逻辑层。

当用户发表一条短信的时候，执行以下五个步骤，

1. 把该短信记录到"短信表" 中去。

2. 从"用户关系表"中取出追他的用户的IDs。

3. 有些追他的用户目前在线，另一些可能离线。在线与否的状态，可以在"用户表"中查到。过滤掉那些离线的用户的IDs。

4. 把那些追他的并且目前在线的用户的IDs，逐个推进一个队列(Queue)中去。

5. 从这个队列中，逐个取出那些追他的并且目前在线的用户的IDs，并且更新这些人的主页，也就是添加最新发表的这条短信。

以上这五个步骤，都由逻辑层(Logic Tier)负责。前三步容易解决，都是简单的数据库操作。最后两步，需要用到一个辅助工具，队列。队列的意义在于，分离了任务的产生与任务的执行。

队列的实现方式有多种，例如Apache Mina[9]就可以用来做队列。但是Twitter团队自己动手实现了一个队列，Kestrel [10,11]。Mina与Kestrel，各自有什么优缺点，似乎还没人做过详细比较。

不管是Kestrel还是Mina，看起来都很复杂。或许有人问，为什么不用简单的数据结构来实现队列，例如动态链表，甚至静态数组？如果逻辑层只在一台服务器上运行，那么对动态链表和静态数组这样的简单的数据结构，稍加改造，的确可以当作队列使用。Kestrel和Mina这些"重量级"的队列，意义在于支持联络多台机器的，分布式的队列。在本系列以后的篇幅中，将会重点介绍。

最后说说前段，表述层。

表述层的主要职能有两个，1. HTTP协议处理器(HTTP Processor)，包括拆解接收到的用户请求，以及封装需要发出的结果。2. 分发器(Dispatcher)，把接收到的用户请求，分发给逻辑层的机器处理。如果逻辑层只有一台机器，那么分发器无意义。但是如果逻辑层由多台机器组成，什么样的请求，发给逻辑层里面哪一台机器，就大有讲究了。逻辑层里众多机器，可能各自专门负责特定的功能，而在同功能的机器之间，要分摊工作，使负载均衡。

访问Twitter网站的，不仅仅是浏览器，而且还有手机，还有像QQ那样的电脑桌面工具，另外还有各式各样的网站插件，以便把其它网站联系到Twitter.com上来[12]。因此，Twitter的访问者与Twitter网站之间的通讯协议，不一定是HTTP，也存在其它协议。

三段论的Twitter架构，主要是针对HTTP协议的终端。但是对于其它协议的终端，Twitter的架构没有明显地划分成三段，而是把表述层和逻辑层合二为一，在Twitter的文献中，这二合一经常被称为"API"。

综上所述，一个能够完成Twitter基本功能的，简单的架构如Figure 1 所示。或许大家会觉得疑惑，这么出名的网站，架构就这么简单？Yes and No，2006年5月Twitter刚上线的时候，Twitter架构与Figure 1差距不大，不一样的地方在于加了一些简单的缓存(Cache)。即便到了现在，Twitter的架构依然可以清晰地看到Figure 1 的轮廓。

Read full article from 解剖Twitter 【2】三段论_邓侃_新浪博客

Dead money (poker) - Wikipedia

In poker, dead money is the amount of money in the pot other than the equal amounts bet by active remaining players in that pot. Examples of dead money include money contributed to the pot by players who have folded, a dead blind posted by a player returning to a game after missing blinds, or an odd chip left in the pot from a previous deal. For example, eight players each ante $1, one player opens for $2, and gets two callers, making the pot total $14. Three players are now in the pot having contributed $3 each, for $9 "live" money; the remaining $5 (representing the antes of the players who folded) is dead money. The amount of dead money in a pot affects the pot odds of plays or rules of thumb that are based on the number of players.

Read full article from Dead money (poker) - Wikipedia

confluentinc/schema-registry: Schema registry for Kafka

Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas. It stores a versioned history of all schemas, provides multiple compatibility settings and allows evolution of schemas according to the configured compatibility setting. It provides serializers that plug into Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in the Avro format.

Read full article from confluentinc/schema-registry: Schema registry for Kafka

《程序员必读之软件架构》读书笔记 I | 江南白衣

codingthearchitecture.com网站作者 Simon Brown的书。编码的架构师，一直是我的职业模板。
而当年我觉得RUP的基于4+1视图的机械架构文档模板不足以表达系统时，Simon Brown的模板给了很好的过渡范例。

架构师应该编码吗？

有些公司认为架构师太宝贵了，不该承担日常编码工作。
有人认为优秀的架构师的重要特征是抽象思维能力，也可以理解为不把时间耗在细节里。
还有一些大型项目通常意味着照看更大的"大局"，有可能你根本没时间写代码。

以上都对。

你不必放弃编码，也不要把大部分时间用于编码

你不应该因为"我是架构师"，就把自己排除在编码之外。
但也必须有足够的时间扮演技术架构师的角色。

1. 参与编写代码

要避免成为PPT架构师，最好是参与实现与交付的过程，确保架构的交付，了解设计在实现上的问题，演进架构而不是画完框图就交给实现团队从此不管。
同时，缩短与团队的距离，保持对团队的影响力，帮助团队对架构的正确理解，分享自己软件开发的经验。

另外，作为开发团队的一份子，你不需要是开发代码最好的。

2. 构建原型、框架和基础

如果不能参与日常编码，至少尝试在设计时快速构建原型去验证你的概念。
还有为团队编写框架和基础，这也是最磨练与体现编码与设计能力的时刻。

3. 进行代码评审

如果完全没有时间编码，至少参与代码评审，了解发生了什么。

4. 实验并与时俱进

如果完全没有时间在工作时间里编码，在工作之外你往往有更多空间来维持编码技能，从贡献开源项目，到不断尝试最新的语言、框架。

Read full article from 《程序员必读之软件架构》读书笔记 I | 江南白衣

Seek (B+ tree) vs. Transfer(LSM tree) - Mac Track - 博客频道 - CSDN.NET

B+ trees

B+ trees have some specific features that allow for efficient insertion, lookup, and deletion of records that are identified by keys. They represent dynamic, multilevel indexes with lower and upper bounds as far as the number of keys in each segment (also called page) is concerned. Using these segments, they achieve a much higher fanout compared to binary trees, resulting in a much lower number of IO operations to find a specific key.

In addition, they also enable you to do range scans very efficiently, since the leaf nodes in the tree are linked and represent an in-order list of all keys, avoiding more costly tree traversals. That is one of the reasons why they are used for indexes in relational database systems.

In a B+ tree index, you get locality on a page-level (where page is synonymous to "block" in other systems): for example, the leaf pages look something like:

Read full article from Seek (B+ tree) vs. Transfer(LSM tree) - Mac Track - 博客频道 - CSDN.NET

撇开代码不说，谈谈我对架构的6个冷思考

计算机是个复杂的机器，相比普通的机器（比如小家电、汽车），它可以在使用过程中对其「工作行为」进行「再定义和场景适配」，以解决不同场景下的人的需求和问题，这种「定义的结果」，对于机器的最终用户来说，是「应用 / Application」。

对于非计算机相关的普通人而言，即便是有诸多对于职位头衔的描述：「程序员」、「软件工程师」、「架构师」、「首席技术官」，也离不开一个潜意识的印象：「做网站的」或者是「修电脑的」。

很多「架构师」，都是从「软件工程师」开始，不知不觉的变成了一个「架构师」。对于我个人而言，当我还是一个实习生，被「升」为一个部门架构师带领一些正式员工干活的时候，对「架构师」这个概念居然是一片空白，甚至于不知道这是个「好消息」，还是个「坏消息」，当然也不知道「架构师」是干嘛的。

所以，我一直以最简单的方式对架构进行定义：架构是一种用计算机解决问题的综合能力，与头衔无关。下面我将结合自己的工作经验，谈谈这些年来，我对结构的理解。

1、架构源于对实践的总结

架构能力并不是与生俱来的，而是和具体经历强相关的，丰富的经验是形成架构能力的基础。

很多时候我们强调「系统性思考」对于架构设计的重要性，希望从方法论上能够对正在从事或者即将从事架构工作的程序员在专业能力上进行提升。教条式、填鸭式的培训，是教不出架构能力的。理论的价值是能够帮助应用理论的人少走一部分的弯路，但不能够解决眼前的现实问题。

在企业里，架构是一个实践结合非常紧密的领域，一切以解决实际问题为目标。由于问题是多种多样的，导致解决的方法也是多种多样的。踩过的雷，填过的坑，都需要进行总结和抽象，才能提升到架构层的高度，防止重蹈覆辙。

2、架构是一个建模的过程

对于一个复杂问题，通常会对复杂问题按照能力领域进行分解，其目的是能够找到与现有能力相匹配的映射。这个映射，就是解决方案。它，离不开人的「知识型劳动」，主要分解为三个方面：

对于已知问题的抽象和建模
对于已知能力的抽象和建模
对于解决方案和工具的设计

其中前两个方面，都提到了「建模」。建模本身是对客观事物的一种抽象，客观事物越复杂，那建模的结果变成「盲人摸象」的概率就越高。

然而，「盲人摸象」在IT领域其实不能算是个「贬义词」，因为这个现象十分的常见。究其原因，解决实际问题信息系统，更多程度是面向于「典型」应用场景，而不是「任意」应用场景的。

场景即是对客观事物的认知视角，信息系统做不到、也不需要对一个完整的客观事物进行全面（360°无死角）建模。

举个具体的例子：对于人这个客观事物，银行系统里，可能会关心这个人财务指标，例如「收入」、「支出」和「存款余额」，但在医院的重症监护病房里，可能就会关心这个人的生命指标，例如「血压」、「心跳」。

从例子里可以看出，一个面向具体问题的场景化应用系统，都是对一个客体进行「片面的」场景化建模。

说到底，建模是一种抽象能力，具体的说，是人对客观事物认知结果的理性提炼和总结，不可否认感性的部分太难以刻画和描述。很符合「庄子·天道」中所述：「意之所随者，不可以言传也」。

如果要拿数学语言进行描述「建模的能力」，就是找到一组尽可能少的「特征向量」去表述这个空间，而找这组「特征向量」的能力，就是建模的能力。

3、架构工作的核心是设计

没有软件的计算机，是「无法使用」的，因为没有办法帮助我们解决任何问题。计算机原本很「生硬」，无法很「柔软」的去直接适配所需要解决的问题。

架构的核心工作是「设计」，设计计算机如何按照预期进行工作。

架构设计中，建模的结果，是模型，它有着结构化、棱角分明的特质，因为这是计算机进行计算的最高效的方式：严格的告诉我们——两个数是相等还是不相等，及其衍生。正由于严格匹配，所以在很长的一段时间里，解决方案的制定和后续系统的交付运行，都围绕着如何严格按照实际场景进行模拟和落地。很少以「按概率成功」对系统的业务功能进行设计和实现，一切都必须「绝对正确」。因为绝大部分的计算机系统，无法理解自然语意。只能根据人为设计的结构化信息，「按部就班」地完成重复性劳动。

人工智能、机器学习，在解决自动化建模领域的成熟度还是远远达不到人的能力，如果达到了，那么软件就不需要人进行「架构设计」了。简单的从架构设计的结果（当然也是结构化的），生成代码，这方面目前的计算机还是有能力胜任的。

任何不符合实际场景的计算结果，用户都认为是「缺陷」，而在系统中产生此类异常结果，往往需要「程序员」为此承担相应的责任。呐，回想一下，在没计算机的时代，反而往往都是店小二算错了帐自己赔，没有人会去责怪算盘。这是为什么，因为算盘足够简单，简单到不需要做任何的监控系统、不需要记录任何的日志，连「三下五除二」这样的操作规则，都已经被社会化学习消除了使用成本。最终，一切出错的原因只有一个：用键盘的人。

是的，计算机系统生来就是是不可靠的，它不可能像「算盘」一样在运行期不依赖任何的自然资源。断电了，会引发故障；光纤断了，会引发故障；磁盘满了，会引发故障。。。一系列的不确定因素，导致「分布式系统」架构设计比「主机系统」的架构设计复杂的多，原本不需要操心的事情，都需要从更上层的软件层加以解决了。

所以，当前架构工作的很大一块，都随着分布式系统规模的增大而加大了比重。也许，导致世界上最聪明的一伙人都去解决计算机的问题了。

4、架构需要作出一系列非技术选择

架构既然是个解决方案，自然有很多可以自由选择的领域，有很多受限的前提条件。这些外围因素，往往还系统背后的个人、团队、企业的价值观、以及非IT能力有关，这是一个很容易被忽视的点。

与人和团队的关系

架构往往是与个人或者团队的能力有关的，因为架构前一部分是设计工作，后一部分是代码框架的落地工作。可以没有一个十全十美、满足各方需求的方案，架构过程中有很多都是妥协的结果，有的是向需求妥协，有的是向运维妥协，有的是向个人英雄主义妥协。另外，绝大部分的选择都是人作出的，这导致了和人、团队的水平形成了很大的耦合关系。

早在1895年，法国心理学家勒庞在他的心理学名著「乌合之众（The Crowd）」就早已经说过：一群精英所作出的群体决定，很有可能是最愚蠢的决定。有时候，技术团队不能太强调民主；有时候，技术团队中的强强产生的效果是「1 + 1 \< 1」。一个良好的、强弱结合的组织结构，才有可能孵化出优秀的工具，再进阶为一个优秀的产品，也有利于成员梯队的培养。

团队越大，一个优秀的架构设计方案被严格执行下去的可能性越小。第一，制定方案的人和落地方案的人大多数情况下都有脱节，很多设计精巧的方案细节到了执行者的手里，会被忽视。第二，为了统一一个团队的认识结构、设计理念，这部分的培训成本往往都是各个雇主不愿意付出的。第三，方案的描述本省是「不精确」的，还很容易存在文档过期的情况，在软件及交付的各个环节，任何参与者都有机会以自己的知识背景作为出发点进行理解，并自豪地加上自己的「杰作」。

与企业的价值观相关

企业的价值观，最直接的体现，就是企业的投资组合。

在大型的企业里，软件产品的采购往往会受制于「采购部」，也会受制于不懂IT的公司级领导所下达的行政干预，有些理由好像听上去也「很有道理」：采购过为什么还要采购，要「保护投资」。往往到了这个层面，程序员、架构师都纷纷表达了无奈。

软件，包含代码和数据。它不是一个简单的能够按照「固定资产折旧」进行的固定资产。它透射的是使用者对客观世界的认识，也需要随着对客观世界认知的变化而变化，因此版本对于软件来说就是一个时刻认知的快照沉淀。

行业的快速发展，企业的快速发展，势必推动企业信息系统的快速发展。对于企业而言，其价值是能够找到感知行业、感知产品、感知用户、感知企业内部运营的触角，每个社会中的实体不管是个人，还是产品都能够在系统中找到它的影子。

对企业主而言，IT是一个长期的投资行为。陈旧的、不符合时代背景的软件，是会变成降低企业生产力的绊脚石。

5、代码是架构设计的落地实现

现今任何的计算机高级编程语言，例如Java/C/C++，或者更高层的DSL，都是人与计算机之间的「单向语言」。这些「单向语言」，并非自然语言，大多数由程序员编写，再交由计算机进行执行，在很长的一段时间内，信息系统都是以这种方式与人进行交互。（当然，也可以慢慢的等待「Siri」之类的助手长大成人，成为一名架构师，也许那个时候，广大架构师需要转行了。）

代码是架构实现的核心，通过代码可以完成对现实世界的「虚拟化」：

概念的虚拟化
能力的虚拟化
实体的虚拟化
记忆的虚拟化
协作的虚拟化

通过一些例子，有助于理解：

概念的虚拟化：一个业务概念的类定义
能力的虚拟化：一个方法对多个输入数据进行加工并返回结果
实体的虚拟化：一个类的实例，即具体的数据
记忆的虚拟化：一条关系型数据库的行记录
协作的虚拟化：远程方法调用

是的，代码是计算机的指挥者，代码是把人类智慧「赋能」给计算机的一种语言。

代码到不到位，写的好不好，对设计的落地实现会产生很大的影响。

6、其实，架构是一种用计算机解决问题的综合能力

很多时候我们看到的「系统架构图」，其实是针对目标问题所设计的「计算机领域的解决方案」，是一种设计图纸。

可以说，「架构工作」不仅要能够交付「设计图纸」，还要能够「建地基、搭房梁」。

宏观层面：对特定问题，进行解决方案的设计
微观层面：对后续的编码工作，形成与解决方案核心相一致的代码框架

做好「架构工作」有很多非技术的「软实力」，比如：

对于团队中成员职能的正确定位，知道他们真正擅长什么
深挖至本质的问题分析
多视角、符合人性的换位思考
舍弃一些力所能及，但影响专注的「杂事」，合理的说「不」
具备一定的投资意识，从更高、更长远的视角，看待投入与产出

其他的发散性思考

在互联网公司出现之前，有没有「互联网公司」呢？他们和现如今的互联网公司的差别是什么？

其实是有的，例如「电网」、「电信运营商」、「股份制商业银行」、「快递物流公司」。在人类社会中最基本的两个元素，就是「实体」和「连接」，一切和连接有关的行业，都可以认为是「互联」，只不过信息系统在企业中的价值是由「生产关系」决定了其价值。

机器学习能够帮助架构设计吗？

机器学习很长一段时间之内都停留在参数调优上，而不具备对于一般事物进行建模的能力。前文也阐述过「概念的虚拟化」和「实体虚拟化」之间的关系，实体虚拟化就是数据，而数据本身已经是类的实例了，

互联网公司大谈「大数据」以及「数据驱动DT」的原因是什么？

前面提到，数据是对客观实体的虚拟化，客观实体并不是无中生有的，他们是自然世界的产物，数据驱动的本质是客观事物驱动，退一步讲，本质仍然是「业务驱动」。当然，打通多个场景化的数据，对客体进行360°的建模，是「大数据」真正价值所在。
需要注意的是，剑总是双刃的，当在计算机系统这个虚拟世界里，找到了360°、包含衣食住行的你，生活是便利了，因为可以预测你的需求，不过你的隐私还存在多少？

对开源软件实施「拿来主义」是否可行？

很多开源软件，直接的「拿来主义」，会导致「后患无穷」。很大程度上，开源代码是一个个人、一个团队整体能力的映射，并且和运行这些代码所需要的环境息息相关。开源代码也是挑人、挑环境的，在一个团队没有想匹配的能力进行正确的使用之前，很多时候都是一匹「天生野马」，在驯服之后才会变成自己的「血汗宝马」，驯服的过程其实就是和自己团队以及周边环境相适配、磨合的过程。

重复造轮子真的是浪费吗？

一个健康的IT团队，应当建立起一套评估「现有轮子」是否产生实际效益的体系，比如能够监控代码在生产环境的实际使用率、故障率，适时的下线一些「低效益」的代码。不要简单的否定和阻止「重新造轮子」，这是与企业内部人的能力对齐、外部大环境对齐的过程，更是企业不断新陈代谢的「投资型基因」。

结构化的数据到底意味着什么？

所谓结构化，其实是面向数据的下游处理者，可以与其内置的概念（数据模型）进行映射和处理。结构是一种「元信息」。
举个具体的例子，一张bitmap图片，它本身是有结构的。bitmap的图片是标明了每个像素点上的RGB颜色值具体是多少，这个数据结构，对于图片浏览器来说，是可以识别和解析成为一张人眼能够识别的图片的，而浏览器本身只负责每个像素点上的颜色还原。倘若这张图片里是一张「用户」写实头像，那么图片浏览器并不能够分析出这张头像具体是哪个自然人，也无法将这张图片作为一个API的入参，联合其他该用户的入参，进行内部业务逻辑的处理。

Read full article from 撇开代码不说，谈谈我对架构的6个冷思考

九月 | 2016 | PHP源码阅读，PHP设计模式-胖胖的空间

创业最开始的时候，是最难的时候，此时，从0到1，从无到有，做的是自己不曾做过的事情，所以，我们称之为创业。

对于早期的技术而言，不要大而全，不用高精尖，先按需求实现，活下来再说。我们需要考虑的是哪些可以用云服务，哪些可以直接用现成的开源方案或技术，哪些需要自己开发；我们可以粗旷一些，要的是快速出活，让产品活下来。

前期那么几杆枪，就技术而行，要用团队成员最熟悉的，要有人能全盘掌控所有的技术栈。虽然我们用的是最熟悉的东西，但是在整个技术选型和开发过程中，需要有以下几个基本的思路：

1. 原则和规范

注意解耦，分层，动静分离、轻重分离的原则；
开发的规范，代码及代码分支管理规范、发布流程；
在开发过程中，对于公共的操作要抽象成组件，即我们常说的职责单一，如缓存操作，数据库操作等等都封装成组件，一边开发一边封装；

2. 保留水平扩展的能力

业务服务端无状态，会话通过 memcache 等来管理；
数据库设计考虑到一定时间内的容量，做好必要的分库分表，如1到2年的容量规划；
热点数据缓存起来，将大量请求打到缓存而不是数据库；

3. 业务隔离

隔离关键业务和非关键业务；
隔离主业务系统与旁路上报、日志上报等周边系统；如果是 HTTP 服务，至少要在域名级别保证其隔离；
不同端业务的隔离；如 PC 侧的业务和 H5 的页面可以是同一套代码，但是域名不同，接入点不同，后端机器相同；

4. 用好开源的轮子

在满足现有业务需求的情况下，对业界开源的轮子做技术选型，在能驾驭的前提下尽量使用已有的，成熟的，经过了大量公司实践的开源组件，如nginx，redis，elk等等。

5. 必要的安全策略

安全是互联网应用无法回避的问题，我们需要在框架或基础组件层面引入常见的 XSS 、CSRF 和 SQL注入等安全问题的过滤；
对于静态的能放到CDN的内容尽量放到CDN，一是就近接入，提高访问速度，二是减少后台的服务压力；
保留快速切到云服务防 DDoS 的能力；
在业务层面实现一定的规则以及联合 WEB 容器实现一定程度上的防 CC 攻击能力；

6. 备份、备份、备份

宕机、不同城市的机房同时起火、光缆被挖断、数据错乱等等各种神奇的事情都有可能出现，此时备份就显示出其价值。我们不仅仅是要备份业务数据库，还要备份代码，备份部署脚本等等；
当所有的不幸都发生的时候，我们所有的东西都不见的时候，我们能够很快的将应用恢复到上一个可预见的备份版本，即我们有灾备方案，最好是能够提前演练过；

7. 监控可能出现的异常

使用第三方的监控服务监控网站的访问可用性，服务的可用性等；
对业务的数据和关键的节点进行监控，比如做金融的需要确认每个用户的进出钱要对得上账，在这里至少要有一个监控；

8. 灰度发布

前期按机器做灰度发布，一个简单的脚本就可以搞定，后期可以实现按用户灰度等，以此提高业务的连续性，保证业务的可用性；

从 0 到 1，不管是技术还是业务都是不成熟的，大家都是摸着石头过河，所以我们需要快速的试错，需要快速的反馈。

在技术层面，在保证以上一些原则的同时，快速迭代，实现产品需求，对于一些出错统计类的东西直接交给第三方来实现；在业务层面，如果是网站，一些流量分析直接也是直接交给第三方，比如百度统计，Google Analytics等，对于具体的业务，一个脚本每天早上跑出报表以邮件的形式发到指定邮件组，将相关人加入邮件组列表接以能接收到报表邮件。

以上是最开始需要注意的原则和必须要实现的东西，在此之外，还有很重要的需要搭建的内容需要持续搭建和实现，包括但不限于以下一些：

降级服务能力：在遇到正常或不正常的大流量时，可以在一定范围内将业务降级，业务降级可以前期提供手动降级能力，后续实现自动降级；
第三方服务可替换：花钱能解决问题，但花钱一般不能真正的解决问题，因为花钱买来的可能是一个坑，还是一个需要自己填的坑。在使用第三方服务时，需要多家备用可替换，如短信服务，多接两家，平时两家均衡分发，或者按业务分发，当某一家出问题时，直接切到正常的那家；
日志中心：日志是定位问题的必备工具，当后台服务有多台机器时，就不能一台一台的用 grep 搜索了，需要有一个集中存储的地方，直接上一个 elk 也许能解决大部分的问题；

Read full article from 九月 | 2016 | PHP源码阅读，PHP设计模式-胖胖的空间

down - Panic

为什么做技术？想到大学时期，在关于工作的遐想中，程序员这个职业从来都没有出现过。一直到临近毕业，发现败的一塌糊涂的自己，已经没有多少选择的权利。13年的夏天里，自己提着行李在火车上站了十二个小时，第一次踩在这个城市的土地上，跟斌在五环外的小店里吃第一顿饭，除了重新开始，已经什么都不愿意想。

从用java写了第一个hello world，一张白纸上，现在已经满满的都是代码。做技术是一件充满成就感而又包含挫败感的事情。

长大以后，自己是已经很难再单纯的出于热爱而去做事了。中学时可以为了打一场篮球，骑车逛遍整个县城的篮球场，回家精疲力尽躺在床上还要回味刚刚那个进球；可以为了能提前买到一张杰伦的专辑，花一整天的时间找遍所有的影像店。而现在，篮球赛只愿意看最后三分钟，一支歌听不到十遍就会删。小时候总是幻想大学的生活，不再被学校、父母束缚。读了大学以后又总是幻想工作以后的生活，看到喜欢的篮球鞋可以买买买，而到了工作之后却又想回到从前的生活。未来总会发生，过去却回不去。

消极情绪就像是病毒感冒一样，越是在意它，病痛越是被放大。

Read full article from down - Panic

Web Service Efficiency at Instagram with Python

Instagram currently features the world's largest deployment of the Django web framework, which is written entirely in Python. We initially chose to use Python because of its reputation for simplicity and practicality, which aligns well with our philosophy of "do the simple thing first." But simplicity can come with a tradeoff: efficiency. Instagram has doubled in size over the last two years and recently crossed 500 million users, so there is a strong need to maximize web service efficiency so that our platform can continue to scale smoothly. In the past year we've made our efficiency program a priority, and over the last six months we've been able to maintain our user growth without adding new capacity to our Django tiers. In this post, we'll share some of the tools we built and how we use them to optimize our daily deployment flow.

Why Efficiency?

Instagram, like all software, is limited by physical constraints like servers and datacenter power. With these constraints in mind, there are two main goals we want to achieve with our efficiency program:

Instagram should be able to serve traffic normally with continuous code rollouts in the case of lost capacity in one data center region, due to natural disaster, regional network issues, etc.
Instagram should be able to freely roll out new products and features without being blocked by capacity.

To meet these goals, we realized we needed to persistently monitor our system and battle regression.

Read full article from Web Service Efficiency at Instagram with Python

pinterest/secor: Secor is a service implementing Kafka log persistence

Secor is a service persisting Kafka logs to Amazon S3, Google Cloud Storage and Openstack Swift.

Key features

strong consistency: as long as Kafka is not dropping messages (e.g., due to aggressive cleanup policy) before Secor is able to read them, it is guaranteed that each message will be saved in exactly one S3 file. This property is not compromised by the notorious temporal inconsistency of S3 caused by the eventual consistency model,
fault tolerance: any component of Secor is allowed to crash at any given point without compromising data integrity,
load distribution: Secor may be distributed across multiple machines,
horizontal scalability: scaling the system out to handle more load is as easy as starting extra Secor processes. Reducing the resource footprint can be achieved by killing any of the running Secor processes. Neither ramping up nor down has any impact on data consistency,
output partitioning: Secor parses incoming messages and puts them under partitioned s3 paths to enable direct import into systems like Hive. day,hour,minute level partitions are supported by secor
configurable upload policies: commit points controlling when data is persisted in S3 are configured through size-based and time-based policies (e.g., upload data when local buffer reaches size of 100MB and at least once per hour),
monitoring: metrics tracking various performance properties are exposed through Ostrich and optionally exported to OpenTSDB / statsD,
customizability: external log message parser may be loaded by updating the configuration,
event transformation: external message level tranformation can be done by using customized class.
Qubole interface: Secor connects to Qubole to add finalized output partitions to Hive tables.

Read full article from pinterest/secor: Secor is a service implementing Kafka log persistence