One place for hosting & domains

      Understanding Database Sharding


      Introduction

      Any application or website that sees significant growth will eventually need to scale in order to accommodate increases in traffic. For data-driven applications and websites, it’s critical that scaling is done in a way that ensures the security and integrity of their data. It can be difficult to predict how popular a website or application will become or how long it will maintain that popularity, which is why some organizations choose a database architecture that allows them to scale their databases dynamically.

      In this conceptual article, we will discuss one such database architecture: sharded databases. Sharding has been receiving lots of attention in recent years, but many don’t have a clear understanding of what it is or the scenarios in which it might make sense to shard a database. We will go over what sharding is, some of its main benefits and drawbacks, and also a few common sharding methods.

      What is Sharding?

      Sharding is a database architecture pattern related to horizontal partitioning — the practice of separating one table’s rows into multiple different tables, known as partitions. Each partition has the same schema and columns, but also entirely different rows. Likewise, the data held in each is unique and independent of the data held in other partitions.

      It can be helpful to think of horizontal partitioning in terms of how it relates to vertical partitioning. In a vertically-partitioned table, entire columns are separated out and put into new, distinct tables. The data held within one vertical partition is independent from the data in all the others, and each holds both distinct rows and columns. The following diagram illustrates how a table could be partitioned both horizontally and vertically:

      Example tables showing horizontal and vertical partitioning

      Sharding involves breaking up one’s data into two or more smaller chunks, called logical shards. The logical shards are then distributed across separate database nodes, referred to as physical shards, which can hold multiple logical shards. Despite this, the data held within all the shards collectively represent an entire logical dataset.

      Database shards exemplify a shared-nothing architecture. This means that the shards are autonomous; they don’t share any of the same data or computing resources. In some cases, though, it may make sense to replicate certain tables into each shard to serve as reference tables. For example, let’s say there’s a database for an application that depends on fixed conversion rates for weight measurements. By replicating a table containing the necessary conversion rate data into each shard, it would help to ensure that all of the data required for queries is held in every shard.

      Oftentimes, sharding is implemented at the application level, meaning that the application includes code that defines which shard to transmit reads and writes to. However, some database management systems have sharding capabilities built in, allowing you to implement sharding directly at the database level.

      Given this general overview of sharding, let’s go over some of the positives and negatives associated with this database architecture.

      Benefits of Sharding

      The main appeal of sharding a database is that it can help to facilitate horizontal scaling, also known as scaling out. Horizontal scaling is the practice of adding more machines to an existing stack in order to spread out the load and allow for more traffic and faster processing. This is often contrasted with vertical scaling, otherwise known as scaling up, which involves upgrading the hardware of an existing server, usually by adding more RAM or CPU.

      It’s relatively simple to have a relational database running on a single machine and scale it up as necessary by upgrading its computing resources. Ultimately, though, any non-distributed database will be limited in terms of storage and compute power, so having the freedom to scale horizontally makes your setup far more flexible.

      Another reason why some might choose a sharded database architecture is to speed up query response times. When you submit a query on a database that hasn’t been sharded, it may have to search every row in the table you’re querying before it can find the result set you’re looking for. For an application with a large, monolithic database, queries can become prohibitively slow. By sharding one table into multiple, though, queries have to go over fewer rows and their result sets are returned much more quickly.

      Sharding can also help to make an application more reliable by mitigating the impact of outages. If your application or website relies on an unsharded database, an outage has the potential to make the entire application unavailable. With a sharded database, though, an outage is likely to affect only a single shard. Even though this might make some parts of the application or website unavailable to some users, the overall impact would still be less than if the entire database crashed.

      Drawbacks of Sharding

      While sharding a database can make scaling easier and improve performance, it can also impose certain limitations. Here, we’ll discuss some of these and why they might be reasons to avoid sharding altogether.

      The first difficulty that people encounter with sharding is the sheer complexity of properly implementing a sharded database architecture. If done incorrectly, there’s a significant risk that the sharding process can lead to lost data or corrupted tables. Even when done correctly, though, sharding is likely to have a major impact on your team’s workflows. Rather than accessing and managing one’s data from a single entry point, users must manage data across multiple shard locations, which could potentially be disruptive to some teams.

      One problem that users sometimes encounter after having sharded a database is that the shards eventually become unbalanced. By way of example, let’s say you have a database with two separate shards, one for customers whose last names begin with letters A through M and another for those whose names begin with the letters N through Z. However, your application serves an inordinate amount of people whose last names start with the letter G. Accordingly, the A-M shard gradually accrues more data than the N-Z one, causing the application to slow down and stall out for a significant portion of your users. The A-M shard has become what is known as a database hotspot. In this case, any benefits of sharding the database are canceled out by the slowdowns and crashes. The database would likely need to be repaired and resharded to allow for a more even data distribution.

      Another major drawback is that once a database has been sharded, it can be very difficult to return it to its unsharded architecture. Any backups of the database made before it was sharded won’t include data written since the partitioning. Consequently, rebuilding the original unsharded architecture would require merging the new partitioned data with the old backups or, alternatively, transforming the partitioned DB back into a single DB, both of which would be costly and time consuming endeavors.

      A final disadvantage to consider is that sharding isn’t natively supported by every database engine. For instance, PostgreSQL does not include automatic sharding as a feature, although it is possible to manually shard a PostgreSQL database. There are a number of Postgres forks that do include automatic sharding, but these often trail behind the latest PostgreSQL release and lack certain other features. Some specialized database technologies — like MySQL Cluster or certain database-as-a-service products like MongoDB Atlas — do include auto-sharding as a feature, but vanilla versions of these database management systems do not. Because of this, sharding often requires a “roll your own” approach. This means that documentation for sharding or tips for troubleshooting problems are often difficult to find.

      These are, of course, only some general issues to consider before sharding. There may be many more potential drawbacks to sharding a database depending on its use case.

      Now that we’ve covered a few of sharding’s drawbacks and benefits, we will go over a few different architectures for sharded databases.

      Sharding Architectures

      Once you’ve decided to shard your database, the next thing you need to figure out is how you’ll go about doing so. When running queries or distributing incoming data to sharded tables or databases, it’s crucial that it goes to the correct shard. Otherwise, it could result in lost data or painfully slow queries. In this section, we’ll go over a few common sharding architectures, each of which uses a slightly different process to distribute data across shards.

      Key Based Sharding

      Key based sharding, also known as hash based sharding, involves using a value taken from newly written data — such as a customer’s ID number, a client application’s IP address, a ZIP code, etc. — and plugging it into a hash function to determine which shard the data should go to. A hash function is a function that takes as input a piece of data (for example, a customer email) and outputs a discrete value, known as a hash value. In the case of sharding, the hash value is a shard ID used to determine which shard the incoming data will be stored on. Altogether, the process looks like this:

      Key based sharding example diagram

      To ensure that entries are placed in the correct shards and in a consistent manner, the values entered into the hash function should all come from the same column. This column is known as a shard key. In simple terms, shard keys are similar to primary keys in that both are columns which are used to establish a unique identifier for individual rows. Broadly speaking, a shard key should be static, meaning it shouldn’t contain values that might change over time. Otherwise, it would increase the amount of work that goes into update operations, and could slow down performance.

      While key based sharding is a fairly common sharding architecture, it can make things tricky when trying to dynamically add or remove additional servers to a database. As you add servers, each one will need a corresponding hash value and many of your existing entries, if not all of them, will need to be remapped to their new, correct hash value and then migrated to the appropriate server. As you begin rebalancing the data, neither the new nor the old hashing functions will be valid. Consequently, your server won’t be able to write any new data during the migration and your application could be subject to downtime.

      The main appeal of this strategy is that it can be used to evenly distribute data so as to prevent hotspots. Also, because it distributes data algorithmically, there’s no need to maintain a map of where all the data is located, as is necessary with other strategies like range or directory based sharding.

      Range Based Sharding

      Range based sharding involves sharding data based on ranges of a given value. To illustrate, let’s say you have a database that stores information about all the products within a retailer’s catalog. You could create a few different shards and divvy up each products’ information based on which price range they fall into, like this:

      Range based sharding example diagram

      The main benefit of range based sharding is that it’s relatively simple to implement. Every shard holds a different set of data but they all have an identical schema as one another, as well as the original database. The application code just reads which range the data falls into and writes it to the corresponding shard.

      On the other hand, range based sharding doesn’t protect data from being unevenly distributed, leading to the aforementioned database hotspots. Looking at the example diagram, even if each shard holds an equal amount of data the odds are that specific products will receive more attention than others. Their respective shards will, in turn, receive a disproportionate number of reads.

      Directory Based Sharding

      To implement directory based sharding, one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory based sharding:

      Directory based sharding example diagram

      Here, the Delivery Zone column is defined as a shard key. Data from the shard key is written to the lookup table along with whatever shard each respective row should be written to. This is similar to range based sharding, but instead of determining which range the shard key’s data falls into, each key is tied to its own specific shard. Directory based sharding is a good choice over range based sharding in cases where the shard key has a low cardinality and it doesn’t make sense for a shard to store a range of keys. Note that it’s also distinct from key based sharding in that it doesn’t process the shard key through a hash function; it just checks the key against a lookup table to see where the data needs to be written.

      The main appeal of directory based sharding is its flexibility. Range based sharding architectures limit you to specifying ranges of values, while key based ones limit you to using a fixed hash function which, as mentioned previously, can be exceedingly difficult to change later on. Directory based sharding, on the other hand, allows you to use whatever system or algorithm you want to assign data entries to shards, and it’s relatively easy dynamically add shards using this approach.

      While directory based sharding is the most flexible of the sharding methods discussed here, the need to connect to the lookup table before every query or write can have a detrimental impact on an application’s performance. Furthermore, the lookup table can become a single point of failure: if it becomes corrupted or otherwise fails, it can impact one’s ability to write new data or access their existing data.

      Should I Shard?

      Whether or not one should implement a sharded database architecture is almost always a matter of debate. Some see sharding as an inevitable outcome for databases that reach a certain size, while others see it as a headache that should be avoided unless it’s absolutely necessary, due to the operational complexity that sharding adds.

      Because of this added complexity, sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database:

      • The amount of application data grows to exceed the storage capacity of a single database node.
      • The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.
      • The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas, resulting in slowed response times or timeouts.

      Before sharding, you should exhaust all other options for optimizing your database. Some optimizations you might want to consider include:

      • Setting up a remote database. If you’re working with a monolithic application in which all of its components reside on the same server, you can improve your database’s performance by moving it over to its own machine. This doesn’t add as much complexity as sharding since the database’s tables remain intact. However, it still allows you to vertically scale your database apart from the rest of your infrastructure.
      • Implementing caching. If your application’s read performance is what’s causing you trouble, caching is one strategy that can help to improve it. Caching involves temporarily storing data that has already been requested in memory, allowing you to access it much more quickly later on.
      • Creating one or more read replicas. Another strategy that can help to improve read performance, this involves copying the data from one database server (the primary server) over to one or more secondary servers. Following this, every new write goes to the primary before being copied over to the secondaries, while reads are made exclusively to the secondary servers. Distributing reads and writes like this keeps any one machine from taking on too much of the load, helping to prevent slowdowns and crashes. Note that creating read replicas involves more computing resources and thus costs more money, which could be a significant constraint for some.
      • Upgrading to a larger server. In most cases, scaling up one’s database server to a machine with more resources requires less effort than sharding. As with creating read replicas, an upgraded server with more resources will likely cost more money. Accordingly, you should only go through with resizing if it truly ends up being your best option.

      Bear in mind that if your application or website grows past a certain point, none of these strategies will be enough to improve performance on their own. In such cases, sharding may indeed be the best option for you.

      Conclusion

      Sharding can be a great solution for those looking to scale their database horizontally. However, it also adds a great deal of complexity and creates more potential failure points for your application. Sharding may be necessary for some, but the time and resources needed to create and maintain a sharded architecture could outweigh the benefits for others.

      By reading this conceptual article, you should have a clearer understanding of the pros and cons of sharding. Moving forward, you can use this insight to make a more informed decision about whether or not a sharded database architecture is right for your application.



      Source link

      HTTP/1.1 vs HTTP/2: What’s the Difference?


      The author selected the Society of Women Engineers to receive a donation as part of the Write for DOnations program.

      Introduction

      The Hypertext Transfer Protocol, or HTTP, is an application protocol that has been the de facto standard for communication on the World Wide Web since its invention in 1989. From the release of HTTP/1.1 in 1997 until recently, there have been few revisions to the protocol. But in 2015, a reimagined version called HTTP/2 came into use, which offered several methods to decrease latency, especially when dealing with mobile platforms and server-intensive graphics and videos. HTTP/2 has since become increasingly popular, with some estimates suggesting that around a third of all websites in the world support it. In this changing landscape, web developers can benefit from understanding the technical differences between HTTP/1.1 and HTTP/2, allowing them to make informed and efficient decisions about evolving best practices.

      After reading this article, you will understand the main differences between HTTP/1.1 and HTTP/2, concentrating on the technical changes HTTP/2 has adopted to achieve a more efficient Web protocol.

      Background

      To contextualize the specific changes that HTTP/2 made to HTTP/1.1, let’s first take a high-level look at the historical development and basic workings of each.

      HTTP/1.1

      Developed by Timothy Berners-Lee in 1989 as a communication standard for the World Wide Web, HTTP is a top-level application protocol that exchanges information between a client computer and a local or remote web server. In this process, a client sends a text-based request to a server by calling a method like GET or POST. In response, the server sends a resource like an HTML page back to the client.

      For example, let’s say you are visiting a website at the domain www.example.com. When you navigate to this URL, the web browser on your computer sends an HTTP request in the form of a text-based message, similar to the one shown here:

      GET /index.html HTTP/1.1
      Host: www.example.com
      

      This request uses the GET method, which asks for data from the host server listed after Host:. In response to this request, the example.com web server returns an HTML page to the requesting client, in addition to any images, stylesheets, or other resources called for in the HTML. Note that not all of the resources are returned to the client in the first call for data. The requests and responses will go back and forth between the server and client until the web browser has received all the resources necessary to render the contents of the HTML page on your screen.

      You can think of this exchange of requests and responses as a single application layer of the internet protocol stack, sitting on top of the transfer layer (usually using the Transmission Control Protocol, or TCP) and networking layers (using the Internet Protocol, or IP):

      Internet Protocol Stack

      There is much to discuss about the lower levels of this stack, but in order to gain a high-level understanding of HTTP/2, you only need to know this abstracted layer model and where HTTP figures into it.

      With this basic overview of HTTP/1.1 out of the way, we can now move on to recounting the early development of HTTP/2.

      HTTP/2

      HTTP/2 began as the SPDY protocol, developed primarily at Google with the intention of reducing web page load latency by using techniques such as compression, multiplexing, and prioritization. This protocol served as a template for HTTP/2 when the Hypertext Transfer Protocol working group httpbis of the IETF (Internet Engineering Task Force) put the standard together, culminating in the publication of HTTP/2 in May 2015. From the beginning, many browsers supported this standardization effort, including Chrome, Opera, Internet Explorer, and Safari. Due in part to this browser support, there has been a significant adoption rate of the protocol since 2015, with especially high rates among new sites.

      From a technical point of view, one of the most significant features that distinguishes HTTP/1.1 and HTTP/2 is the binary framing layer, which can be thought of as a part of the application layer in the internet protocol stack. As opposed to HTTP/1.1, which keeps all requests and responses in plain text format, HTTP/2 uses the binary framing layer to encapsulate all messages in binary format, while still maintaining HTTP semantics, such as verbs, methods, and headers. An application level API would still create messages in the conventional HTTP formats, but the underlying layer would then convert these messages into binary. This ensures that web applications created before HTTP/2 can continue functioning as normal when interacting with the new protocol.

      The conversion of messages into binary allows HTTP/2 to try new approaches to data delivery not available in HTTP/1.1, a contrast that is at the root of the practical differences between the two protocols. The next section will take a look at the delivery model of HTTP/1.1, followed by what new models are made possible by HTTP/2.

      Delivery Models

      As mentioned in the previous section, HTTP/1.1 and HTTP/2 share semantics, ensuring that the requests and responses traveling between the server and client in both protocols reach their destinations as traditionally formatted messages with headers and bodies, using familiar methods like GET and POST. But while HTTP/1.1 transfers these in plain-text messages, HTTP/2 encodes these into binary, allowing for significantly different delivery model possibilities. In this section, we will first briefly examine how HTTP/1.1 tries to optimize efficiency with its delivery model and the problems that come up from this, followed by the advantages of the binary framing layer of HTTP/2 and a description of how it prioritizes requests.

      HTTP/1.1 — Pipelining and Head-of-Line Blocking

      The first response that a client receives on an HTTP GET request is often not the fully rendered page. Instead, it contains links to additional resources needed by the requested page. The client discovers that the full rendering of the page requires these additional resources from the server only after it downloads the page. Because of this, the client will have to make additional requests to retrieve these resources. In HTTP/1.0, the client had to break and remake the TCP connection with every new request, a costly affair in terms of both time and resources.

      HTTP/1.1 takes care of this problem by introducing persistent connections and pipelining. With persistent connections, HTTP/1.1 assumes that a TCP connection should be kept open unless directly told to close. This allows the client to send multiple requests along the same connection without waiting for a response to each, greatly improving the performance of HTTP/1.1 over HTTP/1.0.

      Unfortunately, there is a natural bottleneck to this optimization strategy. Since multiple data packets cannot pass each other when traveling to the same destination, there are situations in which a request at the head of the queue that cannot retrieve its required resource will block all the requests behind it. This is known as head-of-line (HOL) blocking, and is a significant problem with optimizing connection efficiency in HTTP/1.1. Adding separate, parallel TCP connections could alleviate this issue, but there are limits to the number of concurrent TCP connections possible between a client and server, and each new connection requires significant resources.

      These problems were at the forefront of the minds of HTTP/2 developers, who proposed to use the aforementioned binary framing layer to fix these issues, a topic you will learn more about in the next section.

      HTTP/2 — Advantages of the Binary Framing Layer

      In HTTP/2, the binary framing layer encodes requests/responses and cuts them up into smaller packets of information, greatly increasing the flexibility of data transfer.

      Let’s take a closer look at how this works. As opposed to HTTP/1.1, which must make use of multiple TCP connections to lessen the effect of HOL blocking, HTTP/2 establishes a single connection object between the two machines. Within this connection there are multiple streams of data. Each stream consists of multiple messages in the familiar request/response format. Finally, each of these messages split into smaller units called frames:

      Streams, Messages, and Frames

      At the most granular level, the communication channel consists of a bunch of binary-encoded frames, each tagged to a particular stream. The identifying tags allow the connection to interleave these frames during transfer and reassemble them at the other end. The interleaved requests and responses can run in parallel without blocking the messages behind them, a process called multiplexing. Multiplexing resolves the head-of-line blocking issue in HTTP/1.1 by ensuring that no message has to wait for another to finish. This also means that servers and clients can send concurrent requests and responses, allowing for greater control and more efficient connection management.

      Since multiplexing allows the client to construct multiple streams in parallel, these streams only need to make use of a single TCP connection. Having a single persistent connection per origin improves upon HTTP/1.1 by reducing the memory and processing footprint throughout the network. This results in better network and bandwidth utilization and thus decreases the overall operational cost.

      A single TCP connection also improves the performance of the HTTPS protocol, since the client and server can reuse the same secured session for multiple requests/responses. In HTTPS, during the TLS or SSL handshake, both parties agree on the use of a single key throughout the session. If the connection breaks, a new session starts, requiring a newly generated key for further communication. Thus, maintaining a single connection can greatly reduce the resources required for HTTPS performance. Note that, though HTTP/2 specifications do not make it mandatory to use the TLS layer, many major browsers only support HTTP/2 with HTTPS.

      Although the multiplexing inherent in the binary framing layer solves certain issues of HTTP/1.1, multiple streams awaiting the same resource can still cause performance issues. The design of HTTP/2 takes this into account, however, by using stream prioritization, a topic we will discuss in the next section.

      HTTP/2 — Stream Prioritization

      Stream prioritization not only solves the possible issue of requests competing for the same resource, but also allows developers to customize the relative weight of requests to better optimize application performance. In this section, we will break down the process of this prioritization in order to provide better insight into how you can leverage this feature of HTTP/2.

      As you know now, the binary framing layer organizes messages into parallel streams of data. When a client sends concurrent requests to a server, it can prioritize the responses it is requesting by assigning a weight between 1 and 256 to each stream. The higher number indicates higher priority. In addition to this, the client also states each stream’s dependency on another stream by specifying the ID of the stream on which it depends. If the parent identifier is omitted, the stream is considered to be dependent on the root stream. This is illustrated in the following figure:

      Stream Prioritization

      In the illustration, the channel contains six streams, each with a unique ID and associated with a specific weight. Stream 1 does not have a parent ID associated with it and is by default associated with the root node. All other streams have some parent ID marked. The resource allocation for each stream will be based on the weight that they hold and the dependencies they require. Streams 5 and 6 for example, which in the figure have been assigned the same weight and same parent stream, will have the same prioritization for resource allocation.

      The server uses this information to create a dependency tree, which allows the server to determine the order in which the requests will retrieve their data. Based on the streams in the preceding figure, the dependency tree will be as follows:

      Dependency Tree

      In this dependency tree, stream 1 is dependent on the root stream and there is no other stream derived from the root, so all the available resources will allocate to stream 1 ahead of the other streams. Since the tree indicates that stream 2 depends on the completion of stream 1, stream 2 will not proceed until the stream 1 task is completed. Now, let us look at streams 3 and 4. Both these streams depend on stream 2. As in the case of stream 1, stream 2 will get all the available resources ahead of streams 3 and 4. After stream 2 completes its task, streams 3 and 4 will get the resources; these are split in the ratio of 2:4 as indicated by their weights, resulting in a higher chunk of the resources for stream 4. Finally, when stream 3 finishes, streams 5 and 6 will get the available resources in equal parts. This can happen before stream 4 has finished its task, even though stream 4 receives a higher chunk of resources; streams at a lower level are allowed to start as soon as the dependent streams on an upper level have finished.

      As an application developer, you can set the weights in your requests based on your needs. For example, you may assign a lower priority for loading an image with high resolution after providing a thumbnail image on the web page. By providing this facility of weight assignment, HTTP/2 enables developers to gain better control over web page rendering. The protocol also allows the client to change dependencies and reallocate weights at runtime in response to user interaction. It is important to note, however, that a server may change assigned priorities on its own if a certain stream is blocked from accessing a specific resource.

      Buffer Overflow

      In any TCP connection between two machines, both the client and the server have a certain amount of buffer space available to hold incoming requests that have not yet been processed. These buffers offer flexibility to account for numerous or particularly large requests, in addition to uneven speeds of downstream and upstream connections.

      There are situations, however, in which a buffer is not enough. For example, the server may be pushing a large amount of data at a pace that the client application is not able to cope with due to a limited buffer size or a lower bandwidth. Likewise, when a client uploads a huge image or a video to a server, the server buffer may overflow, causing some additional packets to be lost.

      In order to avoid buffer overflow, a flow control mechanism must prevent the sender from overwhelming the receiver with data. This section will provide an overview of how HTTP/1.1 and HTTP/2 use different versions of this mechanism to deal with flow control according to their different delivery models.

      HTTP/1.1

      In HTTP/1.1, flow control relies on the underlying TCP connection. When this connection initiates, both client and server establish their buffer sizes using their system default settings. If the receiver’s buffer is partially filled with data, it will tell the sender its receive window, i.e., the amount of available space that remains in its buffer. This receive window is advertised in a signal known as an ACK packet, which is the data packet that the receiver sends to acknowledge that it received the opening signal. If this advertised receive window size is zero, the sender will send no more data until the client clears its internal buffer and then requests to resume data transmission. It is important to note here that using receive windows based on the underlying TCP connection can only implement flow control on either end of the connection.

      Because HTTP/1.1 relies on the transport layer to avoid buffer overflow, each new TCP connection requires a separate flow control mechanism. HTTP/2, however, multiplexes streams within a single TCP connection, and will have to implement flow control in a different manner.

      HTTP/2

      HTTP/2 multiplexes streams of data within a single TCP connection. As a result, receive windows on the level of the TCP connection are not sufficient to regulate the delivery of individual streams. HTTP/2 solves this problem by allowing the client and server to implement their own flow controls, rather than relying on the transport layer. The application layer communicates the available buffer space, allowing the client and server to set the receive window on the level of the multiplexed streams. This fine-scale flow control can be modified or maintained after the initial connection via a WINDOW_UPDATE frame.

      Since this method controls data flow on the level of the application layer, the flow control mechanism does not have to wait for a signal to reach its ultimate destination before adjusting the receive window. Intermediary nodes can use the flow control settings information to determine their own resource allocations and modify accordingly. In this way, each intermediary server can implement its own custom resource strategy, allowing for greater connection efficiency.

      This flexibility in flow control can be advantageous when creating appropriate resource strategies. For example, the client may fetch the first scan of an image, display it to the user, and allow the user to preview it while fetching more critical resources. Once the client fetches these critical resources, the browser will resume the retrieval of the remaining part of the image. Deferring the implementation of flow control to the client and server can thus improve the perceived performance of web applications.

      In terms of flow control and the stream prioritization mentioned in an earlier section, HTTP/2 provides a more detailed level of control that opens up the possibility of greater optimization. The next section will explain another method unique to the protocol that can enhance a connection in a similar way: predicting resource requests with server push.

      Predicting Resource Requests

      In a typical web application, the client will send a GET request and receive a page in HTML, usually the index page of the site. While examining the index page contents, the client may discover that it needs to fetch additional resources, such as CSS and JavaScript files, in order to fully render the page. The client determines that it needs these additional resources only after receiving the response from its initial GET request, and thus must make additional requests to fetch these resources and complete putting the page together. These additional requests ultimately increase the connection load time.

      There are solutions to this problem, however: since the server knows in advance that the client will require additional files, the server can save the client time by sending these resources to the client before it asks for them. HTTP/1.1 and HTTP/2 have different strategies of accomplishing this, each of which will be described in the next section.

      HTTP/1.1 — Resource Inlining

      In HTTP/1.1, if the developer knows in advance which additional resources the client machine will need to render the page, they can use a technique called resource inlining to include the required resource directly within the HTML document that the server sends in response to the initial GET request. For example, if a client needs a specific CSS file to render a page, inlining that CSS file will provide the client with the needed resource before it asks for it, reducing the total number of requests that the client must send.

      But there are a few problems with resource inlining. Including the resource in the HTML document is a viable solution for smaller, text-based resources, but larger files in non-text formats can greatly increase the size of the HTML document, which can ultimately decrease the connection speed and nullify the original advantage gained from using this technique. Also, since the inlined resources are no longer separate from the HTML document, there is no mechanism for the client to decline resources that it already has, or to place a resource in its cache. If multiple pages require the resource, each new HTML document will have the same resource inlined in its code, leading to larger HTML documents and longer load times than if the resource were simply cached in the beginning.

      A major drawback of resource inlining, then, is that the client cannot separate the resource and the document. A finer level of control is needed to optimize the connection, a need that HTTP/2 seeks to meet with server push.

      HTTP/2 — Server Push

      Since HTTP/2 enables multiple concurrent responses to a client’s initial GET request, a server can send a resource to a client along with the requested HTML page, providing the resource before the client asks for it. This process is called server push. In this way, an HTTP/2 connection can accomplish the same goal of resource inlining while maintaining the separation between the pushed resource and the document. This means that the client can decide to cache or decline the pushed resource separate from the main HTML document, fixing the major drawback of resource inlining.

      In HTTP/2, this process begins when the server sends a PUSH_PROMISE frame to inform the client that it is going to push a resource. This frame includes only the header of the message, and allows the client to know ahead of time which resource the server will push. If it already has the resource cached, the client can decline the push by sending a RST_STREAM frame in response. The PUSH_PROMISE frame also saves the client from sending a duplicate request to the server, since it knows which resources the server is going to push.

      It is important to note here that the emphasis of server push is client control. If a client needed to adjust the priority of server push, or even disable it, it could at any time send a SETTINGS frame to modify this HTTP/2 feature.

      Although this feature has a lot of potential, server push is not always the answer to optimizing your web application. For example, some web browsers cannot always cancel pushed requests, even if the client already has the resource cached. If the client mistakenly allows the server to send a duplicate resource, the server push can use up the connection unnecessarily. In the end, server push should be used at the discretion of the developer. For more on how to strategically use server push and optimize web applications, check out the PRPL pattern developed by Google. To learn more about the possible issues with server push, see Jake Archibald’s blog post HTTP/2 push is tougher than I thought.

      Compression

      A common method of optimizing web applications is to use compression algorithms to reduce the size of HTTP messages that travel between the client and the server. HTTP/1.1 and HTTP/2 both use this strategy, but there are implementation problems in the former that prohibit compressing the entire message. The following section will discuss why this is the case, and how HTTP/2 can provide a solution.

      HTTP/1.1

      Programs like gzip have long been used to compress the data sent in HTTP messages, especially to decrease the size of CSS and JavaScript files. The header component of a message, however, is always sent as plain text. Although each header is quite small, the burden of this uncompressed data weighs heavier and heavier on the connection as more requests are made, particularly penalizing complicated, API-heavy web applications that require many different resources and thus many different resource requests. Additionally, the use of cookies can sometimes make headers much larger, increasing the need for some kind of compression.

      In order to solve this bottleneck, HTTP/2 uses HPACK compression to shrink the size of headers, a topic discussed further in the next section.

      HTTP/2

      One of the themes that has come up again and again in HTTP/2 is its ability to use the binary framing layer to exhibit greater control over finer detail. The same is true when it comes to header compression. HTTP/2 can split headers from their data, resulting in a header frame and a data frame. The HTTP/2-specific compression program HPACK can then compress this header frame. This algorithm can encode the header metadata using Huffman coding, thereby greatly decreasing its size. Additionally, HPACK can keep track of previously conveyed metadata fields and further compress them according to a dynamically altered index shared between the client and the server. For example, take the following two requests:

      Request #1

      method:     GET
      scheme:     https
      host:       example.com
      path:       /academy
      accept:     /image/jpeg
      user-agent: Mozilla/5.0 ...
      

      Request #2

      method:     GET
      scheme:     https
      host:       example.com
      path:       /academy/images
      accept:     /image/jpeg
      user-agent: Mozilla/5.0 ...
      

      The various fields in these requests, such as method, scheme, host, accept, and user-agent, have the same values; only the path field uses a different value. As a result, when sending Request #2, the client can use HPACK to send only the indexed values needed to reconstruct these common fields and newly encode the path field. The resulting header frames will be as follows:

      Header Frame for Request #1

      method:     GET
      scheme:     https
      host:       example.com
      path:       /academy
      accept:     /image/jpeg
      user-agent: Mozilla/5.0 ...
      

      Header Frame for Request #2

      path:       /academy/images
      

      Using HPACK and other compression methods, HTTP/2 provides one more feature that can reduce client-server latency.

      Conclusion

      As you can see from this point-by-point analysis, HTTP/2 differs from HTTP/1.1 in many ways, with some features providing greater levels of control that can be used to better optimize web application performance and other features simply improving upon the previous protocol. Now that you have gained a high-level perspective on the variations between the two protocols, you can consider how such factors as multiplexing, stream prioritization, flow control, server push, and compression in HTTP/2 will affect the changing landscape of web development.

      If you would like to see a performance comparison between HTTP/1.1 and HTTP/2, check out this Google demo that compares the protocols for different latencies. Note that when you run the test on your computer, page load times may vary depending on several factors such as bandwidth, client and server resources available at the time of testing, and so on. If you’d like to study the results of more exhaustive testing, take a look at the article HTTP/2 – A Real-World Performance Test and Analysis. Finally, if you would like to explore how to build a modern web application, you could follow our How To Build a Modern Web Application to Manage Customer Information with Django and React on Ubuntu 18.04 tutorial, or set up your own HTTP/2 server with our How To Set Up Nginx with HTTP/2 Support on Ubuntu 16.04 tutorial.



      Source link

      How To Implement Continuous Testing of Ansible Roles Using Molecule and Travis CI on Ubuntu 18.04


      The author selected the Mozilla Foundation to receive a donation as part of the Write for DOnations program.

      Introduction

      Ansible is an agentless configuration management tool that uses YAML templates to define a list of tasks to be performed on hosts. In Ansible, roles are a collection of variables, tasks, files, templates and modules that are used together to perform a singular, complex function.

      Molecule is a tool for performing automated testing of Ansible roles, specifically designed to support the development of consistently well-written and maintained roles. Molecule’s unit tests allow developers to test roles simultaneously against multiple environments and under different parameters. It’s important that developers continuously run tests against code that often changes; this workflow ensures that roles continue to work as you update code libraries. Running Molecule using a continuous integration tool, like Travis CI, allows for tests to run continuously, ensuring that contributions to your code do not introduce breaking changes.

      In this tutorial, you will use a pre-made base role that installs and configures an Apache web server and a firewall on Ubuntu and CentOS servers. Then, you will initialize a Molecule scenario in that role to create tests and ensure that the role performs as intended in your target environments. After configuring Molecule, you will use Travis CI to continuously test your newly created role. Every time a change is made to your code, Travis CI will run molecule test to make sure that the role still performs correctly.

      Prerequisites

      Before you begin this tutorial, you will need:

      Step 1 — Forking the Base Role Repository

      You will be using a pre-made role called ansible-apache that installs Apache and configures a firewall on Debian- and Red Hat-based distributions. You will fork and use this role as a base and then build Molecule tests on top of it. Forking allows you to create a copy of a repository so you can make changes to it without tampering with the original project.

      Start by creating a fork of the ansible-apache role. Go to the ansible-apache repository and click on the Fork button.

      Once you have forked the repository, GitHub will lead you to your fork’s page. This will be a copy of the base repository, but on your own account.

      Click on the green Clone or Download button and you’ll see a box with Clone with HTTPS.

      Copy the URL shown for your repository. You’ll use this in the next step. The URL will be similar to this:

      https://github.com/username/ansible-apache.git
      

      You will replace username with your GitHub username.

      With your fork set up, you will clone it on your server and begin preparing your role in the next section.

      Step 2 — Preparing Your Role

      Having followed Step 1 of the prerequisite How To Test Ansible Roles with Molecule on Ubuntu 18.04, you will have Molecule and Ansible installed in a virtual environment. You will use this virtual environment for developing your new role.

      First, activate the virtual environment you created while following the prerequisites by running:

      • source my_env/bin/activate

      Run the following command to clone the repository using the URL you just copied in Step 1:

      • git clone https://github.com/username/ansible-apache.git

      Your output will look similar to the following:

      Output

      Cloning into 'ansible-apache'... remote: Enumerating objects: 16, done. remote: Total 16 (delta 0), reused 0 (delta 0), pack-reused 16 Unpacking objects: 100% (16/16), done.

      Move into the newly created directory:

      The base role you've downloaded performs the following tasks:

      • Includes variables: The role starts by including all the required variables according to the distribution of the host. Ansible uses variables to handle the disparities between different systems. Since you are using Ubuntu 18.04 and CentOS 7 as hosts, the role will recognize that the OS families are Debian and Red Hat respectively and include variables from vars/Debian.yml and vars/RedHat.yml.

      • Includes distribution-relevant tasks: These tasks include tasks/install-Debian.yml and tasks/install-RedHat.yml. Depending on the specified distribution, it installs the relevant packages. For Ubuntu, these packages are apache2 and ufw. For CentOS, these packages are httpd and firewalld.

      • Ensures latest index.html is present: This task copies over a template templates/index.html.j2 that Apache will use as the web server's home page.

      • Starts relevant services and enables them on boot: Starts and enables the required services installed as part of the first task. For CentOS, these services are httpd and firewalld, and for Ubuntu, they are apache2 and ufw.

      • Configures firewall to allow traffic: This includes either tasks/configure-Debian-firewall.yml or tasks/configure-RedHat-firewall.yml. Ansible configures either Firewalld or UFW as the firewall and whitelists the http service.

      Now that you have an understanding of how this role works, you will configure Molecule to test it. You will write test cases for these tasks that cover the changes they make.

      Step 3 — Writing Your Tests

      To check that your base role performs its tasks as intended, you will start a Molecule scenario, specify your target environments, and create three custom test files.

      Begin by initializing a Molecule scenario for this role using the following command:

      • molecule init scenario -r ansible-apache

      You will see the following output:

      Output

      --> Initializing new scenario default... Initialized scenario in /home/sammy/ansible-apache/molecule/default successfully.

      You will add CentOS and Ubuntu as your target environments by including them as platforms in your Molecule configuration file. To do this, edit the molecule.yml file using a text editor:

      • nano molecule/default/molecule.yml

      Add the following highlighted content to the Molecule configuration:

      ~/ansible-apache/molecule/default/molecule.yml

      ---
      dependency:
        name: galaxy
      driver:
        name: docker
      lint:
        name: yamllint
      platforms:
        - name: centos7
          image: milcom/centos7-systemd
          privileged: true
        - name: ubuntu18
          image: solita/ubuntu-systemd
          command: /sbin/init
          privileged: true
          volumes:
            - /lib/modules:/lib/modules:ro
      provisioner:
        name: ansible
        lint:
          name: ansible-lint
      scenario:
        name: default
      verifier:
        name: testinfra
        lint:
          name: flake8
      

      Here, you're specifying two target platforms that are launched in privileged mode since you're working with systemd services:

      • centos7 is the first platform and uses the milcom/centos7-systemd image.
      • ubuntu18 is the second platform and uses the solita/ubuntu-systemd image. In addition to using privileged mode and mounting the required kernel modules, you're running /sbin/init on launch to make sure iptables is up and running.

      Save and exit the file.

      For more information on running privileged containers visit the official Molecule documentation.

      Instead of using the default Molecule test file, you will be creating three custom test files, one for each target platform, and one file for writing tests that are common between all platforms. Start by deleting the scenario's default test file test_default.py with the following command:

      • rm molecule/default/tests/test_default.py

      You can now move on to creating the three custom test files, test_common.py, test_Debian.py, and test_RedHat.py for each of your target platforms.

      The first test file, test_common.py, will contain the common tests that each of the hosts will perform. Create and edit the common test file, test_common.py:

      • nano molecule/default/tests/test_common.py

      Add the following code to the file:

      ~/ansible-apache/molecule/default/tests/test_common.py

      import os
      import pytest
      
      import testinfra.utils.ansible_runner
      
      testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
          os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('all')
      
      
      @pytest.mark.parametrize('file, content', [
        ("/var/www/html/index.html", "Managed by Ansible")
      ])
      def test_files(host, file, content):
          file = host.file(file)
      
          assert file.exists
          assert file.contains(content)
      

      In your test_common.py file, you have imported the required libraries. You have also written a test called test_files(), which holds the only common task between distributions that your role performs: copying your template as the web servers homepage.

      The next test file, test_Debian.py, holds tests specific to Debian distributions. This test file will specifically target your Ubuntu platform.

      Create and edit the Ubuntu test file by running the following command:

      • nano molecule/default/tests/test_Debian.py

      You can now import the required libraries and define the ubuntu18 platform as the target host. Add the following code to the start of this file:

      ~/ansible-apache/molecule/default/tests/test_Debian.py

      import os
      import pytest
      
      import testinfra.utils.ansible_runner
      
      testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
          os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('ubuntu18')
      

      Then, in the same file, you'll add test_pkg() test.

      Add the following code to the file, which defines the test_pkg() test:

      ~/ansible-apache/molecule/default/tests/test_Debian.py

      ...
      @pytest.mark.parametrize('pkg', [
          'apache2',
          'ufw'
      ])
      def test_pkg(host, pkg):
          package = host.package(pkg)
      
          assert package.is_installed
      

      This test will check if apache2 and ufw packages are installed on the host.

      Note: When adding multiple tests to a Molecule test file, make sure there are two blank lines between each test or you'll get a syntax error from Molecule.

      To define the next test, test_svc(), add the following code under the test_pkg() test in your file:

      ~/ansible-apache/molecule/default/tests/test_Debian.py

      ...
      @pytest.mark.parametrize('svc', [
          'apache2',
          'ufw'
      ])
      def test_svc(host, svc):
          service = host.service(svc)
      
          assert service.is_running
          assert service.is_enabled
      

      test_svc() will check if the apache2 and ufw services are running and enabled.

      Finally you will add your last test, test_ufw_rules(), to the test_Debian.py file.

      Add this code under the test_svc() test in your file to define test_ufw_rules():

      ~/ansible-apache/molecule/default/tests/test_Debian.py

      ...
      @pytest.mark.parametrize('rule', [
          '-A ufw-user-input -p tcp -m tcp --dport 80 -j ACCEPT'
      ])
      def test_ufw_rules(host, rule):
          cmd = host.run('iptables -t filter -S')
      
          assert rule in cmd.stdout
      

      test_ufw_rules() will check that your firewall configuration permits traffic on the port used by the Apache service.

      With each of these tests added, your test_Debian.py file will look like this:

      ~/ansible-apache/molecule/default/tests/test_Debian.py

      import os
      import pytest
      
      import testinfra.utils.ansible_runner
      
      testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
          os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('ubuntu18')
      
      
      @pytest.mark.parametrize('pkg', [
          'apache2',
          'ufw'
      ])
      def test_pkg(host, pkg):
          package = host.package(pkg)
      
          assert package.is_installed
      
      
      @pytest.mark.parametrize('svc', [
          'apache2',
          'ufw'
      ])
      def test_svc(host, svc):
          service = host.service(svc)
      
          assert service.is_running
          assert service.is_enabled
      
      
      @pytest.mark.parametrize('rule', [
          '-A ufw-user-input -p tcp -m tcp --dport 80 -j ACCEPT'
      ])
      def test_ufw_rules(host, rule):
          cmd = host.run('iptables -t filter -S')
      
          assert rule in cmd.stdout
      

      The test_Debian.py file now includes the three tests: test_pkg(), test_svc(), and test_ufw_rules().

      Save and exit test_Debian.py.

      Next you'll create the test_RedHat.py test file, which will contain tests specific to Red Hat distributions to target your CentOS platform.

      Create and edit the CentOS test file, test_RedHat.py, by running the following command:

      • nano molecule/default/tests/test_RedHat.py

      Similarly to the Ubuntu test file, you will now write three tests to include in your test_RedHat.py file. Before adding the test code, you can import the required libraries and define the centos7 platform as the target host, by adding the following code to the beginning of your file:

      ~/ansible-apache/molecule/default/tests/test_RedHat.py

      import os
      import pytest
      
      import testinfra.utils.ansible_runner
      
      testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
          os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('centos7')
      

      Then, add the test_pkg() test, which will check if the httpd and firewalld packages are installed on the host.

      Following the code for your library imports, add the test_pkg() test to your file. (Again, remember to include two blank lines before each new test.)

      ~/ansible-apache/molecule/default/tests/test_RedHat.py

      ...
      @pytest.mark.parametrize('pkg', [
          'httpd',
          'firewalld'
      ])
      def test_pkg(host, pkg):
          package = host.package(pkg)
      
            assert package.is_installed
      

      Now, you can add the test_svc() test to ensure that httpd and firewalld services are running and enabled.

      Add the test_svc() code to your file following the test_pkg() test:

      ~/ansible-apache/molecule/default/tests/test_RedHat.py

      ...
      @pytest.mark.parametrize('svc', [
          'httpd',
          'firewalld'
      ])
        def test_svc(host, svc):
          service = host.service(svc)
      
          assert service.is_running
          assert service.is_enabled
      

      The final test in test_RedHat.py file will be test_firewalld(), which will check if Firewalld has the http service whitelisted.

      Add the test_firewalld() test to your file after the test_svc() code:

      ~/ansible-apache/molecule/default/tests/test_RedHat.py

      ...
      @pytest.mark.parametrize('file, content', [
          ("/etc/firewalld/zones/public.xml", "<service name="http"/>")
      ])
      def test_firewalld(host, file, content):
          file = host.file(file)
      
          assert file.exists
          assert file.contains(content)
      

      After importing the libraries and adding the three tests, your test_RedHat.py file will look like this:

      ~/ansible-apache/molecule/default/tests/test_RedHat.py

      import os
      import pytest
      
      import testinfra.utils.ansible_runner
      
      testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
          os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('centos7')
      
      
      @pytest.mark.parametrize('pkg', [
          'httpd',
          'firewalld'
      ])
      def test_pkg(host, pkg):
          package = host.package(pkg)
      
          assert package.is_installed
      
      
      @pytest.mark.parametrize('svc', [
          'httpd',
          'firewalld'
      ])
      def test_svc(host, svc):
          service = host.service(svc)
      
          assert service.is_running
          assert service.is_enabled
      
      
      @pytest.mark.parametrize('file, content', [
          ("/etc/firewalld/zones/public.xml", "<service name="http"/>")
      ])
      def test_firewalld(host, file, content):
          file = host.file(file)
      
          assert file.exists
          assert file.contains(content)
      

      Now that you've completed writing tests in all three files, test_common.py, test_Debian.py, and test_RedHat.py, your role is ready for testing. In the next step, you will use Molecule to run these tests against your newly configured role.

      Step 4 — Testing Against Your Role

      You will now execute your newly created tests against the base role ansible-apache using Molecule. To run your tests, use the following command:

      You'll see the following output once Molecule has finished running all the tests:

      Output

      ... --> Scenario: 'default' --> Action: 'verify' --> Executing Testinfra tests found in /home/sammy/ansible-apache/molecule/default/tests/... ============================= test session starts ============================== platform linux -- Python 3.6.7, pytest-4.1.1, py-1.7.0, pluggy-0.8.1 rootdir: /home/sammy/ansible-apache/molecule/default, inifile: plugins: testinfra-1.16.0 collected 12 items tests/test_common.py .. [ 16%] tests/test_RedHat.py ..... [ 58%] tests/test_Debian.py ..... [100%] ========================== 12 passed in 80.70 seconds ========================== Verifier completed successfully.

      You'll see Verifier completed successfully in your output; this means that the verifier executed all of your tests and returned them successfully.

      Now that you've successfully completed the development of your role, you can commit your changes to Git and set up Travis CI for continuous testing.

      Step 5 — Using Git to Share Your Updated Role

      In this tutorial, so far, you have cloned a role called ansible-apache and added tests to it to make sure it works against Ubuntu and CentOS hosts. To share your updated role with the public, you must commit these changes and push them to your fork.

      Run the following command to add the files and commit the changes you've made:

      This command will add all the files that you have modified in the current directory to the staging area.

      You also need to set your name and email address in the git config in order to commit successfully. You can do so using the following commands:

      • git config user.email "sammy@digitalocean.com"
      • git config user.name "John Doe"

      Commit the changed files to your repository:

      • git commit -m "Configured Molecule"

      You'll see the following output:

      Output

      [master b2d5a5c] Configured Molecule 8 files changed, 155 insertions(+), 1 deletion(-) create mode 100644 molecule/default/Dockerfile.j2 create mode 100644 molecule/default/INSTALL.rst create mode 100644 molecule/default/molecule.yml create mode 100644 molecule/default/playbook.yml create mode 100644 molecule/default/tests/test_Debian.py create mode 100644 molecule/default/tests/test_RedHat.py create mode 100644 molecule/default/tests/test_common.py

      This signifies that you have committed your changes successfully. Now, push these changes to your fork with the following command:

      • git push -u origin master

      You will see a prompt for your GitHub credentials. After entering these credentials, your code will be pushed to your repository and you'll see this output:

      Output

      Counting objects: 13, done. Compressing objects: 100% (12/12), done. Writing objects: 100% (13/13), 2.32 KiB | 2.32 MiB/s, done. Total 13 (delta 3), reused 0 (delta 0) remote: Resolving deltas: 100% (3/3), completed with 2 local objects. To https://github.com/username/ansible-apache.git 009d5d6..e4e6959 master -> master Branch 'master' set up to track remote branch 'master' from 'origin'.

      If you go to your fork's repository at github.com/username/ansible-apache, you'll see a new commit called Configured Molecule reflecting the changes you made in the files.

      Now, you can integrate Travis CI with your new repository so that any changes made to your role will automatically trigger Molecule tests. This will ensure that your role always works with Ubuntu and CentOS hosts.

      Step 6 — Integrating Travis CI

      In this step, you're going to integrate Travis CI into your workflow. Once enabled, any changes you push to your fork will trigger a Travis CI build. The purpose of this is to ensure Travis CI always runs molecule test whenever contributors make changes. If any breaking changes are made, Travis will declare the build status as such.

      Proceed to Travis CI to enable your repository. Navigate to your profile page where you can click the Activate button for GitHub.

      You can find further guidance here on activating repositories in Travis CI.

      For Travis CI to work, you must create a configuration file containing instructions for it. To create the Travis configuration file, return to your server and run the following command:

      To duplicate the environment you've created in this tutorial, you will specify parameters in the Travis configuration file. Add the following content to your file:

      ~/ansible-apache/.travis.yml

      ---
      language: python
      python:
        - "2.7"
        - "3.6"
      services:
        - docker
      install:
        - pip install molecule docker
      script:
        - molecule --version
        - ansible --version
        - molecule test
      

      The parameters you've specified in this file are:

      • language: When you specify Python as the language, the CI environment uses separate virtualenv instances for each Python version you specify under the python key.
      • python: Here, you're specifying that Travis will use both Python 2.7 and Python 3.6 to run your tests.
      • services: You need Docker to run tests in Molecule. You're specifying that Travis should ensure Docker is present in your CI environment.
      • install: Here, you're specifying preliminary installation steps that Travis CI will carry out in your virtualenv.
        • pip install molecule docker to check that Ansible and Molecule are present along with the Python library for the Docker remote API.
      • script: This is to specify the steps that Travis CI needs to carry out. In your file, you're specifying three steps:
        • molecule --version prints the Molecule version if Molecule has been successfully installed.
        • ansible --version prints the Ansible version if Ansible has been successfully installed.
        • molecule test finally runs your Molecule tests.

      The reason you specify molecule --version and ansible --version is to catch errors in case the build fails as a result of ansible or molecule misconfiguration due to versioning.

      Once you've added the content to the Travis CI configuration file, save and exit .travis.yml.

      Now, every time you push any changes to your repository, Travis CI will automatically run a build based on the above configuration file. If any of the commands in the script block fail, Travis CI will report the build status as such.

      To make it easier to see the build status, you can add a badge indicating the build status to the README of your role. Open the README.md file using a text editor:

      Add the following line to the README.md to display the build status:

      ~/ansible-apache/README.md

      [![Build Status](https://travis-ci.org/username/ansible-apache.svg?branch=master)](https://travis-ci.org/username/ansible-apache)
      

      Replace username with your GitHub username. Commit and push the changes to your repository as you did earlier.

      First, run the following command to add .travis.yml and README.md to the staging area:

      • git add .travis.yml README.md

      Now commit the changes to your repository by executing:

      • git commit -m "Configured Travis"

      Finally, push these changes to your fork with the following command:

      • git push -u origin master

      If you navigate over to your GitHub repository, you will see that it initially reports build: unknown.

      build-status-unknown

      Within a few minutes, Travis will initiate a build that you can monitor at the Travis CI website. Once the build is a success, GitHub will report the status as such on your repository as well — using the badge you've placed in your README file:

      build-status-passing

      You can access the complete details of the builds by going to the Travis CI website:

      travis-build-status

      Now that you've successfully set up Travis CI for your new role, you can continuously test and integrate changes to your Ansible roles.

      Conclusion

      In this tutorial, you forked a role that installs and configures an Apache web server from GitHub and added integrations for Molecule by writing tests and configuring these tests to work on Docker containers running Ubuntu and CentOS. By pushing your newly created role to GitHub, you have allowed other users to access your role. When there are changes to your role by contributors, Travis CI will automatically run Molecule to test your role.

      Once you're comfortable with the creation of roles and testing them with Molecule, you can integrate this with Ansible Galaxy so that roles are automatically pushed once the build is successful.



      Source link