One place for hosting & domains

      Schema

      How To Use Schema Validation in MongoDB


      The author selected the Open Internet/Free Speech Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      One important aspect of relational databases — which store databases in tables made up of rows and columns — is that they operate on fixed, rigid schemas with fields of known data types. Document-oriented databases like MongoDB are more flexible in this regard, as they allow you to reshape your documents’ structure as needed.

      However, there are likely to be situations in which you might need your data documents to follow a particular structure or fulfill certain requirements. Many document databases allow you to define rules that dictate how parts of your documents’ data should be structured while still offering some freedom to change this structure if needed.

      MongoDB has a feature called schema validation that allows you to apply constraints on your documents’ structure. Schema validation is built around JSON Schema, an open standard for JSON document structure description and validation. In this tutorial, you’ll write and apply validation rules to control the structure of documents in an example MongoDB collection.

      Prerequisites

      To follow this tutorial, you will need:

      Note: The linked tutorials on how to configure your server, install MongoDB, and secure the MongoDB installation refer to Ubuntu 20.04. This tutorial concentrates on MongoDB itself, not the underlying operating system. It will generally work with any MongoDB installation regardless of the operating system as long as authentication has been enabled.

      Step 1 — Inserting Documents Without Applying Schema Validation

      In order to highlight MongoDB’s schema validation features and why they can be useful, this step outlines how to open the MongoDB shell to connect to your locally-installed MongoDB instance and create a sample collection within it. Then, by inserting a number of example documents into this collection, this step will show how MongoDB doesn’t enforce any schema validation by default. In later steps, you’ll begin creating and enforcing such rules yourself.

      To create the sample collection used in this guide, connect to the MongoDB shell as your administrative user. This tutorial follows the conventions of the prerequisite MongoDB security tutorial and assumes the name of this administrative user is AdminSammy and its authentication database is admin. Be sure to change these details in the following command to reflect your own setup, if different:

      • mongo -u AdminSammy -p --authenticationDatabase admin

      Enter the password set during installation to gain access to the shell. After providing the password, you’ll see the > prompt sign.

      To illustrate the schema validation features, this guide’s examples use an sample database containing documents that represent the highest mountains in the world. The sample document for Mount Everest will take this form:

      The Everest document

      {
          "name": "Everest",
          "height": 8848,
          "location": ["Nepal", "China"],
          "ascents": {
              "first": {
                  "year": 1953,
              },
              "first_winter": {
                  "year": 1980,
              },
              "total": 5656,
          }
      }
      

      This document contains the following information:

      • name: the peak’s name.
      • height: the peak’s elevation, in meters.
      • location: the countries in which the mountain is located. This field stores values as an array to allow for mountains located in more than one country.
      • ascents: this field’s value is another document. When one document is stored within another document like this, it’s known as an embedded or nested document. Each ascents document describes successful ascents of the given mountain. Specifically, each ascents document contains a total field that lists the total number of successful ascents of each given peak. Additionally, each of these nested documents contain two fields whose values are also nested documents:
        • first: this field’s value is a nested document that contains one field, year, which describes the year of the first overall successful ascent.
        • first_winter: this field’s value is a nested document that also contains a year field, the value of which represents the year of the first successful winter ascent of the given mountain.

      Run the following insertOne() method to simultaneously create a collection named peaks in your MongoDB installation and insert the previous example document representing Mount Everest into it:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"],
      • "ascents": {
      • "first": {
      • "year": 1953
      • },
      • "first_winter": {
      • "year": 1980
      • },
      • "total": 5656
      • }
      • }
      • )

      The output will contain a success message and an object identifier assigned to the newly inserted object:

      Output

      { "acknowledged" : true, "insertedId" : ObjectId("618ffa70bfa69c93a8980443") }

      Although you inserted this document by running the provided insertOne() method, you had complete freedom in designing this document’s structure. In some cases, you might want to have some degree of flexibility in how documents within the database are structured. However, you might also want to make sure some aspects of the documents’ structure remain consistent to allow for easier data analysis or processing.

      To illustrate why this can be important, consider a few other example documents that might be entered into this database.

      The following document is almost identical to the previous one representing Mount Everest, but it doesn’t contain a name field:

      The Mountain with no name at all

      {
          "height": 8611,
          "location": ["Pakistan", "China"],
          "ascents": {
              "first": {
                  "year": 1954
              },
              "first_winter": {
                  "year": 1921
              },
              "total": 306
          }
      }
      

      For a database containing a list of the highest mountains in the world, adding a document representing a mountain but not including its name would likely be a serious error.

      In this next example document, the mountain’s name is present but its height is represented as a string instead of a number. Additionally, the location is not an array but a single value, and there is no information on the total number of ascent attempts:

      Mountain with a string value for its height

      {
          "name": "Manaslu",
          "height": "8163m",
          "location": "Nepal"
      }
      

      Interpreting a document with as many omissions as this example could prove difficult. For instance, you would not be able to successfully sort the collection by peak heights if the height attribute values are stored as different data types between documents.

      Now run the following insertMany() method to test whether these documents can be inserted into the database without causing any errors:

      • db.peaks.insertMany([
      • {
      • "height": 8611,
      • "location": ["Pakistan", "China"],
      • "ascents": {
      • "first": {
      • "year": 1954
      • },
      • "first_winter": {
      • "year": 1921
      • },
      • "total": 306
      • }
      • },
      • {
      • "name": "Manaslu",
      • "height": "8163m",
      • "location": "Nepal"
      • }
      • ])

      As it turns out, MongoDB will not return any errors and both documents will be inserted successfully:

      Output

      { "acknowledged" : true, "insertedIds" : [ ObjectId("618ffd0bbfa69c93a8980444"), ObjectId("618ffd0bbfa69c93a8980445") ] }

      As this output indicates, both of these documents are valid JSON, which is enough to insert them into the collection. However, this isn’t enough to keep the database logically consistent and meaningful. In the next steps, you’ll build schema validation rules to make sure the data documents in the peaks collection follow a few essential requirements.

      Step 2 — Validating String Fields

      In MongoDB, schema validation works on individual collections by assigning a JSON Schema document to the collection. JSON Schema is an open standard that allows you to define and validate the structure of JSON documents. You do this by creating a schema definition that lists a set of requirements that documents in the given collection must follow to be considered valid.

      Any given collection can only use a single JSON Schema, but you can assign a schema when you create the collection or any time afterwards. If you decide to change your original validation rules later on, you will have to replace the original JSON Schema document with one that aligns with your new requirements.

      To assign a JSON Schema validator document to the peaks collection you created in the previous step, you could run the following command:

      • db.runCommand({
      • "collMod": "collection_name",
      • "validator": {
      • $jsonSchema: {JSON_Schema_document}
      • }
      • })

      The runCommand method executes the collMod command, which modifies the specified collection by applying the validator attribute to it. The validator attribute is responsible for schema validation and, in this example syntax, it accepts the $jsonSchema operator. This operator defines a JSON Schema document which will be used as the schema validator for the given collection.

      Warning: In order to execute the collMod command, your MongoDB user must be granted the appropriate privileges. Assuming you followed the prerequisite tutorial on How To Secure MongoDB on Ubuntu 20.04 and are connected to your MongoDB instance as the administrative user you created in that guide, you will need to grant it an additional role to follow along with the examples in this guide.

      First, switch to your user’s authentication database. This is admin in the following example, but connect to your own user’s authentication database if different:

      Output

      switched to db admin

      Then run a grantRolesToUser() method and grant your user the dbAdmin role over the database where you created the peaks collection. The following example assumes the peaks collection is in the test database:

      • db.grantRolesToUser(
      • "AdminSammy",
      • [ { role : "dbAdmin", db : "test" } ]
      • )

      Alternatively, you can grant your user the dbAdminAnyDatabase role. As this role’s name implies, it will grant your user dbAdmin privileges over every database on your MongoDB instance:

      • db.grantRolesToUser(
      • "AdminSammy",
      • [ "dbAdminAnyDatabase" ]
      • )

      After granting your user the appropriate role, navigate back to the database where your peaks collection is stored:

      Output

      switched to db test

      Be aware that you can also assign a JSON Schema validator when you create a collection. To do so, you could use the following syntax:

      • db.createCollection(
      • "collection_name", {
      • "validator": {
      • $jsonSchema: {JSON_Schema_document}
      • }
      • })

      Unlike the previous example, this syntax doesn’t include the collMod command, since the collection doesn’t yet exist and thus can’t be modified. As with the previous example, though, collection_name is the name of the collection to which you want to assign the validator document and the validator option assigns a specified JSON Schema document as the collection’s validator.

      Applying a JSON Schema validator from the start like this means every document you add to the collection must satisfy the requirements set by the validator. When you add validation rules to an existing collection, though, the new rules won’t affect existing documents until you try to modify them.

      The JSON schema document you pass to the validator attribute should outline every validation rule you want to apply to the collection. The following example JSON Schema will make sure that the name field is present in every document in the collection, and that the name field’s value is always a string:

      Your first JSON Schema document validating the name field

      {
          "bsonType": "object",
          "description": "Document describing a mountain peak",
          "required": ["name"],
          "properties": {
              "name": {
                  "bsonType": "string",
                  "description": "Name must be a string and is required"
              }
          },
      }
      

      This schema document outlines certain requirements that certain parts of documents entered into the collection must follow. The root part of the JSON Schema document (the fields before properties, which in this case are bsonType, description, and required) describes the database document itself.

      The bsonType property describes the data type that the validation engine will expect to find. For the database document itself, the expected type is object. This means that you can only add objects — in other words, complete, valid JSON documents surrounded by curly braces ({ and }) — to this collection. If you were to try to insert some other kind of data type (like a standalone string, integer, or an array), it would cause an error.

      In MongoDB, every document is an object. However, JSON Schema is a standard used to describe and validate all kinds of valid JSON documents, and a plain array or a string is valid JSON, too. When working with MongoDB schema validation, you’ll find that you must always set the root document’s bsonType value as object in the JSON Schema validator.

      Next, the description property provides a short description of the documents found in this collection. This field isn’t required, but in addition to being used to validate documents, JSON Schemas can also be used to annotate the document’s structure. This can help other users understand what the purpose of the documents are, so including a description field can be a good practice.

      The next property in the validation document is the required field. The required field can only accept an array containing a list of document fields that must be present in every document in the collection. In this example, ["name"] means that the documents only have to contain the name field to be considered valid.

      Following that is a properties object that describes the rules used to validate document fields. For each field that you want to define rules for, include an embedded JSON Schema document named after the field. Be aware that you can define schema rules for fields that aren’t listed in the required array. This can be useful in cases where your data has fields that aren’t required, but you’d still like for them to follow certain rules when they are present.

      These embedded schema documents will follow a similar syntax as the main document. In this example, the bsonType property will require every document’s name field to be a string. This embedded document also contains a brief description field.

      To apply this JSON Schema to the peaks collection you created in the previous step, run the following runCommand() method:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • }
      • },
      • }
      • }
      • })

      MongoDB will respond with a success message indicating that the collection was successfully modified:

      Output

      { "ok" : 1 }

      Following that, MongoDB will no longer allow you to insert documents into the peaks collection if they don’t have a name field. To test this, try inserting the document you inserted in the previous step that fully describes a mountain, aside from missing the name field:

      • db.peaks.insertOne(
      • {
      • "height": 8611,
      • "location": ["Pakistan", "China"],
      • "ascents": {
      • "first": {
      • "year": 1954
      • },
      • "first_winter": {
      • "year": 1921
      • },
      • "total": 306
      • }
      • }
      • )

      This time, the operation will trigger an error message indicating a failed document validation:

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . . })

      MongoDB won’t insert any documents that fail to pass the validation rules specified in the JSON Schema.

      Note: Starting with MongoDB 5.0, when validation fails the error messages point towards the failed constraint. In MongoDB 4.4 and earlier, the database provides no further details on the failure reason.

      You can also test whether MongoDB will enforce the data type requirement you included in the JSON Schema by running the following insertOne() method. This is similar to the last operation, but this time it includes a name field. However, this field’s value is a number instead of a string:

      • db.peaks.insertOne(
      • {
      • "name": 123,
      • "height": 8611,
      • "location": ["Pakistan", "China"],
      • "ascents": {
      • "first": {
      • "year": 1954
      • },
      • "first_winter": {
      • "year": 1921
      • },
      • "total": 306
      • }
      • }
      • )

      Once again, the validation will fail. Even though the name field is present, it doesn’t meet the constraint that requires it to be a string:

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . . })

      Try once more, but with the name field present in the document and followed by a string value. This time, name is the only field in the document:

      • db.peaks.insertOne(
      • {
      • "name": "K2"
      • }
      • )

      The operation will succeed, and the document will receive the object identifier as usual:

      Output

      { "acknowledged" : true, "insertedId" : ObjectId("61900965bfa69c93a8980447") }

      The schema validation rules pertain only to the name field. At this point, as long as the name field fulfills the validation requirements, the document will be inserted without error. The rest of the document can take any shape.

      With that, you’ve created your first JSON Schema document and applied the first schema validation rule to the name field, requiring it to be present and a string. However, there are different validation options for different data types. Next, you’ll validate number values stored in each document’s height field.

      Step 3 — Validating Number Fields

      Recall from Step 1 when you inserted the following document into the peaks collection:

      Mountain with a string value for its height

      {
          "name": "Manaslu",
          "height": "8163m",
          "location": "Nepal"
      }
      

      Even though this document’s height value is a string instead of a number, the insertMany() method you used to insert this document was successful. This was possible because you haven’t yet added any validation rules for the height field.

      MongoDB will accept any value for this field — even values that don’t make any sense for this field, like negative values — as long as the inserted document is written in valid JSON syntax. To work around this, you can extend the schema validation document from the previous step to include additional rules regarding the height field.

      Start by ensuring that the height field is always present in newly-inserted documents and that it’s always expressed as a number. Modify the schema validation with the following command:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name", "height"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • },
      • "height": {
      • "bsonType": "number",
      • "description": "Height must be a number and is required"
      • }
      • },
      • }
      • }
      • })

      In this command’s schema document, the height field is included in the required array. Likewise, there’s a height document within the properties object that will require any new height values to be a number. Again, the description field is auxiliary, and any description you include should only be to help other users understand the intention behind the JSON Schema.

      MongoDB will respond with a short success message to let you know that the collection was successfully modified:

      Output

      { "ok" : 1 }

      Now you can test the new rule. Try inserting a document with the minimal document structure required to pass the validation document. The following method will insert a document containing the only two mandatory fields, name and height:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300
      • }
      • )

      The insertion will succeed:

      Output

      { acknowledged: true, insertedId: ObjectId("61e0c8c376b24e08f998e371") }

      Next, try inserting a document with a missing height field:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak"
      • }
      • )

      Then try another that includes the height field, but this field contains a string value:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": "8300m"
      • }
      • )

      Both times, the operations will trigger an error message and fail:

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . . })

      However, if you try inserting a mountain peak with a negative height, the mountain will save properly:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": -100
      • }
      • )

      To prevent this, you could add a few more properties to the schema validation document. Replace the current schema validation settings by running the following operation:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name", "height"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • },
      • "height": {
      • "bsonType": "number",
      • "description": "Height must be a number between 100 and 10000 and is required",
      • "minimum": 100,
      • "maximum": 10000
      • }
      • },
      • }
      • }
      • })

      The new minimum and maximum attributes set constraints on values included in height fields, ensuring they can’t be lower than 100 or higher than 10000. This range makes sense in this case, as this collection is used to store information about mountain peak heights, but you could choose any values you like for these attributes.

      Now, if you try inserting a peak with a negative height value again, the operation will fail:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": -100
      • }
      • )

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . .

      As this output shows, your document schema now validates string values held in each document’s name field as well as numeric values held in the height fields. Continue reading to learn how to validate array values stored in each document’s location field.

      Step 4 — Validating Array Fields

      Now that each peak’s name and height values are being verified by schema validation constraints, we can turn our attention to the location field to guarantee its data consistency.

      Specifying the location for mountains is more tricky than one might expect, since peaks span more than one country, and this is the case for many of the famous eight-thousanders. Because of this, it would make sense store each peak’s location data as an array containing one or more country names instead of being just a string value. As with the height values, making sure each location field’s data type is consistent across every document can help with summarizing data when using aggregation pipelines.

      First, consider some examples of location values that users might enter, and weigh which ones would be valid or invalid:

      • ["Nepal", "China"]: this is a two-element array, and would be a valid value for a mountain spanning two countries.
      • ["Nepal"]: this example is a single-element array, it would also be a valid value for a mountain located in a single country.
      • "Nepal": this example is a plain string. It would be invalid because although it lists a single country, the location field should always contain an array
      • []: an empty array, this example would not be a valid value. After all, every mountain must exist in at least one country.
      • ["Nepal", "Nepal"]: this two-element array would also be invalid, as it contains the same value appearing twice.
      • ["Nepal", 15]: lastly, this two-element array would be invalid, as one of its values is a number instead of a string and this is not a correct location name.

      To ensure that MongoDB will correctly interpret each of these examples as valid or invalid, run the following operation to create some new validation rules for the peaks collection:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name", "height", "location"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • },
      • "height": {
      • "bsonType": "number",
      • "description": "Height must be a number between 100 and 10000 and is required",
      • "minimum": 100,
      • "maximum": 10000
      • },
      • "location": {
      • "bsonType": "array",
      • "description": "Location must be an array of strings",
      • "minItems": 1,
      • "uniqueItems": true,
      • "items": {
      • "bsonType": "string"
      • }
      • }
      • },
      • }
      • }
      • })

      In this $jsonSchema object, the location field is included within the required array as well as the properties object. There, it’s defined with a bsonType of array to ensure that the location value is always an array rather than a single string or a number.

      The minItems property validates that the array must contain at least one element, and the uniqueItems property is set to true to ensure that elements within each location array will be unique. This will prevent values like ["Nepal", "Nepal"] from being accepted. Lastly, the items subdocument defines the validation schema for each individual array item. Here, the only expectation is that every item within a location array must be a string.

      Note: The available schema document properties are different for each bsonType and, depending on the field type, you will be able to validate different aspects of the field value. For example, with number values you could define minimum and maximum allowable values to create a range of acceptable values. In the previous example, by setting the location field’s bsonType to array, you can validate features particular to arrays.

      You can find details on all possible validation choices in the JSON Schema documentation.

      After executing the command, MongoDB will respond with a short success message that the collection was successfully modified with the new schema document:

      Output

      { "ok" : 1 }

      Now try inserting documents matching the examples prepared earlier to test how the new rule behaves. Once again, let’s use the minimal document structure, with only the name, height, and location fields present.

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": ["Nepal", "China"]
      • }
      • )

      The document will be inserted successfully as it fulfills all the defined validation expectations. Similarly, the following document will insert without error:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": ["Nepal"]
      • }
      • )

      However, if you were to run any of the following insertOne() methods, they would trigger a validation error and fail:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": "Nepal"
      • }
      • )
      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": []
      • }
      • )
      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": ["Nepal", "Nepal"]
      • }
      • )
      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": ["Nepal", 15]
      • }
      • )

      As per the validation rules you defined previously, the location values provided in these operations are considered invalid.

      After following this step, three primary fields describing a mountain top are already being validated through MongoDB’s schema validation feature. In the next step, you’ll learn how to validate nested documents using the ascents field as an example.

      Step 5 — Validating Embedded Documents

      At this point, your peaks collection has three fields — name, height and location — that are being kept in check by schema validation. This step focuses on defining validation rules for the ascents field, which describes successful attempts at summiting each peak.

      In the example document from Step 1 that represents Mount Everest, the ascents field was structured as follows:

      The Everest document

      {
          "name": "Everest",
          "height": 8848,
          "location": ["Nepal", "China"],
          "ascents": {
              "first": {
                  "year": 1953,
              },
              "first_winter": {
                  "year": 1980,
              },
              "total": 5656,
          }
      }
      

      The ascents subdocument contains a total field whose value represents the total number of ascent attempts for the given mountain. It also contains information on the first winter ascent of the mountain as well as the first ascent overall. These, however, might not be essential to the mountain description. After all, some mountains might not have been ascended in winter yet, or the ascent dates are disputed or not known. For now, just assume the information that you will always want to have in each document is the total number of ascent attempts.

      You can change the schema validation document so that the ascents field must always be present and its value must always be a subdocument. This subdocument, in turn, must always contain a total attribute holding a number greater than or equal to zero. The first and first_winter fields aren’t required for the purposes of this guide, so the validation form won’t consider them and they can take flexible forms.

      Once again, replace the schema validation document for the peaks collection by running the following runCommand() method:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name", "height", "location", "ascents"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • },
      • "height": {
      • "bsonType": "number",
      • "description": "Height must be a number between 100 and 10000 and is required",
      • "minimum": 100,
      • "maximum": 10000
      • },
      • "location": {
      • "bsonType": "array",
      • "description": "Location must be an array of strings",
      • "minItems": 1,
      • "uniqueItems": true,
      • "items": {
      • "bsonType": "string"
      • }
      • },
      • "ascents": {
      • "bsonType": "object",
      • "description": "Ascent attempts information",
      • "required": ["total"],
      • "properties": {
      • "total": {
      • "bsonType": "number",
      • "description": "Total number of ascents must be 0 or higher",
      • "minimum": 0
      • }
      • }
      • }
      • },
      • }
      • }
      • })

      Anytime the document contains subdocuments under any of its fields, the JSON Schema for that field follows the exact same syntax as the main document schema. Just like how the same documents can be nested within one another, the validation schema nests them within one another as well. This makes it straightforward to define complex validation schemas for document structures containing multiple subdocuments in a hierarchical structure.

      In this JSON Schema document, the ascents field is included within the required array, making it mandatory. It also appears in the properties object where it’s defined with a bsonType of object, just like the root document itself.

      Notice that the definition for ascents validation follows a similar principle as the root document. It has the required field, denoting properties the subdocument must contain. It also defines a properties list, following the same structure. Since the ascents field is a subdocument, it’s values will be validated just like those of a larger document would be.

      Within ascents, there’s a required array whose only value is total, meaning that every ascents subdocument will be required to contain a total field. Following that, the total value is described thoroughly within the properties object, which specifies that this must always be a number with a minimum value of zero.

      Again, because neither the first nor the first_winter fields are mandatory for the purposes of this guide, they aren’t included in these validation rules.

      With this schema validation document applied, try inserting the sample Mount Everest document from the first step to verify it allows you to insert documents you’ve already established as valid:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"],
      • "ascents": {
      • "first": {
      • "year": 1953,
      • },
      • "first_winter": {
      • "year": 1980,
      • },
      • "total": 5656,
      • }
      • }
      • )

      The document saves successfully, and MongoDB returns the new object identifier:

      Output

      { "acknowledged" : true, "insertedId" : ObjectId("619100f51292cb2faee531f8") }

      To make sure the last pieces of validation work properly, try inserting a document that doesn’t include the ascents field:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"]
      • }
      • )

      This time, the operation will trigger an error message pointing out a failed document validation:

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . . })

      Now try inserting a document whose ascents subdocument is missing the total field:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"],
      • "ascents": {
      • "first": {
      • "year": 1953,
      • },
      • "first_winter": {
      • "year": 1980,
      • }
      • }
      • }
      • )

      This will again trigger an error.

      As a final test, try entering a document that contains an ascents field with a total value, but this value is negative:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"],
      • "ascents": {
      • "first": {
      • "year": 1953,
      • },
      • "first_winter": {
      • "year": 1980,
      • },
      • "total": -100
      • }
      • }
      • )

      Because of the negative total value, this document will also fail the validation test.

      Conclusion

      By following this tutorial, you became familiar with JSON Schema documents and how to use them to validate document structures before saving them into a collection. You then used JSON Schema documents to verify field types and apply value constraints to numbers and arrays. You’ve also learned how to validate subdocuments in a nested document structure.

      MongoDB’s schema validation feature should not be considered a replacement for data validation at the application level, but it can further safeguard against violating data constraints that are essential to keeping your data meaningful. Using schema validation can be a helpful tool for structuring one’s data while retaining the flexibility of a schemaless approach to data storage. With schema validation, you are in total control of those parts of the document structure you want to validate and those you’d like to leave open-ended.

      The tutorial described only a subset of MongoDB’s schema validation features. You can apply more constraints to different MongoDB data types, and it’s even possible to change the strictness of validation behavior and use JSON Schema to filter and validate existing documents. We encourage you to study the official official MongoDB documentation to learn more about schema validation and how it can help you work with data stored in the database.



      Source link

      How To Design a Document Schema in MongoDB


      The author selected the Open Internet/Free Speech Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      If you have a lot of experience working with relational databases, it can be difficult to move past the principles of the relational model, such as thinking in terms of tables and relationships. Document-oriented databases like MongoDB make it possible to break free from rigidity and limitations of the relational model. However, the flexibility and freedom that comes with being able to store self-descriptive documents in the database can lead to other pitfalls and difficulties.

      This conceptual article outlines five common guidelines related to schema design in a document-oriented database and highlights various considerations one should make when modeling relationships between data. It will also walk through several strategies one can employ to model such relationships, including embedding documents within arrays and using child and parent references, as well as when these strategies would be most appropriate to use.

      Guideline 1 — Storing Together What Needs to be Accessed Together

      In a typical relational database, data is kept in tables, and each table is constructed with a fixed list of columns representing various attributes that make up an entity, object, or event. For example, in a table representing students at a a university, you might find columns holding each student’s first name, last name, date of birth, and a unique identification number.

      Typically, each table represents a single subject. If you wanted to store information about a student’s current studies, scholarships, or prior education, it could make sense to keep that data in a separate table from the one holding their personal information. You could then connect these tables to signify that there is a relationship between the data in each one, indicating that the information they contain has a meaningful connection.

      For instance, a table describing each student’s scholarship status could refer to students by their student ID number, but it would not store the student’s name or address directly, avoiding data duplication. In such a case, to retrieve information about any student with all information on the student’s social media accounts, prior education, and scholarships, a query would need to access more than one table at a time and then compile the results from different tables into one.

      This method of describing relationships through references is known as a normalized data model. Storing data this way — using multiple separate, concise objects related to each other — is also possible in document-oriented databases. However, the flexibility of the document model and the freedom it gives to store embedded documents and arrays within a single document means that you can model data differently than you might in a relational database.

      The underlying concept for modeling data in a document-oriented database is to “store together what will be accessed together.”“ Digging further into the student example, say that most students at this school have more than one email address. Because of this, the university wants the ability to store multiple email addresses with each student’s contact information.

      In a case like this, an example document could have a structure like the following:

      {
          "_id": ObjectId("612d1e835ebee16872a109a4"),
          "first_name": "Sammy",
          "last_name": "Shark",
          "emails": [
              {
                  "email": "sammy@digitalocean.com",
                  "type": "work"
              },
              {
                  "email": "sammy@example.com",
                  "type": "home"
              }
          ]
      }
      

      Notice that this example document contains an embedded list of email addresses.

      Representing more than a single subject inside a single document characterizes a denormalized data model. It allows applications to retrieve and manipulate all the relevant data for a given object (here, a student) in one go without a need to access multiple separate objects and collections. Doing so also guarantees the atomicity of operations on such a document without having to use multi-document transactions to guarantee integrity.

      Storing together what needs to be accessed together using embedded documents is often the optimal way to represent data in a document-oriented database. In the following guidelines, you’ll learn how different relationships between objects, such as one-to-one or one-to-many relationships, can be best modeled in a document-oriented database.

      Guideline 2 — Modeling One-to-One Relationships with Embedded Documents

      A one-to-one relationship represents an association between two distinct objects where one object is connected with exactly one of another kind.

      Continuing with the student example from the previous section, each student has only one valid student ID card at any given point in time. One card never belongs to multiple students, and no student can have multiple identification cards. If you were to store all this data in a relational database, it would likely make sense to model the relationship between students and their ID cards by storing the student records and the ID card records in separate tables that are tied together through references.

      One common method for representing such relationships in a document database is by using embedded documents. As an example, the following document describes a student named Sammy and their student ID card:

      {
          "_id": ObjectId("612d1e835ebee16872a109a4"),
          "first_name": "Sammy",
          "last_name": "Shark",
          "id_card": {
              "number": "123-1234-123",
              "issued_on": ISODate("2020-01-23"),
              "expires_on": ISODate("2020-01-23")
          }
      }
      

      Notice that instead of a single value, this example document’s id_card field holds an embedded document representing the student’s identification card, described by an ID number, the card’s date of issue, and the card’s expiration date. The identity card essentially becomes a part of the document describing the student Sammy, even though it’s a separate object in real life. Usually, structuring the document schema like this so that you can retrieve all related information through a single query is a sound choice.

      Things become less straightforward if you encounter relationships connecting one object of a kind with many objects of another type, such as a student’s email addresses, the courses they attend, or the messages they post on the student council’s message board. In the next few guidelines, you’ll use these data examples to learn different approaches for working with one-to-many and many-to-many relationships.

      Guideline 3 — Modeling One-to-Few Relationships with Embedded Documents

      When an object of one type is related to multiple objects of another type, it can be described as a one-to-many relationship. A student can have multiple email addresses, a car can have numerous parts, or a shopping order can consist of multiple items. Each of these examples represents a one-to-many relationship.

      While the most common way to represent a one-to-one relationship in a document database is through an embedded document, there are several ways to model one-to-many relationships in a document schema. When considering your options for how to best model these, though, there are three properties of the given relationship you should consider:

      • Cardinality: Cardinality is the measure of the number of individual elements in a given set. For example, if a class has 30 students, you could say that class has a cardinality of 30. In a one-to-many relationship, the cardinality can be different in each case. A student could have one email address or multiple. They could be registered for just a few classes or they could have a completely full schedule. In a one-to-many relationship, the size of “many” will affect how you might model the data.
      • Independent access: Some related data will rarely, if ever, be accessed separately from the main object. For example, it might be uncommon to retrieve a single student’s email address without other student details. On the other hand, a university’s courses might need to be accessed and updated individually, regardless of the student or students that are registered to attend them. Whether or not you will ever access a related document alone will also affect how you might model the data.
      • Whether the relationship between data is strictly a one-to-many relationship: Consider the courses an example student attends at a university. From the student’s perspective, they can participate in multiple courses. On the surface, this may seem like a one-to-many relationship. However, university courses are rarely attended by a single student; more often, multiple students will attend the same class. In cases like this, the relationship in question is not really a one-to-many relationship, but a many-to-many relationship, and thus you’d take a different approach to model this relationship than you would a one-to-many relationship.

      Imagine you’re deciding how to store student email addresses. Each student can have multiple email addresses, such as one for work, one for personal use, and one provided by the university. A document representing a single email address might take a form like this:

      {
          "email": "sammy@digitalocean.com",
          "type": "work"
      }
      

      In terms of cardinality, there will be only a few email addresses for each student, since it’s unlikely that a student will have dozens — let alone hundreds — of email addresses. Thus, this relationship can be characterized as a one-to-few relationship, which is a compelling reason to embed email addresses directly into the student document and store them together. You don’t run any risk that the list of email addresses will grow indefinitely, which would make the document big and inefficient to use.

      Note: Be aware that there are certain pitfalls associated with storing data in arrays. For instance, a single MongoDB document cannot exceed 16MB in size. While it is possible and common to embed multiple documents using array fields, if the list of objects grows uncontrollably the document could quickly reach this size limit. Additionally, storing a large amount of data inside embedded arrays have a big impact on query performance.

      Embedding multiple documents in an array field will likely be suitable in many situations, but know that it also may not always be the best solution.

      Regarding independent access, email addresses will likely not be accessed separately from the student. As such, there is no clear incentive to store them as separate documents in a separate collection. This is another compelling reason to embed them inside the student’s document.

      The last thing to consider is whether this relationship is really a one-to-many relationship instead of a many-to-many relationship. Because an email address belongs to a single person, it’s reasonable to describe this relationship as a one-to-many relationship (or, perhaps more accurately, a one-to-few relationship) instead of a many-to-many relationship.

      These three assumptions suggest that embedding students’ various email addresses within the same documents that describe students themselves would be a good choice for storing this kind of data. A sample student’s document with email addresses embedded might take this shape:

      {
          "_id": ObjectId("612d1e835ebee16872a109a4"),
          "first_name": "Sammy",
          "last_name": "Shark",
          "emails": [
              {
                  "email": "sammy@digitalocean.com",
                  "type": "work"
              },
              {
                  "email": "sammy@example.com",
                  "type": "home"
              }
          ]
      }
      

      Using this structure, every time you retrieve a student’s document you will also retrieve the embedded email addresses in the same read operation.

      If you model a relationship of the one-to-few variety, where the related documents do not need to be accessed independently, embedding documents directly like this is usually desirable, as this can reduce the complexity of the schema.

      As mentioned previously, though, embedding documents like this isn’t always the optimal solution. The next section provides more details on why this might be the case in some scenarios, and outlines how to use child references as an alternative way to represent relationships in a document database.

      Guideline 4 — Modeling One-to-Many and Many-to-Many Relationships with Child References

      The nature of the relationship between students and their email addresses informed how that relationship could best be modeled in a document database. There are some differences between this and the relationship between students and the courses they attend, so the way you model the relationships between students and their courses will be different as well.

      A document describing a single course that a student attends could follow a structure like this:

      {
          "name": "Physics 101",
          "department": "Department of Physics",
          "points": 7
      }
      

      Say that you decided from the outset to use embedded documents to store information about each students’ courses, as in this example:

      {
          "_id": ObjectId("612d1e835ebee16872a109a4"),
          "first_name": "Sammy",
          "last_name": "Shark",
          "emails": [
              {
                  "email": "sammy@digitalocean.com",
                  "type": "work"
              },
              {
                  "email": "sammy@example.com",
                  "type": "home"
              }
          ],
          "courses": [
              {
                  "name": "Physics 101",
                  "department": "Department of Physics",
                  "points": 7
              },
              {
                  "name": "Introduction to Cloud Computing",
                  "department": "Department of Computer Science",
                  "points": 4
              }
          ]
      }
      

      This would be a perfectly valid MongoDB document and could well serve the purpose, but consider the three relationship properties you learned about in the previous guideline.

      The first one is cardinality. A student will likely only maintain a few email addresses, but they can attend multiple courses during their study. After several years of attendance, there could be dozens of courses the student took part in. Plus, they’d attend these courses along with many other students who are likewise attending their own set of courses over their years of attendance.

      If you decided to embed each course like the previous example, the student’s document would quickly get unwieldy. With a higher cardinality, the embedded document approach becomes less compelling.

      The second consideration is independent access. Unlike email addresses, it’s sound to assume there would be cases in which information about university courses would need to be retrieved on their own. For instance, say someone needs information about available courses to prepare a marketing brochure. Additionally, courses will likely need to be updated over time: the professor teaching the course might change, its schedule may fluctuate, or its prerequisites might need to be updated.

      If you were to store the courses as documents embedded within student documents, retrieving the list of all the courses offered by the university would become troublesome. Also, each time a course needs an update, you would need to go through all student records and update the course information everywhere. Both are good reasons to store courses separately and not embed them fully.

      The third thing to consider is whether the relationship between student and a university course is actually one-to-many or instead many-to-many. In this case, it’s the latter, as more than one student can attend each course. This relationship’s cardinality and independent access aspects suggest against embedding each course document, primarily for practical reasons like ease of access and update. Considering the many-to-many nature of the relationship between courses and students, it might make sense to store course documents in a separate collection with unique identifiers of their own.

      The documents representing classes in this separate collection might have a structure like these examples:

      {
          "_id": ObjectId("61741c9cbc9ec583c836170a"),
          "name": "Physics 101",
          "department": "Department of Physics",
          "points": 7
      },
      {
          "_id": ObjectId("61741c9cbc9ec583c836170b"),
          "name": "Introduction to Cloud Computing",
          "department": "Department of Computer Science",
          "points": 4
      }
      

      If you decide to store course information like this, you’ll need to find a way to connect students with these courses so that you will know which students attend which courses. In cases like this where the number of related objects isn’t excessively large, especially with many-to-many relationships, one common way of doing this is to use child references.

      With child references, a student’s document will reference the object identifiers of the courses that the student attends in an embedded array, as in this example:

      {
          "_id": ObjectId("612d1e835ebee16872a109a4"),
          "first_name": "Sammy",
          "last_name": "Shark",
          "emails": [
              {
                  "email": "sammy@digitalocean.com",
                  "type": "work"
              },
              {
                  "email": "sammy@example.com",
                  "type": "home"
              }
          ],
          "courses": [
              ObjectId("61741c9cbc9ec583c836170a"),
              ObjectId("61741c9cbc9ec583c836170b")
          ]
      }
      

      Notice that this example document still has a courses field which also is an array, but instead of embedding full course documents like in the earlier example, only the identifiers referencing the course documents in the separate collection are embedded. Now, when retrieving a student document, courses will not be immediately available and will need to be queried separately. On the other hand, it’s immediately known which courses to retrieve. Also, in case any course’s details need to be updated, only the course document itself needs to be altered. All references between students and their courses will remain valid.

      Note: There is no firm rule for when the cardinality of a relation is too great to embed child references in this manner. You might choose a different approach at either a lower or higher cardinality if it’s what best suits the application in question. After all, you will always want to structure your data to suit the manner in which your application queries and updates it.

      If you model a one-to-many relationship where the amount of related documents is within reasonable bounds and related documents need to be accessed independently, favor storing the related documents separately and embedding child references to connect to them.

      Now that you’ve learned how to use child references to signify relationships between different types of data, this guide will outline an inverse concept: parent references.

      Guideline 5 — Modeling Unbounded One-to-Many Relationships with Parent References

      Using child references works well when there are too many related objects to embed them directly inside the parent document, but the amount is still within known bounds. However, there are cases when the number of associated documents might be unbounded and will continue to grow with time.

      As an example, imagine that the university’s student council has a message board where any student can post whatever messages they want, including questions about courses, travel stories, job postings, study materials, or just a free chat. A sample message in this example consists of a subject and a message body:

      {
          "_id": ObjectId("61741c9cbc9ec583c836174c"),
          "subject": "Books on kinematics and dynamics",
          "message": "Hello! Could you recommend good introductory books covering the topics of kinematics and dynamics? Thanks!",
          "posted_on": ISODate("2021-07-23T16:03:21Z")
      }
      

      You could use either of the two approaches discussed previously — embedding and child references — to model this relationship. If you were to decide on embedding, the student’s document might take a shape like this:

      {
          "_id": ObjectId("612d1e835ebee16872a109a4"),
          "first_name": "Sammy",
          "last_name": "Shark",
          "emails": [
              {
                  "email": "sammy@digitalocean.com",
                  "type": "work"
              },
              {
                  "email": "sammy@example.com",
                  "type": "home"
              }
          ],
          "courses": [
              ObjectId("61741c9cbc9ec583c836170a"),
              ObjectId("61741c9cbc9ec583c836170b")
          ],
          "message_board_messages": [
              {
                  "subject": "Books on kinematics and dynamics",
                  "message": "Hello! Could you recommend good introductory books covering the topics of kinematics and dynamics? Thanks!",
                  "posted_on": ISODate("2021-07-23T16:03:21Z")
              },
              . . .
          ]
      }
      

      However, if a student is prolific with writing messages their document will quickly become incredibly long and could easily exceed the 16MB size limit, so the cardinality of this relation suggests against embedding. Additionally, the messages might need to be accessed separately from the student, as could be the case if the message board page is designed to show the latest messages posted by students. This also suggests that embedding is not the best choice for this scenario.

      Note: You should also consider whether the message board messages are frequently accessed when retrieving the student’s document. If not, having them all embedded inside that document would incur a performance penalty when retrieving and manipulating this document, even when the list of messages would not be used often. Infrequent access of related data is often another clue that you shouldn’t embed documents.

      Now consider using child references instead of embedding full documents as in the previous example. The individual messages would be stored in a separate collection, and the student’s document could then have the following structure:

      {
          "_id": ObjectId("612d1e835ebee16872a109a4"),
          "first_name": "Sammy",
          "last_name": "Shark",
          "emails": [
              {
                  "email": "sammy@digitalocean.com",
                  "type": "work"
              },
              {
                  "email": "sammy@example.com",
                  "type": "home"
              }
          ],
          "courses": [
              ObjectId("61741c9cbc9ec583c836170a"),
              ObjectId("61741c9cbc9ec583c836170b")
          ],
          "message_board_messages": [
              ObjectId("61741c9cbc9ec583c836174c"),
              . . .
          ]
      }
      

      In this example, the message_board_messages field now stores the child references to all messages written by Sammy. However, changing the approach solves only one of the issues mentioned before in that it would now be possible to access the messages independently. But although the student’s document size would grow more slowly using the child references approach, the collection of object identifiers could also become unwieldy given the unbounded cardinality of this relation. A student could easily write thousands of messages during their four years of study, after all.

      In such scenarios, a common way to connect one object to another is through parent references. Unlike the child references described previously, it’s now not the student document referring to individual messages, but rather a reference in the message’s document pointing towards the student that wrote it.

      To use parent references, you would need to modify the message document schema to contain a reference to the student who authored the message:

      {
          "_id": ObjectId("61741c9cbc9ec583c836174c"),
          "subject": "Books on kinematics and dynamics",
          "message": "Hello! Could you recommend a good introductory books covering the topics of kinematics and dynamics? Thanks!",
          "posted_on": ISODate("2021-07-23T16:03:21Z"),
          "posted_by": ObjectId("612d1e835ebee16872a109a4")
      }
      

      Notice the new posted_by field contains the object identifier of the student’s document. Now, the student’s document won’t contain any information about the messages they’ve posted:

      {
          "_id": ObjectId("612d1e835ebee16872a109a4"),
          "first_name": "Sammy",
          "last_name": "Shark",
          "emails": [
              {
                  "email": "sammy@digitalocean.com",
                  "type": "work"
              },
              {
                  "email": "sammy@example.com",
                  "type": "home"
              }
          ],
          "courses": [
              ObjectId("61741c9cbc9ec583c836170a"),
              ObjectId("61741c9cbc9ec583c836170b")
          ]
      }
      

      To retrieve the list of messages written by a student, you would use a query on the messages collection and filter against the posted_by field. Having them in a separate collection makes it safe to let the list of messages grow without affecting any of the student’s documents.

      Note: When using parent references, creating an index on the field referencing the parent document can significantly increase the query performance each time you filter against the parent document identifier.

      If you model a one-to-many relationship where the amount of related documents is unbounded, regardless of whether the documents need to be accessed independently, it’s generally advised that you store related documents separately and use parent references to connect them to the parent document.

      Conclusion

      Thanks to the flexibility of document-oriented databases, determining the best way to model relationships in a document databases is less of a strict science than it is in a relational database. By reading this article, you’ve acquainted yourself with embedding documents and using child and parent references to store related data. You’ve learned about considering the relationship cardinality and avoiding unbounded arrays, as well as taking into account whether the document will be accessed separately or frequently.

      These are just a few guidelines that can help you model typical relationships in MongoDB, but modeling database schema is not a one size fits all. Always take into account your application and how it uses and updates the data when designing the schema.

      To learn more about schema design and common patterns for storing different kinds of data in MongoDB, we encourage you to check the official MongoDB documentation on that topic.



      Source link