MongoDB

Table of Contents

1 Basic

1.1 CRUD

create, read, update, and delete

  • create: insertOne
  • read: find, findOne
  • update: updateOne
  • delete: deleteOne

1.2 Types

null, Boolean, Number, String, Date(int64 since epoch), Regular expression(query only), Array, Embedded document

  • Binary data: Binary data is the only way to save non-UTF-8 strings to the database.
  • Code: MongoDB also makes it possible to store arbitrary JavaScript in queries and documents: {"x" : function() { /* ... */ }}

Object ID

12-byte ID for documents

0 1 2 3 4 5 6 7 8 9 10 11
Timestamp(seconds) Random Counter (random start value)  

1.3 Mongo Shell

conn = new Mongo("host:port")
db = conn.getDB("myDB")
  • help

Useful Helpers

Helper Equivalent
use db db.getSisterDB("db")
show dbs db.getMongo().getDBs()
show collections db.getCollectionNames()
  • pwd():
  • run("ls", "-l", "/home/myUser/my-scripts/")

Custom Helper

  1. define helper function in js
  2. load js
  3. typeof helper_function

Frequently Loaded Scripts

Use .mongorc.js in your home directory, This file is run whenever you start up the shell.

  • common example in .mongorc.js: remove some of the more "dangerous" shell helpers
var no = function() {
    print("Not on my watch.");
};

// Prevent dropping databases
db.dropDatabase = DB.prototype.dropDatabase = no;

// Prevent dropping collections
DBCollection.prototype.drop = no;

// Prevent dropping an index
DBCollection.prototype.dropIndex = no;

// Prevent dropping indexes
DBCollection.prototype.dropIndexes = no;
  • use --norc option to disable .mongorc.js

Customizing Your Prompt

Just set the prompt variable

prompt = function() {
    if (typeof db == 'undefined') {
	return '(nodb)> ';
    }

    // Check the last db operation
    try {
	db.runCommand({getLastError:1});
    }
    catch (e) {
	print(e);
    }

    return db+"> ";
};

1.4 Tools

  • mongoimport: importing raw data from a data feed or MySQL

2 CRUD

2.1 query

  • some query condition: $nin, $or, $exists, $regex
  • Mongo uses the Perl Compatible Regular Expression

querying arrays

  • contains
db.food.insertOne({"fruit" : ["apple", "banana", "peach"]})
db.food.find({"fruit" : "banana"}) // successfully match the document
  • contains all: db.food.find({fruit : {$all : ["apple", "banana"]}})
  • exact match: db.food.find({"fruit" : ["apple", "banana", "peach"]})
  • index match: db.food.find({"fruit.2" : "peach"})
  • $size: not working with $gt
  • $slice: db.blog.posts.findOne(criteria, {"comments" : {"$slice" : [23, 10]}}), db.blog.posts.findOne(criteria, {"comments" : {"$slice" : -1}})
  • db.blog.posts.find({"comments.name" : "bob"}, {"comments.$" : 1}): return the matching element with the $
  • range query on array elements: db.test.find({"x" : {"$elemMatch" : {"$gt" : 10, "$lt" : 20}}})

query embedded documents

  • db.people.find({"name.first" : "Joe", "name.last" : "Schmoe"})

$where

  • Allows you to execute arbitrary JavaScript as part of your query.(Insecure)

"$where" queries should not be used unless strictly necessary: they are much slower than regular queries. Each document has to be converted from BSON to a JavaScript object and then run through the "$where" expression.

query options

limit, skip, sort

  • Avoiding Large Skips: large skips are not very efficient

immortal cursors

Tells the database not to time out the cursor.

2.2 insert

insertMany

  • Batch inserts limit: 48 MB. Drivers will split up the batch insert into multiple 48 MB batch inserts
  • ordererd vs unordered
    • Executing an ordered list of operations on a sharded collection will generally be slower than executing an unordered list since with an ordered list, each operation must wait for the previous operation to finish.
    • For unordered inserts, MongoDB will attempt to insert all documents, regardless of whether some insertions produce errors. execute the operations in parallel

insert

Old method before mongo 3.0. Use insertOne and insertMany instead

2.3 delete

  • deleteOne
  • deleteMany
  • Clear an entire collection: drop

2.4 update

Updating a document is atomic

  • updateOne
  • updateMany
  • replaceOne: takes a filter as the first parameter, but as the second parameter replaceOne expects a document with which it will replace the document matching the filter.

This can be useful to do a dramatic schema migration.

update operaters

  • {"$inc": {"field": 1}}: create or increment 1
  • {"$set": {"new/existed field": "newValue"}}, {"$unset": {"field": 1}}
  • change embedded documents: {"$set": {"field.sub": "newValue"}}
  • {"$setOnInsert" : {"createdAt" : new Date()}}: can be useful for creating padding, initializing counters, and for collections that do not use ObjectIds.
  • findOneAndUpdate (after 4.2): can return the item and update it in a single operation
  • findOneAndReplace
  • findOneAndDelete
  • array operaters
    • {"$push": {"field": document}} adds elements to the end of an array if the array exists and creates a new array if it does not.
    • $each: e.g. {"$push" : {"hourly" : {"$each" : [562.776, 562.790, 559.123]}}})~
    • $slice can be used to create a queue in a document. e.g. {"$push" : {"last5" : {"$each" : ["Nightmare on Elm Street", "Saw"], "$slice" : -5}}})
    • $sort If you only want the array to grow to a certain length, you can use the $slice modifier effectively making a “top N” list of items.
    db.movies.updateOne(
        {"genre" : "horror"},
        {"$push" : {"top10" : {"$each" : [{"name" : "Nightmare on Elm Street",
    				       "rating" : 6.6},
    				      {"name" : "Saw", "rating" : 4.3}],
    			   "$slice" : -10,
    			   "$sort" : {"rating" : -1}}}}
    )
    
    • $addToSet: used to prevent duplicates
    • {"$pop" : {"key" : 1}} removes an element from the end of the array. {"$pop" : {"key" : -1}} removes it from the beginning.
    • $pull is used to remove elements of an array that match the given criteria.
    • update first match: e.g. set in comments array {"$set" : {"comments.$.author" : "Jim"}}
    • $arrayFilters
    db.blog.updateOne(
        {"post" : post_id },
        { $set: { "comments.$[elem].hidden" : true } },
        {
    	arrayFilters: [ { "elem.votes": { $lte: -5 } } ]
        }
    )
    // This command defines elem as the identifier for each matching element in the "comments" array.
    // If the votes value for the comment
    // identified by elem is less than or equal to -5, we will add a field called "hidden" to the
    // "comments" document and set its value to true.
    

upsert

Eliminating race condition: query->if exists->update

3 Server Administration

3.1 mongod options

3.2 Troubleshoot

  • getLastError

4 Indexing

4.1 Test Dataset

for (i=0; i<1000000; i++) {
    db.users.insertOne(
	{
	    "i" : i,
	    "username" : "user"+i,
	    "age" : Math.floor(Math.random()*120),
	    "created" : new Date()
	}
    );
}

Profiling

cursor.explain("executionStats")

Stages

  • FETCH: "FETCH" stage will retrieve the documents themselves and return them in batches as the client requests them.
  • SORT: means MongoDB would have been unable to sort the result set in the database using an index and instead would have had to do an in-memory sort.

Query Types

  • equality filter: {"age": 25}
  • multivalue filter: {"age": {"$gte": 40}}
  • sort component

4.2 Designing Indexes

  • selectivity: minimize the number of records scanned. "nReturned" is very close to "totalKeysExamined".

Cardinality

Cardinality refers to how many distinct values there are for a field in a collection. e.g. "gender" low cardinality. "username" high cardinality In general, the greater the cardinality of a field, the more helpful an index on that field can be

Designing a Compound Index

  1. Keys for equality filters should appear first.
  2. Keys used for sorting should appear before multivalue fields.
  3. Keys for multivalue filters should appear last.

4.3 Specifying an Index to Use

  • hint([indexname]) or hint([indexshape])

4.4 Other Tips

Inefficient Operators

  • $ne: they basically have to scan the entire index.
  • $not: will fall back to doing a table scan
  • $nin: always uses a table scan

$or

Performs two queries and then merges the results. In general, doing two queries and merging the results is much less efficient than doing a single query; thus, whenever possible, prefer $in to $or

4.5 Special Indexes

Index Options

  • unique. notice: unique index count null as a value.
  • partial

    Partial indexes in MongoDB are only created on a subset of the data(the key existed)

    db.users.ensureIndex({"email" : 1},
    		     {"partialFilterExpression": {email: {$exists: true}}})
    

    It is unlike sparse indexes on relational databases, which create fewer index entries pointing to a block of data.

  • Sparse

    "partial" index is the super set of "sparse" index. always prefer "partial" to "sparse"

Geospatial Indexes

Full Text Search

  • poorer write performance on text-indexed collections than on others
  • Create Text Indexes
    db.articles.createIndex({"title": "text",
    			 "body" : "text"},
    			{"weights" : {
    			    "title" : 3,
    			    "body" : 2}},
    			{"default_language" : "chinese"})
    
    // create not only indexes all top-level string fields,
    // but also searches embedded documents and arrays for string fields
    db.articles.createIndex({"$**" : "text"})
    
  • Text Search
    db.articles.find({"$text": {"$search": "impact crater lunar"}},
    		 {title: 1}
    		).limit(10)
    
    • $text: will tokenize the search string using whitespace and most punctuation as delimiters.

TTL Indexes

TTL indexes expire and remove data from normal collections based on the value of a date-typed field and a TTL value for the index.

db.sessions.createIndex({"lastUpdated" : 1}, {"expireAfterSeconds" : 60*60*24})

// change the expireAfterSeconds
db.runCommand({"collMod" : "someapp.cache",
	       "index" : {"keyPattern" : {"lastUpdated" : 1},
			  "expireAfterSeconds" : 3600 }});
  • MongoDB sweeps the TTL index once per minute

4.6 Special Collections

Capped Collections

like circular queues

  • TTL indexes are recommended over capped collections because they perform better with the WiredTiger storage engine
db.createCollection("my_collection",
		    {"capped" : true,
		     "size" : 100000, // collection has a fixed size of 100,000 bytes
		     "max": 100}); // max number of documents

// convert the test collection to a capped collection of 10,000 bytes
db.runCommand({"convertToCapped" : "test", "size" : 10000});

4.7 Tailable Cursors

Tailable cursors are a special type of cursor that are not closed when their results are exhausted. They were inspired by the tail -f command

4.8 Changing Indexes

  • Background indexing is much slower than foreground indexing.
  • 4.2 introduced a hybrid index build. It only holds the exclusive lock at the beginning and end of the index build.

4.9 GridFS

  • mongofiles: upload(put), download(get), list, search for, or delete files in GridFS.

PyMongo API

import pymongo
import gridfs
client = pymongo.MongoClient()
db = client.test
fs = gridfs.GridFS(db)
file_id = fs.put(b"Hello, world", filename="foo.txt")
fs.list() # ['foo.txt']
fs.get(file_id).read() # b'Hello, world'

Chunk

// chunk in fs.chunks
{
    "_id" : ObjectId("..."),
    "n" : 0,
    "data" : BinData("..."),
    "files_id" : ObjectId("...")
}
  • "files_id": The "_id" of the file document that contains the metadata for the file this chunk is from
  • "n": The chunk's position in the file, relative to the other chunks
  • "data": The bytes in this chunk of the file

File Metadata

  • "length": The total number of bytes making up the content of the file.
  • "chunkSize": The size of each chunk comprising the file, in bytes. The default is 255 KB, but this can be adjusted if needed.
  • "uploadDate": A timestamp representing when this file was stored in GridFS.
  • "md5": An MD5 checksum of this file’s contents, generated on the server side.

5 Aggregation

5.1 Reference

5.2 Expressions

  • Match: $match, $exists

    db.companies.aggregate([
      {$match: {founded_year: 2004}},
      {$limit: 5},
      {$project: {
        _id: 0,
        name: 1,
        founded_year: 1
      }}
    ])
    
  • Projection: $project
  • Array: $arrayElemAt, $slice, $filter, $unwind

    // bars is a list of 1440 records
    db.getCollection("minbar").aggregate([{$match: {"symbol": "btc.usdt/okex"}},
    				      {$limit: 50},
    				      {$project: {"_id": 0, "symbol": 1, "date": 1, "bars": 1,
    						  "open": {"$arrayElemAt": ["$bars.open", 0]},
    						  "high": {"$max": "$bars.high"},
    						  "low": {"$min": "$bars.low"},
    						  "close": {"$arrayElemAt": ["$bars.close", -1]},
    						  "volume": {"$sum": "$bars.volume"}}}])
    

5.3 Accumulators

$max, $min, $sum, $avg, $first, $last, $push, $addToSet, $mergeObjects

Project Stage

db.companies.aggregate([
    { $match: { "funding_rounds": { $exists: true, $ne: [ ]} } },
    { $project: {
	_id: 0,
	name: 1,
	largest_round: { $max: "$funding_rounds.raised_amount" }
    } }
])

Group Stage

Fundamental to the group stage is the "_id" field that we specify as part of the document. This is the value of the $group operator itself, using a very strict interpretation.

db.companies.aggregate([
    { $group: {
	_id: { founded_year: "$founded_year" },
	average_number_of_employees: { $avg: "$number_of_employees" }
    } },
    { $sort: { average_number_of_employees: -1 } }

])

db.companies.aggregate( [
    { $match: { "relationships.person": { $ne: null } } },
    { $project: { relationships: 1, _id: 0 } },
    { $unwind: "$relationships" },
    { $group: {
	_id: "$relationships.person",
	count: { $sum: 1 }
    } },
    { $sort: { count: -1 } }
]).pretty()

Complex Example

db.companies.aggregate([
    { $match: { funding_rounds: { $exists: true, $ne: [ ] } } },
    { $unwind: "$funding_rounds" },
    { $sort: { "funding_rounds.funded_year": 1,
	       "funding_rounds.funded_month": 1,
	       "funding_rounds.funded_day": 1 } },
    { $group: {
	_id: { company: "$name" },
	first_round: { $first: "$funding_rounds" },
	last_round: { $last: "$funding_rounds" },
	num_rounds: { $sum: 1 },
	total_raised: { $sum: "$funding_rounds.raised_amount" }
    } },
    { $project: {
	_id: 0,
	company: "$_id.company",
	first_round: {
	    amount: "$first_round.raised_amount",
	    article: "$first_round.source_url",
	    year: "$first_round.funded_year"
	},
	last_round: {
	    amount: "$last_round.raised_amount",
	    article: "$last_round.source_url",
	    year: "$last_round.funded_year"
	},
	num_rounds: 1,
	total_raised: 1,
    } },
    { $sort: { total_raised: -1 } }
] ).pretty()

5.4 Writing Results to a Collection

$out, $merge

  • $merge was introduced in MongoDB version 4.2 and is the preferred stage for writing to a collection

6 Transactions

6.1 Core API VS. Callback API

Core API Callback API
Requires explicit call to start the transaction and commit the transaction. Starts a transaction, executes the specified operations, and commits (or aborts on error).
Does not incorporate error-handling logic for TransientTransactionError and UnknownTransactionCommitResult, and instead provides the flexibility to incorporate custom error handling for these errors. Automatically incorporates error-handling logic for TransientTransactionError and UnknownTransactionCommitResult.
Requires explicit logical session to be passed to API for the specific transaction. Requires explicit logical session to be passed to API for the specific transaction.

6.2 Core API Example

uri = 'mongodb+srv://server.example.com/'
client = MongoClient(uriString)

my_wc_majority = WriteConcern('majority', wtimeout=1000)


client.get_database( "webshop",
		     write_concern=my_wc_majority).orders.insert_one({"sku":
		     "abc123", "qty":0})
client.get_database( "webshop",
		     write_concern=my_wc_majority).inventory.insert_one(
		     {"sku": "abc123", "qty": 1000})

def update_orders_and_inventory(my_session):
    orders = session.client.webshop.orders
    inventory = session.client.webshop.inventory


    with session.start_transaction(
	    read_concern=ReadConcern("snapshot"),
	    write_concern=WriteConcern(w="majority"),
	    read_preference=ReadPreference.PRIMARY):

	orders.insert_one({"sku": "abc123", "qty": 100}, session=my_session)
	inventory.update_one({"sku": "abc123", "qty": {"$gte": 100}},
			     {"$inc": {"qty": -100}}, session=my_session)
	commit_with_retry(my_session)

def commit_with_retry(session):
    while True:
	try:
	    # Commit uses write concern set at transaction start.
	    session.commit_transaction()
	    print("Transaction committed.")
	    break
	except (ConnectionFailure, OperationFailure) as exc:
	    # Can retry commit
	    if exc.has_error_label("UnknownTransactionCommitResult"):
		print("UnknownTransactionCommitResult, retrying "
		      "commit operation ...")
		continue
	    else:
		print("Error during commit ...")
		raise

def run_transaction_with_retry(txn_func, session):
    while True:
	try:
	    txn_func(session)  # performs transaction
	    break
	except (ConnectionFailure, OperationFailure) as exc:
	    # If transient error, retry the whole transaction
	    if exc.has_error_label("TransientTransactionError"):
		print("TransientTransactionError, retrying transaction ...")
		continue
	    else:
		raise

with client.start_session() as my_session:
    try:
	run_transaction_with_retry(update_orders_and_inventory, my_session)
    except Exception as exc:
	# Do something with error. The error handling code is not
	# implemented for you with the Core API.
	raise

6.3 Callback API Example

uriString = 'mongodb+srv://server.example.com/'
client = MongoClient(uriString)
def callback(my_session):
    orders = my_session.client.webshop.orders
    inventory = my_session.client.webshop.inventory

    # Important:: You must pass the session variable 'my_session' to
    # the operations.

    orders.insert_one({"sku": "abc123", "qty": 100}, session=my_session)
    inventory.update_one({"sku": "abc123", "qty": {"$gte": 100}},
			 {"$inc": {"qty": -100}}, session=my_session)

with client.start_session() as session:
    session.with_transaction(callback,
			     read_concern=ReadConcern('local'),
			     write_concern=my_write_concern_majority,
			     read_preference=ReadPreference.PRIMARY)

7 Design Patterns

see Summary

7.1 Embedding vs References

Embedding is better for… References are better for…
Small subdocuments Large subdocuments
Data that does not change regularly Volatile data
When eventual consistency is acceptable When immediate consistency is necessary
Documents that grow by a small amount Documents that grow by a large amount
Data that you'll often need to perform a second query to fetch Data that you'll often exclude from the results
Fast reads Fast writes

8 Replication

8.1 rs

rs is a global variable that contains replication helper functions

rsconf = {
    _id: "replSetName",
    members: [
	{_id: 0, host: "localhost:27017"},
	{_id: 1, host: "localhost:27018"},
	{_id: 2, host: "localhost:27019"}
    ]
}
rs.initiate(rsconf)
rs.status()
  • rs.add
  • rs.remove
  • rs.config
// change hostname
var config = rs.config()
config.members[0].host = "localhost:27017"
rs.reconfig(config)

9 Cluster

9.1 Tools

  • MongoDB Ops Manager