MongoDB
Table of Contents
1 Basic
1.1 CRUD
create, read, update, and delete
- create: insertOne
- read: find, findOne
- update: updateOne
- delete: deleteOne
1.2 Types
null, Boolean, Number, String, Date(int64 since epoch), Regular expression(query only), Array, Embedded document
- Binary data: Binary data is the only way to save non-UTF-8 strings to the database.
- Code: MongoDB also makes it possible to store arbitrary JavaScript in queries and documents:
{"x" : function() { /* ... */ }}
Object ID
12-byte ID for documents
| 0 | 1 | 2 | 3 4 5 6 7 8 9 10 11 |
|---|---|---|---|
| Timestamp(seconds) | Random | Counter (random start value) |
1.3 Mongo Shell
conn = new Mongo("host:port") db = conn.getDB("myDB")
- help
Useful Helpers
| Helper | Equivalent |
|---|---|
| use db | db.getSisterDB("db") |
| show dbs | db.getMongo().getDBs() |
| show collections | db.getCollectionNames() |
pwd():run("ls", "-l", "/home/myUser/my-scripts/")
Custom Helper
- define helper function in js
- load js
typeof helper_function
Frequently Loaded Scripts
Use .mongorc.js in your home directory, This file is run whenever you start up the shell.
- common example in
.mongorc.js: remove some of the more "dangerous" shell helpers
var no = function() { print("Not on my watch."); }; // Prevent dropping databases db.dropDatabase = DB.prototype.dropDatabase = no; // Prevent dropping collections DBCollection.prototype.drop = no; // Prevent dropping an index DBCollection.prototype.dropIndex = no; // Prevent dropping indexes DBCollection.prototype.dropIndexes = no;
- use
--norcoption to disable.mongorc.js
Customizing Your Prompt
Just set the prompt variable
prompt = function() { if (typeof db == 'undefined') { return '(nodb)> '; } // Check the last db operation try { db.runCommand({getLastError:1}); } catch (e) { print(e); } return db+"> "; };
1.4 Tools
mongoimport: importing raw data from a data feed or MySQL
2 CRUD
2.1 query
- some query condition:
$nin,$or,$exists,$regex - Mongo uses the Perl Compatible Regular Expression
querying arrays
- contains
db.food.insertOne({"fruit" : ["apple", "banana", "peach"]}) db.food.find({"fruit" : "banana"}) // successfully match the document
- contains all:
db.food.find({fruit : {$all : ["apple", "banana"]}}) - exact match:
db.food.find({"fruit" : ["apple", "banana", "peach"]}) - index match:
db.food.find({"fruit.2" : "peach"}) $size: not working with$gt$slice:db.blog.posts.findOne(criteria, {"comments" : {"$slice" : [23, 10]}}),db.blog.posts.findOne(criteria, {"comments" : {"$slice" : -1}})db.blog.posts.find({"comments.name" : "bob"}, {"comments.$" : 1}): return the matching element with the$- range query on array elements:
db.test.find({"x" : {"$elemMatch" : {"$gt" : 10, "$lt" : 20}}})
query embedded documents
db.people.find({"name.first" : "Joe", "name.last" : "Schmoe"})
$where
- Allows you to execute arbitrary JavaScript as part of your query.(Insecure)
"$where" queries should not be used unless strictly necessary: they are much slower than regular queries. Each document has to be converted from BSON to a JavaScript object and then run through the "$where" expression.
query options
limit, skip, sort
- Avoiding Large Skips: large skips are not very efficient
immortal cursors
Tells the database not to time out the cursor.
2.2 insert
insertMany
- Batch inserts limit: 48 MB. Drivers will split up the batch insert into multiple 48 MB batch inserts
- ordererd vs unordered
- Executing an ordered list of operations on a sharded collection will generally be slower than executing an unordered list since with an ordered list, each operation must wait for the previous operation to finish.
- For unordered inserts, MongoDB will attempt to insert all documents, regardless of whether some insertions produce errors. execute the operations in parallel
insert
Old method before mongo 3.0. Use insertOne and insertMany instead
2.3 delete
deleteOnedeleteMany- Clear an entire collection:
drop
2.4 update
Updating a document is atomic
updateOneupdateManyreplaceOne: takes a filter as the first parameter, but as the second parameter replaceOne expects a document with which it will replace the document matching the filter.
This can be useful to do a dramatic schema migration.
update operaters
{"$inc": {"field": 1}}: create or increment 1{"$set": {"new/existed field": "newValue"}},{"$unset": {"field": 1}}- change embedded documents:
{"$set": {"field.sub": "newValue"}} {"$setOnInsert" : {"createdAt" : new Date()}}: can be useful for creating padding, initializing counters, and for collections that do not use ObjectIds.findOneAndUpdate(after 4.2): can return the item and update it in a single operationfindOneAndReplacefindOneAndDelete
- array operaters
{"$push": {"field": document}}adds elements to the end of an array if the array exists and creates a new array if it does not.$each: e.g.{"$push" : {"hourly" : {"$each" : [562.776, 562.790, 559.123]}}})~$slicecan be used to create a queue in a document. e.g.{"$push" : {"last5" : {"$each" : ["Nightmare on Elm Street", "Saw"], "$slice" : -5}}})$sortIf you only want the array to grow to a certain length, you can use the$slicemodifier effectively making a “top N” list of items.
db.movies.updateOne( {"genre" : "horror"}, {"$push" : {"top10" : {"$each" : [{"name" : "Nightmare on Elm Street", "rating" : 6.6}, {"name" : "Saw", "rating" : 4.3}], "$slice" : -10, "$sort" : {"rating" : -1}}}} )
$addToSet: used to prevent duplicates{"$pop" : {"key" : 1}}removes an element from the end of the array.{"$pop" : {"key" : -1}}removes it from the beginning.$pullis used to remove elements of an array that match the given criteria.- update first match: e.g. set in comments array
{"$set" : {"comments.$.author" : "Jim"}} $arrayFilters
db.blog.updateOne( {"post" : post_id }, { $set: { "comments.$[elem].hidden" : true } }, { arrayFilters: [ { "elem.votes": { $lte: -5 } } ] } ) // This command defines elem as the identifier for each matching element in the "comments" array. // If the votes value for the comment // identified by elem is less than or equal to -5, we will add a field called "hidden" to the // "comments" document and set its value to true.
upsert
Eliminating race condition: query->if exists->update
3 Server Administration
3.1 mongod options
3.2 Troubleshoot
getLastError
4 Indexing
4.1 Test Dataset
for (i=0; i<1000000; i++) { db.users.insertOne( { "i" : i, "username" : "user"+i, "age" : Math.floor(Math.random()*120), "created" : new Date() } ); }
Profiling
cursor.explain("executionStats")
Stages
- FETCH: "FETCH" stage will retrieve the documents themselves and return them in batches as the client requests them.
- SORT: means MongoDB would have been unable to sort the result set in the database using an index and instead would have had to do an in-memory sort.
Query Types
- equality filter: {"age": 25}
- multivalue filter: {"age": {"$gte": 40}}
- sort component
4.2 Designing Indexes
- selectivity: minimize the number of records scanned. "nReturned" is very close to "totalKeysExamined".
Cardinality
Cardinality refers to how many distinct values there are for a field in a collection. e.g. "gender" low cardinality. "username" high cardinality In general, the greater the cardinality of a field, the more helpful an index on that field can be
Designing a Compound Index
- Keys for equality filters should appear first.
- Keys used for sorting should appear before multivalue fields.
- Keys for multivalue filters should appear last.
4.3 Specifying an Index to Use
hint([indexname])orhint([indexshape])
4.4 Other Tips
Inefficient Operators
$ne: they basically have to scan the entire index.$not: will fall back to doing a table scan$nin: always uses a table scan
$or
Performs two queries and then merges the results. In general, doing two queries and merging the results
is much less efficient than doing a single query; thus, whenever possible, prefer $in to $or
4.5 Special Indexes
Index Options
- unique. notice: unique index count
nullas a value.
- partial
Partial indexes in MongoDB are only created on a subset of the data(the key existed)
db.users.ensureIndex({"email" : 1}, {"partialFilterExpression": {email: {$exists: true}}})
It is unlike sparse indexes on relational databases, which create fewer index entries pointing to a block of data.
- Sparse
"partial" index is the super set of "sparse" index. always prefer "partial" to "sparse"
Geospatial Indexes
Full Text Search
- poorer write performance on text-indexed collections than on others
- Create Text Indexes
db.articles.createIndex({"title": "text", "body" : "text"}, {"weights" : { "title" : 3, "body" : 2}}, {"default_language" : "chinese"}) // create not only indexes all top-level string fields, // but also searches embedded documents and arrays for string fields db.articles.createIndex({"$**" : "text"})
- Text Search
db.articles.find({"$text": {"$search": "impact crater lunar"}}, {title: 1} ).limit(10)
$text: will tokenize the search string using whitespace and most punctuation as delimiters.
TTL Indexes
TTL indexes expire and remove data from normal collections based on the value of a date-typed field and a TTL value for the index.
db.sessions.createIndex({"lastUpdated" : 1}, {"expireAfterSeconds" : 60*60*24}) // change the expireAfterSeconds db.runCommand({"collMod" : "someapp.cache", "index" : {"keyPattern" : {"lastUpdated" : 1}, "expireAfterSeconds" : 3600 }});
- MongoDB sweeps the TTL index once per minute
4.6 Special Collections
Capped Collections
like circular queues
- TTL indexes are recommended over capped collections because they perform better with the WiredTiger storage engine
db.createCollection("my_collection", {"capped" : true, "size" : 100000, // collection has a fixed size of 100,000 bytes "max": 100}); // max number of documents // convert the test collection to a capped collection of 10,000 bytes db.runCommand({"convertToCapped" : "test", "size" : 10000});
4.7 Tailable Cursors
Tailable cursors are a special type of cursor that are not closed when their results are exhausted.
They were inspired by the tail -f command
4.8 Changing Indexes
- Background indexing is much slower than foreground indexing.
- 4.2 introduced a hybrid index build. It only holds the exclusive lock at the beginning and end of the index build.
4.9 GridFS
- mongofiles: upload(put), download(get), list, search for, or delete files in GridFS.
PyMongo API
import pymongo import gridfs client = pymongo.MongoClient() db = client.test fs = gridfs.GridFS(db) file_id = fs.put(b"Hello, world", filename="foo.txt") fs.list() # ['foo.txt'] fs.get(file_id).read() # b'Hello, world'
Chunk
// chunk in fs.chunks { "_id" : ObjectId("..."), "n" : 0, "data" : BinData("..."), "files_id" : ObjectId("...") }
- "files_id": The "_id" of the file document that contains the metadata for the file this chunk is from
- "n": The chunk's position in the file, relative to the other chunks
- "data": The bytes in this chunk of the file
File Metadata
- "length": The total number of bytes making up the content of the file.
- "chunkSize": The size of each chunk comprising the file, in bytes. The default is 255 KB, but this can be adjusted if needed.
- "uploadDate": A timestamp representing when this file was stored in GridFS.
- "md5": An MD5 checksum of this file’s contents, generated on the server side.
5 Aggregation
5.1 Reference
5.2 Expressions
Match: $match, $exists
db.companies.aggregate([ {$match: {founded_year: 2004}}, {$limit: 5}, {$project: { _id: 0, name: 1, founded_year: 1 }} ])
- Projection: $project
Array: $arrayElemAt, $slice, $filter, $unwind
// bars is a list of 1440 records db.getCollection("minbar").aggregate([{$match: {"symbol": "btc.usdt/okex"}}, {$limit: 50}, {$project: {"_id": 0, "symbol": 1, "date": 1, "bars": 1, "open": {"$arrayElemAt": ["$bars.open", 0]}, "high": {"$max": "$bars.high"}, "low": {"$min": "$bars.low"}, "close": {"$arrayElemAt": ["$bars.close", -1]}, "volume": {"$sum": "$bars.volume"}}}])
5.3 Accumulators
$max, $min, $sum, $avg, $first, $last, $push, $addToSet, $mergeObjects
Project Stage
db.companies.aggregate([ { $match: { "funding_rounds": { $exists: true, $ne: [ ]} } }, { $project: { _id: 0, name: 1, largest_round: { $max: "$funding_rounds.raised_amount" } } } ])
Group Stage
Fundamental to the group stage is the "_id" field that we specify as part of the document. This is the value of the $group operator itself, using a very strict interpretation.
db.companies.aggregate([ { $group: { _id: { founded_year: "$founded_year" }, average_number_of_employees: { $avg: "$number_of_employees" } } }, { $sort: { average_number_of_employees: -1 } } ]) db.companies.aggregate( [ { $match: { "relationships.person": { $ne: null } } }, { $project: { relationships: 1, _id: 0 } }, { $unwind: "$relationships" }, { $group: { _id: "$relationships.person", count: { $sum: 1 } } }, { $sort: { count: -1 } } ]).pretty()
Complex Example
db.companies.aggregate([ { $match: { funding_rounds: { $exists: true, $ne: [ ] } } }, { $unwind: "$funding_rounds" }, { $sort: { "funding_rounds.funded_year": 1, "funding_rounds.funded_month": 1, "funding_rounds.funded_day": 1 } }, { $group: { _id: { company: "$name" }, first_round: { $first: "$funding_rounds" }, last_round: { $last: "$funding_rounds" }, num_rounds: { $sum: 1 }, total_raised: { $sum: "$funding_rounds.raised_amount" } } }, { $project: { _id: 0, company: "$_id.company", first_round: { amount: "$first_round.raised_amount", article: "$first_round.source_url", year: "$first_round.funded_year" }, last_round: { amount: "$last_round.raised_amount", article: "$last_round.source_url", year: "$last_round.funded_year" }, num_rounds: 1, total_raised: 1, } }, { $sort: { total_raised: -1 } } ] ).pretty()
5.4 Writing Results to a Collection
$out, $merge
- $merge was introduced in MongoDB version 4.2 and is the preferred stage for writing to a collection
6 Transactions
6.1 Core API VS. Callback API
| Core API | Callback API |
|---|---|
| Requires explicit call to start the transaction and commit the transaction. | Starts a transaction, executes the specified operations, and commits (or aborts on error). |
| Does not incorporate error-handling logic for TransientTransactionError and UnknownTransactionCommitResult, and instead provides the flexibility to incorporate custom error handling for these errors. | Automatically incorporates error-handling logic for TransientTransactionError and UnknownTransactionCommitResult. |
| Requires explicit logical session to be passed to API for the specific transaction. | Requires explicit logical session to be passed to API for the specific transaction. |
6.2 Core API Example
uri = 'mongodb+srv://server.example.com/' client = MongoClient(uriString) my_wc_majority = WriteConcern('majority', wtimeout=1000) client.get_database( "webshop", write_concern=my_wc_majority).orders.insert_one({"sku": "abc123", "qty":0}) client.get_database( "webshop", write_concern=my_wc_majority).inventory.insert_one( {"sku": "abc123", "qty": 1000}) def update_orders_and_inventory(my_session): orders = session.client.webshop.orders inventory = session.client.webshop.inventory with session.start_transaction( read_concern=ReadConcern("snapshot"), write_concern=WriteConcern(w="majority"), read_preference=ReadPreference.PRIMARY): orders.insert_one({"sku": "abc123", "qty": 100}, session=my_session) inventory.update_one({"sku": "abc123", "qty": {"$gte": 100}}, {"$inc": {"qty": -100}}, session=my_session) commit_with_retry(my_session) def commit_with_retry(session): while True: try: # Commit uses write concern set at transaction start. session.commit_transaction() print("Transaction committed.") break except (ConnectionFailure, OperationFailure) as exc: # Can retry commit if exc.has_error_label("UnknownTransactionCommitResult"): print("UnknownTransactionCommitResult, retrying " "commit operation ...") continue else: print("Error during commit ...") raise def run_transaction_with_retry(txn_func, session): while True: try: txn_func(session) # performs transaction break except (ConnectionFailure, OperationFailure) as exc: # If transient error, retry the whole transaction if exc.has_error_label("TransientTransactionError"): print("TransientTransactionError, retrying transaction ...") continue else: raise with client.start_session() as my_session: try: run_transaction_with_retry(update_orders_and_inventory, my_session) except Exception as exc: # Do something with error. The error handling code is not # implemented for you with the Core API. raise
6.3 Callback API Example
uriString = 'mongodb+srv://server.example.com/' client = MongoClient(uriString) def callback(my_session): orders = my_session.client.webshop.orders inventory = my_session.client.webshop.inventory # Important:: You must pass the session variable 'my_session' to # the operations. orders.insert_one({"sku": "abc123", "qty": 100}, session=my_session) inventory.update_one({"sku": "abc123", "qty": {"$gte": 100}}, {"$inc": {"qty": -100}}, session=my_session) with client.start_session() as session: session.with_transaction(callback, read_concern=ReadConcern('local'), write_concern=my_write_concern_majority, read_preference=ReadPreference.PRIMARY)
7 Design Patterns
see Summary
7.1 Embedding vs References
| Embedding is better for… | References are better for… |
|---|---|
| Small subdocuments | Large subdocuments |
| Data that does not change regularly | Volatile data |
| When eventual consistency is acceptable | When immediate consistency is necessary |
| Documents that grow by a small amount | Documents that grow by a large amount |
| Data that you'll often need to perform a second query to fetch | Data that you'll often exclude from the results |
| Fast reads | Fast writes |
8 Replication
8.1 rs
rs is a global variable that contains replication helper functions
rsconf = { _id: "replSetName", members: [ {_id: 0, host: "localhost:27017"}, {_id: 1, host: "localhost:27018"}, {_id: 2, host: "localhost:27019"} ] } rs.initiate(rsconf) rs.status()
rs.addrs.removers.config
// change hostname var config = rs.config() config.members[0].host = "localhost:27017" rs.reconfig(config)
9 Cluster
9.1 Tools
- MongoDB Ops Manager