NGStorage Admin Guide-2.3.6

NG|Storage Admin Guide
Version 2.3.6
NetGuardians SA <info@netguardians.ch>
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Shards & Replicas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. Exploring Your Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1. Cluster Health. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2. List All Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3. Create an Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4. Index and Query a Document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5. Delete an Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4. Modifying Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1. Updating Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2. Deleting Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3. Batch Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5. Exploring Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.1. The Search API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2. Introducing the Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3. Executing Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.4. Executing Filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5. Executing Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7. NG|Screener Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2. Index Naming Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.3. Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.4. Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.5. System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8. Structuring Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9. Values Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10. Bucket Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.1. Children Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.2. Date Histogram Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10.3. Date Range Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.4. Sampler Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10.5. Filter Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
10.6. Filters Aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.7. Geo Distance Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.8. GeoHash Grid Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.9. Global Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.10. Histogram Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.11. IP Range Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
10.12. Missing Aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
10.13. Nested Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
10.14. Range Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
10.15. Reverse Nested Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.16. Sampler Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.17. Significant Terms Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.18. Terms Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11. Matrix Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.1. Matrix Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
12. Metrics Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
12.1. Average Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
12.2. Cardinality Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
12.3. Extended Stats Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
12.4. Geo Bounds Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
12.5. Geo Centroid Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
12.6. Max Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
12.7. Min Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
12.8. Percentiles Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
12.9. Percentile Ranks Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
12.10. Percentile Ranks Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
12.11. Scripted Metric Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
12.12. Stats Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
12.13. Sum Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.14. Top Hits Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.15. Value Count Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
13. Pipeline Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
13.1. Average Bucket Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.2. Bucket Script Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
13.3. Bucket Selector Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
13.4. Cumulative Sum Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.5. Derivative Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
13.6. Extended Stats Bucket Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
13.7. Maximum Bucket Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.8. Minimum Bucket Aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.9. Moving Average Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
13.10. Percentiles Bucket Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
13.11. Serial Differencing Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
13.12. Stats Bucket Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
13.13. Sum Bucket Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
14. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
14.1. Caching Heavy Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
14.2. Returning Only Aggregation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
14.3. Aggregation Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
15. Index Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
16. Search Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
17. Anatomy of an Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
17.1. Character Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
17.2. Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
17.3. Token Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
18. Analyzers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
18.1. Configuring Built-in Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
18.2. Custom Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.3. Fingerprint Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
18.4. Keyword Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
18.5. Language Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
18.6. Pattern Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
18.7. Simple Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
18.8. Standard Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
18.9. Stop Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
18.10. Whitespace Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
19. Character Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
19.1. HTML Strip Char Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
19.2. Mapping Char Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
19.3. Pattern Replace Char Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
20. Token Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
21. Tokenizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
22. Testing Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Advanced Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
23. Catalog APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
23.1. Cat Aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
23.2. Cat Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
23.3. Cat Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
23.4. Cat Fielddata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
23.5. Cat Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
23.6. Cat Indices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
23.7. Cat Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
23.8. Cat Nodeattrs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
23.9. Cat Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
23.10. Cat Pending Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
23.11. Cat Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
23.12. Cat Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
23.13. Cat Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
23.14. Cat Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
23.15. Cat Shards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
23.16. Cat Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
23.17. Cat Thread Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
24. Cluster APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
24.1. Node Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
24.2. Cluster Allocation Explain API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
24.3. Cluster Health. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
24.4. Nodes Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
24.5. Nodes Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
24.6. Pending Cluster Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
24.7. Cluster Reroute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
24.8. Cluster State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
24.9. Cluster Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
24.10. Task Management API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
24.11. Cluster Update Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
25. Document APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
25.1. Bulk API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
25.2. Delete API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
25.3. Delete By Query API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
25.4. Get API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
25.5. Index API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
25.6. Multi Get API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
25.7. Multi Termvectors API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
25.8. Refresh API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
25.9. Reindex API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
25.10. Term Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
25.11. Update API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
25.12. Update By Query API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
26. Index Modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
26.1. Index Shard Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
26.2. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
26.3. Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
26.4. Merge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
26.5. Similarity Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
26.6. Slow Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
26.7. Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
26.8. Translog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
27. Indices APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
27.1. Index Aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
27.2. Analyze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
27.3. Clear Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
27.4. Create Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
27.5. Delete Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
27.6. Flush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
27.7. Force Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
27.8. Get Field Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
27.9. Get Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
27.10. Get Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
27.11. Get Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
27.12. Indices Exists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
27.13. Open / Close Index API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
27.14. Put Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
27.15. Indices Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
27.16. Refresh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
27.17. Rollover Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
27.18. Indices Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
27.19. Shadow Replica Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
27.20. Indices Shard Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
27.21. Shrink Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
27.22. Indices Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
27.23. Index Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
27.24. Types Exists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
27.25. Update Indices Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
27.26. Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
Ingest Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
28. Pipeline Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
29. Ingest APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
29.1. Put Pipeline API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
29.2. Get Pipeline API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
29.3. Delete Pipeline API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
29.4. Simulate Pipeline API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
30. Accessing Data in Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
31. Handling Failures in Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
32. Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
33. Mapping Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
34. Dynamic Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
34.1. Default Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
34.2. Dynamic Field Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
34.3. Dynamic Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
35. Meta-Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
35.1. All Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
35.2. ID Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
35.3. Index Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
35.4. Meta Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
35.5. Parent Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
35.6. Routing field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
35.7. Source Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
35.8. Type Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
35.9. UID Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
36. Mapping Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
36.1. Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
36.2. Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
36.3. Coerce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
36.4. Copy-To . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
36.5. Doc Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
36.6. Dynamic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
36.7. Enabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
36.8. Field-Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
36.9. Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
36.10. Geohash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
36.11. Ignore Above . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
36.12. Ignored Malformed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
36.13. Include In All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
36.14. Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
36.15. Index Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
36.16. Multi-Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
36.17. Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
36.18. Null Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
36.19. Position Increment Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
36.20. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
36.21. Search Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
36.22. Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
36.23. Term Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
37. Field Data-types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
38. Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
38.1. Shard Allocation Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
38.2. Shard Allocation Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
38.3. Disk-Based Shard Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
38.4. Miscellaneous Cluster Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
38.5. Cluster Level Shard Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
39. Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
39.1. Azure Classic Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
39.2. EC2 Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
39.3. Google Compute Engine Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
39.4. Zen Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
40. Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
40.1. Circuit Breaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
40.2. Fielddata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
40.3. Indexing Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
40.4. Node Query Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
40.5. Indices Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
40.6. Shard Request Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
41. Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
41.1. Advanced Text Scoring in Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
41.2. Lucene Expressions Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
41.3. Accessing Document Fields and Special Variables . . . . . . . . . . . . . . . . . . . . . . . 532
41.4. Groovy Scripting Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
41.5. Native (Java) Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
41.6. Painless Scripting Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
41.7. Painless Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
41.8. Scripting and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
41.9. How to Use Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
42. Advanced Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
42.1. Local Gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
42.2. HTTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
42.3. Memcached . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
42.4. Network Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
42.5. Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
42.6. Thread Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
42.7. Transport. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
42.8. Tribe Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
Query DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
43. Query DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
43.1. Bool Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
43.2. Boosting Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
43.3. Common Terms Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
43.4. Compound Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
43.5. Constant Score Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
43.6. Dis Max Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
43.7. Exists Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
43.8. Full Text Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
43.9. Function Score Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
43.10. Geo Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
43.11. Has Child Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
43.12. Has Parent Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
43.13. IDs Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
43.14. Joining Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
43.15. Match All Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
43.16. Match Phrase Prefix Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
43.17. Match Phrase Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
43.18. Match Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
43.19. Minimum Should Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
43.20. More Like This Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
43.21. Multi Match Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
43.22. Multi Term Query Rewrite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
43.23. Nested Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
43.24. Parent ID Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
43.25. Percolate Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
43.26. Prefix Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
43.27. Query and Filter Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
43.28. Query String Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
43.29. Query String Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
43.30. Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
43.31. Regexp Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
43.32. Regular Expression Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
43.33. Script Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
43.34. Simple Query String Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
43.35. Span Containing Query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
43.36. Span First Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
43.37. Span Multi Term Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
43.38. Span Near Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
43.39. Span Not Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
43.40. Span Or Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
43.41. Span Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
43.42. Span Term Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
43.43. Span Within Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
43.44. Specialized Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
43.45. Template Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
43.46. Term Level Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
43.47. Term Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
43.48. Terms Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
43.49. Type Query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
43.50. Wildcard Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Search APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
44. Search APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
44.1. Request Body Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
44.2. Suggesters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
44.3. Count API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697
44.4. Explain API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
44.5. Multi Search API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
44.6. Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
44.7. Search Shards API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
44.8. Search Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
44.9. URI Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
44.10. Validate API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Setup NG|Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
45. Installing NG|Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
45.1. Checking that NG|Storage is running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
45.2. Install NG|Storage with Debian Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742
45.3. SysV init vs systemd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
45.4. Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
45.5. Running NG|Storage with systemd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
45.6. Install NG|Storage on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
45.7. Install NG|Storage with .zip or .tar.gz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
46. Important System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
46.1. Configuring System Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
46.2. File Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
46.3. Set JVM heap Size via jvm.options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
46.4. Disable Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
46.5. Number of Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
46.6. Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
47. Bootstrap Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
48. Configuring NG|Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
49. Important NG|Storage Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
You can download a printable version of this help NGStorage_Admin_Guide-
2.3.6.pdf
Disclaimer
All material in these pages, including text, layout, presentation, logos, icons,
photos, and all other artwork is the Intellectual Property of NetGuardians SA,
unless otherwise stated, and subject to NetGuardians SA copyright. No
commercial use of any material is authorized without the express permission of
NetGuardians SA. Information contained in, or derived from these pages must not
be used for development, production, marketing or any other act, which infringes
copyright. This document is for informational purposes only. NetGuardians SA
makes no warranties, expressed or implied, in this document.
Preface | 1
Chapter 1. Introduction
NG|Storage is a highly scalable open-source full-text search and analytics engine. It allows
you to store, search, and analyze big volumes of data quickly and in near real time. It is
generally used as the underlying engine/technology that powers applications that have
complex search features and requirements.
Here are a few sample use-cases that NG|Storage could be used for:
• You run an online web store where you allow your customers to search for products that
you sell. In this case, you can use NG|Storage to store your entire product catalog and
inventory and provide search and autocomplete suggestions for them.
• You want to collect log or transaction data and you want to analyze and mine this data to
look for trends, statistics, summarizations, or anomalies. In this case, you can collect,
aggregate, and parse your data, and then feed this data into NG|Storage. Once the data
is in NG|Storage, you can run searches and aggregations to mine any information that is
of interest to you.
• You run a price alerting platform which allows price-savvy customers to specify a rule
like "I am interested in buying a specific electronic gadget and I want to be notified if the
price of gadget falls below $X from any vendor within the next month". In this case you
can scrape vendor prices, push them into NG|Storage and use its reverse-search
capability to match price movements against customer queries and eventually push the
alerts out to the customer once matches are found.
• You have analytics/business-intelligence needs and want to quickly investigate, analyze,

visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records).
In this case, you can use NG|Storage to store your data and then use NG|Discover to
build custom dashboards that can visualize aspects of your data that are important to
you. Additionally, you can use the NG|Storage aggregations functionality to perform
complex business intelligence queries against your data.
For the rest of this tutorial, I will guide you through the process of getting NG|Storage up
and running, taking a peek inside it, and performing basic operations like indexing,
searching, and modifying your data. At the end of this guide, you should have a good idea of
what NG|Storage is, how it works, and hopefully be inspired to see how you can use it to
either build sophisticated search applications or to mine intelligence from your data.
2 | Chapter 1. Introduction
Basic Concepts
There are a few concepts that are core to NG|Storage. Understanding these concepts from
the outset will tremendously help ease the learning process.
Near Realtime (NRT)
NG|Storage is a near real time search platform. What this means is there is a slight latency
(normally one second) from the time you index a document until the time it becomes
searchable.
Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data
and provides federated indexing and search capabilities across all nodes. A cluster is
identified by a unique name which by default is "NGELK". This name is important because a
node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different environments,
otherwise you might end up with nodes joining the wrong cluster. For instance you could
use logging-dev, logging-stage, and logging-prod for the development, staging,
and production clusters.
Note that it is valid and perfectly fine to have a cluster with only a single node in it.
Furthermore, you may also have multiple independent clusters each with its own unique
cluster name.
Node
A node is a single server that is part of your cluster, stores your data, and participates in
the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a
name which by default is a random Marvel character name that is assigned to the node at
startup. You can define any node name you want if you do not want the default. This name
is important for administration purposes where you want to identify which servers in your
network correspond to which nodes in your NG|Storage cluster.
A node can be configured to join a specific cluster by the cluster name. By default, each
node is set up to join a cluster named NGELK1 which means that if you start up a number of
Chapter 1. Introduction | 3
nodes on your network and—¬assuming they can discover each other—¬they will all
automatically form and join a single cluster named NGELK1.
In a single cluster, you can have as many nodes as you want. Furthermore, if there are no
other NG|Storage nodes currently running on your network, starting a single node will by
default form a new single-node cluster named NGELK1.
Index
An index is a collection of documents that have somewhat similar characteristics. For

example, you can have an index for customer data, another index for a product catalog, and
yet another index for order data. An index is identified by a name (that must be all
lowercase) and this name is used to refer to the index when performing indexing, search,
update, and delete operations against the documents in it.
In a single cluster, you can define as many indexes as you want.
Type
Within an index, you can define one or more types. A type is a logical category/partition of
your index whose semantics is completely up to you. In general, a type is defined for
documents that have a set of common fields. For example, let’s assume you run a blogging
platform and store all your data in a single index. In this index, you may define a type for
user data, another type for blog data, and yet another type for comments data.
Document
A document is a basic unit of information that can be indexed. For example, you can have a
document for a single customer, another document for a single product, and yet another
for a single order. This document is expressed in JSON (JavaScript Object Notation) which
is an ubiquitous internet data interchange format.
Within an index/type, you can store as many documents as you want. Note that although a
document physically resides in an index, a document actually must be indexed/assigned to
a type inside an index.
4 | Chapter 1. Introduction
Chapter 2. Shards & Replicas
An index can potentially store a large amount of data that can exceed the hardware limits of
a single node. For example, a single index of a billion documents taking up 1TB of disk
space may not fit on the disk of a single node or may be too slow to serve search requests
from a single node alone.
To solve this problem, NG|Storage provides the ability to subdivide your index into multiple
pieces called shards. When you create an index, you can simply define the number of
shards that you want. Each shard is in itself a fully-functional and independent "index" that
can be hosted on any node in the cluster.
Sharding is important for two primary reasons:
• It allows you to horizontally split/scale your content volume
• It allows you to distribute and parallelize operations across shards (potentially on

multiple nodes) thus increasing performance/throughput
The mechanics of how a shard is distributed and also how its documents are aggregated
back into search requests are completely managed by NG|Storage and is transparent to
you as the user.
In a network/cloud environment where failures can be expected anytime, it is very useful

and highly recommended to have a failover mechanism in case a shard/node somehow
goes offline or disappears for whatever reason. To this end, NG|Storage allows you to make
one or more copies of your index’s shards into what are called replica shards, or replicas
for short.
Replication is important for two primary reasons:
• It provides high availability in case a shard/node fails. For this reason, it is important to
note that a replica shard is never allocated on the same node as the original/primary
shard that it was copied from.
• It allows you to scale out your search volume/throughput since searches can be
executed on all replicas in parallel.
To summarize, each index can be split into multiple shards. An index can also be replicated
zero (meaning no replicas) or more times. Once replicated, each index will have primary
shards (the original shards that were replicated from) and replica shards (the copies of the
primary shards). The number of shards and replicas can be defined per index at the time
Chapter 2. Shards & Replicas | 5
the index is created. After the index is created, you may change the number of replicas
dynamically anytime but you cannot change the number shards after-the-fact.
By default, each index in NG|Storage is allocated 5 primary shards and 1 replica which
means that if you have at least two nodes in your cluster, your index will have 5 primary
shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
Each NG|Storage shard is a Lucene index. There is a maximum number of

documents you can have in a single Lucene index. As of LUCENE-5843,
 the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You
can monitor shard sizes using the Shards api.
With that out of the way, let’s get started with the fun part…¬
Also note the line marked http with information about the HTTP address (192.168.8.112)
and port (9200) that our node is reachable from. By default, NG|Storage uses port 9200 to
provide access to its REST API. This port is configurable if necessary.
NG|Screener Standards
The default name of the cluster in this version of NG|Screener is NGELK. As for the default
name of the first node, it is automatically set to NGELK1.
6 | Chapter 2. Shards & Replicas

Chapter 3. Exploring Your Cluster
The REST API
Now that we have our node (and cluster) up and running, the next step is to understand how
to communicate with it. Fortunately, NG|Storage provides a very comprehensive and
powerful REST API that you can use to interact with your cluster. Among the few things that
can be done with the API are as follows:
• Check your cluster, node, and index health, status, and statistics
• Administer your cluster, node, and index data and metadata
• Perform CRUD (Create, Read, Update, and Delete) and search operations against your
indexes
• Execute advanced search operations such as paging, sorting, filtering, scripting,

aggregations, and many others
3.1. Cluster Health
Let’s start with a basic health check, which we can use to see how our cluster is doing.
We’ll be using curl to do this but you can use any tool that allows you to make HTTP/REST
calls. Let’s assume that we are still on the same node where we started NG|Storage on and
open another command shell window.
To check the cluster health, we will be using the 'cat' API. Remember previously that our
node HTTP endpoint is available at port 9200:
curl 'localhost:9200/_cat/health?v'
And the response:
epoch timestamp cluster status node.total node.data shards pri

relo init unassign
1394735289 14:28:09 ngStorage green 1 1 0 0 0
0 0
We can see that our cluster named "NGELK" is up with a green status.
Whenever we ask for the cluster health, we either get green, yellow, or red. Green means
everything is good (cluster is fully functional), yellow means all data is available but some
Chapter 3. Exploring Your Cluster | 7

replicas are not yet allocated (cluster is fully functional), and red means some data is not
available for whatever reason. Note that even if a cluster is red, it still is partially functional
(i.e. it will continue to serve search requests from the available shards) but you will likely
need to fix it ASAP since you have missing data.
Also from the above response, we can see and total of 1 node and that we have 0 shards
since we have no data in it yet. Note that since we are using the default cluster name
(NGELK) and since NG|Storage uses unicast network discovery by default to find other
nodes on the same machine, it is possible that you could accidentally start up more than
one node on your computer and have them all join a single cluster. In this scenario, you
may see more than 1 node in the above response.
We can also get a list of nodes in our cluster as follows:
curl 'localhost:9200/_cat/nodes?v'
And the response:
curl 'localhost:9200/_cat/nodes?v'
host ip heap.percent ram.percent load node.role master name
mwubuntu1 127.0.1.1 8 4 0.00 d * New
Goblin
Here, we can see our one node named "New Goblin", which is the single node that is
currently in our cluster.
3.2. List All Indices
Now let’s take a peek at our indices:
curl 'localhost:9200/_cat/indices?v'
And the response:
health index pri rep docs.count docs.deleted store.size pri.store.size
Which simply means we have no indices yet in the cluster.
8 | Chapter 3. Exploring Your Cluster

3.3. Create an Index
Now let’s create an index named "customer" and then list all the indexes again:
curl -XPUT 'localhost:9200/customer?pretty'

The first command creates the index named "customer" using the PUT verb. We simply
append pretty to the end of the call to tell it to pretty-print the JSON response (if any).
And the response:
curl -XPUT 'localhost:9200/customer?pretty'

{
"acknowledged" : true
}
yellow customer 5 1 0 0 495b 495b
The results of the second command tells us that we now have 1 index named customer and
it has 5 primary shards and 1 replica (the defaults) and it contains 0 documents in it.
You might also notice that the customer index has a yellow health tagged to it. Recall from
our previous discussion that yellow means that some replicas are not (yet) allocated. The
reason this happens for this index is because NG|Storage by default created one replica for
this index. Since we only have one node running at the moment, that one replica cannot yet
be allocated (for high availability) until a later point in time when another node joins the
cluster. Once that replica gets allocated onto a second node, the health status for this index
will turn to green.
3.4. Index and Query a Document
Let’s now put something into our customer index. Remember previously that in order to
index a document, we must tell NG|Storage which type in the index it should go to.
Let’s index a simple customer document into the customer index, "external" type, with an
ID of 1 as follows:
Our JSON document: { "name": "John Doe" }

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
"name": "John Doe"
}'
And the response:

{
"name": "John Doe"
}'
{
"_index" : "customer",
"_type" : "external",
"_id" : "1",
"_version" : 1,
"created" : true
}
From the above, we can see that a new customer document was successfully created inside
the customer index and the external type. The document also has an internal id of 1 which
we specified at index time.
It is important to note that NG|Storage does not require you to explicitly create an index
first before you can index documents into it. In the previous example, NG|Storage will
automatically create the customer index if it didn’t already exist beforehand.
Let’s now retrieve that document that we just indexed:
curl -XGET 'localhost:9200/customer/external/1?pretty'
And the response:
curl -XGET 'localhost:9200/customer/external/1?pretty'

{
"_index" : "customer",
"_type" : "external",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : { "name": "John Doe" }
}
Nothing out of the ordinary here other than a field, found, stating that we found a
document with the requested ID 1 and another field, _source, which returns the full JSON
document that we indexed from the previous step.
10 | Chapter 3. Exploring Your Cluster

3.5. Delete an Index
Now let’s delete the index that we just created and then list all the indexes again:
curl -XDELETE 'localhost:9200/customer?pretty'

And the response:
curl -XDELETE 'localhost:9200/customer?pretty'

{
"acknowledged" : true
}
Which means that the index was deleted successfully and we are now back to where we
started with nothing in our cluster.
Before we move on, let’s take a closer look again at some of the API commands that we
have learned so far:
curl -XPUT 'localhost:9200/customer'

curl -XPUT 'localhost:9200/customer/external/1' -d '
{
"name": "John Doe"
}'
curl 'localhost:9200/customer/external/1'
curl -XDELETE 'localhost:9200/customer'
If we study the above commands carefully, we can actually see a pattern of how we access
data in NG|Storage. That pattern can be summarized as follows:
curl -X<REST Verb> <Node>:<Port>/<Index>/<Type>/<ID>
This REST access pattern is pervasive throughout all the API commands that if you can
simply remember it, you will have a good head start at mastering NG|Storage.

Chapter 4. Modifying Your Data
NG|Storage provides data manipulation and search capabilities in near real time. By
default, you can expect a one second delay (refresh interval) from the time you
index/update/delete your data until the time that it appears in your search results. This is
an important distinction from other platforms like SQL wherein data is immediately
available after a transaction is completed.
Indexing/Replacing Documents
We’ve previously seen how we can index a single document. Let’s recall that command
again:

{
"name": "John Doe"
}'
Again, the above will index the specified document into the customer index, external type,
with the ID of 1. If we then executed the above command again with a different (or same)
document, NG|Storage will replace (i.e. reindex) a new document on top of the existing one
with the ID of 1:

{
"name": "Jane Doe"
}'
The above changes the name of the document with the ID of 1 from "John Doe" to "Jane
Doe". If, on the other hand, we use a different ID, a new document will be indexed and the
existing document(s) already in the index remains untouched.

{
"name": "Jane Doe"
}'
The above indexes a new document with an ID of 2.
When indexing, the ID part is optional. If not specified, NG|Storage will generate a random
ID and then use it to index the document. The actual ID NG|Storage generates (or whatever
we specified explicitly in the previous examples) is returned as part of the index API call.
12 | Chapter 4. Modifying Your Data

This example shows how to index a document without an explicit ID:
curl -XPOST 'localhost:9200/customer/external?pretty' -d '

{
"name": "Jane Doe"
}'
Note that in the above case, we are using the POST verb instead of PUT since we didn’t
specify an ID.
4.1. Updating Documents
In addition to being able to index and replace documents, we can also update documents.
Note though that NG|Storage does not actually do in-place updates under the hood.
Whenever we do an update, NG|Storage deletes the old document and then indexes a new
document with the update applied to it in one shot.
This example shows how to update our previous document (ID of 1) by changing the name
field to "Jane Doe":
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '

{
"doc": { "name": "Jane Doe" }
}'
This example shows how to update our previous document (ID of 1) by changing the name
field to "Jane Doe" and at the same time add an age field to it:

{
"doc": { "name": "Jane Doe", "age": 20 }
}'
Updates can also be performed by using simple scripts. Note that dynamic scripts like the
following are disabled by default as of 1.4.3, have a look at the Scripting documents for
more details. This example uses a script to increment the age by 5:

{
"script" : "ctx._source.age += 5"
}'
In the above example, ctx._source refers to the current source document that is about
Chapter 4. Modifying Your Data | 13

to be updated.
Note that as of this writing, updates can only be performed on a single document at a time.
In the future, NG|Storage might provide the ability to update multiple documents given a
query condition (like an SQL UPDATE-WHERE statement).
4.2. Deleting Documents
Deleting a document is fairly straightforward. This example shows how to delete our
previous customer with the ID of 2:
curl -XDELETE 'localhost:9200/customer/external/2?pretty'
The delete-by-query plugin can delete all documents matching a specific query.
4.3. Batch Processing
In addition to being able to index, update, and delete individual documents, NG|Storage also
provides the ability to perform any of the above operations in batches using the 'bulk` API.
This functionality is important in that it provides a very efficient mechanism to do multiple
operations as fast as possible with as little network roundtrips as possible.
As a quick example, the following call indexes two documents (ID 1 - John Doe and ID 2 -
Jane Doe) in one bulk operation:
curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '

{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
This example updates the first document (ID of 1) and then deletes the second document
(ID of 2) in one bulk operation:
curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '

{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
'
Note above that for the delete action, there is no corresponding source document after it
since deletes only require the ID of the document to be deleted.
14 | Chapter 4. Modifying Your Data

The bulk API executes all the actions sequentially and in order. If a single action fails for
whatever reason, it will continue to process the remainder of the actions after it. When the
bulk API returns, it will provide a status for each action (in the same order it was sent in) so
that you can check if a specific action failed or not.
Chapter 4. Modifying Your Data | 15

Chapter 5. Exploring Your Data
Sample Dataset
Now that we’ve gotten a glimpse of the basics, let’s try to work on a more realistic dataset.
I’ve prepared a sample of fictitious JSON documents of customer bank account information.
Each document has the following schema:
{
"account_number": 0,
"balance": 16623,
"firstname": "Bradshaw",
"lastname": "Mckenzie",
"age": 29,
"gender": "F",
"address": "244 Columbus Place",
"employer": "Euron",
"email": "bradshawmckenzie@euron.com",
"city": "Hobucken",
"state": "CO"
}
For the curious, I generated this data from www.json-generator.com/ so please ignore
the actual values and semantics of the data as these are all randomly generated.
Loading the Sample Dataset
You can download the sample dataset (accounts.json) from here. Extract it to our current
directory and let’s load it into our cluster as follows:
curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary

"@accounts.json"
And the response:
yellow bank 5 1 1000 0 424.4kb 424.4kb
Which means that we just successfully bulk indexed 1000 documents into the bank index
(under the account type).
16 | Chapter 5. Exploring Your Data

5.1. The Search API
Now let’s start with some simple searches. There are two basic ways to run searches: one
is by sending search parameters through the REST request URI and the other by sending
them through the REST request body. The request body method allows you to be more
expressive and also to define your searches in a more readable JSON format. We’ll try one
example of the request URI method but for the remainder of this tutorial, we will
exclusively be using the request body method.
The REST API for search is accessible from the _search endpoint. This example returns
all documents in the bank index:
curl 'localhost:9200/bank/_search?q=*&pretty'
Let’s first dissect the search call. We are searching (_search endpoint) in the bank index,
and the q=* parameter instructs NG|Storage to match all documents in the index. The
pretty parameter, again, just tells NG|Storage to return pretty-printed JSON results.
And the response (partially shown):
Chapter 5. Exploring Your Data | 17

curl 'localhost:9200/bank/_search?q=*&pretty'
{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : 1.0,
"hits" : [ {
"_index" : "bank",
"_type" : "account",
"_id" : "1",
"_score" : 1.0, "_source" : {"account_number":1,"balance"
:39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","addres
s":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com"
,"city":"Brogan","state":"IL"}
}, {
"_index" : "bank",
"_id" : "6",
:5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","addres
s":"671 Bristol Street","employer":"Netagy","email"
:"hattiebond@netagy.com","city":"Dante","state":"TN"}
}, {
"_index" : "bank",
As for the response, we see the following parts:
• took - time in milliseconds for NG|Storage to execute the search
• timed_out - tells us if the search timed out or not
• _shards - tells us how many shards were searched, as well as a count of the
successful/failed searched shards
• hits - search results
• hits.total - total number of documents matching our search criteria
• hits.hits - actual array of search results (defaults to first 10 documents)
• _score and max_score - ignore these fields for now
Here is the same exact search above using the alternative request body method:

curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match_all": {} }
}'
The difference here is that instead of passing q=* in the URI, we POST a JSON-style query
request body to the _search API. We’ll discuss this JSON query in the next section.

{
}'
{
"took" : 26,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : 1.0,
"hits" : [ {
"_index" : "bank",
"_id" : "1",
:39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","addres
s":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com"
,"city":"Brogan","state":"IL"}
}, {
"_index" : "bank",
"_id" : "6",
:5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","addres
s":"671 Bristol Street","employer":"Netagy","email"
:"hattiebond@netagy.com","city":"Dante","state":"TN"}
}, {
"_index" : "bank",
"_id" : "13",
It is important to understand that once you get your search results back, NG|Storage is
completely done with the request and does not maintain any kind of server-side resources
or open cursors into your results. This is in stark contrast to many other platforms such as
SQL wherein you may initially get a partial subset of your query results up-front and then
you have to continuously go back to the server if you want to fetch (or page through) the

rest of the results using some kind of stateful server-side cursor.
5.2. Introducing the Query Language
NG|Storage provides a JSON-style domain-specific language that you can use to execute
queries. This is referred to as the Query DSL. The query language is quite comprehensive
and can be intimidating at first glance but the best way to actually learn it is to start with a
few basic examples.
Going back to our last example, we executed this query:
{
}
Dissecting the above, the query part tells us what our query definition is and the
match_all part is simply the type of query that we want to run. The match_all query is
simply a search for all documents in the specified index.
In addition to the query parameter, we also can pass other parameters to influence the
search results. For example, the following does a match_all and returns only the first
document:

{
"query": { "match_all": {} },
"size": 1
}'
Note that if size is not specified, it defaults to 10.
This example does a match_all and returns documents 11 through 20:

{
"from": 10,
"size": 10
}'
The from parameter (0-based) specifies which document index to start from and the size
parameter specifies how many documents to return starting at the from parameter. This
feature is useful when implementing paging of search results. Note that if from is not
specified, it defaults to 0.
This example does a match_all and sorts the results by account balance in descending
order and returns the top 10 (default size) documents.

{
"sort": { "balance": { "order": "desc" } }
}'
5.3. Executing Searches
Now that we have seen a few of the basic search parameters, let’s dig in some more into
the Query DSL. Let’s first take a look at the returned document fields. By default, the full
JSON document is returned as part of all searches. This is referred to as the source
(_source field in the search hits). If we don’t want the entire source document returned,
we have the ability to request only a few fields from within source to be returned.
This example shows how to return two fields, account_number and balance (inside of
_source), from the search:

{
"_source": ["account_number", "balance"]
}'
Note that the above example simply reduces the _source field. It will still only return one
field named _source but within it, only the fields account_number and balance are
included.
If you come from a SQL background, the above is somewhat similar in concept to the SQL
SELECT FROM field list.
Now let’s move on to the query part. Previously, we’ve seen how the match_all query is
used to match all documents. Let’s now introduce a new query called the match query,
which can be thought of as a basic fielded search query (i.e. a search done against a
specific field or set of fields).
This example returns the account numbered 20:

{
"query": { "match": { "account_number": 20 } }
}'
This example returns all accounts containing the term "mill" in the address:

{
"query": { "match": { "address": "mill" } }
}'
This example returns all accounts containing the term "mill" or "lane" in the address:

{
"query": { "match": { "address": "mill lane" } }
}'
This example is a variant of match (match_phrase) that returns all accounts containing
the phrase "mill lane" in the address:

{
"query": { "match_phrase": { "address": "mill lane" } }
}'
Let’s now introduce the bool(ean) query. The bool query allows us to compose smaller
queries into bigger queries using boolean logic.
This example composes two match queries and returns all accounts containing "mill" and
"lane" in the address:

{
"query": {
"bool": {
"must": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}'
In the above example, the bool must clause specifies all the queries that must be true for
a document to be considered a match.
In contrast, this example composes two match queries and returns all accounts containing
"mill" or "lane" in the address:

{
"query": {
"bool": {
"should": [
]
}
}
}'
In the above example, the bool should clause specifies a list of queries either of which
must be true for a document to be considered a match.
This example composes two match queries and returns all accounts that contain neither
"mill" nor "lane" in the address:

{
"query": {
"bool": {
"must_not": [
]
}
}
}'
In the above example, the bool must_not clause specifies a list of queries none of which
must be true for a document to be considered a match.
We can combine must, should, and must_not clauses simultaneously inside a bool
query. Furthermore, we can compose bool queries inside any of these bool clauses to
mimic any complex multi-level boolean logic.
This example returns all accounts of anybody who is 40 years old but don’t live in ID(aho):

{
"query": {
"bool": {
"must": [
{ "match": { "age": "40" } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
}'
5.4. Executing Filters
In the previous section, we skipped over a little detail called the document score (_score
field in the search results). The score is a numeric value that is a relative measure of how
well the document matches the search query that we specified. The higher the score, the
more relevant the document is, the lower the score, the less relevant the document is.
But queries do not always need to produce scores, in particular when they are only used for
"filtering" the document set. NG|Storage detects these situations and automatically
optimizes query execution in order not to compute useless scores.
The bool query that we introduced in the previous section also supports filter clauses
which allow to use a query to restrict the documents that will be matched by other clauses,
without changing how scores are computed. As an example, let’s introduce the range
query, which allows us to filter documents by a range of values. This is generally used for
numeric or date filtering.
This example uses a bool query to return all accounts with balances between 20000 and
30000, inclusive. In other words, we want to find accounts with a balance that is greater
than or equal to 20000 and less than or equal to 30000.

{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}'
Dissecting the above, the bool query contains a match_all query (the query part) and a
range query (the filter part). We can substitute any other queries into the query and the
filter parts. In the above case, the range query makes perfect sense since documents
falling into the range all match "equally", i.e., no document is more relevant than another.
In addition to the match_all, match, bool, and range queries, there are a lot of other
query types that are available and we won’t go into them here. Since we already have a
basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge in
learning and experimenting with the other query types.
5.5. Executing Aggregations
Aggregations provide the ability to group and extract statistics from your data. The easiest
way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL
aggregate functions. In NG|Storage, you have the ability to execute searches returning hits
and at the same time return aggregated results separate from the hits all in one response.
This is very powerful and efficient in the sense that you can run queries and multiple
aggregations and get the results back of both (or either) operations in one shot avoiding
network roundtrips using a concise and simplified API.
To start with, this example groups all the accounts by state, and then returns the top 10
(default) states sorted by count descending (also default):

{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
}
}
}
}'
In SQL, the above aggregation is similar in concept to:
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC

"hits" : {
"total" : 1000,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"buckets" : [ {
"key" : "al",
"doc_count" : 21
}, {
"key" : "tx",
"doc_count" : 17
}, {
"key" : "id",
"doc_count" : 15
}, {
"key" : "ma",
"doc_count" : 15
}, {
"key" : "md",
"doc_count" : 15
}, {
"key" : "pa",
"doc_count" : 15
}, {
"key" : "dc",
"doc_count" : 14
}, {
"key" : "me",
"doc_count" : 14
}, {
"key" : "mo",
"doc_count" : 14
}, {
"key" : "nd",
"doc_count" : 14
} ]
}
}
}
We can see that there are 21 accounts in AL(abama), followed by 17 accounts in TX,
followed by 15 accounts in ID(aho), and so forth.
Note that we set size=0 to not show search hits because we only want to see the
aggregation results in the response.
Building on the previous aggregation, this example calculates the average account balance
by state (again only for the top 10 states sorted by count in descending order):

{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}'
Notice how we nested the average_balance aggregation inside the group_by_state

aggregation. This is a common pattern for all the aggregations. You can nest aggregations
inside aggregations arbitrarily to extract pivoted summarizations that you require from
your data.
Building on the previous aggregation, let’s now sort on the average balance in descending
order:

{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"average_balance": "desc"
}
},
"aggs": {
"avg": {
"field": "balance"
}
}
}
}
}
}'
This example demonstrates how we can group by age brackets (ages 20-29, 30-39, and 40-
49), then by gender, and then finally get the average account balance, per age bracket, per

gender:

{
"size": 0,
"aggs": {
"group_by_age": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 30
},
{
"from": 30,
"to": 40
},
{
"from": 40,
"to": 50
}
]
},
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender"
},
"aggs": {
"avg": {
"field": "balance"
}
}
}
}
}
}
}
}'
There are a many other aggregations capabilities that we won’t go into detail here. The
aggregations reference guide is a great starting point if you want to do further
experimentation.

Chapter 6. Conclusion
NG|Storage is both a simple and complex product. We’ve so far learned the basics of what
it is, how to look inside of it, and how to work with it using some of the REST APIs. I hope
that this tutorial has given you a better understanding of what NG|Storage is and more
importantly, inspired you to further experiment with the rest of its great features!
30 | Chapter 6. Conclusion
Chapter 7. NG|Screener Standards
7.1. Introduction
NG|Screener is composed of many parts that allow NG|Screener to operate. The

architecture is as follows:
NG|Screener UI
• The main GUI - Graphical User Interface.
• 2 part application:
• Frontend: Javascript / HTML 5 / CSS 3 running in Web Browser¬—¬downloaded

upon first connection.
• Backend: Java daemon running on NG Appliance.
• No technical dependencies on NG|Screener Daemon.
• Communication / Synchronization / Interoperability happens in DB.
NG|Screener Daemon
• The processing and analytics backend.
• Can be shutdown and restarted without any impact on user experience aside from
controls not being executed anymore .
• ngAdmin reload commands have been removed.
NG|Discover
• The administration GUI for forensic views
• 2 part application
• Frontend: Javascript / HTML 5 / CSS 3 running in Web Browser¬—¬downloaded

upon first connection.
• Backend: Node.js application running on NG Appliance.
NG|Storage
• The Big Data/NoSQL compliant storage and search engine.
Chapter 7. NG|Screener Standards | 31

• NOSQL Genes…¬
PostgreSQL
• The RDMBS
• Still used for control execution.
• Will be replaced in NG|Screener version 6.5 (in favor of a hadoop stack).
NG|Storage, just as NRT, has a sliding window of Data since we cannot keep data in it
indefinitely.
• Data kept in NG|Storage takes 8 times the place it would take compressed in log-
collector.
• Therefore, Data in NG|Storage is maintained using a sliding window approach, similar to

what we do in NRT.
The REALTIME Loader maintains the data corresponding to the NRT period in both NRT and
NG|Storage.
The INITIAL Loader focuses only on the additional (past) data that should be in NG|Storage
since that window is usually way bigger.
• In contrary to the REALTIME Loader, the INITIAL Loader handled first the last log files
and then moves back to the past.
7.2. Index Naming Standards
Fields can be indexed or not indexed:
• Some fields can be set as "not indexed", in which case they are only stored.
32 | Chapter 7. NG|Screener Standards

• Such fields cannot be searched or used in filters.
• In our case, using our default NG template, all fields are indexed.
Fields can be analyzed or not analyzed:
• Analyzed fields are tokenized and stored in a way that enables full text search.
• In our case, aside from a few exceptions (refer to index template), fields are not
analyzed.
Underlying Lucene Index have to be "not too big, not too small":
• At NetGuardians, we enforce the following standard (which is quite common) -
• We split the data to be indexed by host/Service/day
• For instance, /log-

collector/2016/T24Server1/temenosT24Transaction/09-07-
2016.log
• would be put in index ngt-t24server1-temenost24transaction-

20160709
• The ngt- part is the prefix
• Prefixes are a very important notion+
• In our case we have 4 and only 4 different prefixes only that match the business
model:
• ngt: Data / event related to Financial Transactions
• ngc: Data / event related to Business users actions on the Information System
as a whole
• ngi: Data / event related to Technical Actions (mostly by technical users) on

the system or technical information
• ngv: Data / event related to Violations generated by our own controls
• However, there are some exceptions.
7.3. Data Model

The Business Data Model serves 2 purposes:
• Support Forensic Navigation
• Enables NG to develop visualizations and dashboards once and for all
• Independently of any Customer or target Core Banking System
Physical Model under /log-collector

Business Model
We want to normalize the Physical Model into the Business Model:

• Align technical approach to business and marketing approach.
• Consistency of Forensic Analysis.
Make our physical Log Model (Log File Model) enter the Business Data Model:
The Business Mapping adapts information:
• From the NGevent model
• To the New Data Model

Index patterns correspond to:
• The different actual indexes where data coming from services are put in
• The new Data Model

Oh. One thing we haven’t discussed so far…¬ NG|Violations are also stored
in ElasticSearch
• The plain old way: NG|Screener puts violations in /log-collector, then
 NGE, then NG|Storage…¬
• To the Business Data Model
The Violations Data Model in NG|Storage also complies to NG_COMMON to

enable forensic navigation.
Limit of the Data Model
The Business Data Model covers most of our needs in terms of Forensic Navigation.
• Since:
• The data in "Channels", i.e. the protocol entries related to human activities around a
financial transaction re-references that transaction under
business_reference=FT123456789
• The data in "Transactions", i.e. the accounting entries around a financial transaction
references that transaction as well under business_reference=FT123456789
• It’s pretty easy to support navigation to view all the data around a business entity:
• One only needs to defined a filter on "business_reference=FT123456789" and

navigate back and forth between the forensic view.
This is the core principle of our forensic analysis user experience. It’s called forensic
navigation! The data model is at the root of it!
7.4. Approach
On NG|Storage, we never use index mapping directly but favor index templates.
• Indexes are created automagically by the daemon whenever some new data appear in a
log file for some new service or some new day.
• Since index creation is automatic, we have no chance to inject the index template
• Instead, we use index templates! We provide 2 default index templates: ng* and nm-
search
• ng* provides the defaults for all our data

• nm-search is specific for our internal meta-data (not of any interest to PS
consultants or end users)
• ng*,` nm-search` are packaged in the daemon
• We cannot really change them at runtime!
• They are reapplied whenever the daemon restarts. Changes applied at runtime
would be overwritten whenever the daemon restarts
• Should one want to change a dynamic mapping or add some specificity to an index or set
of indexes, one needs to create a new index template applied to the subset of indexes,
for instance ngt-, or ngt-avaloq-, etc.
Index Refreshing
• ElasticSearch uses a lot of caching to speed up data retrieval on one node.
• The cache are not necessarily kept in sync with the new data ES puts in the underlying
Lucene index.
• For this reason, ELasticSearch implements an index refresh mechanism aimed at

periodically refreshing the caches with the underlying index.
• By default, the refresh period is set to 30 seconds.
The daemon indexes all the new data for a specific index in a bulk way.
• It starts by setting the refresh period of the index to -1, thus completely disabling
refreshing of the cache.
• Then it processes all the data related to this specific index.
• At the end of the processing, it resets the cache refresh period to its default value: 30
seconds.
• This is a slight point of failure that the consultants should be aware of!
• If the daemon crashes in between, while processing the data, the cache refresh
setting of the index will remain to -1.
• The data that may have been partially processed so far will never appear in the
index.
• Of course when the daemon restarts, it will recover its processing and reset the
refresh period to its default value.
• Unless it never succeeds in processing this data…¬

• This is a borderline case but it’s worth mentioning it.
7.5. System
NG|Storage consists of a single daemon started on the Platform. It’s a Java Process:
[root@NG-SCREENER admin]# ps -efl | grep ngstorage

4 S ng-scre+ 1056 1 9 80 0 - 1187980 futex_ 09:50 ? 00:03:17 /bin java
-Xms2g -Xmx2g
-Djava.awt.headless=true -XX:+UseParNewGC -XX... -Dfile.encoding=UTF-8
-Djna.nosys=true -
Des.path.home=/usr/local/ng-screener/ngstorage -cp /usr/local/ng-
screener/ngstorage/lib/*
org.elasticsearch.bootstrap.Elasticsearch start -Des.pidfile
=/var/run/ngstorage/ngStorage.pid - Des.default.path.home=/usr/local/ng-
screener/ngstorage -Des.default.path.logs=/var/log/ng-screener
-Des.default.path.data=/storage/ngstorage -Des.default.path.conf=/etc/ng-
screener/ngstorage
NG|Storage is installed in /usr/local/ng-screener/ngstorage/:
[root@NG-SCREENER admin]# ls /usr/local/ng-screener/ngstorage/

bin lib modules plugins
Underlying data are stored in /storage/ngstorage/:
[root@NG-SCREENER admin]# ls /storage/ngstorage/

NGELK
Configuration files are stored in /etc/ng-screener/ngstorage /:
[root@NG-SCREENER admin]# ls /etc/ng-screener/ngstorage/

logging.yml ngStorageEnv ngStorage.yml scripts
Log files are stored in /var/log/ng-screener/:
[root@NG-SCREENER admin]# ls -la /var/log/ng-screener/ngStorage*

-rw-r--r-- 1 ng-screener ng-screener 113651 Jun 30 11:33 /var/log/ng-
screener/ngStorage_NGELK.log
-rw-r--r-- 1 ng-screener ng-screener 209873 Jun 17 17:47 /var/log/ng-
12
screener/ngStorage_NGELK.log.2016-06-17
Check status of NG|Storage: systemctl status ng-storage

[root@NG-SCREENER admin]# systemctl status ng-storage
¬ ng-storage.service - ngStorage
Loaded: loaded (/usr/lib/systemd/system/ng-storage.service; enabled;
vendor preset: disabled)
Active: active (running) since Thu 2016-06-30 09:25:43 CEST; 2h 3min
ago
Docs: http://www.netguardians.ch
Main PID: 1056 (java)
CGroup: /system.slice/ng-storage.service
¬¬1056 /bin/java -Xms2g -Xmx2g -Djava.awt.headless=true
-XX:+UseParNewGC -
XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupan...
Jun 30 09:25:43 NG-SCREENER systemd[1]: Started ngStorage.

Jun 30 09:25:43 NG-SCREENER systemd[1]: Starting ngStorage ...
Enable NG|Storage to be started automatically:
systemctl enable ng-storage
Disable NG|Storage to avoid automatic start:
systemctl disable ng-storage
Start NG|Storage:
systemctl start ng-storage
Stop NG|Storage:
systemctl stop ng-storage
Structure of the data folder:

Structure of the data folder :
[root@NG-SCREENER admin]# ls -1 /storage/ngstorage/NGELK/nodes/0/indices/
ng-avaloq-avaloqcorebankingsession-20160626
...
ng-avaloq-avaloqcorebankingtransaction-20160627
ng-avaloq-avaloqcorebankingtransaction-20160629
...
ngi-ng-cts-ssh_tom-20160628
ngi-ng-cts-ssh_tom-20160629
ngi-ng-cts-ssh_user1-20160628
ngi-ng-cts-ssh_user1-20160629
ng-ng-screener-linuxusermanagementsudo-20160622
...
ngc-t24server1-temenost24protocol-20160618
...
ngt-t24server1-temenost24transaction-20160627
...
ngv-ng-screener-ngviolation-20160625
ngv-ng-screener-ngviolation-20160626
...
ng-dc.corp.netguardians.ch-microsoftwindows2008securityaccountlogonlogoff-
20160626
ng-dc.corp.netguardians.ch-microsoftwindows200
Never ever manipulate the data folder, e.g delete an index folder, without
 stopping NG|Storage first. The index would be corrupted.
Glossary of Terms
¬ analysis
Analysis is the process of converting full text to terms. Depending on which analyzer
is used, these phrases: FOO BAR, Foo-Bar, foo,bar will probably all result in the
terms foo and bar. These terms are what is actually stored in the index. + A full text
query (not a term query) for FoO:bAR will also be analyzed to the terms foo,bar and
will thus match the terms stored in the index. + It is this process of analysis (both at
index time and at search time) that allows NG|Storage to perform full text queries. +
Also see text and term.

¬ cluster
A cluster consists of one or more nodes which share the same cluster name. Each
cluster has a single master node which is chosen automatically by the cluster and
which can be replaced if the current master node fails.
¬ document
A document is a JSON document which is stored in NG|Storage. It is like a row in a

table in a relational database. Each document is stored in an index and has a type and
an id. + A document is a JSON object (also known in other languages as a hash /
hashmap / associative array) which contains zero or more fields, or key-value pairs. +
The original JSON document that is indexed will be stored in the _source field, which
is returned by default when getting or searching for a document.
¬ id
The ID of a document identifies a document. The index/type/id of a document

must be unique. If no ID is provided, then it will be auto-generated. (also see routing)
¬ field
A document contains a list of fields, or key-value pairs. The value can be a simple
(scalar) value (eg a string, integer, date), or a nested structure like an array or an
object. A field is similar to a column in a table in a relational database. + The mapping
for each field has a field type (not to be confused with document type) which indicates
the type of data that can be stored in that field, eg integer, string, object. The
mapping also allows you to define (amongst other things) how the value for a field
should be analyzed.
¬ index
An index is like a table in a relational database. It has a mapping which defines the
fields in the index, which are grouped by multiple type. + An index is a logical
namespace which maps to one or more primary shards and can have zero or more
replica shards.
¬ mapping
A mapping is like a schema definition in a relational database. Each index has a

mapping, which defines each type within the index, plus a number of index-wide
settings. + A mapping can either be defined explicitly, or it will be generated
automatically when a document is indexed.

¬ node
A node is a running instance of NG|Storage which belongs to a cluster. Multiple nodes

can be started on a single server for testing purposes, but usually you should have
one node per server. + At startup, a node will use unicast to discover an existing
cluster with the same cluster name and will try to join that cluster.
¬ primary shard
Each document is stored in a single primary shard. When you index a document, it is
indexed first on the primary shard, then on all replicas of the primary shard. + By
default, an index has 5 primary shards. You can specify fewer or more primary shards
to scale the number of documents that your index can handle. + You cannot change
the number of primary shards in an index, once the index is created. + See also
routing
¬ replica shard
Each primary shard can have zero or more replicas. A replica is a copy of the primary
shard, and has two purposes: +
1. increase failover: a replica shard can be promoted to a primary shard if the

primary fails
2. increase performance: get and search requests can be handled by primary or

replica shards. + By default, each primary shard has one replica, but the number
of replicas can be changed dynamically on an existing index. A replica shard will
never be started on the same node as its primary shard.
¬ routing
When you index a document, it is stored on a single primary shard. That shard is
chosen by hashing the routing value. By default, the routing value is derived from
the ID of the document or, if the document has a specified parent document, from the
ID of the parent document (to ensure that child and parent documents are stored on
the same shard). + This value can be overridden by specifying a routing value at
index time, or a routing in the mapping.
¬ shard
A shard is a single Lucene instance. It is a low-level "worker" unit which is managed

automatically by NG|Storage. An index is a logical namespace which points to primary
and replica shards. + Other than defining the number of primary and replica shards
that an index should have, you never need to refer to shards directly. Instead, your

code should deal only with an index. + NG|Storage distributes shards amongst all
nodes in the cluster, and can move shards automatically from one node to another in
the case of node failure, or the addition of new nodes.
¬ source field
By default, the JSON document that you index will be stored in the _source field and
will be returned by all get and search requests. This allows you access to the original
object directly from search results, rather than requiring a second step to retrieve the
object from an ID.
¬ term
A term is an exact value that is indexed in NG|Storage. The terms foo, Foo, FOO are
NOT equivalent. Terms (i.e. exact values) can be searched for using term queries.
See also text and analysis.
¬ text
Text (or full text) is ordinary unstructured text, such as this paragraph. By default, text
will be analyzed into terms, which is what is actually stored in the index. + Text fields
need to be analyzed at index time in order to be searchable as full text, and keywords
in full text queries must be analyzed at search time to produce (and search for) the
same terms that were generated at index time. + See also term and analysis.
¬ type
A type represents the type of document, e.g. an email, a user, or a tweet. The
search API can filter documents by type. An index can contain multiple types, and
each type has a list of fields that can be specified for documents of that type. Fields
with the same name in different types in the same index must have the same mapping
(which defines how each field in the document is indexed and made searchable).
Aggregations
The aggregations framework helps provide aggregated data based on a search query. It is
based on simple building blocks called aggregations, that can be composed in order to
build complex summaries of the data.
An aggregation can be seen as a unit-of-work that builds analytic information over a set of
documents. The context of the execution defines what this document set is (e.g. a top-level
aggregation executes within the context of the executed query/filters of the search

request).
There are many different types of aggregations, each with its own purpose and output. To
better understand these types, it is often easier to break them into three main families:
Bucketing
A family of aggregations that build buckets, where each bucket is associated with a
key and a document criterion. When the aggregation is executed, all the buckets
criteria are evaluated on every document in the context and when a criterion matches,
the document is considered to "fall in" the relevant bucket. By the end of the
aggregation process, we’ll end up with a list of buckets - each one with a set of
documents that "belong" to it.
Metric
Aggregations that keep track and compute metrics over a set of documents.
Matrix
A family of aggregations that operate on multiple fields and produce a matrix result
based on the values extracted from the requested document fields. Unlike metric and
bucket aggregations, this aggregation family does not yet support scripting.
Pipeline
Aggregations that aggregate the output of other aggregations and their associated
metrics
The interesting part comes next. Since each bucket effectively defines a document set (all
documents belonging to the bucket), one can potentially associate aggregations on the
bucket level, and those will execute within the context of that bucket. This is where the real
power of aggregations kicks in: aggregations can be nested!
Bucketing aggregations can have sub-aggregations (bucketing or metric).

The sub-aggregations will be computed for the buckets which their parent
 aggregation generates. There is no hard limit on the level/depth of nested

aggregations (one can nest an aggregation under a "parent" aggregation,
which is itself a sub-aggregation of another higher-level aggregation).

Chapter 8. Structuring Aggregations
The following snippet captures the basic structure of aggregations:
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"meta" : { [<meta_data_body>] } ]?
[,"aggregations" : { [<sub_aggregation>]+ } ]?
}
[,"<aggregation_name_2>" : { ... } ]*
}
The aggregations object (the key aggs can also be used) in the JSON holds the
aggregations to be computed. Each aggregation is associated with a logical name that the
user defines (e.g. if the aggregation computes the average price, then it would make sense
to name it avg_price). These logical names will also be used to uniquely identify the
aggregations in the response. Each aggregation has a specific type
(<aggregation_type> in the above snippet) and is typically the first key within the
named aggregation body. Each type of aggregation defines its own body, depending on the
nature of the aggregation (e.g. an avg aggregation on a specific field will define the field on
which the average will be calculated). At the same level of the aggregation type definition,
one can optionally define a set of additional aggregations, though this only makes sense if
the aggregation you defined is of a bucketing nature. In this scenario, the sub-aggregations
you define on the bucketing aggregation level will be computed for all the buckets built by
the bucketing aggregation. For example, if you define a set of aggregations under the
range aggregation, the sub-aggregations will be computed for the range buckets that are
defined.
Chapter 8. Structuring Aggregations | 47

Chapter 9. Values Source
Some aggregations work on values extracted from the aggregated documents. Typically,
the values will be extracted from a specific document field which is set using the field key
for the aggregations. It is also possible to define a `script`which will generate the values
(per document).
When both field and script settings are configured for the aggregation, the script will
be treated as a value script. While normal scripts are evaluated on a document level
(i.e. the script has access to all the data associated with the document), value scripts are
evaluated on the value level. In this mode, the values are extracted from the configured
field and the script is used to apply a "transformation" over these value/s.
When working with scripts, the lang and params settings can also be
defined. The former defines the scripting language which is used
(assuming the proper language is available in NG|Storage, either by
 default or as a plugin). The latter enables defining all the "dynamic"

expressions in the script as parameters, which enables the script to keep
itself static between calls (this will ensure the use of the cached compiled
scripts in NG|Storage).
Scripts can generate a single value or multiple values per document. When generating
multiple values, one can use the script_values_sorted settings to indicate whether
these values are sorted or not. Internally, NG|Storage can perform optimizations when
dealing with sorted values (for example, with the min aggregations, knowing the values are
sorted, NG|Storage will skip the iterations over all the values and rely on the first value in
the list to be the minimum value among all other values associated with the same
document).
48 | Chapter 9. Values Source

Chapter 10. Bucket Aggregations
Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do,
but instead, they create buckets of documents. Each bucket is associated with a criterion
(depending on the aggregation type) which determines whether or not a document in the
current context "falls" into it. In other words, the buckets effectively define document sets.
In addition to the buckets themselves, the bucket aggregations also compute and return
the number of documents that "fell into" each bucket.
Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations.

These sub-aggregations will be aggregated for the buckets created by their "parent" bucket
aggregation.
There are different bucket aggregators, each with a different "bucketing" strategy. Some
define a single bucket, some define fixed number of multiple buckets, and others
dynamically create the buckets during the aggregation process.
10.1. Children Aggregation
A special single bucket aggregation that enables aggregating from buckets on parent
document types to buckets on child documents.
This aggregation relies on the _parent field in the mapping. This aggregation has a single
option:
• type - The what child type the buckets in the parent space should be mapped to.
For example, let’s say we have an index of questions and answers. The answer type has the
following _parent field in the mapping:
{
"answer" : {
"_parent" : {
"type" : "question"
}
}
}
For more information, please refer to the source ElasticSearch reference documentation
chapter.
Chapter 10. Bucket Aggregations | 49

10.2. Date Histogram Aggregation
A multi-bucket aggregation similar to the histogram except it can only be applied on date
values. Since dates are represented in NG|Storage internally as long values, it is possible to
use the normal histogram on dates as well, though accuracy will be compromised. The
reason for this is in the fact that time based intervals are not fixed (think of leap years and
on the number of days in a month). For this reason, we need special support for time based
data. From a functionality perspective, this histogram supports the same features as the
normal histogram. The main difference is that the interval can be specified by date/time
expressions.
Requesting bucket intervals of a month.
{
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
}
Available expressions for interval: year, quarter, month, week, day, hour, minute,
second
Time values can also be specified via abbreviations supported by time units parsing. Note
that fractional time values are not supported, but you can address this by shifting to
another time unit (e.g., 1.5h could instead be specified as 90m).
{
"aggs" : {
"field" : "date",
"interval" : "90m"
}
}
}
}
Keys
Internally, a date is represented as a 64 bit number representing a timestamp in

milliseconds-since-the-epoch. These timestamps are returned as the bucket keys. The
50 | Chapter 10. Bucket Aggregations
key_as_string is the same timestamp converted to a formatted date string using the
format specified with the format parameter:
If no format is specified, then it will use the first date format specified in
 the field mapping.
{
"aggs" : {
"field" : "date",
"interval" : "1M",
"format" : "yyyy-MM-dd" 1
}
}
}
}
1 - Supports expressive date format pattern
Response:
{
"aggregations": {
"articles_over_time": {
"buckets": [
{
"key_as_string": "2013-02-02",
"key": 1328140800000,
"doc_count": 1
},
{
"key_as_string": "2013-03-02",
"key": 1330646400000,
"doc_count": 2
},
...
]
}
}
}
Time Zone
Date-times are stored in NG|Storage in UTC. By default, all bucketing and rounding is also
done in UTC. The time_zone parameter can be used to indicate that bucketing should use
a different time zone.
Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or

as a timezone id, an identifier used in the TZ database like America/Los_Angeles.
Consider the following example:
PUT my_index/log/1
{
"date": "2015-10-01T00:30:00Z"
}
PUT my_index/log/2
{
"date": "2015-10-01T01:30:00Z"
}
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day"
}
}
}
}
UTC is used if no time zone is specified, which would result in both of these documents
being placed into the same day bucket, which starts at midnight UTC on 1 October 2015:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-10-01T00:00:00.000Z",
"key": 1443657600000,
"doc_count": 2
}
]
}
}
If a time_zone of -01:00 is specified, then midnight starts at one hour before midnight
UTC:

{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"time_zone": "-01:00"
}
}
}
}
Now the first document falls into the bucket for 30 September 2015, while the second
document falls into the bucket for 1 October 2015:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-09-30T00:00:00.000-01:00", 1
"key": 1443571200000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T00:00:00.000-01:00", 1
"key": 1443657600000,
"doc_count": 1
}
]
}
}
1 - The key_as_string value represents midnight on each day in the specified time
zone.
Offset
The offset parameter is used to change the start value of each bucket by the specified
positive (+) or negative offset (-) duration, such as 1h for an hour, or 1M for a month.
For instance, when using an interval of day, each bucket runs from midnight to midnight.
Setting the offset parameter to +6h would change each bucket to run from 6am to 6am:

PUT my_index/log/1
{
"date": "2015-10-01T05:30:00Z"
}
PUT my_index/log/2
{
"date": "2015-10-01T06:30:00Z"
}
{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"offset": "+6h"
}
}
}
}
Instead of a single bucket starting at midnight, the above request groups the documents
into buckets starting at 6am:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-09-30T06:00:00.000Z",
"key": 1443592800000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T06:00:00.000Z",
"key": 1443679200000,
"doc_count": 1
}
]
}
}
The start offset of each bucket is calculated after the time_zone

 adjustments have been made.
Scripts
Like with the normal histogram, both document level scripts and value level scripts are
supported. It is also possible to control the order of the returned buckets using the order
settings and filter the returned buckets based on a min_doc_count setting (by default all
buckets between the first bucket that matches documents and the last one are returned).
This histogram also supports the extended_bounds setting, which enables extending the
bounds of the histogram beyond the data itself (to read more on why you’d want to do that
please refer to the explanation here).
Missing value
The missing parameter defines how documents that are missing a value should be
treated. By default they will be ignored but it is also possible to treat them as if they had a
value.
{
"aggs" : {
"publish_date" : {
"field" : "publish_date",
"interval": "year",
"missing": "2000-01-01" 1
}
}
}
}
1 - Documents without a value in the publish_date field will fall into the same bucket as
documents that have the value 2000-01-01.
10.3. Date Range Aggregation
A range aggregation that is dedicated for date values. The main difference between this
aggregation and the normal range aggregation is that the from and to values can be
expressed in Date Math expressions, and it is also possible to specify a date format by
which the from and to response fields will be returned. Note that this aggregation
includes the from value and excludes the to value for each range.
Example:

{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "to": "now-10M/M" }, 1
{ "from": "now-10M/M" } ¬
]
}
}
}
}
1 - < now minus 10 months, rounded down to the start of the month. 2 - >= now minus 10
months, rounded down to the start of the month.
In the example above, we created two range buckets, the first will "bucket" all documents
dated prior to 10 months ago and the second will "bucket" all documents dated since 10
months ago
Response:
{
...
"aggregations": {
"range": {
"buckets": [
{
"to": 1.3437792E+12,
"to_as_string": "08-2012",
"doc_count": 7
},
{
"from": 1.3437792E+12,
"from_as_string": "08-2012",
"doc_count": 2
}
]
}
}
}
Date Format/Pattern
 this information was copied from JodaDate
All ASCII letters are reserved as format pattern letters, which are defined as follows:

Symbol Meaning Presentation Examples
G era text AD
C century of era (>=0) number 20
Y year of era (>=0) year 1996
x weekyear year 1996
w week of weekyear number 27
e day of week number 2
E day of week text Tuesday; Tue
y year year 1996
D day of year number 189
M month of year month July; Jul; 07
d day of month number 10
a halfday of day text PM
K hour of halfday (0~11) number 0
h clockhour of halfday number 12
(1~12)
H hour of day (0~23) number 0
k clockhour of day number 24
(1~24)
m minute of hour number 30
s second of minute number 55
S fraction of second number 978
z time zone text Pacific Standard
Time; PST
Z time zone offset/id zone -0800; -08:00;
America/Los_Angeles
' escape for text delimiter ''
The count of pattern letters determine the format.
Text
If the number of pattern letters is 4 or more, the full form is used; otherwise a short
or abbreviated form is used if available.
Number
The minimum number of digits. Shorter numbers are zero-padded to this amount.
Year
Numeric presentation for year and weekyear fields are handled specially. For
example, if the count of 'y' is 2, the year will be displayed as the zero-based year of

the century, which is two digits.
Month
3 or over, use text, otherwise use number.
Zone
'Z' outputs offset without a colon, 'ZZ' outputs the offset with a colon, 'ZZZ' or more
outputs the zone id.
Zone names
Time zone names ('z') cannot be parsed.
Any characters in the pattern that are not in the ranges of ['a'..'z'] and ['A'..'Z'] will be
treated as quoted text. For instance, characters like ':', '.', ' ', '#' and '?' will appear in the
resulting time text even they are not embraced within single quotes.
Time zone in date range aggregations
Dates can be converted from another time-zone to UTC by specifying the time_zone
parameter.
Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as
one of the time zone ids from the TZ database.
The time_zone parameter is also applied to rounding in date math expressions. As an

example, to round to the beginning of the day in the CET time zone, you can do the
following:
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"time_zone": "CET",
"ranges": [
{ "to": "2016-02-15/d" }, 1
{ "from": "2016-02-15/d", "to" : "now/d" <2>},
{ "from": "now/d" },
]
}
}
}
}
1 - This date will be converted to 2016-02-15T00:00:00.000+01:00. 2 - now/d will be

rounded to the beginning of the day in the CET time zone.
10.4. Sampler Aggregation
A filtering aggregation used to limit any sub aggregations' processing to a sample of the
top-scoring documents. Diversity settings are used to limit the number of matches that
share a common value such as an "author".
Example use cases:
• Tightening the focus of analytics to high-relevance matches rather than the potentially
very long tail of low-quality matches
• Removing bias from analytics by ensuring fair representation of content from different
sources
• Reducing the running cost of aggregations that can produce useful results using only
samples e.g. significant_terms
Example:
{
"query": {
"match": {
"text": "iphone"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 200,
"field" : "user.id"
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "text"
}
}
}
}
}
}
Response:

{
...
"aggregations": {
"sample": {
"doc_count": 1000,1
"keywords": { ¬
"doc_count": 1000,
"buckets": [
...
{
"key": "bend",
"doc_count": 58,
"score": 37.982536582524276,
"bg_count": 103
},
....
}
1 - 1000 documents were sampled in total becase we asked for a maximum of 200 from an
index with 5 shards. The cost of performing the nested significant_terms aggregation was
therefore limited rather than unbounded. 2 - The results of the significant_terms
aggregation are not skewed by any single over-active Twitter user because we asked for a
maximum of one tweet from any one user in our sample.
Shard Size
The shard_size parameter limits how many top-scoring documents are collected in the
sample processed on each shard. The default value is 100.
Controlling diversity =field or script and max_docs_per_value settings are used to

control the maximum number of documents collected on any one shard which share a
common value. The choice of value (e.g. author) is loaded from a regular field or
derived dynamically by a script.
The aggregation will throw an error if the choice of field or script produces multiple values
for a document. It is currently not possible to offer this form of de-duplication using many
values, primarily due to concerns over efficiency.
Any good market researcher will tell you that when working with samples
of data it is important that the sample represents a healthy variety of
opinions rather than being skewed by any single voice. The same is true
 with aggregations and sampling with these diversify settings can offer a
way to remove the bias in your content (an over-populated geography, a
large spike in a timeline or an over-active forum spammer).

Field
Controlling diversity using a field:
{
"aggs" : {
"sample" : {
"diverisfied_sampler" : {
"field" : "author",
"max_docs_per_value" : 3
}
}
}
}
Note that the max_docs_per_value setting applies on a per-shard basis only for the
purposes of shard-local sampling. It is not intended as a way of providing a global de-
duplication feature on search results.
Script
Controlling diversity using a script:
{
"aggs" : {
"sample" : {
"diversified_sampler" : {
"script" : {
"lang" : "painless",
"inline" : "doc['author'].value + '/' +
doc['genre'].value"
}
}
}
}
}
Note in the above example we chose to use the default max_docs_per_value setting of 1
and combine author and genre fields to ensure each shard sample has, at most, one match
for an author/genre pair.
Execution Hint
When using the settings to control diversity, the optional execution_hint setting can
influence the management of the values used for de-duplication. Each option will hold up to
shard_size values in memory while performing de-duplication but the type of value held
can be controlled as follows:

• hold field values directly (map)
• hold ordinals of the field as determined by the Lucene index (global_ordinals)
• hold hashes of the field values - with potential for hash collisions (bytes_hash)
The default setting is to use global_ordinals if this information is available from the
Lucene index and reverting to map if not. The bytes_hash setting may prove faster in
some cases but introduces the possibility of false positives in de-duplication logic due to
the possibility of hash collisions. Please note that NG|Storage will ignore the choice of
execution hint if it is not applicable and that there is no backward compatibility guarantee
on these hints.
Limitations
• Cannot be nested under breadth_first aggregations Being a quality-based filter the

sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a terms aggregation which has the collect_mode
switched from the default depth_first mode to breadth_first as this discards
scores. In this situation an error will be thrown.
• Limited de-dup logic. The de-duplication logic in the diversify settings applies only at a
shard level so will not apply across shards.
• No specialized syntax for geo/date fields Currently the syntax for defining the
diversifying values is defined by a choice of field or script - there is no added
syntactical sugar for expressing geo or date units such as "7d" (7 days). This support
may be added in a later release and users will currently have to create these sorts of
values using a script.
10.5. Filter Aggregation
Defines a single bucket of all the documents in the current document set context that
match a specified filter. Often this will be used to narrow down the current aggregation
context to a specific set of documents.
Example:

{
"aggs" : {
"red_products" : {
"filter" : { "term": { "color": "red" } },
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
}
In the above example, we calculate the average price of all the products that are red.
Response:
{
...
"aggs" : {
"red_products" : {
"doc_count" : 100,
"avg_price" : { "value" : 56.3 }
}
}
}
10.6. Filters Aggregation
Defines a multi bucket aggregation where each bucket is associated with a filter. Each
bucket will collect all documents that match its associated filter.
Example:

{
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"errors" : { "term" : { "body" : "error" }},
"warnings" : { "term" : { "body" : "warning" }}
}
},
"aggs" : {
"monthly" : {
"histogram" : {
"field" : "timestamp",
"interval" : "1M"
}
}
}
}
}
}
In the above example, we analyze log messages. The aggregation will build two collection
(buckets) of log messages - one for all those containing an error, and another for all those
containing a warning. And for each of these buckets it will break them down by month.
Response:
...
"aggs" : {
"messages" : {
"buckets" : {
"errors" : {
"doc_count" : 34,
"monthly" : {
"buckets" : [
... // the histogram monthly breakdown
]
}
},
"warnings" : {
"doc_count" : 439,
"monthly" : {
"buckets" : [
]
}
}
}
}
}
...
Anonymous filters
The filters field can also be provided as an array of filters, as in the following request:
{
"aggs" : {
"messages" : {
"filters" : {
"filters" : [
{ "term" : { "body" : "error" }},
{ "term" : { "body" : "warning" }}
]
},
"aggs" : {
"monthly" : {
"histogram" : {
"interval" : "1M"
}
}
}
}
}
}
The filtered buckets are returned in the same order as provided in the request. The
response for this example would be:
...
"aggs" : {
"messages" : {
"buckets" : [
{
"doc_count" : 34,
"monthly" : {
"buckets : [
]
}
},
{
"doc_count" : 439,
"monthly" : {
"buckets : [
]
}
}
]
}
}
...
Other Bucket

The other_bucket parameter can be set to add a bucket to the response which will
contain all documents that do not match any of the given filters. The value of this parameter
can be as follows:
false
Does not compute the other bucket
true
Returns the other bucket bucket either in a bucket (named other by default) if
named filters are being used, or as the last bucket if anonymous filters are being
used
The other_bucket_key parameter can be used to set the key for the other bucket to a
value other than the default other. Setting this parameter will implicitly set the
other_bucket parameter to true.
The following snippet shows a response where the other bucket is requested to be named
other_messages.
{
"aggs" : {
"messages" : {
"filters" : {
"other_bucket_key": "other_messages",
"filters" : {
"errors" : { "term" : { "body" : "error" }},
"warnings" : { "term" : { "body" : "warning" }}
}
},
"aggs" : {
"monthly" : {
"histogram" : {
"interval" : "1M"
}
}
}
}
}
}
The response would be something like the following:

...
"aggs" : {
"messages" : {
"buckets" : {
"errors" : {
"doc_count" : 34,
"monthly" : {
"buckets" : [
]
}
},
"warnings" : {
"doc_count" : 439,
"monthly" : {
"buckets" : [
]
}
},
"other_messages" : {
"doc_count" : 237,
"monthly" : {
"buckets" : [
]
}
}
}
}
}
}
...
10.7. Geo Distance Aggregation
A multi-bucket aggregation that works on geo_point fields and conceptually works very
similar to the range aggregation. The user can define a point of origin and a set of distance
range buckets. The aggregation evaluate the distance of each document value from the
origin point and determines the buckets it belongs to based on the ranges (a document
belongs to a bucket if the distance between the document and the origin falls within the
distance range of the bucket).

{
"aggs" : {
"rings_around_amsterdam" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
}
Response:
{
"aggregations": {
"rings" : {
"buckets": [
{
"key": "*-100.0",
"from": 0,
"to": 100.0,
"doc_count": 3
},
{
"key": "100.0-300.0",
"from": 100.0,
"to": 300.0,
"doc_count": 1
},
{
"key": "300.0-*",
"from": 300.0,
"doc_count": 7
}
]
}
}
}
The specified field must be of type geo_point (which can only be set explicitly in the
mappings). And it can also hold an array of geo_point fields, in which case all will be
taken into account during aggregation. The origin point can accept all formats supported by
the geo_point type:
• Object format: { "lat" : 52.3760, "lon" : 4.894 } - this is the safest format
as it is the most explicit about the lat & lon values

• String format: "52.3760, 4.894" - where the first number is the lat and the second
is the lon
• Array format: [4.894, 52.3760] - which is based on the GeoJson standard and
where the first number is the lon and the second one is the lat
By default, the distance unit is m (metres) but it can also accept: mi (miles), in (inches), yd
(yards), km (kilometers), cm (centimeters), mm (millimeters).
{
"aggs" : {
"rings" : {
"geo_distance" : {
"origin" : "52.3760, 4.894",
"unit" : "mi", 1
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
}
1 - The distances will be computed as miles
There are three distance calculation modes: sloppy_arc (the default), arc (most
accurate) and plane (fastest). The arc calculation is the most accurate one but also the
more expensive one in terms of performance. The sloppy_arc is faster but less accurate.
The plane is the fastest but least accurate distance function. Consider using plane when
your search context is "narrow" and spans smaller geographical areas (like cities or even
countries). plane may return higher error mergins for searches across very large areas
(e.g. cross continent search). The distance calculation type can be set using the
distance_type parameter:

{
"aggs" : {
"rings" : {
"geo_distance" : {
"origin" : "52.3760, 4.894",
"distance_type" : "plane",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
}
10.8. GeoHash Grid Aggregation
A multi-bucket aggregation that works on geo_point fields and groups points into
buckets that represent cells in a grid. The resulting grid can be sparse and only contains
cells that have matching data. Each cell is labeled using a geohash which is of user-
definable precision.
• High precision geohashes have a long string length and represent cells that cover only a
small area.
• Low precision geohashes have a short string length and represent cells that each cover
a large area.
Geohashes used in this aggregation can have a choice of precision between 1 and 12.
For more information please refer to the source ElasticSearch reference documentation
chapter.
10.9. Global Aggregation
Defines a single bucket of all the documents within the search execution context. This
context is defined by the indices and the document types you’re searching on, but is not
influenced by the search query itself.
Global aggregators can only be placed as top level aggregators (it makes
 no sense to embed a global aggregator within another bucket

aggregator)

Example:
{
"query" : {
"match" : { "title" : "shirt" }
},
"aggs" : {
"all_products" : {
"global" : {}, 1
"aggs" : { ¬
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
}
1 - The global aggregation has an empty body 2 - The sub-aggregations that are
registered for this global aggregation
The above aggregation demonstrates how one would compute aggregations (avg_price in
this example) on all the documents in the search context, regardless of the query (in our
example, it will compute the average price over all products in our catalog, not just on the
"shirts").
The response for the above aggregation:
{
...
"aggregations" : {
"all_products" : {
"doc_count" : 100, 1
"avg_price" : {
"value" : 56.3
}
}
}
}
1 - The number of documents that were aggregated (in our case, all documents within the
search context)
10.10. Histogram Aggregation
A multi-bucket values source based aggregation that can be applied on numeric values
extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over
the values. For example, if the documents have a field that holds a price (numeric), we can

configure this aggregation to dynamically build buckets with interval 5 (in case of price it
may represent $5). When the aggregation executes, the price field of every document will
be evaluated and will be rounded down to its closest bucket - for example, if the price is 32
and the bucket size is 5 then the rounding will yield 30 and thus the document will "fall"
into the bucket that is associated with the key 30. To make this more formal, here is the
rounding function that is used:
rem = value % interval

if (rem < 0) {
rem += interval
}
bucket_key = value - rem
From the rounding function above it can be seen that the intervals themselves must be
integers.
Currently, values are cast to integers before being bucketed, which might
cause negative floating-point values to fall into the wrong bucket. For
 instance, -4.5 with an interval of 2 would be cast to -4, and so would end
up in the -4 ¬ val < -2 bucket instead of the -6 ¬ val < -4 bucket.
The following snippet "buckets" the products based on their price by interval of 50:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50
}
}
}
}
And the following may be the response:

{
"aggregations": {
"prices" : {
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 50,
"doc_count": 4
},
{
"key": 100,
"doc_count": 0
},
{
"key": 150,
"doc_count": 3
}
]
}
}
}
Minimum document count
The response above show that no documents has a price that falls within the range of [100
- 150). By default the response will fill gaps in the histogram with empty buckets. It is
possible change that and request buckets with a higher minimum count thanks to the
min_doc_count setting:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"min_doc_count" : 1
}
}
}
}
Response:

{
"aggregations": {
"prices" : {
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 50,
"doc_count": 4
},
{
"key": 150,
"doc_count": 3
}
]
}
}
}
By default the histogram returns all the buckets within the range of the data itself, that is,
the documents with the smallest values (on which with histogram) will determine the min
bucket (the bucket with the smallest key) and the documents with the highest values will
determine the max bucket (the bucket with the highest key). Often, when requesting empty
buckets, this causes a confusion, specifically, when the data is also filtered.
To understand why, let’s look at an example:
Lets say the you’re filtering your request to get all docs with values between 0 and 500, in
addition you’d like to slice the data per price using a histogram with an interval of 50. You
also specify "min_doc_count" : 0 as you’d like to get all buckets even the empty ones.
If it happens that all products (documents) have prices higher than 100, the first bucket
you’ll get will be the one with 100 as its key. This is confusing, as many times, you’d also
like to get those buckets between 0 - 100.
With extended_bounds setting, you now can "force" the histogram aggregation to start
building buckets on a specific min values and also keep on building buckets up to a max
value (even if there are no documents anymore). Using extended_bounds only makes
sense when min_doc_count is 0 (the empty buckets will never be returned if
min_doc_count is greater than 0).
Note that (as the name suggest) extended_bounds is not filtering buckets. Meaning, if
the extended_bounds.min is higher than the values extracted from the documents, the
documents will still dictate what the first bucket will be (and the same goes for the
extended_bounds.max and the last bucket). For filtering buckets, one should nest the
histogram aggregation under a range filter aggregation with the appropriate from/to
settings.
Example:
{
"query" : {
"constant_score" : { "filter": { "range" : { "price" : { "to" :
"500" } } } }
},
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"extended_bounds" : {
"min" : 0,
"max" : 500
}
}
}
}
}
Order
By default the returned buckets are sorted by their key ascending, though the order
behaviour can be controlled using the order setting.
Ordering the buckets by their key - descending:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"order" : { "_key" : "desc" }
}
}
}
}
Ordering the buckets by their doc_count - ascending:

{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"order" : { "_count" : "asc" }
}
}
}
}
If the histogram aggregation has a direct metrics sub-aggregation, the latter can determine
the order of the buckets:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"order" : { "price_stats.min" : "asc" } 1
},
"aggs" : {
"price_stats" : { "stats" : {} } ¬
}
}
}
}
1 - The { "price_stats.min" : asc" } will sort the buckets based on min value of
their price_stats sub-aggregation.
2 - There is no need to configure the price field for the price_stats aggregation as it
will inherit it by default from its parent histogram aggregation.
It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy.
This is supported as long as the aggregations path are of a single-bucket type, where the
last aggregation in the path may either by a single-bucket one or a metrics one. If it’s a
single-bucket type, the order will be defined by the number of docs in the bucket (i.e.
doc_count), in case it’s a metrics one, the same rules as above apply (where the path
must indicate the metric name to sort by in case of a multi-value metrics aggregation, and
in case of a single-value metrics aggregation the sort will be applied on that value).
The path must be defined in the following form:

AGG_SEPARATOR := '>'
METRIC_SEPARATOR := '.'
AGG_NAME := <the name of the aggregation>
METRIC := <the name of the metric (in case of multi-value
metrics aggregation)>
PATH :=
<AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"order" : { "promoted_products>rating_stats.avg" : "desc"
} 1
},
"aggs" : {
"promoted_products" : {
"filter" : { "term" : { "promoted" : true }},
"aggs" : {
"rating_stats" : { "stats" : { "field" : "rating"
}}
}
}
}
}
}
}
The above will sort the buckets based on the avg rating among the promoted products
Offset
By default the bucket keys start with 0 and then continue in even spaced steps of
interval, e.g. if the interval is 10 the first buckets (assuming there is data inside them)
will be [0 - 9], [10-19], [20-29]. The bucket boundaries can be shifted by using the offset
option.
This can be best illustrated with an example. If there are 10 documents with values ranging
from 5 to 14, using interval 10 will result in two buckets with 5 documents each. If an
additional offset 5 is used, there will be only one single bucket [5-14] containing all the 10
documents.
Response Format
By default, the buckets are returned as an ordered array. It is also possible to request the
response as a hash instead keyed by the buckets keys:

{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"keyed" : true
}
}
}
}
Response:
{
"aggregations": {
"prices": {
"buckets": {
"0": {
"key": 0,
"doc_count": 2
},
"50": {
"key": 50,
"doc_count": 4
},
"150": {
"key": 150,
"doc_count": 3
}
}
}
}
}
Missing value
value.
{
"aggs" : {
"quantity" : {
"histogram" : {
"field" : "quantity",
"interval": 10,
"missing": 0 1
}
}
}
}

1 - Documents without a value in the quantity field will fall into the same bucket as
documents that have the value 0.
10.11. IP Range Aggregation
Just like the dedicated date range aggregation, there is also a dedicated range aggregation
for IP typed fields:
Example:
{
"aggs" : {
"ip_ranges" : {
"ip_range" : {
"field" : "ip",
"ranges" : [
{ "to" : "10.0.0.5" },
{ "from" : "10.0.0.5" }
]
}
}
}
}
Response:
{
...
"aggregations": {
"ip_ranges": {
"buckets" : [
{
"to": "10.0.0.5",
"doc_count": 4
},
{
"from": "10.0.0.5",
"doc_count": 6
}
]
}
}
}
IP ranges can also be defined as CIDR masks:

{
"aggs" : {
"ip_ranges" : {
"ip_range" : {
"field" : "ip",
"ranges" : [
{ "mask" : "10.0.0.0/25" },
{ "mask" : "10.0.0.127/25" }
]
}
}
}
}
Response:
{
"aggregations": {
"ip_ranges": {
"buckets": [
{
"key": "10.0.0.0/25",
"from": "10.0.0.0",
"to": "10.0.0.127",
"doc_count": 127
},
{
"key": "10.0.0.127/25",
"from": "10.0.0.0",
"to": "10.0.0.127",
"doc_count": 127
}
]
}
}
}
10.12. Missing Aggregation
A field data based single bucket aggregation, that creates a bucket of all documents in the
current document set context that are missing a field value (effectively, missing a field or
having the configured NULL value set). This aggregator will often be used in conjunction
with other field data bucket aggregators (such as ranges) to return information for all the
documents that could not be placed in any of the other buckets due to missing field data
values.
Example:

{
"aggs" : {
"products_without_a_price" : {
"missing" : { "field" : "price" }
}
}
}
In the above example, we get the total number of products that do not have a price.
Response:
{
...
"aggs" : {
"products_without_a_price" : {
"doc_count" : 10
}
}
}
10.13. Nested Aggregation
A special single bucket aggregation that enables aggregating nested documents.
For example, lets say we have a index of products, and each product holds the list of
resellers - each having its own price for the product. The mapping could look like:
{
...
"product" : {
"properties" : {
"resellers" : { 1
"type" : "nested",
"properties" : {
"name" : { "type" : "text" },
"price" : { "type" : "double" }
}
}
}
}
}
1 - The resellers is an array that holds nested documents under the product object.
The following aggregations will return the minimum price products can be purchased in:

{
"query" : {
"match" : { "name" : "led tv" }
},
"aggs" : {
"resellers" : {
"nested" : {
"path" : "resellers"
},
"aggs" : {
"min_price" : { "min" : { "field" : "resellers.price" } }
}
}
}
}
As you can see above, the nested aggregation requires the path of the nested documents
within the top level documents. Then one can define any type of aggregation over these
nested documents.
Response:
{
"aggregations": {
"resellers": {
"min_price": {
"value" : 350
}
}
}
}
10.14. Range Aggregation
A multi-bucket value source based aggregation that enables the user to define a set of
ranges - each representing a bucket. During the aggregation process, the values extracted
from each document will be checked against each bucket range and "bucket" the
relevant/matching document. Note that this aggregation includes the from value and
excludes the to value for each range.
Example:

{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
Response:
{
...
"aggregations": {
"price_ranges" : {
"buckets": [
{
"to": 50,
"doc_count": 2
},
{
"from": 50,
"to": 100,
"doc_count": 4
},
{
"from": 100,
"doc_count": 4
}
]
}
}
}
Keyed Response
Setting the keyed flag to true will associate a unique string key with each bucket and
return the ranges as a hash rather than an array:

{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"keyed" : true,
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
Response:
{
...
"aggregations": {
"price_ranges" : {
"buckets": {
"*-50.0": {
"to": 50,
"doc_count": 2
},
"50.0-100.0": {
"from": 50,
"to": 100,
"doc_count": 4
},
"100.0-*": {
"from": 100,
"doc_count": 4
}
}
}
}
}
It is also possible to customize the key for each range:

{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"keyed" : true,
"ranges" : [
{ "key" : "cheap", "to" : 50 },
{ "key" : "average", "from" : 50, "to" : 100 },
{ "key" : "expensive", "from" : 100 }
]
}
}
}
}
Script
{
"aggs" : {
"price_ranges" : {
"range" : {
"script" : {
"lang": "painless",
"inline": "doc['price'].value"
},
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
This will interpret the script parameter as an inline script with the painless script
language and no script parameters. To use a file script use the following syntax:

{
"aggs" : {
"price_ranges" : {
"range" : {
"script" : {
"file": "my_script",
"params": {
"field": "price"
}
},
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
 for indexed scripts replace the file parameter with an id parameter.
Value Script
Lets say the product prices are in USD but we would like to get the price ranges in EURO.
We can use value script to convert the prices prior the aggregation (assuming conversion
rate of 0.8)
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"script" : {
"lang": "painless",
"inline": "_value * params.conversion_rate",
"params" : {
"conversion_rate" : 0.8
}
},
"ranges" : [
{ "to" : 35 },
{ "from" : 35, "to" : 70 },
{ "from" : 70 }
]
}
}
}
}
Sub Aggregations

The following example, not only "bucket" the documents to the different buckets but also
computes statistics over the prices in each price range
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
},
"aggs" : {
"price_stats" : {
"stats" : { "field" : "price" }
}
}
}
}
}
Response:

{
"aggregations": {
"price_ranges" : {
"buckets": [
{
"to": 50,
"doc_count": 2,
"price_stats": {
"count": 2,
"min": 20,
"max": 47,
"avg": 33.5,
"sum": 67
}
},
{
"from": 50,
"to": 100,
"doc_count": 4,
"price_stats": {
"count": 4,
"min": 60,
"max": 98,
"avg": 82.5,
"sum": 330
}
},
{
"from": 100,
"doc_count": 4,
"price_stats": {
"count": 4,
"min": 134,
"max": 367,
"avg": 216,
"sum": 864
}
}
]
}
}
}
If a sub aggregation is also based on the same value source as the range aggregation (like
the stats aggregation in the example above) it is possible to leave out the value source
definition for it. The following will return the same response as above:

{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
},
"aggs" : {
"price_stats" : {
"stats" : {} 1
}
}
}
}
}
1 - We don’t need to specify the price as we "inherit" it by default from the parent range
aggregation
10.15. Reverse Nested Aggregation
A special single bucket aggregation that enables aggregating on parent docs from nested
documents. Effectively this aggregation can break out of the nested block structure and link
to other nested structures or the root document, which allows nesting other aggregations
that aren’t part of the nested object in a nested aggregation.
For more information please refere to the source ElasticSearch reference documentation
chapter.
10.16. Sampler Aggregation
A filtering aggregation used to limit any sub aggregations' processing to a sample of the
top-scoring documents.
Example use cases:
• Tightening the focus of analytics to high-relevance matches rather than the potentially
very long tail of low-quality matches
• Reducing the running cost of aggregations that can produce useful results using only
samples e.g. significant_terms
Example:

{
"query": {
"match": {
"text": "iphone"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 200
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "text"
}
}
}
}
}
}
Response:
{
...
"aggregations": {
"sample": {
"doc_count": 1000,1
"keywords": {
"doc_count": 1000,
"buckets": [
...
{
"key": "bend",
"doc_count": 58,
"score": 37.982536582524276,
"bg_count": 103
},
....
}
1 - 1000 documents were sampled in total because we asked for a maximum of 200 from an
index with 5 shards. The cost of performing the nested significant_terms aggregation was
therefore limited rather than unbounded.
Shard Size
The shard_size parameter limits how many top-scoring documents are collected in the
sample processed on each shard. The default value is 100.

Limitations
• Cannot be nested under breadth_first aggregations Being a quality-based filter the

sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a terms aggregation which has the collect_mode
switched from the default depth_first mode to breadth_first as this discards
scores. In this situation an error will be thrown.
10.17. Significant Terms Aggregation
An aggregation that returns interesting or unusual occurrences of terms in a set.
experimental[The significant_terms aggregation can be very heavy when run on large

indices. Work is in progress to provide more lightweight sampling techniques. As a result,
the API for this feature may change in non-backwards compatible ways]
Example use cases:
• Suggesting "H5N1" when users search for "bird flu" in text
• Identifying the merchant that is the "common point of compromise" from the
transaction history of credit card owners reporting loss
• Suggesting keywords relating to stock symbol $ATI for an automated news classifier
• Spotting the fraudulent doctor who is diagnosing more than his fair share of whiplash
injuries
• Spotting the tire manufacturer who has a disproportionate number of blow-outs
In all these cases the terms being selected are not simply the most popular terms in a set.
They are the terms that have undergone a significant change in popularity measured
between a foreground and background set. If the term "H5N1" only exists in 5 documents in
a 10 million document index and yet is found in 4 of the 100 documents that make up a
user’s search results that is significant and probably very relevant to their search.
5/10,000,000 vs 4/100 is a big swing in frequency.
Single-set analysis
In the simplest case, the foreground set of interest is the search results matched by a
query and the background set used for statistical comparisons is the index or indices from
which the results were gathered.
Example:

{
"query" : {
"terms" : {"force" : [ "British Transport Police" ]}
},
"aggregations" : {
"significantCrimeTypes" : {
"significant_terms" : { "field" : "crime_type" }
}
}
}
Response:
{
...
"aggregations" : {
"significantCrimeTypes" : {
"doc_count": 47347,
"buckets" : [
{
"key": "Bicycle theft",
"doc_count": 3640,
"score": 0.371235374214817,
"bg_count": 66799
}
...
]
}
}
}
When querying an index of all crimes from all police forces, what these results show is that
the British Transport Police force stand out as a force dealing with a disproportionately
large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes
(66799/5064554) but for the British Transport Police, who handle crime on railways and
stations, 7% of crimes (3640/47347) is a bike theft. This is a significant seven-fold increase
in frequency and so this anomaly was highlighted as the top crime type.
The problem with using a query to spot anomalies is it only gives us one subset to use for
comparisons. To discover all the other police forces' anomalies we would have to repeat
the query for each of the different forces.
This can be a tedious way to look for unusual patterns in an index
Multi-set analysis A simpler way to perform analysis across multiple categories is to use a
parent-level aggregation to segment the data ready for analysis.

Example using a parent aggregation for segmentation:
{
"aggregations": {
"forces": {
"terms": {"field": "force"},
"aggregations": {
"significantCrimeTypes": {
"significant_terms": {"field": "crime_type"}
}
}
}
}
}
Response:

{
...
"aggregations": {
"forces": {
"buckets": [
{
"key": "Metropolitan Police Service",
"doc_count": 894038,
"doc_count": 894038,
"buckets": [
{
"key": "Robbery",
"doc_count": 27617,
"score": 0.0599,
"bg_count": 53182
},
...
]
}
},
{
"key": "British Transport Police",
"doc_count": 47347,
"doc_count": 47347,
"buckets": [
{
"key": "Bicycle theft",
"doc_count": 3640,
"score": 0.371,
"bg_count": 66799
},
...
]
}
}
]
}
}
Now we have anomaly detection for each of the police forces using a single request.
We can use other forms of top-level aggregations to segment our data, for example
segmenting by geographic area to identify unusual hot-spots of a particular crime type:

{
"aggs": {
"hotspots": {
"geohash_grid" : {
"field":"location",
"precision":5,
},
"aggs": {
"significant_terms": {"field": "crime_type"}
}
}
}
}
}
This example uses the geohash_grid aggregation to create result buckets that represent
geographic areas, and inside each bucket we can identify anomalous levels of a crime type
in these tightly-focused areas e.g.
• Airports exhibit unusual numbers of weapon confiscations
• Universities show uplifts of bicycle thefts
At a higher geohash_grid zoom-level with larger coverage areas we would start to see
where an entire police-force may be tackling an unusual volume of a particular crime type.
Obviously a time-based top-level segmentation would help identify current trends for each
point in time where a simple terms aggregation would typically show the very popular
"constants" that persist across all time slots.
How are the scores calculated?
The numbers returned for scores are primarily intended for ranking different
suggestions sensibly rather than something easily understood by end users. The
scores are derived from the doc frequencies in foreground and background sets. In
brief, a term is considered significant if there is a noticeable difference in the
frequency in which a term appears in the subset and in the background. The way the
terms are ranked can be configured, see "Parameters" section.
Use on free-text fields
The significant_terms aggregation can be used effectively on tokenized free-text fields to

suggest:

• keywords for refining end-user searches
• keywords for use in percolator queries
Picking a free-text field as the subject of a significant terms analysis can
 be expensive! It will attempt to load every unique word into RAM. It is

recommended to only use this on smaller indices.
Use the "like this but not this" pattern
You can spot mis-categorized content by first searching a structured field e.g.
category:adultMovie and use significant_terms on the free-text
"movie_description" field. Take the suggested words (I’ll leave them to your
imagination) and then search for all movies NOT marked as category:adultMovie but
containing these keywords. You now have a ranked list of badly-categorized movies
that you should reclassify or at least remove from the "familyFriendly" category.
The significance score from each term can also provide a useful boost setting to sort
matches. Using the minimum_should_match setting of the terms query with the
keywords will help control the balance of precision/recall in the result set i.e a high
setting would have a small number of relevant results packed full of keywords and a
setting of "1" would produce a more exhaustive results set with all documents
containing any keyword.
Free-text significant_terms are much more easily understood when

viewed in context. Take the results of significant_terms suggestions
from a free-text field and use them in a terms query on the same field
 with a highlight clause to present users with example snippets of

documents. When the terms are presented unstemmed, highlighted, with
the right case, in the right order and with some context, their
significance/meaning is more readily apparent.
Custom background sets
Ordinarily, the foreground set of documents is "diffed" against a background set of all the
documents in your index. However, sometimes it may prove useful to use a narrower
background set as the basis for comparisons. For example, a query on documents relating
to "Madrid" in an index with content from all over the world might reveal that "Spanish"
was a significant term. This may be true but if you want some more focused terms you
could use a background_filter on the term 'spain' to establish a narrower set of
documents as context. With this as a background "Spanish" would now be seen as
commonplace and therefore not as significant as words like "capital" that relate more
strongly with Madrid. Note that using a background filter will slow things down - each
term’s background frequency must now be derived on-the-fly from filtering posting lists
rather than reading the index’s pre-computed count for a term.
Limitations
• Significant terms must be indexed values Unlike the terms aggregation it is currently
not possible to use script-generated terms for counting purposes. Because of the way
the significant_terms aggregation must consider both foreground and background
frequencies it would be prohibitively expensive to use a script on the entire index to
obtain background frequencies for comparisons. Also DocValues are not supported as
sources of term data for similar reasons.
• No analysis of floating point fields Floating point fields are currently not supported as
the subject of significant_terms analysis. While integer or long fields can be used to
represent concepts like bank account numbers or category numbers which can be
interesting to track, floating point fields are usually used to represent quantities of
something. As such, individual floating point terms are not useful for this form of
frequency analysis.
Use as a parent aggregation If there is the equivalent of a match_all query or no query

criteria providing a subset of the index the significant_terms aggregation should not be
used as the top-most aggregation - in this scenario the foreground set is exactly the same
as the background set and so there is no difference in document frequencies to observe
and from which to make sensible suggestions.
Another consideration is that the significant_terms aggregation produces many candidate

results at shard level that are only later pruned on the reducing node once all statistics
from all shards are merged. As a result, it can be inefficient and costly in terms of RAM to
embed large child aggregations under a significant_terms aggregation that later discards
many candidate terms. It is advisable in these cases to perform two searches - the first to
provide a rationalized list of significant_terms and then add this shortlist of terms to a
second query to go back and fetch the required child aggregations.
Approximate counts The counts of how many documents contain a term provided in results
are based on summing the samples returned from each shard and as such may be:

• low if certain shards did not provide figures for a given term in their top sample
• high when considering the background frequency as it may count occurrences found in
deleted documents
Like most design decisions, this is the basis of a trade-off in which we have chosen to
provide fast performance at the cost of some (typically small) inaccuracies. However, the
size and shard size settings covered in the next section provide tools to help control
the accuracy levels.
Parameters
• JLH score
The scores are derived from the doc frequencies in foreground and background sets. The
absolute change in popularity (foregroundPercent - backgroundPercent) would favor
common terms whereas the relative change in popularity (foregroundPercent/
backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs
recall balance and so the absolute and relative changes are multiplied to provide a sweet
spot between precision and recall.
Mutual information Mutual information as described in "Information Retrieval", Manning et

al., Chapter 13.5.1 can be used as significance score by adding the parameter
"mutual_information": {
"include_negatives": true
}
Mutual information does not differentiate between terms that are descriptive for the subset
or for documents outside the subset. The significant terms therefore can contain terms that
appear more or less frequent in the subset than outside the subset. To filter out the terms
that appear less often in the subset than in documents outside the subset,
include_negatives can be set to false.
Per default, the assumption is that the documents in the bucket are also contained in the
background. If instead you defined a custom background filter that represents a different
set of documents that you want to compare to, set
"background_is_superset": false
Chi square Chi square as described in "Information Retrieval", Manning et al., Chapter
13.5.2 can be used as significance score by adding the parameter
"chi_square": {
}
Chi square behaves like mutual information and can be configured with the same
parameters include_negatives and background_is_superset.
Google normalized distance Google normalized distance as described in "The Google

Similarity Distance", Cilibrasi and Vitanyi, 2007 (http://arxiv.org/pdf/cs/0412098v3.pdf) can
be used as significance score by adding the parameter
"gnd": {
}
gnd also accepts the background_is_superset parameter.
Percentage A simple calculation of the number of documents in the foreground sample

with a term divided by the number of documents in the background with the term. By
default this produces a score greater than zero and less than one.
The benefit of this heuristic is that the scoring logic is simple to explain to anyone familiar
with a "per capita" statistic. However, for fields with high cardinality there is a tendency for
this heuristic to select the rarest terms such as typos that occur only once because they
score 1/1 = 100%.
It would be hard for a seasoned boxer to win a championship if the prize was awarded
purely on the basis of percentage of fights won - by these rules a newcomer with only one
fight under his belt would be impossible to beat. Multiple observations are typically
required to reinforce a view so it is recommended in these cases to set both
min_doc_count and shard_min_doc_count to a higher value such as 10 in order to
filter out the low-frequency terms that otherwise take precedence.
"percentage": {
}
Which one is best?
Roughly, mutual_information prefers high frequent terms even if they occur also
frequently in the background. For example, in an analysis of natural language text this
might lead to selection of stop words. mutual_information is unlikely to select very
rare terms like misspellings. gnd prefers terms with a high co-occurrence and avoids
selection of stopwords. It might be better suited for synonym detection. However, gnd has a

tendency to select very rare terms that are, for example, a result of misspelling.
chi_square and jlh are somewhat in-between.
It is hard to say which one of the different heuristics will be the best choice as it depends on
what the significant terms are used for (see for example [Yang and Pedersen, "A
Comparative Study on Feature Selection in Text Categorization",
1997](http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf) for a
study on using significant terms for feature selection for text classification).
If none of the above measures suits your usecase than another option is to implement a
custom significance measure:
Scripted Customized scores can be implemented via a script:
"script_heuristic": {
"script": "_subset_freq/(_superset_freq - _subset_freq + 1)"
}
Scripts can be inline (as in above example), indexed or stored on disk. For details on the
options, see script documentation.
Available parameters in the script are
_subset_freq
Number of documents the term appears in in the subset.
_superset_freq
Number of documents the term appears in in the superset.
_subset_size
Number of documents in the subset.
_superset_size
Number of documents in the superset.
Size & Shard Size
The size parameter can be set to define how many term buckets should be returned out of
the overall terms list. By default, the node coordinating the search process will request
each shard to provide its own top term buckets and once all shards respond, it will reduce
the results to the final list that will then be returned to the client. If the number of unique
terms is greater than size, the returned list can be slightly off and not accurate (it could

be that the term counts are slightly off and it could even be that a term that should have
been in the top size buckets was not returned).
To ensure better accuracy a multiple of the final size is used as the number of terms to
request from each shard using a heuristic based on the number of shards. To take manual
control of this setting the shard_size parameter can be used to control the volumes of
candidate terms produced by each shard.
Low-frequency terms can turn out to be the most interesting ones once all results are
combined so the significant_terms aggregation can produce higher-quality results when
the shard_size parameter is set to values significantly higher than the size setting. This
ensures that a bigger volume of promising candidate terms are given a consolidated review
by the reducing node before the final selection. Obviously large candidate term lists will
cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be
balanced. If shard_size is set to -1 (the default) then shard_size will be automatically
estimated based on the number of shards and the size parameter.
shard_size cannot be smaller than size (as it doesn’t make much
 sense). When it is, NG|Storage will override it and reset it to be equal

to size.
It is possible to only return terms that match more than a configured number of hits using
the min_doc_count option:
{
"aggs" : {
"tags" : {
"significant_terms" : {
"field" : "tag",
"min_doc_count": 10
}
}
}
}
The above aggregation would only return tags which have been found in 10 hits or more.
Default value is 3.
Terms that score highly will be collected on a shard level and merged with the terms
collected from other shards in a second step. However, the shard does not have the
information about the global term frequencies available. The decision if a term is added to a

candidate list depends only on the score computed on the shard using local shard
frequencies, not the global frequencies of the word. The min_doc_count criterion is only
applied after merging local terms statistics of all shards. In a way the decision to add the
term as a candidate is made without being very certain about if the term will actually reach
the required min_doc_count. This might cause many (globally) high frequent terms to be
missing in the final result if low frequent but high scoring terms populated the candidate
lists. To avoid this, the shard_size parameter can be increased to allow more candidate
terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count parameter
The parameter shard_min_doc_count regulates the certainty a shard has if the term
should actually be added to the candidate list or not with respect to the min_doc_count.
Terms will only be considered if their local shard frequency within the set is higher than the
shard_min_doc_count. If your dictionary contains many low frequent words and you are
not interested in these (for example misspellings), then you can set the
shard_min_doc_count parameter to filter out candidate terms on a shard level that will
with a reasonable certainty not reach the required min_doc_count even after merging the
local frequencies. shard_min_doc_count is set to 1 per default and has no effect unless
you explicitly set it.
Setting min_doc_count to 1 is generally not advised as it tends to return

terms that are typos or other bizarre curiosities. Finding more than
one instance of a term helps reinforce that, while still rare, the term
was not the result of a one-off accident. The default value of 3 is used
 to provide a minimum weight-of-evidence. Setting
shard_min_doc_count too high will cause significant candidate terms
to be filtered out on a shard level. This value should be set much lower
than min_doc_count/#shards.
Custom background context
The default source of statistical information for background term frequencies is the entire
index and this scope can be narrowed through the use of a background_filter to focus
in on significant terms within a narrower context:

{
"query" : {
"match" : "madrid"
},
"aggs" : {
"tags" : {
"field" : "tag",
"background_filter": {
"term" : { "text" : "spain"}
}
}
}
}
}
The above filter would help focus in on terms that were peculiar to the city of Madrid rather
than revealing terms like "Spanish" that are unusual in the full index’s worldwide context
but commonplace in the subset of documents containing the word "Spain".
Use of background filters will slow the query as each term’s postings must
 be filtered to determine a frequency
Filtering Values
It is possible (although rarely required) to filter the values for which buckets will be
created. This can be done using the include and exclude parameters which are based
on a regular expression string or arrays of exact terms. This functionality mirrors the
features described in the terms aggregation documentation.
Execution hint
There are different mechanisms by which terms aggregations can be executed:
• by using field values directly in order to aggregate data per-bucket (map)
• by using ordinals of the field and preemptively allocating one bucket per ordinal value
(global_ordinals)
• by using ordinals of the field and dynamically allocating one bucket per ordinal value
(global_ordinals_hash)
NG|Storage tries to have sensible defaults so this is something that generally doesn’t need
to be configured.
map should only be considered when very few documents match a query. Otherwise the

ordinals-based execution modes are significantly faster. By default, map is only used when
running an aggregation on scripts, since they don’t have ordinals.
global_ordinals is the second fastest option, but the fact that it preemptively allocates
buckets can be memory-intensive, especially if you have one or more sub aggregations. It is
used by default on top-level terms aggregations.
global_ordinals_hash on the contrary to global_ordinals and

global_ordinals_low_cardinality allocates buckets dynamically so memory usage
is linear to the number of values of the documents that are part of the aggregation scope. It
is used by default in inner aggregations.
{
"aggs" : {
"tags" : {
"field" : "tags",
"execution_hint": "map" 1
}
}
}
}
1 - the possible values are map, global_ordinals and global_ordinals_hash
Please note that NG|Storage will ignore this execution hint if it is not applicable.
10.18. Terms Aggregation
A multi-bucket value source based aggregation where buckets are dynamically built - one
per unique value.
Example:
{
"aggs" : {
"genres" : {
"terms" : { "field" : "genre" }
}
}
}
Response:

{
...
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound": 0, 1
"sum_other_doc_count": 0, ¬
"buckets" : [ ¬
{
"key" : "jazz",
"doc_count" : 10
},
{
"key" : "rock",
"doc_count" : 10
},
{
"key" : "electronic",
"doc_count" : 10
},
]
}
}
}
1 - an upper bound of the error on the document counts for each term, see below 2 - when
there are lots of unique terms, NG|Storage only returns the top terms; this number is the
sum of the document counts for all buckets that are not part of the response 3 - the list of
the top buckets, the meaning of top being defined by the order
By default, the terms aggregation will return the buckets for the top ten terms ordered by
the doc_count. One can change this default behaviour by setting the size parameter.
Size
The size parameter can be set to define how many term buckets should be returned out of
the overall terms list. By default, the node coordinating the search process will request
each shard to provide its own top size term buckets and once all shards respond, it will
reduce the results to the final list that will then be returned to the client. This means that if
the number of unique terms is greater than size, the returned list is slightly off and not
accurate (it could be that the term counts are slightly off and it could even be that a term
that should have been in the top size buckets was not returned).
Document counts are approximate
As described above, the document counts (and the results of any sub aggregations) in the
terms aggregation are not always accurate. This is because each shard provides its own

view of what the ordered list of terms should be and these are combined to give a final view.
Consider the following scenario:
A request is made to obtain the top 5 terms in the field product, ordered by descending
document count from an index with 3 shards. In this case each shard is asked to give its top
5 terms.
{
"aggs" : {
"products" : {
"terms" : {
"field" : "product",
"size" : 5
}
}
}
}
The terms for each of the three shards are shown below with their respective document
counts in brackets:
Shard A Shard B Shard C

1 Product A (25) Product A (30) Product A (45)
2 Product B (18) Product B (25) Product C (44)
3 Product C (6) Product F (17) Product Z (36)
4 Product D (3) Product Z (16) Product G (30)
5 Product E (2) Product G (15) Product E (29)
6 Product F (2) Product H (14) Product H (28)
7 Product G (2) Product I (10) Product Q (2)
8 Product H (2) Product Q (6) Product D (1)
9 Product I (1) Product J (8)
10 Product J (1) Product C (4)
The shards will return their top 5 terms so the results from the shards will be:
Shard A Shard B Shard C

1 Product A (25) Product A (30) Product A (45)
2 Product B (18) Product B (25) Product C (44)
3 Product C (6) Product F (17) Product Z (36)
4 Product D (3) Product Z (16) Product G (30)
5 Product E (2) Product G (15) Product E (29)
Taking the top 5 results from each of the shards (as requested) and combining them to
make a final top 5 list produces the following:

1 Product A (100)
2 Product Z (52)
3 Product C (50)
4 Product G (45)
5 Product B (43)
Because Product A was returned from all shards we know that its document count value is
accurate. Product C was only returned by shards A and C so its document count is shown as
50 but this is not an accurate count. Product C exists on shard B, but its count of 4 was not
high enough to put Product C into the top 5 list for that shard. Product Z was also returned
only by 2 shards but the third shard does not contain the term. There is no way of knowing,
at the point of combining the results to produce the final list of terms, that there is an error
in the document count for Product C and not for Product Z. Product H has a document
count of 44 across all 3 shards but was not included in the final list of terms because it did
not make it into the top five terms on any of the shards.
Shard Size
The higher the requested size is, the more accurate the results will be, but also, the more
expensive it will be to compute the final results (both due to bigger priority queues that are
managed on a shard level and due to bigger data transfers between the nodes and the
client).
The shard_size parameter can be used to minimize the extra work that comes with
bigger requested size. When defined, it will determine how many terms the coordinating
node will request from each shard. Once all the shards responded, the coordinating node
will then reduce them to a final result which will be based on the size parameter - this
way, one can increase the accuracy of the returned terms and avoid the overhead of
streaming a big list of buckets back to the client.
shard_size cannot be smaller than size (as it doesn’t make much
 sense). When it is, NG|Storage will override it and reset it to be equal

to size.
The default shard_size is a multiple of the size parameter which is dependant on the
number of shards.
Calculating Document Count Error
There are two error values which can be shown on the terms aggregation. The first gives a

value for the aggregation as a whole which represents the maximum potential document
count for a term which did not make it into the final list of terms. This is calculated as the
sum of the document count from the last term returned from each shard .For the example
given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario
a term which was not returned could have the 4th highest document count.
{
...
"aggregations" : {
"products" : {
"doc_count_error_upper_bound" : 46,
"buckets" : [
{
"key" : "Product A",
"doc_count" : 100
},
{
"key" : "Product Z",
"doc_count" : 52
},
...
]
}
}
}
Per bucket document count error
The second error value can be enabled by setting the show_term_doc_count_error

parameter to true. This shows an error value for each term returned by the aggregation
which represents the 'worst case' error in the document count and can be useful when
deciding on a value for the shard_size parameter. This is calculated by summing the
document counts for the last term returned by all shards which did not return the term. In
the example above the error in the document count for Product C would be 15 as Shard B
was the only shard not to return the term and the document count of the last term it did
return was 15. The actual document count of Product C was 54 so the document count was
only actually off by 4 even though the worst case was that it would be off by 15. Product A,
however has an error of 0 for its document count, since every shard returned it we can be
confident that the count returned is accurate.

{
...
"aggregations" : {
"products" : {
"buckets" : [
{
"key" : "Product A",
"doc_count" : 100,
"doc_count_error_upper_bound" : 0
},
{
"key" : "Product Z",
"doc_count" : 52,
"doc_count_error_upper_bound" : 2
},
...
]
}
}
}
These errors can only be calculated in this way when the terms are ordered by descending
document count. When the aggregation is ordered by the terms values themselves (either
ascending or descending) there is no error in the document count since if a shard does not
return a particular term which appears in the results from another shard, it must not have
that term in its index. When the aggregation is either sorted by a sub aggregation or in
order of ascending document count, the error in the document counts cannot be
determined and is given a value of -1 to indicate this.
Order
The order of the buckets can be customized by setting the order parameter. By default,
the buckets are ordered by their doc_count descending. It is also possible to change this
behaviour as follows:
Ordering the buckets by their doc_count in an ascending manner:
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"order" : { "_count" : "asc" }
}
}
}
}

Ordering the buckets alphabetically by their terms in an ascending manner:
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"order" : { "_term" : "asc" }
}
}
}
}
Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation
name):
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"order" : { "avg_play_count" : "desc" }
},
"aggs" : {
"avg_play_count" : { "avg" : { "field" : "play_count" } }
}
}
}
}
Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation
name):
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"order" : { "playback_stats.avg" : "desc" }
},
"aggs" : {
"playback_stats" : { "stats" : { "field" : "play_count" }
}
}
}
}
}

Sorting by ascending _count or by sub aggregation is discouraged as it
increases the error on document counts. It is fine when a single shard is
queried, or when the field that is being aggregated was used as a routing
 key at index time: in these cases results will be accurate since shards have
disjoint values. However otherwise, errors are unbounded. One particular
case that could still be useful is sorting by min or max aggregation: counts
will not be accurate but at least the top buckets will be correctly picked.
It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy.
This is supported as long as the aggregations path are of a single-bucket type, where the
last aggregation in the path may either be a single-bucket one or a metrics one. If it’s a
single-bucket type, the order will be defined by the number of docs in the bucket (i.e.
doc_count), in case it’s a metrics one, the same rules as above apply (where the path
must indicate the metric name to sort by in case of a multi-value metrics aggregation, and
in case of a single-value metrics aggregation the sort will be applied on that value).
The path must be defined in the following form:
PATH :=
{
"aggs" : {
"countries" : {
"terms" : {
"field" : "artist.country",
"order" : { "rock>playback_stats.avg" : "desc" }
},
"aggs" : {
"rock" : {
"filter" : { "term" : { "genre" : "rock" }},
"aggs" : {
"playback_stats" : { "stats" : { "field" :
"play_count" }}
}
}
}
}
}
}

The above will sort the artist’s countries buckets based on the average play count among
the rock songs.
Multiple criteria can be used to order the buckets by providing an array of order criteria
such as the following:
{
"aggs" : {
"countries" : {
"terms" : {
"field" : "artist.country",
"order" : [ { "rock>playback_stats.avg" : "desc" }, {
"_count" : "desc" } ]
},
"aggs" : {
"rock" : {
"filter" : { "term" : { "genre" : { "rock" }}},
"aggs" : {
"playback_stats" : { "stats" : { "field" :
"play_count" }}
}
}
}
}
}
}
The above will sort the artist’s countries buckets based on the average play count among
the rock songs and then by their doc_count in descending order.
In the event that two buckets share the same values for all order criteria
 the bucket’s term value is used as a tie-breaker in ascending alphabetical

order to prevent non-deterministic ordering of buckets.
It is possible to only return terms that match more than a configured number of hits using
the min_doc_count option:
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"min_doc_count": 10
}
}
}
}

The above aggregation would only return tags which have been found in 10 hits or more.
Default value is 1.
Terms are collected and ordered on a shard level and merged with the terms collected
from other shards in a second step. However, the shard does not have the information
about the global document count available. The decision if a term is added to a candidate
list depends only on the order computed on the shard using local shard frequencies. The
min_doc_count criterion is only applied after merging local terms statistics of all shards.
In a way the decision to add the term as a candidate is made without being very certain
about if the term will actually reach the required min_doc_count. This might cause many
(globally) high frequent terms to be missing in the final result if low frequent terms
populated the candidate lists. To avoid this, the shard_size parameter can be increased
to allow more candidate terms on the shards. However, this increases memory
consumption and network traffic.
shard_min_doc_count parameter
The parameter shard_min_doc_count regulates the certainty a shard has if the term
should actually be added to the candidate list or not with respect to the min_doc_count.
Terms will only be considered if their local shard frequency within the set is higher than the
shard_min_doc_count. If your dictionary contains many low frequent terms and you are
not interested in those (for example misspellings), then you can set the
shard_min_doc_count parameter to filter out candidate terms on a shard level that will
with a reasonable certainty not reach the required min_doc_count even after merging the
local counts. shard_min_doc_count is set to 0 per default and has no effect unless you
explicitly set it.
Setting min_doc_count=0 will also return buckets for terms that didn’t
match any hit. However, some of the returned terms which have a
document count of zero might only belong to deleted documents or
 documents from other types, so there is no warranty that a
match_all query would find a positive document count for those
terms.

When NOT sorting on doc_count descending, high values of
min_doc_count may return a number of buckets which is less than
size because not enough data was gathered from the shards. Missing
 buckets can be back by increasing shard_size. Setting

shard_min_doc_count too high will cause terms to be filtered out on a
shard level. This value should be set much lower than
min_doc_count/#shards.
Script
Generating the terms using a script:
{
"aggs" : {
"genres" : {
"terms" : {
"script" : {
"inline": "doc['genre'].value"
"lang": "painless"
}
}
}
}
}
This will interpret the script parameter as an inline script with the default script
{
"aggs" : {
"genres" : {
"terms" : {
"script" : {
"params": {
"field": "genre"
}
}
}
}
}
}
Value Script

{
"aggs" : {
"genres" : {
"terms" : {
"field" : "gendre",
"script" : {
"inline" : "'Genre: ' +_value"
"lang" : "painless"
}
}
}
}
}
Filtering Values
It is possible to filter the values for which buckets will be created. This can be done using
the include and exclude parameters which are based on regular expression strings or
arrays of exact values.
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"include" : ".*sport.*",
"exclude" : "water_.*"
}
}
}
}
In the above example, buckets will be created for all the tags that has the word sport in
them, except those starting with water_ (so the tag water_sports will no be
aggregated). The include regular expression will determine what values are "allowed" to
be aggregated, while the exclude determines the values that should not be aggregated.
When both are defined, the exclude has precedence, meaning, the include is evaluated
first and only then the exclude.
The syntax is the same as regexp queries.
For matching based on exact values the include and exclude parameters can simply
take an array of strings that represent the terms as they are found in the index:

{
"aggs" : {
"JapaneseCars" : {
"terms" : {
"field" : "make",
"include" : ["mazda", "honda"]
}
},
"ActiveCarManufacturers" : {
"terms" : {
"field" : "make",
"exclude" : ["rover", "jensen"]
}
}
}
}
Multi-field terms aggregation
The terms aggregation does not support collecting terms from multiple fields in the same
document. The reason is that the terms agg doesn’t collect the string term values
themselves, but rather uses global ordinals to produce a list of all of the unique values in
the field. Global ordinals results in an important performance boost which would not be
possible across multiple fields.
There are two approaches that you can use to perform a terms agg across multiple fields:
Script
Use a script to retrieve terms from multiple fields. This disables the global ordinals
optimization and will be slower than collecting terms from a single field, but it gives
you the flexibility to implement this option at search time.
copy_to field
If you know ahead of time that you want to collect the terms from two or more fields,
then use copy_to in your mapping to create a new dedicated field at index time
which contains the values from both fields. You can aggregate on this single field,
which will benefit from the global ordinals optimization.
Collect mode
Deferring calculation of child aggregations
For fields with many unique terms and a small number of required results it can be more
efficient to delay the calculation of child aggregations until the top parent-level aggs have
been pruned. Ordinarily, all branches of the aggregation tree are expanded in one depth-

first pass and only then any pruning occurs. In some scenarios this can be very wasteful
and can hit memory constraints. An example problem scenario is querying a movie
database for the 10 most popular actors and their 5 most common co-stars:
{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "actors",
"size" : 5
}
}
}
}
}
}
Even though the number of actors may be comparatively small and we want only 50 result
buckets there is a combinatorial explosion of buckets during calculation - a single actor can
produce n buckets where n is the number of actors. The sane option would be to first
determine the 10 most popular actors and only then examine the top co-stars for these 10
actors. This alternative strategy is what we call the breadth_first collection mode as
opposed to the depth_first mode.
The breadth_first is the default mode for fields with a cardinality

bigger than the requested size or when the cardinality is unknown
 (numeric fields or scripts for instance). It is possible to override the default
heuristic and to provide a collect mode directly in the request:

{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10,
"collect_mode" : "breadth_first" 1
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "actors",
"size" : 5
}
}
}
}
}
}
1 - the possible values are breadth_first and depth_first
When using breadth_first mode the set of documents that fall into the uppermost
buckets are cached for subsequent replay so there is a memory overhead in doing this
which is linear with the number of matching documents. Note that the order parameter
can still be used to refer to data from a child aggregation when using the breadth_first
setting - the parent aggregation understands that this child aggregation will need to be
called first before any of the other child aggregations.
Nested aggregations such as top_hits which require access to score

information under an aggregation that uses the breadth_first
 collection mode need to replay the query on the second pass but only for
the documents belonging to the top buckets.
Execution hint
experimental[The automated execution optimization is experimental, so this parameter is

provided temporarily as a way to override the default behaviour]
There are different mechanisms by which terms aggregations can be executed:
• by using field values directly in order to aggregate data per-bucket (map)
• by using ordinals of the field and preemptively allocating one bucket per ordinal value
(global_ordinals)
• by using ordinals of the field and dynamically allocating one bucket per ordinal value

(global_ordinals_hash)
• by using per-segment ordinals to compute counts and remap these counts to global
counts using global ordinals (global_ordinals_low_cardinality)
NG|Storage tries to have sensible defaults so this is something that generally doesn’t need
to be configured.
map should only be considered when very few documents match a query. Otherwise the
ordinals-based execution modes are significantly faster. By default, map is only used when
running an aggregation on scripts, since they don’t have ordinals.
global_ordinals_low_cardinality only works for leaf terms aggregations but is

usually the fastest execution mode. Memory usage is linear with the number of unique
values in the field, so it is only enabled by default on low-cardinality fields.
global_ordinals is the second fastest option, but the fact that it preemptively allocates
buckets can be memory-intensive, especially if you have one or more sub aggregations. It is
used by default on top-level terms aggregations.
global_ordinals_hash on the contrary to global_ordinals and

global_ordinals_low_cardinality allocates buckets dynamically so memory usage
is linear to the number of values of the documents that are part of the aggregation scope. It
is used by default in inner aggregations.
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"execution_hint": "map" 1
}
}
}
}
1 - the possible values are map, global_ordinals, global_ordinals_hash and

global_ordinals_low_cardinality
Please note that NG|Storage will ignore this execution hint if it is not applicable and that
there is no backward compatibility guarantee on these hints.
Missing value

value.
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"missing": "N/A" 1
}
}
}
}
1 - Documents without a value in the tags field will fall into the same bucket as documents
that have the value N/A.

Chapter 11. Matrix Aggregations
The aggregations in this family operate on multiple fields and produce a matrix result
based on the values extracted from the requested document fields. Unlike metric and
bucket aggregations, this aggregation family does not yet support scripting.
11.1. Matrix Stats
The matrix_stats aggregation is a numeric aggregation that computes the following

statistics over a set of document fields:
count
Number of per field samples included in the calculation.
mean
The average value for each field.
variance
Per field Measurement for how spread out the samples are from the mean.
skewness
Per field measurement quantifying the asymmetric distribution around the mean.
kurtosis
Per field measurement quantifying the shape of the distribution.
covariance
A matrix that quantitatively describes how changes in one field are associated with
another.
correlation
The covariance matrix scaled to a range of -1 to 1, inclusive. Describes the
relationship between field distributions.
The following example demonstrates the use of matrix stats to describe the relationship
between income and poverty.
Chapter 11. Matrix Aggregations | 121

{
"aggs": {
"matrixstats": {
"matrix_stats": {
"fields": ["poverty", "income"]
}
}
}
}
The aggregation type is matrix_stats and the fields setting defines the set of fields (as
an array) for computing the statistics. The above request returns the following response:
{
...
"aggregations": {
"matrixstats": {
"fields": [{
"name": "income",
"count": 50,
"mean": 51985.1,
"variance": 7.383377037755103E7,
"skewness": 0.5595114003506483,
"kurtosis": 2.5692365287787124,
"covariance": {
"income": 7.383377037755103E7,
"poverty": -21093.65836734694
},
"correlation": {
"income": 1.0,
"poverty": -0.8352655256272504
}
}, {
"name": "poverty",
"count": 50,
"mean": 12.732000000000001,
"variance": 8.637730612244896,
"skewness": 0.4516049811903419,
"kurtosis": 2.8615929677997767,
"covariance": {
"income": -21093.65836734694,
"poverty": 8.637730612244896
},
"correlation": {
"income": -0.8352655256272504,
"poverty": 1.0
}
}]
}
}
}
Multi Value Fields
122 | Chapter 11. Matrix Aggregations

The matrix_stats aggregation treats each document field as an independent sample.
The mode parameter controls what array value the aggregation will use for array or multi-
valued fields. This parameter can take one of the following:
avg
(default) Use the average of all values.
min
Pick the lowest value.
max
Pick the highest value.
sum
Use the sum of all values.
median
Use the median of all values.
Missing Values
value. This is done by adding a set of fieldname : value mappings to specify default values
per field.
{
"aggs": {
"matrixstats": {
"matrix_stats": {
"fields": ["poverty", "income"],
"missing": {"income" : 50000} 1
}
}
}
}
1 - Documents without a value in the income field will have the default value 50000.
Script
This aggregation family does not yet support scripting.
Chapter 11. Matrix Aggregations | 123

Chapter 12. Metrics Aggregations
The aggregations in this family compute metrics based on values extracted in one way or
another from the documents that are being aggregated. The values are typically extracted
from the fields of the document (using the field data), but can also be generated using
scripts.
Numeric metrics aggregations are a special type of metrics aggregation which output
numeric values. Some aggregations output a single numeric metric (e.g. avg) and are
called single-value numeric metrics aggregation, others generate multiple
metrics (e.g. stats) and are called multi-value numeric metrics aggregation.
The distinction between single-value and multi-value numeric metrics aggregations plays a
role when these aggregations serve as direct sub-aggregations of some bucket
aggregations (some bucket aggregations enable you to sort the returned buckets based on
the numeric metrics in each bucket).
12.1. Average Aggregation
A single-value metrics aggregation that computes the average of numeric values that
are extracted from the aggregated documents. These values can be extracted either from
specific numeric fields in the documents, or be generated by a provided script.
Assuming the data consists of documents representing exams grades (between 0 and 100)
of students
{
"aggs" : {
"avg_grade" : { "avg" : { "field" : "grade" } }
}
}
The above aggregation computes the average grade over all documents. The aggregation
type is avg and the field setting defines the numeric field of the documents the average
will be computed on. The above will return the following:
124 | Chapter 12. Metrics Aggregations

{
...
"aggregations": {
"avg_grade": {
"value": 75
}
}
}
The name of the aggregation (avg_grade above) also serves as the key by which the
aggregation result can be retrieved from the returned response.
Script
Computing the average grade based on a script:
{
...,
"aggs" : {
"avg_grade" : {
"avg" : {
"script" : {
"inline" : "doc['grade'].value",
"lang" : "painless"
}
}
}
}
}
{
...,
"aggs" : {
"avg_grade" : {
"avg" : {
"script" : {
"params": {
"field": "grade"
}
}
}
}
}
}
Chapter 12. Metrics Aggregations | 125

Value Script
It turned out that the exam was way above the level of the students and a grade correction
needs to be applied. We can use value script to get the new average:
{
"aggs" : {
...
"aggs" : {
"avg_corrected_grade" : {
"avg" : {
"field" : "grade",
"script" : {
"lang": "painless",
"inline": "_value * params.correction",
"params" : {
"correction" : 1.2
}
}
}
}
}
}
}
Missing value
value.
{
"aggs" : {
"grade_avg" : {
"avg" : {
"field" : "grade",
"missing": 10 1
}
}
}
}
1 - Documents without a value in the grade field will fall into the same bucket as

12.2. Cardinality Aggregation
A single-value metrics aggregation that calculates an approximate count of distinct

values. Values can be extracted either from specific fields in the document or generated by
a script.
Assume you are indexing books and would like to count the unique authors that match a
query:
{
"aggs" : {
"author_count" : {
"cardinality" : {
"field" : "author"
}
}
}
}
Precision control
This aggregation also supports the precision_threshold option:
experimental[The precision_threshold option is specific to the current internal

implementation of the cardinality agg, which may change in the future]
{
"aggs" : {
"author_count" : {
"cardinality" : {
"field" : "author_hash",
"precision_threshold": 100 1
}
}
}
}
1 - The precision_threshold options allows to trade memory for accuracy, and defines
a unique count below which counts are expected to be close to accurate. Above this value,
counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds
above this number will have the same effect as a threshold of 40000. The default values is
3000.
Counts are approximate
Computing exact counts requires loading values into a hash set and returning its size. This

doesn’t scale when working on high-cardinality sets and/or large values as the required
memory usage and the need to communicate those per-shard sets between nodes would
utilize too many resources of the cluster.
This cardinality aggregation is based on the HyperLogLog++ algorithm, which counts

based on the hashes of the values with some interesting properties:
• configurable precision, which decides on how to trade memory for accuracy,
• excellent accuracy on low-cardinality sets,
• fixed memory usage: no matter if there are tens or billions of unique values, memory
usage only depends on the configured precision.
For a precision threshold of c, the implementation that we are using requires about c * 8
bytes.
Pre-computed hashes
On string fields that have a high cardinality, it might be faster to store the hash of your field
values in your index and then run the cardinality aggregation on this field. This can either
be done by providing hash values from client-side or by letting NG|Storage compute hash
values for you by using the mapper-murmur plugin.
Pre-computing hashes is usually only useful on very large and/or high-

cardinality fields as it saves CPU and memory. However, on numeric fields,
hashing is very fast and storing the original values requires as much or
 less memory than storing the hashes. This is also true on low-cardinality
string fields, especially given that those have an optimization in order to
make sure that hashes are computed at most once per unique value per
segment.
Script
The cardinality metric supports scripting, with a noticeable performance hit however
since hashes need to be computed on the fly.

{
"aggs" : {
"author_count" : {
"cardinality" : {
"script": {
"lang": "painless",
"inline": "doc['author.first_name'].value + ' ' +
doc['author.last_name'].value"
}
}
}
}
}
{
"aggs" : {
"author_count" : {
"cardinality" : {
"script" : {
"params": {
"first_name_field": "author.first_name",
"last_name_field": "author.last_name"
}
}
}
}
}
}
Missing value
value.

{
"aggs" : {
"tag_cardinality" : {
"cardinality" : {
"field" : "tag",
"missing": "N/A" 1
}
}
}
}
1 - Documents without a value in the tag field will fall into the same bucket as documents
that have the value N/A.
12.3. Extended Stats Aggregation
A multi-value metrics aggregation that computes stats over numeric values extracted
from the aggregated documents. These values can be extracted either from specific
numeric fields in the documents, or be generated by a provided script.
The extended_stats aggregations is an extended version of the stats aggregation,

where additional metrics are added such as sum_of_squares, variance,
std_deviation and std_deviation_bounds.
of students
{
"aggs" : {
"grades_stats" : { "extended_stats" : { "field" : "grade" } }
}
}
The above aggregation computes the grades statistics over all documents. The aggregation
type is extended_stats and the field setting defines the numeric field of the
documents the stats will be computed on. The above will return the following:

{
...
"aggregations": {
"grade_stats": {
"count": 9,
"min": 72,
"max": 99,
"avg": 86,
"sum": 774,
"sum_of_squares": 67028,
"variance": 51.55555555555556,
"std_deviation": 7.180219742846005,
"std_deviation_bounds": {
"upper": 100.36043948569201,
"lower": 71.63956051430799
}
}
}
}
The name of the aggregation (grades_stats above) also serves as the key by which the
Standard Deviation Bounds
By default, the extended_stats metric will return an object called

std_deviation_bounds, which provides an interval of plus/minus two standard
deviations from the mean. This can be a useful way to visualize variance of your data. If
you want a different boundary, for example three standard deviations, you can set sigma in
the request:
{
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"field" : "grade",
"sigma" : 3 1
}
}
}
}
1 - sigma controls how many standard deviations +/- from the mean should be displayed
sigma can be any non-negative double, meaning you can request non-integer values such
as 1.5. A value of 0 is valid, but will simply return the average for both upper and lower
bounds.

Standard Deviation and Bounds require normality
The standard deviation and its bounds are displayed by default, but they
are not always applicable to all data-sets. Your data must be normally
 distributed for the metrics to make sense. The statistics behind standard
deviations assumes normally distributed data, so if your data is skewed
heavily left or right, the value returned will be misleading.
Script
Computing the grades stats based on a script:
{
...,
"aggs" : {
"grades_stats" : {
"script" : {
"lang" : "painless"
}
}
}
}
}
{
...,
"aggs" : {
"grades_stats" : {
"script" : {
"params": {
"field": "grade"
}
}
}
}
}
}

Value Script
needs to be applied. We can use value script to get the new stats:
{
"aggs" : {
...
"aggs" : {
"grades_stats" : {
"field" : "grade",
"script" : {
"params" : {
"correction" : 1.2
}
}
}
}
}
}
}
Missing value
value.
{
"aggs" : {
"grades_stats" : {
"field" : "grade",
"missing": 0 1
}
}
}
}
12.4. Geo Bounds Aggregation
A metric aggregation that computes the bounding box containing all geo_point values for a

field.
Example:
{
"query" : {
"match" : { "business_type" : "shop" }
},
"aggs" : {
"viewport" : {
"geo_bounds" : {
"field" : "location", 1
"wrap_longitude" : true ¬
}
}
}
}
1 - The geo_bounds aggregation specifies the field to use to obtain the bounds 2 -
wrap_longitude is an optional parameter which specifies whether the bounding box
should be allowed to overlap the international date line. The default value is true
The above aggregation demonstrates how one would compute the bounding box of the
location field for all documents with a business type of shop
{
...
"aggregations": {
"viewport": {
"bounds": {
"top_left": {
"lat": 80.45,
"lon": -160.22
},
"bottom_right": {
"lat": 40.65,
"lon": 42.57
}
}
}
}
}
12.5. Geo Centroid Aggregation
A metric aggregation that computes the weighted centroid from all coordinate values for a
[geo-point] field.
Example:
{
"query" : {
"match" : { "crime" : "burglary" }
},
"aggs" : {
"centroid" : {
"geo_centroid" : {
"field" : "location" 1
}
}
}
}
1 - The geo_centroid aggregation specifies the field to use for computing the centroid.
(NOTE: field must be a [geo-point] type)
The above aggregation demonstrates how one would compute the centroid of the location
field for all documents with a crime type of burglary
{
...
"aggregations": {
"centroid": {
"location": {
"lat": 80.45,
"lon": -160.22
}
}
}
}
The geo_centroid aggregation is more interesting when combined as a sub-aggregation

to other bucket aggregations.
Example:

{
"query" : {
"match" : { "crime" : "burglary" }
},
"aggs" : {
"towns" : {
"terms" : { "field" : "town" },
"aggs" : {
"centroid" : {
"geo_centroid" : { "field" : "location" }
}
}
}
}
}
The above example uses geo_centroid as a sub-aggregation to a terms bucket

aggregation for finding the central location for all crimes of type burglary in each town.
{
...
"buckets": [
{
"key": "Los Altos",
"doc_count": 113,
"centroid": {
"location": {
"lat": 37.3924582824111,
"lon": -122.12104808539152
}
}
},
{
"key": "Mountain View",
"doc_count": 92,
"centroid": {
"location": {
"lat": 37.382152481004596,
"lon": -122.08116559311748
}
}
}
]
}
12.6. Max Aggregation
A single-value metrics aggregation that keeps track and returns the maximum value
among the numeric values extracted from the aggregated documents. These values can be
extracted either from specific numeric fields in the documents, or be generated by a
provided script.
Computing the max price value across all documents
{
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
}
}
Response:
{
...
"aggregations": {
"max_price": {
"value": 35
}
}
}
As can be seen, the name of the aggregation (max_price above) also serves as the key by
which the aggregation result can be retrieved from the returned response.
Script
Computing the max price value across all document, this time using a script:
{
"aggs" : {
"max_price" : {
"max" : {
"script" : {
"inline" : "doc['price'].value",
"lang" : "painless"
}
}
}
}
}

{
"aggs" : {
"max_price" : {
"max" : {
"script" : {
"params": {
"field": "price"
}
}
}
}
}
}
Value Script
Let’s say that the prices of the documents in our index are in USD, but we would like to
compute the max in EURO (and for the sake of this example, lets say the conversion rate is
1.2). We can use a value script to apply the conversion rate to every value before it is
aggregated:
{
"aggs" : {
"max_price_in_euros" : {
"max" : {
"field" : "price",
"script" : {
"lang": "painless",
"params" : {
}
}
}
}
}
}
Missing value
value.

{
"aggs" : {
"grade_max" : {
"max" : {
"field" : "grade",
"missing": 10 1
}
}
}
}
12.7. Min Aggregation
A single-value metrics aggregation that keeps track and returns the minimum value
among numeric values extracted from the aggregated documents. These values can be
extracted either from specific numeric fields in the documents, or be generated by a
provided script.
Computing the min price value across all documents:
{
"aggs" : {
"min_price" : { "min" : { "field" : "price" } }
}
}
Response:
{
...
"aggregations": {
"min_price": {
"value": 10
}
}
}
As can be seen, the name of the aggregation (min_price above) also serves as the key by
which the aggregation result can be retrieved from the returned response.
Script
Computing the min price value across all document, this time using a script:
{
"aggs" : {
"min_price" : {
"min" : {
"script" : {
"inline" : "doc['price'].value",
"lang" : "painless"
}
}
}
}
}
{
"aggs" : {
"min_price" : {
"min" : {
"script" : {
"params": {
"field": "price"
}
}
}
}
}
}
Value Script
Let’s say that the prices of the documents in our index are in USD, but we would like to
compute the min in EURO (and for the sake of this example, lets say the conversion rate is
1.2). We can use a value script to apply the conversion rate to every value before it is
aggregated:

{
"aggs" : {
"min_price_in_euros" : {
"min" : {
"field" : "price",
"script" :
"params" : {
}
}
}
}
}
}
Missing value
value.
{
"aggs" : {
"grade_min" : {
"min" : {
"field" : "grade",
"missing": 10 1
}
}
}
}
12.8. Percentiles Aggregation
A multi-value metrics aggregation that calculates one or more percentiles over numeric
values extracted from the aggregated documents. These values can be extracted either
from specific numeric fields in the documents, or be generated by a provided script.
Percentiles show the point at which a certain percentage of observed values occur. For
example, the 95th percentile is the value which is greater than 95% of the observed values.
Percentiles are often used to find outliers. In normal distributions, the 0.13th and 99.87th

percentiles represents three standard deviations from the mean. Any data which falls
outside three standard deviations is often considered an anomaly.
When a range of percentiles are retrieved, they can be used to estimate the data
distribution and determine if the data is skewed, bimodal, etc.
Assume your data consists of website load times. The average and median load times are
not overly useful to an administrator. The max may be interesting, but it can be easily
skewed by a single slow response.
Let’s look at a range of percentiles representing load time:
{
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time" 1
}
}
}
}
1 - The field load_time must be a numeric field
By default, the percentile metric will generate a range of percentiles: [ 1, 5, 25,

50, 75, 95, 99 ]. The response will look like this:
{
...
"aggregations": {
"load_time_outlier": {
"values" : {
"1.0": 15,
"5.0": 20,
"25.0": 23,
"50.0": 25,
"75.0": 29,
"95.0": 60,
"99.0": 150
}
}
}
}
As you can see, the aggregation will return a calculated value for each percentile in the
default range. If we assume response times are in milliseconds, it is immediately obvious
that the webpage normally loads in 15-30ms, but occasionally spikes to 60-150ms.

Often, administrators are only interested in outliers¬—¬the extreme percentiles. We can
specify just the percents we are interested in (requested percentiles must be a value
between 0-100 inclusive):
{
"aggs" : {
"percentiles" : {
"field" : "load_time",
"percents" : [95, 99, 99.9] 1
}
}
}
}
1 - Use the percents parameter to specify particular percentiles to calculate
Script
The percentile metric supports scripting. For example, if our load times are in milliseconds
but we want percentiles calculated in seconds, we could use a script to convert them on-
the-fly:
{
"aggs" : {
"percentiles" : {
"script" : {
"lang": "painless",
"inline": "doc['load_time'].value / params.timeUnit",
1
"params" : {
"timeUnit" : 1000 ¬
}
}
}
}
}
}
1 - The field parameter is replaced with a script parameter, which uses the script to
generate values which percentiles are calculated on 2 - Scripting supports parameterized
input just like any other script

{
"aggs" : {
"percentiles" : {
"script" : {
"params" : {
"timeUnit" : 1000
}
}
}
}
}
}
Percentiles are (usually) approximate
There are many different algorithms to calculate percentiles. The naive implementation
simply stores all the values in a sorted array. To find the 50th percentile, you simply find
the value that is at my_array[count(my_array) * 0.5].
Clearly, the naive implementation does not scale¬—¬the sorted array grows linearly with
the number of values in your dataset. To calculate percentiles across potentially billions of
values in an NG|Storage cluster, approximate percentiles are calculated.
The algorithm used by the percentile metric is called TDigest (introduced by Ted
Dunning in Computing Accurate Quantiles using T-Digests).
When using this metric, there are a few guidelines to keep in mind:
• Accuracy is proportional to q(1-q). This means that extreme percentiles (e.g. 99%) are
more accurate than less extreme percentiles, such as the median
• For small sets of values, percentiles are highly accurate (and potentially 100% accurate
if the data is small enough).
• As the quantity of values in a bucket grows, the algorithm begins to approximate the
percentiles. It is effectively trading accuracy for memory savings. The exact level of
inaccuracy is difficult to generalize, since it depends on your data distribution and
volume of data being aggregated
The reason why error diminishes for large number of values is that the law of large
numbers makes the distribution of values more and more uniform and the t-digest tree can
do a better job at summarizing it. It would not be the case on more skewed distributions.
Compression
experimental[The compression parameter is specific to the current internal

implementation of percentiles, and may change in the future]
Approximate algorithms must balance memory utilization with estimation accuracy. This
balance can be controlled using a compression parameter:
{
"aggs" : {
"percentiles" : {
"compression" : 200 1
}
}
}
}
1 - Compression controls memory usage and approximation error
The TDigest algorithm uses a number of "nodes" to approximate percentiles¬—¬the more

nodes available, the higher the accuracy (and large memory footprint) proportional to the
volume of data. The compression parameter limits the maximum number of nodes to 20
* compression.
Therefore, by increasing the compression value, you can increase the accuracy of your
percentiles at the cost of more memory. Larger compression values also make the
algorithm slower since the underlying tree data structure grows in size, resulting in more
expensive operations. The default compression value is 100.
A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount of
data which arrives sorted and in-order) the default settings will produce a TDigest roughly
64KB in size. In practice data tends to be more random and the TDigest will use less
memory.
HDR Histogram
HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can
be useful when calculating percentiles for latency measurements as it can be faster than
the t-digest implementation with the trade-off of a larger memory footprint. This
implementation maintains a fixed worse-case percentage error (specified as a number of
significant digits). This means that if data is recorded with values from 1 microsecond up to
1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will
maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds
(or better) for the maximum tracked value (1 hour).
The HDR Histogram can be used by specifying the method parameter in the request:
{
"aggs" : {
"percentiles" : {
"percents" : [95, 99, 99.9],
"method" : "hdr", 1
"number_of_significant_value_digits" : 3 ¬
}
}
}
}
1 - The method parameter is set to hdr to indicate that HDR Histogram should be used to
calculate the percentiles 2 - number_of_significant_value_digits specifies the
resolution of values for the histogram in number of significant digits
The HDRHistogram only supports positive values and will error if it is passed a negative
value. It is also not a good idea to use the HDRHistogram if the range of values is unknown
as this could lead to high memory usage.
Missing value
value.
{
"aggs" : {
"grade_percentiles" : {
"percentiles" : {
"field" : "grade",
"missing": 10 1
}
}
}
}

12.9. Percentile Ranks Aggregation
A multi-value metrics aggregation that calculates one or more percentile ranks over
numeric values extracted from the aggregated documents. These values can be extracted
either from specific numeric fields in the documents, or be generated by a provided script.
Please see [search-aggregations-metrics-percentile-aggregation-

approximation] and [search-aggregations-metrics-percentile-
 aggregation-compression] for advice regarding approximation and
memory use of the percentile ranks aggregation
Percentile rank show the percentage of observed values which are below certain value. For
example, if a value is greater than or equal to 95% of the observed values it is said to be at
the 95th percentile rank.
Assume your data consists of website load times. You may have a service agreement that
95% of page loads completely within 15ms and 99% of page loads complete within 30ms.
{
"aggs" : {
"percentile_ranks" : {
"field" : "load_time", 1
"values" : [15, 30]
}
}
}
}
The response will look like this:

{
...
"aggregations": {
"values" : {
"15": 92,
"30": 100
}
}
}
}
From this information you can determine you are hitting the 99% load time target but not
quite hitting the 95% load time target
Script
The percentile rank metric supports scripting. For example, if our load times are in
milliseconds but we want to specify values in seconds, we could use a script to convert
them on-the-fly:
{
"aggs" : {
"values" : [3, 5],
"script" : {
"lang": "painless",
1
"params" : {
}
}
}
}
}
}
generate values which percentile ranks are calculated on 2 - Scripting supports
parameterized input just like any other script

{
"aggs" : {
"values" : [3, 5],
"script" : {
"params" : {
"timeUnit" : 1000
}
}
}
}
}
}
HDR Histogram
be useful when calculating percentile ranks for latency measurements as it can be faster
than the t-digest implementation with the trade-off of a larger memory footprint. This
{
"aggs" : {
"values" : [15, 30],
"method" : "hdr", 1
}
}
}
}
calculate the percentile_ranks 2 - number_of_significant_value_digits specifies
the resolution of values for the histogram in number of significant digits

Missing value
value.
{
"aggs" : {
"grade_ranks" : {
"field" : "grade",
"missing": 10 1
}
}
}
}
12.10. Percentile Ranks Aggregation
A multi-value metrics aggregation that calculates one or more percentile ranks over
numeric values extracted from the aggregated documents. These values can be extracted
either from specific numeric fields in the documents, or be generated by a provided script.
Please see [search-aggregations-metrics-percentile-aggregation-

approximation] and [search-aggregations-metrics-percentile-
 aggregation-compression] for advice regarding approximation and
memory use of the percentile ranks aggregation
Percentile rank show the percentage of observed values which are below certain value. For
example, if a value is greater than or equal to 95% of the observed values it is said to be at
the 95th percentile rank.
Assume your data consists of website load times. You may have a service agreement that
95% of page loads completely within 15ms and 99% of page loads complete within 30ms.

{
"aggs" : {
"field" : "load_time", 1
"values" : [15, 30]
}
}
}
}
The response will look like this:
{
...
"aggregations": {
"values" : {
"15": 92,
"30": 100
}
}
}
}
From this information you can determine you are hitting the 99% load time target but not
quite hitting the 95% load time target
Script
The percentile rank metric supports scripting. For example, if our load times are in
milliseconds but we want to specify values in seconds, we could use a script to convert
them on-the-fly:

{
"aggs" : {
"values" : [3, 5],
"script" : {
"lang": "painless",
1
"params" : {
}
}
}
}
}
}
generate values which percentile ranks are calculated on 2 - Scripting supports
parameterized input just like any other script
{
"aggs" : {
"values" : [3, 5],
"script" : {
"params" : {
"timeUnit" : 1000
}
}
}
}
}
}
HDR Histogram
be useful when calculating percentile ranks for latency measurements as it can be faster
than the t-digest implementation with the trade-off of a larger memory footprint. This

{
"aggs" : {
"values" : [15, 30],
"method" : "hdr", 1
}
}
}
}
calculate the percentile_ranks 2 - number_of_significant_value_digits specifies
the resolution of values for the histogram in number of significant digits
Missing value
value.
{
"aggs" : {
"grade_ranks" : {
"field" : "grade",
"missing": 10 1
}
}
}
}

12.11. Scripted Metric Aggregation
A metric aggregation that executes using scripts to provide a metric output.
Example:
{
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "_agg['transactions'] = []",
"map_script" : "if (doc['type'].value == \"sale\") {
_agg.transactions.add(doc['amount'].value) } else {
_agg.transactions.add(-1 * doc['amount'].value) }", 1
"combine_script" : "profit = 0; for (t in
_agg.transactions) { profit += t }; return profit",
"reduce_script" : "profit = 0; for (a in _aggs) { profit
+= a }; return profit"
}
}
}
}
1 - map_script is the only required parameter
The above aggregation demonstrates how one would use the script aggregation compute
the total profit from sale and cost transactions.
{
...
"aggregations": {
"profit": {
"value": 170
}
}
}
The above example can also be specified using file scripts as follows:

{
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : {
"file": "my_init_script"
},
"map_script" : {
"file": "my_map_script"
},
"combine_script" : {
"file": "my_combine_script"
},
"params": {
"field": "amount" 1
},
"reduce_script" : {
"file": "my_reduce_script"
},
}
}
}
}
1 - script parameters for init, map and combine scripts must be specified in a global
params object so that it can be share between the scripts
For more details on specifying scripts see script documentation.
Allowed return types
Whilst and valid script object can be used within a single script. the scripts must return or
store in the _agg object only the following types:
• primitive types
• String
• Map (containing only keys and values of the types listed here)
• Array (containing elements of only the types listed here)
Scope of scripts
The scripted metric aggregation uses scripts at 4 stages of its execution:
init_script
Executed prior to any collection of documents. Allows the aggregation to set up any

initial state.
In the above example, the init_script creates an array transactions in the

_agg object.
map_script
Executed once per document collected. This is the only required script. If no
combine_script is specified, the resulting state needs to be stored in an object named
_agg.
In the above example, the map_script checks the value of the type field. If the value
is 'sale' the value of the amount field is added to the transactions array. If the value of
the type field is not 'sale' the negated value of the amount field is added to
transactions.
combine_script
Executed once on each shard after document collection is complete. Allows the
aggregation to consolidate the state returned from each shard. If a combine_script is
not provided the combine phase will return the aggregation variable.
In the above example, the combine_script iterates through all the stored
transactions, summing the values in the profit variable and finally returns profit.
reduce_script
Executed once on the coordinating node after all shards have returned their results.
The script is provided with access to a variable _aggs which is an array of the result
of the combine_script on each shard. If a reduce_script is not provided the reduce
phase will return the _aggs variable.
In the above example, the reduce_script iterates through the profit returned by
each shard summing the values before returning the final combined profit which will
be returned in the response of the aggregation.
Worked Example
Imagine a situation where you index the following documents into and index with 2 shards:

$ curl -XPUT 'http://localhost:9200/transactions/stock/1' -d '
{
"type": "sale",
"amount": 80
}
'

{
"type": "cost",
"amount": 10
}
'

{
"type": "cost",
"amount": 30
}
'

{
"type": "sale",
"amount": 130
}
'
Lets say that documents 1 and 3 end up on shard A and documents 2 and 4 end up on shard
B. The following is a breakdown of what the aggregation result is at each stage of the
example above.
Before init-script
Before init_script. No params object was specified so the default params object is used:
"params" : {
"_agg" : {}
}
After init-script
This is run once on each shard before any document collection is performed, and so we will
have a copy on each shard:
Shard A

"params" : {
"_agg" : {
"transactions" : []
}
}
Shard B
"params" : {
"_agg" : {
"transactions" : []
}
}
After map-script
Each shard collects its documents and runs the map_script on each document that is
collected:
Shard A
"params" : {
"_agg" : {
"transactions" : [ 80, -30 ]
}
}
Shard B
"params" : {
"_agg" : {
"transactions" : [ -10, 130 ]
}
}
After combine-script
The combine_script is executed on each shard after document collection is complete and
reduces all the transactions down to a single profit figure for each shard (by summing the
values in the transactions array) which is passed back to the coordinating node:
Shard A
50
Shard B
120

After reduce-script
The reduce_script receives an _aggs array containing the result of the combine script for
each shard:
"_aggs" : [
50,
120
]
It reduces the responses for the shards down to a final overall profit figure (by summing the
values) and returns this as the result of the aggregation to produce the response:
{
...
"aggregations": {
"profit": {
"value": 170
}
}
}
Other Parameters
params
Optional. An object whose contents will be passed as variables to the init_script,

map_script and combine_script. This can be useful to allow the user to control
the behavior of the aggregation and for storing state between the scripts. If this is not
specified, the default is the equivalent of providing:
"params" : {
"_agg" : {}
}
reduce_params
Optional. An object whose contents will be passed as variables to the

reduce_script. This can be useful to allow the user to control the behavior of the
reduce phase. If this is not specified the variable will be undefined in the
reduce_script execution.
12.12. Stats Aggregation
A multi-value metrics aggregation that computes stats over numeric values extracted
The stats that are returned consist of: min, max, sum, count and avg.
of students
{
"aggs" : {
"grades_stats" : { "stats" : { "field" : "grade" } }
}
}
The above aggregation computes the grades statistics over all documents. The aggregation
type is stats and the field setting defines the numeric field of the documents the stats
will be computed on. The above will return the following:
{
...
"aggregations": {
"grades_stats": {
"count": 6,
"min": 60,
"max": 98,
"avg": 78.5,
"sum": 471
}
}
}
The name of the aggregation (grades_stats above) also serves as the key by which the
Script
Computing the grades stats based on a script:

{
...,
"aggs" : {
"grades_stats" : {
"stats" : {
"script" : {
"lang": "painless",
"inline": "doc['grade'].value"
}
}
}
}
}
{
...,
"aggs" : {
"grades_stats" : {
"stats" : {
"script" : {
"params" : {
"field" : "grade"
}
}
}
}
}
}
Value Script
needs to be applied. We can use a value script to get the new stats:

{
"aggs" : {
...
"aggs" : {
"grades_stats" : {
"stats" : {
"field" : "grade",
"script" :
"lang": "painless",
"params" : {
"correction" : 1.2
}
}
}
}
}
}
}
Missing value
value.
{
"aggs" : {
"grades_stats" : {
"stats" : {
"field" : "grade",
"missing": 0 1
}
}
}
}
12.13. Sum Aggregation
A single-value metrics aggregation that sums up numeric values that are extracted
Assuming the data consists of documents representing stock ticks, where each tick holds

the change in the stock price from the previous tick.
{
"query" : {
"constant_score" : {
"filter" : {
"range" : { "timestamp" : { "from" : "now/1d+9.5h", "to" :
"now/1d+16h" }}
}
}
},
"aggs" : {
"intraday_return" : { "sum" : { "field" : "change" } }
}
}
The above aggregation sums up all changes in the today’s trading stock ticks which
accounts for the intraday return. The aggregation type is sum and the field setting defines
the numeric field of the documents of which values will be summed up. The above will
return the following:
{
...
"aggregations": {
"intraday_return": {
"value": 2.18
}
}
}
The name of the aggregation (intraday_return above) also serves as the key by which
the aggregation result can be retrieved from the returned response.
Script
Computing the intraday return based on a script:

{
...,
"aggs" : {
"intraday_return" : {
"sum" : {
"script" : {
"lang": "painless",
"inline": "doc['change'].value"
}
}
}
}
}
{
...,
"aggs" : {
"intraday_return" : {
"sum" : {
"script" : {
"params" : {
"field" : "change"
}
}
}
}
}
}
Value Script
Computing the sum of squares over all stock tick changes:

{
"aggs" : {
...
"aggs" : {
"daytime_return" : {
"sum" : {
"field" : "change",
"script" : {
"lang": "painless",
"inline": "_value * _value"
}
}
}
}
}
}
Missing value
value.
{
"aggs" : {
"total_time" : {
"sum" : {
"field" : "took",
"missing": 100 1
}
}
}
}
1 - Documents without a value in the took field will fall into the same bucket as documents
that have the value 100.
12.14. Top Hits Aggregation
A top_hits metric aggregator keeps track of the most relevant document being
aggregated. This aggregator is intended to be used as a sub aggregator, so that the top
matching documents can be aggregated per bucket.
The top_hits aggregator can effectively be used to group result sets by certain fields via
a bucket aggregator. One or more bucket aggregators determines by which properties a
result set get sliced into.

Options
• from - The offset from the first result you want to fetch.
• size - The maximum number of top matching hits to return per bucket. By default the
top three matching hits are returned.
• sort - How the top matching hits should be sorted. By default the hits are sorted by the
score of the main query.
Supported per hit features
The top_hits aggregation returns regular search hits, because of this many per hit features
can be supported:
• Highlighting
• Explain
• Named filters and queries
• Source filtering
• Script fields
• Doc value fields
• Include versions
Example
In the following example we group the questions by tag and per tag we show the last active
question. For each question only the title field is being included in the source.

{
"aggs": {
"top-tags": {
"terms": {
"field": "tags",
"size": 3
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"sort": [
{
"last_activity_date": {
"order": "desc"
}
}
],
"_source": {
"include": [
"title"
]
},
"size" : 1
}
}
}
}
}
}
Possible response snippet:
"aggregations": {
"top-tags": {
"buckets": [
{
"key": "windows-7",
"doc_count": 25365,
"top_tags_hits": {
"hits": {
"total": 25365,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "question",
"_id": "602679",
"_score": 1,
"_source": {
"title": "Windows port opening"
},
"sort": [
1370143231177
]
}
]
}
}
},
{
"key": "linux",
"doc_count": 18342,
"top_tags_hits": {
"hits": {
"total": 18342,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_id": "602672",
"_score": 1,
"_source": {
"title": "Ubuntu RFID Screensaver lock-unlock"
},
"sort": [
1370143379747
]
}
]
}
}
},
{
"key": "windows",
"doc_count": 18119,
"top_tags_hits": {
"hits": {
"total": 18119,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_id": "602678",
"_score": 1,
"_source": {
"title": "If I change my computers date / time,
what could be affected?"
},
"sort": [
1370142868283
]
}
]
}
}
}
]
}
}
Field collapse example

Field collapsing or result grouping is a feature that logically groups a result set into groups
and per group returns top documents. The ordering of the groups is determined by the
relevancy of the first document in a group. In NG|Storage this can be implemented via a
bucket aggregator that wraps a top_hits aggregator as sub-aggregator.
In the example below we search across crawled webpages. For each webpage we store the
body and the domain the webpage belong to. By defining a terms aggregator on the
domain field we group the result set of webpages by domain. The top_hits aggregator is
then defined as sub-aggregator, so that the top matching hits are collected per bucket.
Also a max aggregator is defined which is used by the terms aggregator’s order feature the
return the buckets by relevancy order of the most relevant document in a bucket.
{
"query": {
"match": {
"body": "elections"
}
},
"aggs": {
"top-sites": {
"terms": {
"field": "domain",
"order": {
"top_hit": "desc"
}
},
"aggs": {
"top_tags_hits": {
"top_hits": {}
},
"top_hit" : {
"max": {
"script": {
"lang": "painless",
"inline": "_score"
}
}
}
}
}
}
}
At the moment the max (or min) aggregator is needed to make sure the buckets from the
terms aggregator are ordered according to the score of the most relevant webpage per
domain. Unfortunately the top_hits aggregator can’t be used in the order option of the
terms aggregator yet.

top-hits support in a nested or reverse-nested aggregator
If the top_hits aggregator is wrapped in a nested or reverse_nested aggregator then

nested hits are being returned. Nested hits are in a sense hidden mini documents that are
part of regular document where in the mapping a nested field type has been configured.
The top_hits aggregator has the ability to un-hide these documents if it is wrapped in a
nested or reverse_nested aggregator. Read more about nested in the nested type
mapping.
If nested type has been configured a single document is actually indexed as multiple
Lucene documents and they share the same id. In order to determine the identity of a
nested hit there is more needed than just the id, so that is why nested hits also include their
nested identity. The nested identity is kept under the _nested field in the search hit and
includes the array field and the offset in the array field the nested hit belongs to. The offset
is zero based.
Top hits response snippet with a nested hit, which resides in the third slot of array field
nested_field1 in document with id 1:
...
"hits": {
"total": 25365,
"max_score": 1,
"hits": [
{
"_index": "a",
"_type": "b",
"_id": "1",
"_score": 1,
"_nested" : {
"field" : "nested_field1",
"offset" : 2
}
"_source": ...
},
...
]
}
...
If _source is requested then just the part of the source of the nested object is returned,
not the entire source of the document. Also stored fields on the nested inner object level
are accessible via top_hits aggregator residing in a nested or reverse_nested
aggregator.
Only nested hits will have a _nested field in the hit, non nested (regular) hits will not have

a _nested field.
The information in _nested can also be used to parse the original source somewhere else
if _source isn’t enabled.
If there are multiple levels of nested object types defined in mappings then the _nested
information can also be hierarchical in order to express the identity of nested hits that are
two layers deep or more.
In the example below a nested hit resides in the first slot of the field
nested_grand_child_field which then resides in the second slow of the
nested_child_field field:
...
"hits": {
"total": 2565,
"max_score": 1,
"hits": [
{
"_index": "a",
"_type": "b",
"_id": "1",
"_score": 1,
"_nested" : {
"field" : "nested_child_field",
"offset" : 1,
"_nested" : {
"field" : "nested_grand_child_field",
"offset" : 0
}
}
"_source": ...
},
...
]
}
...
12.15. Value Count Aggregation
A single-value metrics aggregation that counts the number of values that are extracted
from the aggregated documents. These values can be extracted either from specific fields
in the documents, or be generated by a provided script. Typically, this aggregator will be
used in conjunction with other single-value aggregations. For example, when computing
the avg one might be interested in the number of values the average is computed over.

{
"aggs" : {
"grades_count" : { "value_count" : { "field" : "grade" } }
}
}
Response:
{
...
"aggregations": {
"grades_count": {
"value": 10
}
}
}
The name of the aggregation (grades_count above) also serves as the key by which the
Script
Counting the values generated by a script:
{
...,
"aggs" : {
"grades_count" : {
"value_count" : {
"script" : {
"lang" : "painless"
}
}
}
}
}

{
...,
"aggs" : {
"grades_count" : {
"value_count" : {
"script" : {
"params" : {
"field" : "grade"
}
}
}
}
}
}

Chapter 13. Pipeline Aggregations
Pipeline aggregations work on the outputs produced from other aggregations rather than
from document sets, adding information to the output tree. There are many different types
of pipeline aggregation, each computing different information from other aggregations, but
these types can be broken down into two families:
Parent
A family of pipeline aggregations that is provided with the output of its parent
aggregation and is able to compute new buckets or new aggregations to add to
existing buckets.
Sibling
Pipeline aggregations that are provided with the output of a sibling aggregation and
are able to compute a new aggregation which will be at the same level as the sibling
aggregation.
Pipeline aggregations can reference the aggregations they need to perform their
computation by using the buckets_path parameter to indicate the paths to the required
metrics. The syntax for defining these paths can be found in the buckets_path Syntax
section below.
Pipeline aggregations cannot have sub-aggregations but depending on the type it can
reference another pipeline in the buckets_path allowing pipeline aggregations to be
chained. For example, you can chain together two derivatives to calculate the second
derivative (i.e. a derivative of a derivative).
Because pipeline aggregations only add to the output, when chaining
 pipeline aggregations the output of each pipeline aggregation will be

included in the final output.
buckets_path Syntax
Most pipeline aggregations require another aggregation as their input. The input
aggregation is defined via the buckets_path parameter, which follows a specific format:
174 | Chapter 13. Pipeline Aggregations

PATH :=
For example, the path "my_bucket>my_stats.avg" will path to the avg value in the
"my_stats" metric, which is contained in the "my_bucket" bucket aggregation.
Paths are relative from the position of the pipeline aggregation; they are not absolute
paths, and the path cannot go back "up" the aggregation tree. For example, this moving
average is embedded inside a date_histogram and refers to a "sibling" metric "the_sum":
{
"my_date_histo":{
"date_histogram":{
"field":"timestamp",
"interval":"day"
},
"aggs":{
"the_sum":{
"sum":{ "field": "lemmings" } 1
},
"the_movavg":{
"moving_avg":{ "buckets_path": "the_sum" } ¬
}
}
}
}
1 - The metric is called "the_sum" 2 - The buckets_path refers to the metric via a
relative path "the_sum"
buckets_path is also used for Sibling pipeline aggregations, where the aggregation is
"next" to a series of buckets instead of embedded "inside" them. For example, the
max_bucket aggregation uses the buckets_path to specify a metric embedded inside a
sibling aggregation:
Chapter 13. Pipeline Aggregations | 175

{
"aggs" : {
"sales_per_month" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"max_monthly_sales": {
"max_bucket": {
"buckets_path": "sales_per_month>sales" 1
}
}
}
}
1 - buckets_path instructs this max_bucket aggregation that we want the maximum

value of the sales aggregation in the sales_per_month date histogram.
Special Paths
Instead of pathing to a metric, buckets_path can use a special "_count" path. This
instructs the pipeline aggregation to use the document count as it’s input. For example, a
moving average can be calculated on the document count of each bucket, instead of a
specific metric:
{
"my_date_histo":{
"date_histogram":{
"interval":"day"
},
"aggs":{
"the_movavg":{
"moving_avg":{ "buckets_path": "_count" } 1
}
}
}
}
1 - By using _count instead of a metric name, we can calculate the moving average of
document counts in the histogram

Dealing with dots in agg names
An alternate syntax is supported to cope with aggregations or metrics which have dots in
the name, such as the 99.9th percentile. This metric may be referred to as:
"buckets_path": "my_percentile[99.9]"
Dealing with gaps in the data
Data in the real world is often noisy and sometimes contains gaps¬—¬places where data
simply doesn’t exist. This can occur for a variety of reasons, the most common being:
• Documents falling into a bucket do not contain a required field
• There are no documents matching the query for one or more buckets
• The metric being calculated is unable to generate a value, likely because another
dependent bucket is missing a value. Some pipeline aggregations have specific
requirements that must be met (e.g. a derivative cannot calculate a metric for the first
value because there is no previous value, HoltWinters moving average need "warmup"
data to begin calculating, etc)
Gap policies are a mechanism to inform the pipeline aggregation about the desired
behavior when "gappy" or missing data is encountered. All pipeline aggregations accept
the gap_policy parameter. There are currently two gap policies to choose from:
skip
This option treats missing data as if the bucket does not exist. It will skip the bucket
and continue calculating using the next available value.
insert_zeros
This option will replace missing values with a zero (0) and pipeline aggregation
computation will proceed as normal.
13.1. Average Bucket Aggregation
A sibling pipeline aggregation which calculates the (mean) average value of a specified
metric in a sibling aggregation. The specified metric must be numeric and the sibling
aggregation must be a multi-bucket aggregation.
Syntax

An avg_bucket aggregation looks like this in isolation:
{
"avg_bucket": {
"buckets_path": "the_sum"
}
}
Table 1. avg_bucket Parameters
Parameter Name Description Required Default Value

buckets_path The path to the Required
buckets we wish to
find the average for
(see [buckets-path-
syntax] for more
details)
gap_policy The policy to apply Optional, defaults to
when gaps are found skip
in the data (see [gap-
policy] for more
details)
format format to apply to the Optional, defaults to
output value of this null
aggregation
The following snippet calculates the average of the total monthly sales:
{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"avg_monthly_sales": {
"avg_bucket": {
}
}
}
}

1 - buckets_path instructs this avg_bucket aggregation that we want the (mean) average
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"avg_monthly_sales": {
"value": 328.33333333333333
}
}
}
13.2. Bucket Script Aggregation
A parent pipeline aggregation which executes a script which can perform per bucket
computations on specified metrics in the parent multi-bucket aggregation. The specified
metric must be numeric and the script must return a numeric value.
Syntax
A bucket_script aggregation looks like this in isolation:

{
"bucket_script": {
"buckets_path": {
"my_var1": "the_sum", 1
"my_var2": "the_value_count"
},
"script": "my_var1 / my_var2"
}
}
1 - Here, my_var1 is the name of the variable for this buckets path to use in the script,
the_sum is the path to the metrics to use for that variable.
Table 2. bucket_script Parameters

script The script to run for Required
this aggregation. The
script can be inline,
file or indexed. (see
Scripting for more
details)
buckets_path A map of script Required
variables and their
associated path to the
buckets we wish to
use for the variable
(see [buckets-path-
syntax] for more
details)
policy] for more
details)
aggregation
The following snippet calculates the ratio percentage of t-shirt sales compared to total
sales each month:

{
"aggs" : {
"field" : "date",
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"t-shirts": {
"filter": {
"term": {
"type": "t-shirt"
}
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"t-shirt-percentage": {
"bucket_script": {
"buckets_path": {
"tShirtSales": "t-shirts>sales",
"totalSales": "total_sales"
},
"script": "tShirtSales / totalSales * 100"
}
}
}
}
}
}
{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 50
},
"t-shirts": {
"doc_count": 2,
"sales": {
"value": 10
}
},
"value": 20
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2
"total_sales": {
"value": 60
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 15
}
},
"value": 25
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 40
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 20
}
},
"value": 50
}
}
]
}
}
}
13.3. Bucket Selector Aggregation
A parent pipeline aggregation which executes a script which determines whether the
current bucket will be retained in the parent multi-bucket aggregation. The specified
metric must be numeric and the script must return a boolean value. If the script language
is expression then a numeric return value is permitted. In this case 0.0 will be evaluated
as false and all other values will evaluate to true.
Note: The bucket_selector aggregation, like all pipeline aggregations, executions after all
other sibling aggregations. This means that using the bucket_selector aggregation to filter
the returned buckets in the response does not save on execution time running the
aggregations.
Syntax
A bucket_selector aggregation looks like this in isolation:
{
"bucket_selector": {
"buckets_path": {
"my_var1": "the_sum", 1
"my_var2": "the_value_count"
},
"script": "my_var1 > my_var2"
}
}
1 - Here, my_var1 is the name of the variable for this buckets path to use in the script,
the_sum is the path to the metrics to use for that variable.
Table 3. bucket_selector Parameters

script The script to run for Required
this aggregation. The
script can be inline,
file or indexed. (see
Scripting for more
details)
buckets_path A map of script Required
variables and their
associated path to the
buckets we wish to
use for the variable
(see [buckets-path-
syntax] for more
details)
policy] for more
details)
The following snippet only retains buckets where the total sales for the month is less than
or equal to 50:

{
"aggs" : {
"field" : "date",
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"sales_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalSales": "total_sales"
},
"script": "totalSales <= 50"
}
}
}
}
}
}
{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 50
}
},1
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 40
},
}
]
}
}
}
1 - Bucket for 2015/02/01 00:00:00 has been removed as its total sales exceeded 50

13.4. Cumulative Sum Aggregation
A parent pipeline aggregation which calculates the cumulative sum of a specified metric in
a parent histogram (or date_histogram) aggregation. The specified metric must be numeric
and the enclosing histogram must have min_doc_count set to 0 (default for histogram
aggregations).
Syntax
A cumulative_sum aggregation looks like this in isolation:
{
"cumulative_sum": {
}
}
Table 4. cumulative_sum Parameters

buckets we wish to
find the cumulative
sum for (see
[buckets-path-syntax]
for more details)
aggregation
The following snippet calculates the cumulative sum of the total monthly sales:

{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"cumulative_sales": {
"cumulative_sum": {
"buckets_path": "sales" 1
}
}
}
}
}
}
1 - buckets_path instructs this cumulative sum aggregation to use the output of the
sales aggregation for the cumulative sum

{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
},
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
},
"value": 610
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
},
"value": 985
}
}
]
}
}
}
13.5. Derivative Aggregation
A parent pipeline aggregation which calculates the derivative of a specified metric in a

parent histogram (or date_histogram) aggregation. The specified metric must be numeric
and the enclosing histogram must have min_doc_count set to 0 (default for histogram
aggregations).
Syntax
A derivative aggregation looks like this in isolation:

{
"derivative": {
}
}
Table 5. derivative Parameters

buckets we wish to
find the derivative for
(see [buckets-path-
syntax] for more
details)
policy] for more
details)
aggregation
First Order Derivative
The following snippet calculates the derivative of the total monthly sales:
{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales" 1
}
}
}
}
}
}

1 - buckets_path instructs this derivative aggregation to use the output of the sales
aggregation for the derivative
{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
} 1
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
},
"sales_deriv": {
"value": -490 ¬
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2, ¬
"sales": {
"value": 375
},
"sales_deriv": {
"value": 315
}
}
]
}
}
}
1 - No derivative for the first bucket since we need at least 2 data points to calculate the
derivative 2 - Derivative value units are implicitly defined by the sales aggregation and the
parent histogram so in this case the units would be $/month assuming the price field has
units of $. 3 - The number of documents in the bucket are represented by the doc_count f
Second Order Derivative
A second order derivative can be calculated by chaining the derivative pipeline aggregation

onto the result of another derivative pipeline aggregation as in the following example which
will calculate both the first and the second order derivative of the total monthly sales:
{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales"
}
},
"sales_2nd_deriv": {
"derivative": {
"buckets_path": "sales_deriv" 1
}
}
}
}
}
}
1 - buckets_path for the second derivative points to the name of the first derivative

{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
} 1
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
},
"sales_deriv": {
"value": -490
} 1
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
},
"sales_deriv": {
"value": 315
},
"sales_2nd_deriv": {
"value": 805
}
}
]
}
}
}
1 - No second derivative for the first two buckets since we need at least 2 data points from
the first derivative to calculate the second derivative
Units
The derivative aggregation allows the units of the derivative values to be specified. This
returns an extra field in the response normalized_value which reports the derivative
value in the desired x-axis units. In the below example we calculate the derivative of the
total sales per month but ask for the derivative of the sales as in the units of sales per day:

{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales",
"unit": "day" 1
}
}
}
}
}
}
1 - unit specifies what unit to use for the x-axis of the derivative calculation

{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
} 1
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
},
"sales_deriv": {
"value": -490, 1
"normalized_value": -17.5 ¬
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
},
"sales_deriv": {
"value": 315,
"normalized_value": 10.16129032258065
}
}
]
}
}
}
1 - value is reported in the original units of 'per month' 2 - normalized_value is

reported in the desired units of 'per day'
13.6. Extended Stats Bucket Aggregation
A sibling pipeline aggregation which calculates a variety of stats across all bucket of a
specified metric in a sibling aggregation. The specified metric must be numeric and the
sibling aggregation must be a multi-bucket aggregation.
This aggregation provides a few more statistics (sum of squares, standard deviation, etc)
compared to the stats_bucket aggregation.
Syntax
A extended_stats_bucket aggregation looks like this in isolation:
{
"extended_stats_bucket": {
}
}
Table 6. extended_stats_bucket Parameters

buckets we wish to
calculate stats for
(see [buckets-path-
syntax] for more
details)
gap_policy The policy to apply Optional skip
when gaps are found
policy] for more
details)
format format to apply to the Optional null
output value of this
aggregation
sigma The number of Optional 2
standard deviations
above/below the
mean to display
The following snippet calculates the sum of all the total monthly sales buckets:

{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"stats_monthly_sales": {
"extended_stats_bucket": {
"buckets_paths": "sales_per_month>sales" 1
}
}
}
}
1 - bucket_paths instructs this extended_stats_bucket aggregation that we want

the calculate stats for the sales aggregation in the sales_per_month date histogram.

{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"count": 3,
"min": 60,
"max": 550,
"avg": 328.333333333,
"sum": 985,
"sum_of_squares": 446725,
"variance": 41105.5555556,
"std_deviation": 117.054909559,
"std_deviation_bounds": {
"upper": 562.443152451,
"lower": 94.2235142151
}
}
}
}
13.7. Maximum Bucket Aggregation
A sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a
specified metric in a sibling aggregation and outputs both the value and the key(s) of the
bucket(s). The specified metric must be numeric and the sibling aggregation must be a
multi-bucket aggregation.

Syntax
A max_bucket aggregation looks like this in isolation:
{
"max_bucket": {
}
}
Table 7. max_bucket Parameters

buckets we wish to
find the maximum for
(see [buckets-path-
syntax] for more
details)
policy] for more
details)
aggregation
The following snippet calculates the maximum of the total monthly sales:

{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"max_bucket": {
}
}
}
}
1 - buckets_path instructs this max_bucket aggregation that we want the maximum


{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"keys": ["2015/01/01 00:00:00"], 1
"value": 550
}
}
}
1 - keys is an array of strings since the maximum value may be present in multiple
buckets
13.8. Minimum Bucket Aggregation
A sibling pipeline aggregation which identifies the bucket(s) with the minimum value of a
specified metric in a sibling aggregation and outputs both the value and the key(s) of the
bucket(s). The specified metric must be numeric and the sibling aggregation must be a
multi-bucket aggregation.
Syntax
A max_bucket aggregation looks like this in isolation:

{
"min_bucket": {
}
}
Table 8. min_bucket Parameters

buckets we wish to
find the minimum for
(see [buckets-path-
syntax] for more
details)
policy] for more
details)
aggregation
The following snippet calculates the minimum of the total monthly sales:
{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"min_monthly_sales": {
"min_bucket": {
}
}
}
}
1 - buckets_path instructs this max_bucket aggregation that we want the minimum

{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"min_monthly_sales": {
"keys": ["2015/02/01 00:00:00"], 1
"value": 60
}
}
}
1 - keys is an array of strings since the minimum value may be present in multiple buckets
13.9. Moving Average Aggregation
Given an ordered series of data, the Moving Average aggregation will slide a window across
the data and emit the average value of that window. For example, given the data [1, 2,
3, 4, 5, 6, 7, 8, 9, 10], we can calculate a simple moving average with windows
size of 5 as follows:
• (1 + 2 + 3 + 4 + 5) / 5 = 3
• (2 + 3 + 4 + 5 + 6) / 5 = 4

• (3 + 4 + 5 + 6 + 7) / 5 = 5
• etc
Moving averages are a simple method to smooth sequential data. Moving averages are
typically applied to time-based data, such as stock prices or server metrics. The smoothing
can be used to eliminate high frequency fluctuations or random noise, which allows the
lower frequency trends to be more easily visualized, such as seasonality.
Syntax
A moving_avg aggregation looks like this in isolation:
{
"moving_avg": {
"buckets_path": "the_sum",
"model": "holt",
"window": 5,
"gap_policy": "insert_zero",
"settings": {
"alpha": 0.8
}
}
}
Table 9. moving_avg Parameters

buckets_path Path to the metric of Required
interest (see
buckets_path
Syntax for more
details
model The moving average Optional simple
weighting model that
we wish to use
gap_policy Determines what Optional insert_zero
should happen when
a gap in the data is
encountered.
window The size of window to Optional 5
"slide" across the
histogram.
minimize If the model should be Optional false for most
algorithmically models
minimized. See
Minimization for more
details

settings Model-specific Optional
settings, contents
which differ
depending on the
model specified.
moving_avg aggregations must be embedded inside of a histogram or

date_histogram aggregation. They can be embedded like any other metric aggregation:
{
"my_date_histo":{ 1
"date_histogram":{
"interval":"day"
},
"aggs":{
"the_sum":{
"sum":{ "field": "lemmings" } ¬
},
"the_movavg":{
"moving_avg":{ "buckets_path": "the_sum" } ¬
}
}
}
}
1 - A date_histogram named "my_date_histo" is constructed on the "timestamp" field,

with one-day intervals 2 - A sum metric is used to calculate the sum of a field. This could
be any metric (sum, min, max, etc) 3 - Finally, we specify a moving_avg aggregation which
uses "the_sum" metric as its input.
Moving averages are built by first specifying a histogram or date_histogram over a

field. You can then optionally add normal metrics, such as a sum, inside of that histogram.
Finally, the moving_avg is embedded inside the histogram. The buckets_path
parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
[buckets-path-syntax] for a description of the syntax for buckets_path.
Models
The moving_avg aggregation includes four different moving average "models". The main
difference is how the values in the window are weighted. As data-points become "older" in
the window, they may be weighted differently. This will affect the final average for that
window.
Models are specified using the model parameter. Some models may have optional
configurations which are specified inside the settings parameter.

Simple
The simple model calculates the sum of all values in the window, then divides by the size
of the window. It is effectively a simple arithmetic mean of the window. The simple model
does not perform any time-dependent weighting, which means the values from a simple
moving average tend to "lag" behind the real data.
{
"the_movavg":{
"moving_avg":{
"window" : 30,
"model" : "simple"
}
}
}
A simple model has no special settings to configure
The window size can change the behavior of the moving average. For example, a small
window ("window": 10) will closely track the data and only smooth out small scale
fluctuations.
In contrast, a simple moving average with larger window ("window": 100) will smooth
out all higher-frequency fluctuations, leaving only low-frequency, long term trends. It also
tends to "lag" behind the actual data by a substantial amount.
Linear
The linear model assigns a linear weighting to points in the series, such that "older"
datapoints (e.g. those at the beginning of the window) contribute a linearly less amount to
the total average. The linear weighting helps reduce the "lag" behind the data’s mean,
since older points have less influence.
{
"the_movavg":{
"moving_avg":{
"window" : 30,
"model" : "linear"
}
}
A linear model has no special settings to configure
Like the simple model, window size can change the behavior of the moving average. For
example, a small window ("window": 10) will closely track the data and only smooth out
small scale fluctuations.
In contrast, a linear moving average with larger window ("window": 100) will smooth
out all higher-frequency fluctuations, leaving only low-frequency, long term trends. It also
tends to "lag" behind the actual data by a substantial amount, although typically less than
the simple model.
EWMA (Exponentially Weighted)
The ewma model (aka "single-exponential") is similar to the linear model, except older
data-points become exponentially less important, rather than linearly less important. The
speed at which the importance decays can be controlled with an alpha setting. Small
values make the weight decay slowly, which provides greater smoothing and takes into
account a larger portion of the window. Larger valuers make the weight decay quickly,
which reduces the impact of older values on the moving average. This tends to make the
moving average track the data more closely but with less smoothing.
The default value of alpha is 0.3, and the setting accepts any float from 0-1 inclusive.
The EWMA model can be Minimized
{
"the_movavg":{
"moving_avg":{
"window" : 30,
"model" : "ewma",
"settings" : {
"alpha" : 0.5
}
}
}
Holt-Linear
The holt model (aka "double exponential") incorporates a second exponential term which
tracks the data’s trend. Single exponential does not perform well when the data has an
underlying linear trend. The double exponential model calculates two values internally: a
"level" and a "trend".
The level calculation is similar to ewma, and is an exponentially weighted view of the data.
The difference is that the previously smoothed value is used instead of the raw value, which
allows it to stay close to the original series. The trend calculation looks at the difference

between the current and last value (e.g. the slope, or trend, of the smoothed data). The
trend value is also exponentially weighted.
Values are produced by multiplying the level and trend components.
The default value of alpha is 0.3 and beta is 0.1. The settings accept any float from 0-1
inclusive.
The Holt-Linear model can be Minimized
{
"the_movavg":{
"moving_avg":{
"window" : 30,
"model" : "holt",
"settings" : {
"alpha" : 0.5,
"beta" : 0.5
}
}
}
In practice, the alpha value behaves very similarly in holt as ewma: small values produce
more smoothing and more lag, while larger values produce closer tracking and less lag.
The value of beta is often difficult to see. Small values emphasize long-term trends (such
as a constant linear trend in the whole series), while larger values emphasize short-term
trends. This will become more apparently when you are predicting values.
Holt-Winters
The holt_winters model (aka "triple exponential") incorporates a third exponential term
which tracks the seasonal aspect of your data. This aggregation therefore smooths based
on three components: "level", "trend" and "seasonality".
The level and trend calculation is identical to holt The seasonal calculation looks at the
difference between the current point, and the point one period earlier.
Holt-Winters requires a little more handholding than the other moving averages. You need
to specify the "periodicity" of your data: e.g. if your data has cyclic trends every 7 days, you
would set period: 7. Similarly if there was a monthly trend, you would set it to 30.
There is currently no periodicity detection, although that is planned for future
enhancements.
There are two varieties of Holt-Winters: additive and multiplicative.

"Cold Start"
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to

"bootstrap" the algorithm. This means that your window must always be at least twice the
size of your period. An exception will be thrown if it isn’t. It also means that Holt-Winters
will not emit a value for the first 2 * period buckets; the current algorithm does not
backcast.
Because the "cold start" obscures what the moving average looks like, the rest of the Holt-
Winters images are truncated to not show the "cold start". Just be aware this will always
be present at the beginning of your moving averages!
Additive Holt-Winters
Additive seasonality is the default; it can also be specified by setting "type": "add". This
variety is preferred when the seasonal affect is additive to your data. E.g. you could simply
subtract the seasonal effect to "de-seasonalize" your data into a flat trend.
The default values of alpha and gamma are 0.3 while beta is 0.1. The settings accept
any float from 0-1 inclusive. The default value of period is 1.
The additive Holt-Winters model can be Minimized
{
"the_movavg":{
"moving_avg":{
"window" : 30,
"model" : "holt_winters",
"settings" : {
"type" : "add",
"alpha" : 0.5,
"beta" : 0.5,
"gamma" : 0.5,
"period" : 7
}
}
}
Multiplicative Holt-Winters
Multiplicative is specified by setting "type": "mult". This variety is preferred when the
seasonal affect is multiplied against your data. E.g. if the seasonal affect is x5 the data,
rather than simply adding to it.
The default values of alpha and gamma are 0.3 while beta is 0.1. The settings accept
any float from 0-1 inclusive. The default value of period is 1.
The multiplicative Holt-Winters model can be Minimized
Multiplicative Holt-Winters works by dividing each data point by the

seasonal value. This is problematic if any of your data is zero, or if there
are gaps in the data (since this results in a divid-by-zero). To combat this,
 the mult Holt-Winters pads all values by a very small amount (1*10-10) so
that all values are non-zero. This affects the result, but only minimally. If
your data is non-zero, or you prefer to see NaN when zero’s are
encountered, you can disable this behavior with pad: false
{
"the_movavg":{
"moving_avg":{
"window" : 30,
"settings" : {
"type" : "mult",
"alpha" : 0.5,
"beta" : 0.5,
"gamma" : 0.5,
"period" : 7,
"pad" : true
}
}
}
Prediction
All the moving average model support a "prediction" mode, which will attempt to
extrapolate into the future given the current smoothed, moving average. Depending on the
model and parameter, these predictions may or may not be accurate.
Predictions are enabled by adding a predict parameter to any moving average

aggregation, specifying the number of predictions you would like appended to the end of the
series. These predictions will be spaced out at the same interval as your buckets:

{
"the_movavg":{
"moving_avg":{
"window" : 30,
"model" : "simple",
"predict" : 10
}
}
The simple, linear and ewma models all produce "flat" predictions: they essentially
converge on the mean of the last value in the series, producing a flat.
In contrast, the holt model can extrapolate based on local or global constant trends. If we
set a high beta value, we can extrapolate based on local constant trends.
In contrast, if we choose a small beta, the predictions are based on the global constant
trend. In this series, the global trend is slightly positive, so the prediction makes a sharp u-
turn and begins a positive slope.
The holt_winters model has the potential to deliver the best predictions, since it also
incorporates seasonal fluctuations into the model.
Minimization
Some of the models (EWMA, Holt-Linear, Holt-Winters) require one or more parameters to
be configured. Parameter choice can be tricky and sometimes non-intuitive. Furthermore,
small deviations in these parameters can sometimes have a drastic effect on the output
moving average.
For that reason, the three "tunable" models can be algorithmically minimized.
Minimization is a process where parameters are tweaked until the predictions generated by
the model closely match the output data. Minimization is not fullproof and can be
susceptible to overfitting, but it often gives better results than hand-tuning.
Minimization is disabled by default for ewma and holt_linear, while it is enabled by

default for holt_winters. Minimization is most useful with Holt-Winters, since it helps
improve the accuracy of the predictions. EWMA and Holt-Linear are not great predictors,
and mostly used for smoothing data, so minimization is less useful on those models.
Minimization is enabled/disabled via the minimize parameter:

{
"the_movavg":{
"moving_avg":{
"window" : 30,
"minimize" : true, 1
"settings" : {
"period" : 7
}
}
}
1 - Minimization is enabled with the minimize parameter
When enabled, minimization will find the optimal values for alpha, beta and gamma. The
user should still provide appropriate values for window, period and type.
Minimization works by running a stochastic process called simulated

annealing. This process will usually generate a good solution, but is not
guaranteed to find the global optimum. It also requires some amount of
additional computational power, since the model needs to be re-run
multiple times as the values are tweaked. The run-time of minimization is
 linear to the size of the window being processed: excessively large

windows may cause latency.
Finally, minimization fits the model to the last n values, where n =

window. This generally produces better forecasts into the future, since
the parameters are tuned around the end of the series. It can, however,
generate poorer fitting moving averages at the beginning of the series.
13.10. Percentiles Bucket Aggregation
A sibling pipeline aggregation which calculates percentiles across all bucket of a specified
Syntax
A percentiles_bucket aggregation looks like this in isolation:

{
"percentiles_bucket": {
}
}
Table 10. sum_bucket Parameters

buckets we wish to
find the sum for (see
for more details)
when gaps are found
policy] for more
details)
aggregation
percents The list of percentiles Optional [ 1, 5, 25, 50,
to calculate 75, 95, 99 ]
{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"sum_monthly_sales": {
"percentiles_bucket": {
"buckets_paths": "sales_per_month>sales", 1
"percents": [ 25.0, 50.0, 75.0 ] ¬
}
}
}
}

1 - bucket_paths instructs this percentiles_bucket aggregation that we want to calculate
percentiles for the sales aggregation in the sales_per_month date histogram. 2 -
percents specifies which percentiles we wish to calculate, in this case, the 25th, 50th and
75th percentil
{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"percentiles_monthly_sales": {
"values" : {
"25.0": 60,
"50.0": 375",
"75.0": 550
}
}
}
}
Percentiles_bucket implementation
The Percentile Bucket returns the nearest input data point that is not greater than the
requested percentile; it does not interpolate between data points.
The percentiles are calculated exactly and is not an approximation (unlike the Percentiles
Metric). This means the implementation maintains an in-memory, sorted list of your data to
compute the percentiles, before discarding the data. You may run into memory pressure
issues if you attempt to calculate percentiles over many millions of data-points in a single
percentiles_bucket.
13.11. Serial Differencing Aggregation
Serial differencing is a technique where values in a time series are subtracted from itself at
different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where n is the
period being used.
A period of 1 is equivalent to a derivative with no time normalization: it is simply the change

from one point to the next. Single periods are useful for removing constant, linear trends.
Single periods are also useful for transforming data into a stationary series. In this
example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which
would make it difficult to use with some techniques.
By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear
trend). We can see that the data becomes a stationary series (e.g. the first difference is
randomly distributed around zero, and doesn’t seem to exhibit any pattern/behavior). The
transformation reveals that the dataset is following a random-walk; the value is the
previous value +/- a random amount. This insight allows selection of further tools for
analysis.
Larger periods can be used to remove seasonal / cyclic behavior. In this example, a
population of lemmings was synthetically generated with a sine wave + constant linear
trend + random noise. The sine wave has a period of 30 days.
The first-difference removes the constant trend, leaving just a sine wave. The 30th-
difference is then applied to the first-difference to remove the cyclic behavior, leaving a
stationary series which is amenable to other analysis.
Syntax
A serial_diff aggregation looks like this in isolation:

{
"serial_diff": {
"lag": "7"
}
}
Table 11. serial_diff Parameters

buckets_path Path to the metric of Required
interest (see
buckets_path
Syntax for more
details
lag The historical bucket Optional 1
to subtract from the
current value. E.g. a
lag of 7 will subtract
the current value
from the value 7
buckets ago. Must be
a positive, non-zero
integer
gap_policy Determines what Optional insert_zero
should happen when
a gap in the data is
encountered.
format Format to apply to the Optional null
aggregation
serial_diff aggregations must be embedded inside of a histogram or

date_histogram aggregation:

{
"aggs": {
"my_date_histo": { 1
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"the_sum": {
"sum": {
"field": "lemmings" ¬
}
},
"thirtieth_difference": {
"serial_diff": { ¬
"lag" : 30
}
}
}
}
}
}
1 - A date_histogram named "my_date_histo" is constructed on the "timestamp" field,

with one-day intervals 2 - A sum metric is used to calculate the sum of a field. This could
be any metric (sum, min, max, etc) 3 - Finally, we specify a serial_diff aggregation
which uses "the_sum" metric as its input.
Serial differences are built by first specifying a histogram or date_histogram over a

field. You can then optionally add normal metrics, such as a sum, inside of that histogram.
Finally, the serial_diff is embedded inside the histogram. The buckets_path
parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
[buckets-path-syntax] for a description of the syntax for buckets_path.
13.12. Stats Bucket Aggregation
A sibling pipeline aggregation which calculates a variety of stats across all bucket of a
specified metric in a sibling aggregation. The specified metric must be numeric and the
sibling aggregation must be a multi-bucket aggregation.
Syntax
A stats_bucket aggregation looks like this in isolation:

{
"stats_bucket": {
}
}
Table 12. stats_bucket Parameters

buckets we wish to
calculate stats for
(see [buckets-path-
syntax] for more
details)
when gaps are found
policy] for more
details)
aggregation
{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"stats_bucket": {
"buckets_paths": "sales_per_month>sales" 1
}
}
}
}
1 - bucket_paths instructs this stats_bucket aggregation that we want the calculate

stats for the sales aggregation in the sales_per_month date histogram.
{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"count": 3,
"min": 60,
"max": 550,
"avg": 328.333333333,
"sum": 985
}
}
}
13.13. Sum Bucket Aggregation
A sibling pipeline aggregation which calculates the sum across all bucket of a specified
Syntax
A sum_bucket aggregation looks like this in isolation:

{
"sum_bucket": {
}
}
Table 13. sum_bucket Parameters

buckets we wish to
find the sum for (see
for more details)
policy] for more
details)
aggregation
{
"aggs" : {
"field" : "date",
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"sum_bucket": {
}
}
}
}
1 - buckets_path instructs this sum_bucket aggregation that we want the sum of the
sales aggregation in the sales_per_month date histogram.

{
"aggregations": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"value": 985
}
}
}

Chapter 14. Miscellaneous
14.1. Caching Heavy Aggregations
Frequently used aggregations (e.g. for display on the home page of a website) can be
cached for faster responses. These cached results are the same results that would be
returned by an uncached aggregation¬—¬you will never get stale results.
See Shard Request Cache for more details.
14.2. Returning Only Aggregation Results
There are many occasions when aggregations are required but search hits are not. For
these cases the hits can be ignored by setting size=0. For example:
GET /twitter/tweet/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "text"
}
}
}
}
Setting size to 0 avoids executing the fetch phase of the search making the request more
efficient.
14.3. Aggregation Metadata
You can associate a piece of metadata with individual aggregations at request time that will
be returned in place at response time.
Consider this example where we want to associate the color blue with our terms
aggregation.
220 | Chapter 14. Miscellaneous

GET /twitter/tweet/_search
{
"size": 0,
"aggs": {
"titles": {
"terms": {
"field": "title"
},
"meta": {
"color": "blue"
}
}
}
}
Then that piece of metadata will be returned in place for our titles terms aggregation
{
"aggregations": {
"titles": {
"meta": {
"color" : "blue"
},
"sum_other_doc_count" : 0,
"buckets": [
]
}
},
...
}
Analysis
Analysis is the process of converting text, like the body of any email, into tokens or terms
which are added to the inverted index for searching. Analysis is performed by an analyzer
which can be either a built-in analyzer or a 'custom' analyzer defined per index.
Chapter 14. Miscellaneous | 221

Chapter 15. Index Time Analysis
For instance at index time, the built-in 'english' analyzer would convert this sentence:
Specifying an index time analyzer
Each 'text' field in a mapping can specify its own 'analyzer':
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
At index time, if no 'analyzer' has been specified, it looks for an analyzer in the index
settings called 'default'. Failing that, it defaults to using the 'standard' analyzer.
222 | Chapter 15. Index Time Analysis

Chapter 16. Search Time Analysis
This same analysis process is applied to the query string at search time in full text queries
like the 'match' query to convert the text in the query string into terms of the same form as
those that are stored in the inverted index.
For instance, a user might search for:
Even though the exact words used in the query string don’t appear in the original text
('quick' vs 'QUICK', 'fox' vs 'foxes'), because we have applied the same analyzer to both the
text and the query string, the terms from the query string exactly match the terms from the
text in the inverted index, which means that this query would match our example document.
Specifying a search time analyzer
Usually the same analyzer should be used both at index time and at search time, and full
text queries like the 'match' query will use the mapping to look up the analyzer to use for
each field.
The analyzer to use to search a particular field is determined by looking for:
• An 'analyzer' specified in the query itself.
• The 'search_analyzer' mapping parameter.
• The 'analyzer' mapping parameter.
• An analyzer in the index settings called 'default_search'.
• An analyzer in the index settings called 'default'.
• The 'standard' analyzer.
Chapter 16. Search Time Analysis | 223

Chapter 17. Anatomy of an Analyzer
An analyzer¬—¬whether built-in or custom¬—¬is just a package which contains three

lower-level building blocks: character filters, tokenizers, and token filters.
The built-in analyzers pre-package these building blocks into analyzers suitable for
different languages and types of text. NG|Storage also exposes the individual building
blocks so that they can be combined to define new 'custom' analyzers.
17.1. Character Filters
A character filter receives the original text as a stream of characters and can transform the
stream by adding, removing, or changing characters. For instance, a character filter could
be used to convert Arabic numerals into their Latin equivalents (0123456789), or to strip
HTML elements like '' from the stream.
An analyzer may have zero or more character filters, which are applied in order.
17.2. Tokenizer
A tokenizer receives a stream of characters, breaks it up into individual tokens (usually

individual words), and outputs a stream of tokens. For instance, a 'whitespace' tokenizer
breaks text into tokens whenever it sees any whitespace. It would convert the text '"Quick
brown fox!"' into the terms '[Quick, brown, fox!]'.
The tokenizer is also responsible for recording the order or position of each term and the
start and end character offsets of the original word which the term represents.
An analyzer must have exactly one tokenizer.
17.3. Token Filters
A token filter receives the token stream and may add, remove, or change tokens. For
example, a 'lowercase' token filter converts all tokens to lowercase, a 'stop' token filter
removes common words (stop words) like 'the' from the token stream, and a 'synonym'
token filter introduces synonyms into the token stream.
Token filters are not allowed to change the position or character offsets of each token.
An analyzer may have zero or more token filters, which are applied in order.
224 | Chapter 17. Anatomy of an Analyzer

Chapter 18. Analyzers
NG|Storage ships with a wide range of built-in analyzers, which can be used in any index
without further configuration:
Standard Analyzer
The 'standard' analyzer divides text into terms on word boundaries, as defined by the
Unicode Text Segmentation algorithm. It removes most punctuation, lowercases
terms, and supports removing stop words.
Simple Analyzer
The 'simple' analyzer divides text into terms whenever it encounters a character
which is not a letter. It lowercases all terms.
Whitespace Analyzer
The 'whitespace' analyzer divides text into terms whenever it encounters any
whitespace character. It does not lowercase terms.
Stop Analyzer
The 'stop' analyzer is like the 'simple' analyzer, but also supports removal of stop
words.
Keyword Analyzer
The 'keyword' analyzer is a ''noop'' analyzer that accepts whatever text it is given and
outputs the exact same text as a single term.
Pattern Analyzer
The 'pattern' analyzer uses a regular expression to split the text into terms. It
supports lower-casing and stop words.
Language Analyzers
NG|Storage provides many language-specific analyzers like 'english' or 'french'.
Fingerprint Analyzer
The 'fingerprint' analyzer is a specialist analyzer which creates a fingerprint which

can be used for duplicate detection.
Custom analyzers
If you do not find an analyzer suitable for your needs, you can create a 'custom' analyzer
Chapter 18. Analyzers | 225
which combines the appropriate character filters, tokenizer, and token filters.
18.1. Configuring Built-in Analyzers
The built-in analyzers can be used directly without any configuration. Some of them,
however, support configuration options to alter their behaviour. For instance, the
'standard' analyzer can be configured to support a list of stop words:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"std_english": { 1
"type": "standard",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_text": {
"type": "text",
"analyzer": "standard", 2
"fields": {
"english": {
"type": "text",
"analyzer": "std_english" ¬
}
}
}
}
}
}
}
POST my_index/_analyze
{
"field": "my_text", 2
"text": "The old brown cow"
}
{
"field": "my_text.english", ¬
"text": "The old brown cow"
}
1 - We define the 'std_english' analyzer to be based on the 'standard' analyzer, but

configured to remove the pre-defined list of English stopwords. 2 - The 'my_text' field uses
226 | Chapter 18. Analyzers

the 'standard' analyzer directly, without any configuration. No stop words will be
removed from this field. The resulting terms are: '[ the, old, brown, cow ]' 3 - The
'my_text.english' field uses the 'std_english' analyzer, so English stop words will be
removed. The resulting terms are: '[ old, brown, cow ]'
18.2. Custom Analyzer
When the built-in analyzers do not fulfill your needs, you can create a 'custom' analyzer
which uses the appropriate combination of:
• zero or more character filters
• a tokenizer
• zero or more token filters.
Configuration
The 'custom' analyzer accepts the following parameters:
'tokenizer'
A built-in or customised tokenizer. (Required)
'char_filter'
An optional array of built-in or customised character filters.
'filter'
An optional array of built-in or customised token filters.
'position_increment_gap'
When indexing an array of text values, NG|Storage inserts a fake "gap" between the
last term of one value and the first term of the next value to ensure that a phrase
query doesn’t match two terms from different array elements. Defaults to '100'. See
Position Increment Gap for more.
Example configuration
Here is an example that combines the following:
Character Filter
• HTML Strip Character Filter

Tokenizer
• Standard Tokenizer
Token Filters
• Lowercase Token Filter
• ASCII-Folding Token Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
{
"analyzer": "my_custom_analyzer",
"text": "Is this deja vu?"
}
The above example produces the following terms:
[ is, this, deja, vu ]
The previous example used tokenizer, token filters, and character filters with their default
configurations, but it is possible to create configured versions of each and to use them in a
custom analyzer.
Here is a more complicated example that combines the following:
Character Filter
• Mapping Character Filter, configured to replace ':)' with 'happy' and ':(' with 'sad'

Tokenizer
• Pattern Tokenizer, configured to split on punctuation characters
Token Filters
• Stop Token Filter, configured to use the pre-defined list of English stop words
Here is an example:

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons" 1
],
"tokenizer": "punctuation", 1
"filter": [
"lowercase",
"english_stop" 1
]
}
},
"tokenizer": {
"punctuation": { 1
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": { 1
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": { 1
"type": "stop",
}
}
}
}
}
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
1 - The 'emoticon' character filter, 'punctuation' tokenizer and 'english_stop' token filter
are custom implementations which are defined in the same index settings.
[ i'm, _happy_, person, you ]

18.3. Fingerprint Analyzer
The 'fingerprint' analyzer implements a fingerprinting algorithm which is used by the

OpenRefine project to assist in clustering.
Input text is lowercased, normalized to remove extended characters, sorted, deduplicated

and concatenated into a single token. If a stopword list is configured, stop words will also
be removed.
Definition
It consists of:
Tokenizer
Token Filters (in order)
1. Lower Case Token Filter
2. [analysis-asciifolding-tokenfilter]
3. Stop Token Filter (disabled by default)
4. [analysis-fingerprint-tokenfilter]
Example output
POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, Godel said this sentence is consistent and."
}
The above sentence would produce the following single term:
[ and consistent godel is said sentence this yes ]
Configuration
The 'fingerprint' analyzer accepts the following parameters:
'separator'
The character to use to concate the terms. Defaults to a space.

'max_output_size'
The maximum token size to emit. Defaults to '255'. Tokens larger than this size will
be discarded.
'stopwords'
A pre-defined stop words list like 'english' or an array containing a list of stop words.
Defaults to 'none'.
'stopwords_path'
The path to a file containing stop words.
See the Stop Token Filter for more information about stop word configuration.
In this example, we configure the 'fingerprint' analyzer to use the pre-defined list of English
stop words:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer": {
"type": "fingerprint",
}
}
}
}
}
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, Godel said this sentence is consistent and."
}
The above example produces the following term:
[ consistent godel said sentence yes ]
18.4. Keyword Analyzer
The 'keyword' analyzer is a ''noop'' analyzer which returns the entire input string as a
single token.
Definition
It consists of:
Tokenizer
• Keyword Tokenizer
Example output
POST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following single term:
[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
Configuration
The 'keyword' analyzer is not configurable.
18.5. Language Analyzers
A set of analyzers aimed at analyzing specific language text. The following types are
supported: 'arabic', 'armenian', 'basque', 'catalan', 'bulgarian', 'catalan', 'czech', 'finnish',
'dutch', 'english', 'finnish', 'french', 'galician', 'german', 'irish', 'hindi', 'hungarian',
'indonesian', 'italian', 'latvian', 'lithuanian', 'norwegian', 'portuguese', 'romanian', 'russian',
'sorani', 'spanish', 'swedish', 'turkish'.
chapter.
18.6. Pattern Analyzer
The 'pattern' analyzer uses a regular expression to split the text into terms. The regular
expression should match the token separators not the tokens themselves. The regular
expression defaults to '\W+' (or all non-word characters).
Definition

It consists of:
Tokenizer
• Pattern Tokenizer
Token Filters
• Lower Case Token Filter
• Stop Token Filter (disabled by default)
Example output
POST _analyze
{
"analyzer": "pattern",
}
The above sentence would produce the following terms:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Configuration
The 'pattern' analyzer accepts the following parameters:
'pattern'
A Java regular expression, defaults to '\W+'.
'flags'
Java regular expression flags. Flags should be pipe-separated, eg

'"CASE_INSENSITIVE|COMMENTS"'.
'lowercase'
Should terms be lowercased or not. Defaults to 'true'.
'max_token_length'
The maximum token length. If a token is seen that exceeds this length then it is split
at 'max_token_length' intervals. Defaults to '255'.
'stopwords'
Defaults to 'none'.
'stopwords_path'
In this example, we configure the 'pattern' analyzer to split email addresses on non-word
characters or on underscores ('\W|_'), and to lower-case the result:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_", 1
"lowercase": true
}
}
}
}
}
{
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
1 - The backslashes in the pattern need to be escaped when specifying the pattern as a
JSON string.
[ john, smith, foo, bar, com ]
CamelCase tokenizer
The following more complicated example splits CamelCase text into tokens:

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern":
"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])
(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "camel",
"text": "MooseX::FTPClass2_beta"
}
[ moose, x, ftp, class, 2, beta ]
The regex above is easier to understand as:
([^\p{L}\d]+) # swallow non letters and numbers,

| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)
18.7. Simple Analyzer
The 'simple' analyzer breaks text into terms whenever it encounters a character which is
not a letter. All terms are lower cased.
Definition
It consists of:
Tokenizer

• Lower Case Tokenizer
Example output
POST _analyze
{
"analyzer": "simple",
}
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Configuration
The 'simple' analyzer is not configurable.
18.8. Standard Analyzer
The 'standard' analyzer is the default analyzer which is used if none is specified. It provides
grammar based tokenization (based on the Unicode Text Segmentation algorithm, as
specified in Unicode Standard Annex #29) and works well for most languages.
Definition
It consists of:
Tokenizer
Token Filters
• Standard Token Filter
• Lower Case Token Filter
• Stop Token Filter (disabled by default)
Example output
POST _analyze
{
"analyzer": "standard",
}

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
Configuration
The 'standard' analyzer accepts the following parameters:
'max_token_length'
The maximum token length. If a token is seen that exceeds this length then it is split
at 'max_token_length' intervals. Defaults to '255'.
'stopwords'
Defaults to 'none'.
'stopwords_path'
In this example, we configure the 'standard' analyzer to have a 'max_token_length' of 5 (for

demonstration purposes), and to use the pre-defined list of English stop words:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
}
}
}
}
}
{
"analyzer": "my_english_analyzer",
}

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
18.9. Stop Analyzer
The 'stop' analyzer is the same as the 'simple' analyzer but adds support for removing stop
words. It defaults to using the 'english' stop words.
Definition
It consists of:
Tokenizer
• Lower Case Tokenizer
Token filters
• Stop Token Filter
Example output
POST _analyze
{
"analyzer": "stop",
}
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
Configuration
The 'stop' analyzer accepts the following parameters:
'stopwords'
Defaults to 'english'.
'stopwords_path'
In this example, we configure the 'stop' analyzer to use a specified list of words as stop
words:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer": {
"type": "stop",
"stopwords": ["the", "over"]
}
}
}
}
}
{
"analyzer": "my_stop_analyzer",
}
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
18.10. Whitespace Analyzer
The 'whitespace' analyzer breaks text into terms whenever it encounters a whitespace
character.
Definition
It consists of:
Tokenizer
• Whitespace Tokenizer
Example output

POST _analyze
{
"analyzer": "whitespace",
}
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
Configuration
The 'whitespace' analyzer is not configurable.

Chapter 19. Character Filters
Character filters are used to preprocess the stream of characters before it is passed to the
tokenizer.
A character filter receives the original text as a stream of characters and can transform the
stream by adding, removing, or changing characters. For instance, a character filter could
be used to convert Arabic numerals into their Latin equivalents (0123456789), or to strip
HTML elements like from the stream.
NG|Storage has a number of built in character filters which can be used to build custom
analyzers.
HTML Strip Character Filter
The html_strip character filter strips out HTML elements like and decodes
HTML entities like &.
Mapping Character Filter
The mapping character filter replaces any occurrences of the specified strings with
the specified replacements.
Pattern Replace Character Filter
The pattern_replace character filter replaces any characters matching a regular

expression with the specified replacement.
19.1. HTML Strip Char Filter
The html_strip character filter strips HTML elements from the text and replaces HTML
entities with their decoded value (e.g. replacing & with &).
Example output
POST _analyze
{
"tokenizer": "keyword", ¬
"char_filter": [ "html_strip" ],
"text": "I'm so happy!"
}
¬ The keyword tokenizer returns a single term.
The above example returns the term:
242 | Chapter 19. Character Filters

[ \nI'm so happy!\n ]
The same example with the standard tokenizer would return the following terms:
[ I'm, so, happy ]
Configuration
The html_strip character filter accepts the following parameter:
escaped_tags
An array of HTML tags which should not be stripped from the original text.
In this example, we configure the html_strip character filter to leave tags in place:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["b"]
}
}
}
}
}
{
"analyzer": "my_analyzer",
"text": "I'm so happy!"
}
[ \nI'm so happy!\n ]
Chapter 19. Character Filters | 243

19.2. Mapping Char Filter
The mapping character filter accepts a map of keys and values. Whenever it encounters a
string of characters that is the same as a key, it replaces them with the value associated
with that key.
Matching is greedy; the longest pattern matching at a given point wins. Replacements are
allowed to be the empty string.
Configuration
The mapping character filter accepts the following parameters:
mappings
A array of mappings, with each element having the form key ¬ value.
mappings_path
A path, either absolute or relative to the config directory, to a UTF-8 encoded text
mappings file containing a key ¬ value mapping per line.
Either the mappings or mappings_path parameter must be provided.
In this example, we configure the mapping character filter to replace Arabic numerals with
their Latin equivalents:

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"¬ => 0",
"¬ => 1",
"¬ => 2",
"¬ => 3",
"¬ => 4",
"¬ => 5",
"¬ => 6",
"¬ => 7",
"¬ => 8",
"¬ => 9"
]
}
}
}
}
}
{
"text": "My license plate is ¬¬¬¬¬"
}
[ My license plate is 25015 ]
Keys and values can be strings with multiple characters. The following example replaces
the :) and :( emoticons with a text equivalent:

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
}
}
}
}
{
"text": "I'm delighted about it :("
}
[ I'm, delighted, about, it, _sad_ ]
19.3. Pattern Replace Char Filter
The pattern_replace character filter uses a regular expression to match characters

which should be replaced with the specified replacement string. The replacement string
can refer to capture groups in the regular expression.
Configuration
The pattern_replace character filter accepts the following parameters:
pattern
A Java regular expression. Required.

replacement
The replacement string, which can reference capture groups using the $1..$9 syntax,
as explained here.
flags
Java regular expression flags. Flags should be pipe-separated, eg
"CASE_INSENSITIVE|COMMENTS".
In this example, we configure the pattern_replace character filter to replace any

embedded dashes in numbers with underscores, i.e 123-456-789 → 123_456_789:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
{
"text": "My credit card is 123-456-789"
}
[ My, credit, card, is 123_456_789 ]

Using a replacement string that changes the length of the original text will
 work for search purposes, but will result in incorrect highlighting, as can
be seen in the following example.
This example inserts a space whenever it encounters a lower-case letter followed by an

upper-case letter (i.e. fooBarBaz → foo Bar Baz), allowing camelCase words to be
queried individually:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
"replacement": " "
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
{
"text": "The fooBarBaz method"
}
The above returns the following terms:

[ the, foo, bar, baz, method ]
Querying for bar will find the document correctly, but highlighting on the result will
produce incorrect highlights, because our character filter changed the length of the original
text:
PUT my_index/my_doc/1?refresh
{
}
GET my_index/_search
{
"query": {
"match": {
"text": "bar"
}
},
"highlight": {
"fields": {
"text": {}
}
}
}
The output from the above is:

{
"timed_out": false,
"took": $body.took,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2824934,
"hits": [
{
"_index": "my_index",
"_type": "my_doc",
"_id": "1",
"_score": 0.2824934,
"_source": {
},
"highlight": {
"text": [
"The fooBarBaz method" ¬
]
}
}
]
}
}
¬ Note the incorrect highlight.

Chapter 20. Token Filters
Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg
lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).
NG|Storage has a number of built in token filters which can be used to build custom
analyzers. The following filters are available:
• Apostrophe Token Filter
• ASCII Folding Token Filter
• CJK Bigram Token Filter
• CJK Width Token Filter
• Classic Token Filter
• Common Grams Token Filter
• Compound Word Token Filter
• Decimal Digit Token Filter
• Edge NGram Token Filter
• Elision Token Filter
• Fingerprint Token Filter
• Hunspell Token Filter
• Keep Words Token Filter
• Keyword Marker Token Filter
• Keyword Repeat Token Filter
• KStem Token Filter
• NGram Token Filter
• Pattern Capture Token Filter
Chapter 20. Token Filters | 251

• Pattern Replace Token Filter
• Porter Stem Token Filter
• Reverse Token Filter
• Shingle Token Filter
• Snowball Token Filter
• Standard Token Filter
• Stemmer Override Token Filter
• Stemmer Token Filter
• Stop Token Filter
• Synonym Token Filter
• Trim Token Filter
• Truncate Token Filter
• Unique Token Filter
• Uppercase Token Filter
• Word Delimiter Token Filter
chapter.
252 | Chapter 20. Token Filters

Chapter 21. Tokenizers
A tokenizer receives a stream of characters, breaks it up into individual tokens (usually

individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer
breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick
brown fox!" into the terms [Quick, brown, fox!].
The tokenizer is also responsible for recording the order or position of each term (used for
phrase and word proximity queries) and the start and end character offsets of the original
word which the term represents (used for highlighting search snippets).
NG|Storage has a number of built in tokenizers which can be used to build custom
analyzers.
Word Oriented Tokenizers
The following tokenizers are usually used for tokenizing full text into individual words:
Standard Tokenizer
The standard tokenizer divides text into terms on word boundaries, as defined by
the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is
the best choice for most languages.
Letter Tokenizer
The letter tokenizer divides text into terms whenever it encounters a character
which is not a letter.
Lowercase Tokenizer
The lowercase tokenizer, like the letter tokenizer, divides text into terms
whenever it encounters a character which is not a letter, but it also lowercases all
terms.
Whitespace Tokenizer
The whitespace tokenizer divides text into terms whenever it encounters any
whitespace character.
UAX URL Email Tokenizer
The uax_url_email tokenizer is like the standard tokenizer except that it

recognises URLs and email addresses as single tokens.
Chapter 21. Tokenizers | 253

Classic Tokenizer
The classic tokenizer is a grammar based tokenizer for the English Language.
Thai Tokenizer
The thai tokenizer segments Thai text into words.
Partial Word Tokenizers
These tokenizers break up text or words into small fragments, for partial word matching:
N-Gram Tokenizer
The ngram tokenizer can break up text into words when it encounters any of a list of
specified characters (e.g. whitespace or punctuation), then it returns n-grams of each
word: a sliding window of continuous letters, e.g. quick → [qu, ui, ic, ck].
Edge N-Gram Tokenizer
The edge_ngram tokenizer can break up text into words when it encounters any of a
list of specified characters (e.g. whitespace or punctuation), then it returns n-grams
of each word which are anchored to the start of the word, e.g. quick → [q, qu,
qui, quic, quick].
Structured Text Tokenizers
The following tokenizers are usually used with structured text like identifiers, email
addresses, zip codes, and paths, rather than with full text:
Keyword Tokenizer
The keyword tokenizer is a `noop'' tokenizer that accepts whatever

text it is given and outputs the exact same text as a single
term. It can be combined with token filters like `lowercase to
normalise the analysed terms.
Pattern Tokenizer
The pattern tokenizer uses a regular expression to either split text into terms
whenever it matches a word separator, or to capture matching text as terms.
Path Tokenizer
The path_hierarchy tokenizer takes a hierarchical value like a filesystem path,

splits on the path separator, and emits a term for each component in the tree, e.g.
/foo/bar/baz → [/foo, /foo/bar, /foo/bar/baz ].
254 | Chapter 21. Tokenizers

chapter.
Chapter 21. Tokenizers | 255

Chapter 22. Testing Analyzers
The analyze API is an invaluable tool for viewing the terms produced by an analyzer. For
more information please refer to the source ElasticSearch reference documentation
chapter.
Advanced Concepts
256 | Chapter 22. Testing Analyzers

Chapter 23. Catalog APIs
23.1. Cat Aliases
aliases shows information about currently configured aliases to indices including filter
and routing infos.
% curl '192.168.56.10:9200/_cat/aliases?v'
alias index filter routing.index routing.search
alias2 test1 * - -
alias4 test1 - 2 1,2
alias1 test1 - - -
alias3 test1 - 1 1
The output shows that alias has configured a filter, and specific routing configurations in
alias3 and alias4.
If you only want to get information about a single alias, you can specify the alias in the URL,
for example /_cat/aliases/alias1.
23.2. Cat Allocation
allocation provides a snapshot of how many shards are allocated to each data node and
how much disk space they are using.
% curl '192.168.56.10:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host
ip node
1 3.1gb 5.6gb 72.2gb 77.8gb 7.8
192.168.56.10 192.168.56.10 Jarella
1 3.1gb 5.6gb 72.2gb 77.8gb 7.8
192.168.56.30 192.168.56.30 Solarr
1 3.0gb 5.5gb 72.3gb 77.8gb 7.6
192.168.56.20 192.168.56.20 Adam II
Here we can see that each node has been allocated a single shard and that they’re all using
about the same amount of space.
23.3. Cat Count
count provides quick access to the document count of the entire cluster, or individual
indices.
Chapter 23. Catalog APIs | 257

% curl 192.168.56.10:9200/_cat/indices
green wiki1 3 0 10000 331 168.5mb 168.5mb
green wiki2 3 0 428 0 8mb 8mb
% curl 192.168.56.10:9200/_cat/count
1384314124582 19:42:04 10428
% curl 192.168.56.10:9200/_cat/count/wiki2
1384314139815 19:42:19 428
The document count indicates the number of live documents and does not
 include deleted documents which have not yet been cleaned up by the
merge process.
23.4. Cat Fielddata
fielddata shows how much heap memory is currently being used by fielddata on every
data node in the cluster.
% curl '192.168.56.10:9200/_cat/fielddata?v'
id host ip node field size
c223lARiSGeezlbrcugAYQ myhost1 10.20.100.200 Jessica Jones body 159.8kb
c223lARiSGeezlbrcugAYQ myhost1 10.20.100.200 Jessica Jones text 225.7kb
waPCbitNQaCL6xC8VxjAwg myhost2 10.20.100.201 Adversary body 159.8kb
waPCbitNQaCL6xC8VxjAwg myhost2 10.20.100.201 Adversary text 275.3kb
yaDkp-G3R0q1AJ-HUEvkSQ myhost3 10.20.100.202 Microchip body 109.2kb
yaDkp-G3R0q1AJ-HUEvkSQ myhost3 10.20.100.202 Microchip text 175.3kb
Fields can be specified either as a query parameter, or in the URL path:
% curl '192.168.56.10:9200/_cat/fielddata?v&fields=body'
% curl '192.168.56.10:9200/_cat/fielddata/body,text?v'
c223lARiSGeezlbrcugAYQ myhost1 10.20.100.200 Jessica Jones text 225.7kb
waPCbitNQaCL6xC8VxjAwg myhost2 10.20.100.201 Adversary text 275.3kb
yaDkp-G3R0q1AJ-HUEvkSQ myhost3 10.20.100.202 Microchip text 175.3kb
The output shows the individual fielddata for the`body` and text fields, one row per field
per node.
258 | Chapter 23. Catalog APIs

23.5. Cat Health
health is a terse, one-line representation of the same information from

/_cluster/health. It has one option ts to disable the timestamping.
% curl localhost:9200/_cat/health
1384308967 18:16:07 foo green 3 3 3 3 0 0 0
% curl 'localhost:9200/_cat/health?v&ts=0'
cluster status nodeTotal nodeData shards pri relo init unassign tasks
foo green 3 3 3 3 0 0 0 0
A common use of this command is to verify the health is consistent across nodes:
% pssh -i -h list.of.cluster.hosts curl -s localhost:9200/_cat/health

[1] 20:20:52 [SUCCESS] es3.vm
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
[2] 20:20:52 [SUCCESS] es1.vm
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
[3] 20:20:52 [SUCCESS] es2.vm
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
A less obvious use is to track recovery of a large cluster over time. With enough shards,
starting a cluster, or even recovering after losing a node, can take time (depending on your
network & disk). A way to track its progress is by using this command in a delayed loop:
% while true; do curl localhost:9200/_cat/health; sleep 120; done

1384309446 18:24:06 foo red 3 3 20 20 0 0 1812 0
1384309566 18:26:06 foo yellow 3 3 950 916 0 12 870 0
1384309686 18:28:06 foo yellow 3 3 1328 916 0 12 492 0
1384309806 18:30:06 foo green 3 3 1832 916 4 0 0
^C
In this scenario, we can tell that recovery took roughly four minutes. If this were going on
for hours, we would be able to watch the UNASSIGNED shards drop precipitously. If that
number remained static, we would have an idea that there is a problem.
Why the timestamp?
You typically are using the health command when a cluster is malfunctioning. During this
period, it’s extremely important to correlate activities across log files, alerting systems, etc.
There are two outputs. The HH:MM:SS output is simply for quick human consumption. The
epoch time retains more information, including date, and is machine sortable if your
recovery spans days.

23.6. Cat Indices
The indices command provides a cross-section of each index. This information spans
nodes.
% curl 'localhost:9200/_cat/indices/twi*?v'
health status index pri rep docs.count docs.deleted store.size
pri.store.size
green open twitter 5 1 11434 0 64mb
32mb
green open twitter2 2 0 2030 0 5.8mb
5.8mb
We can tell quickly how many shards make up an index, the number of docs at the Lucene
level, including hidden docs (e.g., from nested types), deleted docs, primary store size, and
total store size (all shards including replicas). All these exposed metrics come directly from
Lucene APIs.
Primaries
The index stats by default will show them for all of an index’s shards, including replicas. A
pri flag can be supplied to enable the view of relevant stats in the context of only the
primaries.
Examples
Which indices are yellow?
% curl localhost:9200/_cat/indices | grep ^yell

yellow open wiki 2 1 6401 1115 151.4mb 151.4mb
yellow open twitter 5 1 11434 0 32mb 32mb
What’s my largest index by disk usage not including replicas?
% curl 'localhost:9200/_cat/indices?bytes=b' | sort -rnk8

green open wiki 2 0 6401 1115 158843725 158843725
green open twitter 5 1 11434 0 67155614 33577857
green open twitter2 2 0 2030 0 6125085 6125085
How many merge operations have the shards for the wiki completed?

% curl
'localhost:9200/_cat/indices/wiki?pri&v&h=health,index,prirep,docs.count,m
t'
health index docs.count mt pri.mt
green wiki 9646 16 16
How much memory is used per index?
% curl 'localhost:9200/_cat/indices?v&h=i,tm'
i tm
wiki 8.1gb
test 30.5kb
user 1.9mb
23.7. Cat Master
master doesn’t have any extra options. It simply displays the master’s node ID, bound IP
address, and node name.
% curl 'localhost:9200/_cat/master?v'
id ip node
Ntgn2DcuTjGuXlhKDUD4vA 192.168.56.30 Solarr
This information is also available via the nodes command, but this is slightly shorter when
all you want to do, for example, is verify all nodes agree on the master:
% pssh -i -h list.of.cluster.hosts curl -s localhost:9200/_cat/master

[1] 19:16:37 [SUCCESS] es3.vm
[2] 19:16:37 [SUCCESS] es2.vm
[3] 19:16:37 [SUCCESS] es1.vm
23.8. Cat Nodeattrs
The nodeattrs command shows custom node attributes.
% curl 192.168.56.10:9200/_cat/nodeattrs
node host ip attr value
Black Bolt epsilon 192.168.1.8 rack rack314
Black Bolt epsilon 192.168.1.8 azone us-east-1
The first few columns give you basic info per node.

node host ip
Black Bolt epsilon 192.168.1.8
Black Bolt epsilon 192.168.1.8
The attr and value columns can give you a picture of custom node attributes.
attr value
rack rack314
azone us-east-1
Columns
Below is an exhaustive list of the existing headers that can be passed to nodeattrs?h= to
retrieve the relevant details in ordered columns. If no headers are specified, then those
marked to Appear by Default will appear. If any header is specified, then the defaults are
not used.
Aliases can be used in place of the full header name for brevity. Columns appear in the
order that they are listed below unless a different order is specified (e.g., h=attr,value
versus h=value,attr).
When specifying headers, the headers are not placed in the output by default. To have the
headers appear in the output, use verbose mode (v). The header name will match the
supplied value (e.g., pid versus p). For example:
% curl 192.168.56.10:9200/_cat/nodeattrs?v&h=name,pid,attr,value
name pid attr value
Black Bolt 28000 rack rack314
Black Bolt 28000 azone us-east-1
Header Alias Appear by Description Example

Default
node name Yes Name of the node Black Bolt
id nodeId No Unique node ID k0zy
pid p No Process ID 13061
host h Yes Host name n1
ip i Yes IP address 127.0.1.1
port po No Bound transport 9300
port
attr attr.name Yes Attribute name rack
value attr.value Yes Attribute value rack123

23.9. Cat Nodes
The nodes command shows the cluster topology.
% curl 192.168.56.10:9200/_cat/nodes
SP4H 4727 192.168.56.30 9300 {version} {jdk} 72.1gb 35.4 93.9mb 79 239.1mb
0.45 3.4h mdi - Boneyard
_uhJ 5134 192.168.56.10 9300 {version} {jdk} 72.1gb 33.3 93.9mb 85 239.1mb
0.06 3.4h mdi * Athena
HfDp 4562 192.168.56.20 9300 {version} {jdk} 72.2gb 74.5 93.9mb 83 239.1mb
0.12 3.4h mdi - Zarek
The first few columns tell you where your nodes live. For sanity it also tells you what
version of ES and the JVM each one runs.
nodeId pid ip port version jdk

u2PZ 4234 192.168.56.30 9300 {version} {jdk}
URzf 5443 192.168.56.10 9300 {version} {jdk}
ActN 3806 192.168.56.20 9300 {version} {jdk}
The next few give a picture of your heap, memory, and load.
diskAvail heapPercent heapMax ramPercent ramMax load

72.1gb 31.3 93.9mb 81 239.1mb 0.24
72.1gb 19.6 93.9mb 82 239.1mb 0.05
72.2gb 64.9 93.9mb 84 239.1mb 0.12
The last columns provide ancillary information that can often be useful when looking at the
cluster as a whole, particularly large ones. How many master-eligible nodes do I have?
How many client nodes? It looks like someone restarted a node recently; which one was it?
uptime node.role master name

3.5h di - Boneyard
3.5h md * Athena
3.5h i - Zarek
Columns
Below is an exhaustive list of the existing headers that can be passed to nodes?h= to
retrieve the relevant details in ordered columns. If no headers are specified, then those
marked to Appear by Default will appear. If any header is specified, then the defaults are
not used.
Aliases can be used in place of the full header name for brevity. Columns appear in the
order that they are listed below unless a different order is specified (e.g., h=pid,id versus
h=id,pid).
When specifying headers, the headers are not placed in the output by default. To have the
headers appear in the output, use verbose mode (v). The header name will match the
supplied value (e.g., pid versus p). For example:
% curl 192.168.56.10:9200/_cat/nodes?v&h=id,ip,port,v,m
id ip port v m
pLSN 192.168.56.30 9300 {version} -
k0zy 192.168.56.10 9300 {version} -
6Tyi 192.168.56.20 9300 {version} *
% curl 192.168.56.10:9200/_cat/nodes?h=id,ip,port,v,m
pLSN 192.168.56.30 9300 {version} -
k0zy 192.168.56.10 9300 {version} -
6Tyi 192.168.56.20 9300 {version} *

Default
id nodeId No Unique node ID k0zy
pid p No Process ID 13061
ip i Yes IP address 127.0.1.1
port po No Bound transport 9300
port
http_address http No bound http 127.0.0.1:9200
address
version v No NG Storage version
{version} build b No NG
Storage Build 5c03844 jdk j No
hash
Running Java 1.8.0 disk.avail d, disk, No
version diskAvail
Available disk 1.8gb heap.current hc, No
space heapCurrent
Used heap 311.2mb heap.percent hp, Yes
heapPercent
Used heap 7 heap.max hm, heapMax No
percentage
Maximum 1015.6mb ram.current rc, ramCurrent No
configured heap
Used total 513.4mb ram.percent rp, ramPercent Yes
memory
Used total 47 ram.max rm, ramMax No
memory
percentage

Default
Total memory 2.9gb file_desc.cur fdc, No
rent fileDescripto
rCurrent
Used file 123 file_desc.per fdp, Yes
descriptors cent fileDescripto
rPercent
Used file 1 file_desc.max fdm, No
descriptors fileDescripto
percentage rMax
Maximum 1024 load l No

number of file
descriptors
Most recent load 0.22 cpu No
average
Recent system 12 uptime u No
CPU usage as
percent
Node uptime 17.3m node.role r, role, Yes
nodeRole
Master eligible mdi master m Yes
node (m); Data
node (d); Ingest
node (i);
Coordinating
node only (-)
Elected master * name n Yes
(*); Not elected
master (-)
Node name Venom completion.si cs, No
ze completionSiz
e
Size of 0b fielddata.mem fm, No
completion ory_size fielddataMemo
ry
Used fielddata 0b fielddata.evi fe, No
cache memory ctions fielddataEvic
tions
Fielddata cache 0 query_cache.m qcm, No
evictions emory_size queryCacheMem
ory
Used query cache 0b query_cache.e qce, No
memory victions queryCacheEvi
ctions
Query cache 0 request_cache rcm, No
evictions .memory_size requestCacheM
emory

Default
Used request 0b request_cache rce, No
cache memory .evictions requestCacheE
victions
Request cache 0 request_cache rchc, No
evictions .hit_count requestCacheH
itCount
Request cache 0 request_cache rcmc, No
hit count .miss_count requestCacheM
issCount
Request cache 0 flush.total ft, flushTotal No
miss count
Number of 1 flush.total_t ftt, No
flushes ime flushTotalTim
e
Time spent in 1 get.current gc, getCurrent No
flush
Number of 0 get.time gti, getTime No
current get
operations
Time spent in get 14ms get.total gto, getTotal No
Number of get 2 get.exists_ti geti, No
operations me getExistsTime
Time spent in 14ms get.exists_to geto, No
successful gets tal getExistsTota
l
Number of 2 get.missing_t gmti, No
successful get ime getMissingTim
operations e
Time spent in 0s get.missing_t gmto, No

failed gets otal getMissingTot
al
Number of failed 1 indexing.dele idc, No
get operations te_current indexingDelet
eCurrent
Number of 0 indexing.dele idti, No
current deletion te_time indexingDelet
operations eTime
Time spent in 2ms indexing.dele idto, No

deletions te_total indexingDelet
eTotal
Number of 2 indexing.inde iic, No
deletion x_current indexingIndex
operations Current
Number of 0 indexing.inde iiti, No

current indexing x_time indexingIndex
operations Time

Default
Time spent in 134ms indexing.inde iito, No
indexing x_total indexingIndex
Total
Number of 1 merges.curren mc, No
indexing t mergesCurrent
operations
Number of 0 merges.curren mcd, No
current merge t_docs mergesCurrent
operations Docs
Number of 0 merges.curren mcs, No

current merging t_size mergesCurrent
documents Size
Size of current 0b merges.total mt, No

merges mergesTotal
Number of 0 merges.total_ mtd, No
completed merge docs mergesTotalDo
operations cs
Number of 0 merges.total_ mts, No

merged size mergesTotalSi
documents ze
Size of current 0b merges.total_ mtt, No

merges time mergesTotalTi
me
Time spent 0s percolate.cur pc, No
merging rent percolateCurr
documents ent
Number of 0 percolate.mem pm, No

current ory_size percolateMemo
percolations ry
Memory used by 0b percolate.que pq, No

current ries percolateQuer
percolations ies
Number of 0 percolate.tim pti, No

registered e percolateTime
percolation
queries
Time spent 0s percolate.tot pto, No
percolating al percolateTota
l
Total 0 refresh.total rto, No
percolations refreshTotal
Number of 16 refresh.time rti, No
refreshes refreshTime

Default
Time spent in 91ms script.compil scrcc, No
refreshes ations scriptCompila
tions
Total script 17 script.cache_ scrce, No
compilations evictions scriptCacheEv
ictions
Total compiled 6 search.fetch_ sfc, No
scripts evicted current searchFetchCu
from cache rrent
Current fetch 0 search.fetch_ sfti, No

phase operations time searchFetchTi
me
Time spent in 37ms search.fetch_ sfto, No
fetch phase total searchFetchTo
tal
Number of fetch 7 search.open_c so, No
operations ontexts searchOpenCon
texts
Open search 0 search.query_ sqc, No
contexts current searchFetchCu
rrent
Current query 0 search.query_ sqti, No
phase operations time searchFetchTi
me
Time spent in 43ms search.query_ sqto, No
query phase total searchFetchTo
tal
Number of query 9 search.scroll scc, No
operations _current searchScrollC
urrent
Open scroll 2 search.scroll scti, No
contexts _time searchScrollT
ime
Time scroll 2m search.scroll scto, No
contexts held _total searchScrollT
open otal
Completed scroll 1 segments.coun sc, No

contexts t segmentsCount
Number of 4 segments.memo sm, No
segments ry segmentsMemor
y
Memory used by 1.4kb segments.inde siwm, No
segments x_writer_memo segmentsIndex
ry WriterMemory
Memory used by 18mb segments.vers svmm, No
index writer ion_map_memor segmentsVersi
y onMapMemory

23.10. Cat Pending Tasks
pending_tasks provides the same information as the /_cluster/pending_tasks API

in a convenient tabular format.
% curl 'localhost:9200/_cat/pending_tasks?v'
insertOrder timeInQueue priority source
1685 855ms HIGH update-mapping [foo][t]
1693 753ms HIGH refresh-mapping [foo][[t]]
23.11. Cat Plugins
The plugins command provides a view per node of running plugins. This information
spans nodes.
% curl 'localhost:9200/_cat/plugins?v'
name component version description
Abraxas discovery-gce 5.0.0 The Google Compute Engine (GCE)
Discovery plugin allows to use GCE API for the unicast discovery
mechanism.
Abraxas lang-javascript 5.0.0 The JavaScript language plugin
allows to have javascript as the language of scripts to execute.
We can tell quickly how many plugins per node we have and which versions.
23.12. Cat Recovery
The recovery command is a view of index shard recoveries, both on-going and previously
completed. It is a more compact view of the JSON recovery API.
A recovery event occurs anytime an index shard moves to a different node in the cluster.
This can happen during a snapshot recovery, a change in replication level, node failure, or
on node startup. This last type is called a local store recovery and is the normal way for
shards to be loaded from disk when a node starts up.
As an example, here is what the recovery state of a cluster may look like when there are no
shards in transit from one node to another:

> curl -XGET 'localhost:9200/_cat/recovery?v'
index shard time type stage source_host source_node target_host
target_node repository snapshot files files_percent bytes bytes_percent
total_files total_bytes translog translog_percent total_translog
index 0 87ms store done 127.0.0.1 Athena 127.0.0.1
Athena n/a n/a 0 0.0% 0 0.0% 0
0 0 100.0% 0
Athena n/a n/a 0 0.0% 0 0.0% 0
0 0 100.0% 0
Athena n/a n/a 0 0.0% 0 0.0% 0
0 0 100.0% 0
Athena n/a n/a 0 0.0% 0 0.0% 0
0 0 100.0% 0
Athena n/a n/a 0 0.0% 0 0.0% 0
0 0 100.0% 0
In the above case, the source and target nodes are the same because the recovery type was
store, i.e. they were read from local storage on node start.
Now let’s see what a live recovery looks like. By increasing the replica count of our index
and bringing another node online to host the replicas, we can see what a live shard
recovery looks like.
> curl -XPUT 'localhost:9200/wiki/_settings' -d'{"number_of_replicas":1}'

{"acknowledged":true}
> curl -XGET

'localhost:9200/_cat/recovery?v&h=i,s,t,ty,st,shost,thost,f,fp,b,bp'
i s t ty st shost thost f fp b bp
wiki 0 1252ms store done hostA hostA 4 100.0% 23638870 100.0%
wiki 0 1672ms replica index hostA hostB 4 75.0% 23638870 48.8%
We can see in the above listing that our 3 initial shards are in various stages of being
replicated from one node to another. Notice that the recovery type is shown as replica.
The files and bytes copied are real-time measurements.
Finally, let’s see what a snapshot recovery looks like. Assuming I have previously made a
backup of my index, I can restore it using the snapshot and restore API.

> curl -XPOST 'localhost:9200/_snapshot/imdb/snapshot_2/_restore'
> curl -XGET
'localhost:9200/_cat/recovery?v&h=i,s,t,ty,st,rep,snap,f,fp,b,bp'
i s t ty st rep snap f fp b bp
imdb 0 1978ms snapshot done imdb snap_1 79 8.0% 12086 9.0%
imdb 1 2790ms snapshot index imdb snap_1 88 7.7% 11025 8.1%
imdb 4 819ms snapshot init imdb snap_1 0 0.0% 0 0.0%
23.13. Cat Repositories
The repositories command shows the snapshot repositories registered in the cluster.
% curl 'localhost:9200/_cat/repositories?v'
id type
repo1 fs
repo2 s3
We can quickly see which repositories are registered and their type.
23.14. Cat Segments
The segments command provides low level information about the segments in the shards
of an index.
% curl 'http://localhost:9200/_cat/segments?v'
index shard prirep ip segment generation docs.count [...]
test 4 p 192.168.2.105 _0 0 1
test1 2 p 192.168.2.105 _0 0 1
test1 3 p 192.168.2.105 _2 2 1
[...] docs.deleted size size.memory committed searchable version compound

0 2.9kb 7818 false true 4.10.2 true
The output shows information about index names and shard numbers in the first two
columns.
If you only want to get information about segments in one particular index, you can add the
index name in the URL, for example /_cat/segments/test. Also, several indexes can be
queried like /_cat/segments/test,test1

The following columns provide additional monitoring information:
prirep
Whether this segment belongs to a primary or replica shard.
ip
The ip address of the segments shard.
segment
A segment name, derived from the segment generation. The name is internally used
to generate the file names in the directory of the shard this segment belongs to.
generation
The generation number is incremented with each segment that is written. The name
of the segment is derived from this generation number.
docs.count
The number of non-deleted documents that are stored in this segment. Note that
these are Lucene documents, so the count will include hidden documents (e.g. from
nested types).
docs.deleted
The number of deleted documents that are stored in this segment. It is perfectly fine
if this number is greater than 0, space is going to be reclaimed when this segment
gets merged.
size
The amount of disk space that this segment uses.
size.memory
Segments store some data into memory in order to be searchable efficiently. This
column shows the number of bytes in memory that are used.
committed
Whether the segment has been sync’ed on disk. Segments that are committed would
survive a hard reboot. No need to worry in case of false, the data from uncommitted
segments is also stored in the transaction log so that NG|Storage is able to replay
changes on the next start.

searchable
True if the segment is searchable. A value of false would most likely mean that the
segment has been written to disk but no refresh occurred since then to make it
searchable.
version
The version of Lucene that has been used to write this segment.
compound
Whether the segment is stored in a compound file. When true, this means that Lucene
merged all files from the segment in a single one in order to save file descriptors.
23.15. Cat Shards
The shards command is the detailed view of what nodes contain which shards. It will tell
you if it’s a primary or replica, the number of docs, the bytes it takes on disk, and the node
where it’s located.
Here we see a single index, with three primary shards and no replicas:
% curl 192.168.56.20:9200/_cat/shards
wiki1 0 p STARTED 3014 31.1mb 192.168.56.10 Stiletto
wiki1 1 p STARTED 3013 29.6mb 192.168.56.30 Frankie Raye
wiki1 2 p STARTED 3973 38.1mb 192.168.56.20 Commander Kraken
Index Pattern
If you have many shards, you may wish to limit which indices show up in the output. You
can always do this with grep, but you can save some bandwidth by supplying an index
pattern to the end.
% curl 192.168.56.20:9200/_cat/shards/wiki2
Relocation
Let’s say you’ve checked your health and you see two relocating shards. Where are they
from and where are they going?

% curl 192.168.56.10:9200/_cat/health
1384315316 20:01:56 foo green 3 3 12 6 2 0 0
% curl 192.168.56.10:9200/_cat/shards | fgrep RELO
wiki1 0 r RELOCATING 3014 31.1mb 192.168.56.20 Commander Kraken ->
192.168.56.30 Frankie Raye
wiki1 1 r RELOCATING 3013 29.6mb 192.168.56.10 Stiletto -> 192.168.56.30
Frankie Raye
Shard states
Before a shard can be used, it goes through an INITIALIZING state. shards can show
you which ones.
% curl -XPUT 192.168.56.20:9200/_settings -d'{"number_of_replicas":1}'

% curl 192.168.56.20:9200/_cat/shards
wiki1 0 r INITIALIZING 0 14.3mb 192.168.56.30 Frankie Raye
wiki1 1 r INITIALIZING 0 13.1mb 192.168.56.20 Commander Kraken
wiki1 2 r INITIALIZING 0 14mb 192.168.56.10 Stiletto
If a shard cannot be assigned, for example you’ve overallocated the number of replicas for
the number of nodes in the cluster, the shard will remain UNASSIGNED with the reason
code ALLOCATION_FAILED.
% curl -XPUT 192.168.56.20:9200/_settings -d'{"number_of_replicas":3}'

% curl 192.168.56.20:9200/_cat/health
1384316325 20:18:45 foo yellow 3 3 9 3 0 0 3
% curl 192.168.56.20:9200/_cat/shards
wiki1 0 r STARTED 3014 31.1mb 192.168.56.30 Frankie Raye
wiki1 0 r STARTED 3014 31.1mb 192.168.56.20 Commander Kraken
wiki1 0 r UNASSIGNED ALLOCATION_FAILED
wiki1 1 r STARTED 3013 29.6mb 192.168.56.10 Stiletto
wiki1 1 r STARTED 3013 29.6mb 192.168.56.20 Commander Kraken
wiki1 2 r STARTED 3973 38.1mb 192.168.56.10 Stiletto
wiki1 2 r STARTED 3973 38.1mb 192.168.56.30 Frankie Raye
Reasons for unassigned shard
These are the possible reasons for a shard be in a unassigned state:
INDEX_CREATED

Unassigned as a result of an API creation of an index.
CLUSTER_RECOVERED
Unassigned as a result of a full cluster recovery.
INDEX_REOPENED
Unassigned as a result of opening a closed index.
DANGLING_INDEX_IMPORTED
Unassigned as a result of importing a dangling index.
NEW_INDEX_RESTORED
Unassigned as a result of restoring into a new index.
EXISTING_INDEX_RESTORED
Unassigned as a result of restoring into a closed index.
REPLICA_ADDED
Unassigned as a result of explicit addition of a replica.
ALLOCATION_FAILED
Unassigned as a result of a failed allocation of the shard.
NODE_LEFT
Unassigned as a result of the node hosting it leaving the cluster.
REROUTE_CANCELLED
Unassigned as a result of explicit cancel reroute command.
REINITIALIZED
When a shard moves from started back to initializing, for example, with shadow
replicas.
REALLOCATED_REPLICA
A better replica location is identified and causes the existing replica allocation to be
cancelled.
23.16. Cat Snapshots
The snapshots command shows all snapshots that belong to a specific repository. To find
a list of available repositories to query, the command /_cat/repositories can be used.
Querying the snapshots of a repository named repo1 then looks as follows.
% curl 'localhost:9200/_cat/snapshots/repo1?v'
id status start_epoch start_time end_epoch end_time duration indices
successful_shards failed_shards total_shards
snap1 FAILED 1445616705 18:11:45 1445616978 18:16:18 4.6m 1
4 1 5
snap2 SUCCESS 1445634298 23:04:58 1445634672 23:11:12 6.2m 2
10 0 10
Each snapshot contains information about when it was started and stopped. Start and stop
timestamps are available in two formats. The HH:MM:SS output is simply for quick human
consumption. The epoch time retains more information, including date, and is machine
sortable if the snapshot process spans days.
23.17. Cat Thread Pool
The thread_pool command shows cluster wide thread pool statistics per node. By default
the active, queue and rejected statistics are returned for the bulk, index and search thread
pools.
% curl 192.168.56.10:9200/_cat/thread_pool
host1 192.168.1.35 0 0 0 0 0 0 0 0 0
host2 192.168.1.36 0 0 0 0 0 0 0 0 0
The first two columns contain the host and ip of a node.
host ip
host1 192.168.1.35
host2 192.168.1.36
The next three columns show the active queue and rejected statistics for the bulk thread
pool.
bulk.active bulk.queue bulk.rejected

0 0 0
The remaining columns show the active queue and rejected statistics of the index and
search thread pool respectively.
Also other statistics of different thread pools can be retrieved by using the h (header)
parameter.

% curl
'localhost:9200/_cat/thread_pool?v&h=id,host,suggest.active,suggest.reject
ed,suggest.completed'
host suggest.active suggest.rejected suggest.completed
host1 0 0 0
host2 0 0 0
Here the host columns and the active, rejected and completed suggest thread pool statistic
are displayed. The suggest thread pool won’t be displayed by default, so you always need to
be specific about what statistic you want to display.
Available Thread Pools
Currently available thread pools:
Thread Pool Alias Description

bulk b Thread pool used for bulk
operations
flush f Thread pool used for flush
operations
generic ge Thread pool used for generic
operations (e.g. background
node discovery)
get g Thread pool used for get
operations
index i Thread pool used for index
/delete operations
management ma Thread pool used for
management of NG
Storage (e.g. cluster force_merge fm
management)
Thread pool used for force refresh r
merge operations
Thread pool used for refresh search s
operations
Thread pool used for search snapshot sn
/count operations
Thread pool used for suggest su
snapshot operations
Thread pool used for warmer w
suggester operations
The thread pool name (or alias) must be combined with a thread pool field below to retrieve
the requested information.

Thread Pool Fields
For each thread pool, you can load details about it by using the field names in the table
below, either using the full field name (e.g. bulk.active) or its alias (e.g. sa is equivalent
to search.active).
Field Name Alias Description

type t The current (*) type of thread
pool (cached, fixed or
scaling)
active a The number of active threads
in the current thread pool
size s The number of threads in the
current thread pool
queue q The number of tasks in the
queue for the current thread
pool
queueSize qs The maximum number of
tasks in the queue for the
current thread pool
rejected r The number of rejected
threads in the current thread
pool
largest l The highest number of active
pool
completed c The number of completed
pool
min mi The configured minimum
number of active threads
allowed in the current thread
pool
max ma The configured maximum
number of active threads
allowed in the current thread
pool
keepAlive k The configured keep alive
time for threads
Other Fields
In addition to details about each thread pool, it is also convenient to get an understanding of
where those thread pools reside. As such, you can request other details like the ip of the
responding node(s).

Field Name Alias Description
id nodeId The unique node ID
pid p The process ID of the running
node
host h The hostname for the current
node
ip i The IP address for the
current node
port po The bound transport port for
the current node

Chapter 24. Cluster APIs
24.1. Node Specification
Most cluster level APIs allow to specify which nodes to execute on (for example, getting the
node stats for a node). Nodes can be identified in the APIs either using their internal node
id, the node name, address, custom attributes, or just the _local node receiving the
request. For example, here are some sample executions of nodes info:
# Local
curl localhost:9200/_nodes/_local
# Address
curl localhost:9200/_nodes/10.0.0.3,10.0.0.4
curl localhost:9200/_nodes/10.0.0.*
# Names
curl localhost:9200/_nodes/node_name_goes_here
curl localhost:9200/_nodes/node_name_goes_*
# Attributes (set something like node.rack: 2 in the config)
curl localhost:9200/_nodes/rack:2
curl localhost:9200/_nodes/ra*:2
curl localhost:9200/_nodes/ra*:2*
24.2. Cluster Allocation Explain API
The cluster allocation explanation API is designed to assist in answering the question "why
is this shard unassigned?". To explain the allocation (on unassigned state) of a shard, issue
a request like:
$ curl -XGET 'http://localhost:9200/_cluster/allocation/explain' -d'{

"index": "myindex",
"shard": 0,
"primary": false
}'
Specify the index and shard id of the shard you would like an explanation for, as well as
the primary flag to indicate whether to explain a primary or replica shard.
The response looks like:
{
"shard" : {
"index" : "myindex",
"index_uuid" : "KnW0-zELRs6PK84l0r38ZA",
"id" : 0,
"primary" : false
280 | Chapter 24. Cluster APIs

},
"assigned" : false, 1
"shard_state_fetch_pending": false, 2
"unassigned_info" : {
"reason" : "INDEX_CREATED", 3
"at" : "2016-03-22T20:04:23.620Z"
},
"allocation_delay_ms" : 0, 4
"remaining_delay_ms" : 0, 5
"nodes" : {
"V-Spi0AyRZ6ZvKbaI3691w" : {
"node_name" : "node1",
"node_attributes" : { 6
"bar" : "baz"
},
"store" : {
"shard_copy" : "NONE" 7
},
"final_decision" : "NO", 8
"final_explanation" : "the shard cannot be assigned because one or
more allocation decider returns a 'NO' decision",
"weight" : 0.06666675, 9
"decisions" : [ { 10
"decider" : "filter",
"decision" : "NO",
"explanation" : "node does not match index include filters
[foo:\"bar\"]"
} ]
},
"Qc6VL8c5RWaw1qXZ0Rg57g" : {
"node_attributes" : {
"bar" : "baz",
"foo" : "bar"
},
"store" : {
"shard_copy" : "AVAILABLE"
},
"final_decision" : "NO",
"weight" : -1.3833332,
"decisions" : [ {
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated on the same node id
[Qc6VL8c5RWaw1qXZ0Rg57g] on which it already exists"
} ]
},
"PzdyMZGXQdGhqTJHF_hGgA" : {
"node_attributes" : { },
"store" : {
"shard_copy" : "NONE"
},
Chapter 24. Cluster APIs | 281

"weight" : 2.3166666,
"decisions" : [ {
"decision" : "NO",
[foo:\"bar\"]"
} ]
}
}
}
1 - Whether the shard is assigned or unassigned 2 - Whether information about the shard is
still being fetched 3 - Reason for the shard originally becoming unassigned 4 - Configured
delay before the shard can be allocated 5 - Remaining delay before the shard can be
allocated 6 - User-added attributes the node has 7 - The shard copy information for this
node and error (if applicable) 8 - Final decision and explanation of whether the shard can be
allocated to this node 9 - Weight for how much the allocator would like to allocate the shard
to this node 10 - List of node decisions factoring into final decision about the shard
For a shard that is already assigned, the output looks similar to:
{
"shard" : {
"index" : "only-foo",
"index_uuid" : "KnW0-zELRs6PK84l0r38ZA",
"id" : 0,
"primary" : true
},
"assigned" : true,
"assigned_node_id" : "Qc6VL8c5RWaw1qXZ0Rg57g", 1
"shard_state_fetch_pending": false,
"allocation_delay_ms" : 0,
"remaining_delay_ms" : 0,
"nodes" : {
"V-Spi0AyRZ6ZvKbaI3691w" : {
"node_name" : "Susan Storm",
"bar" : "baz"
},
"store" : {
},
"weight" : 1.4499999,
"decisions" : [ {
"decision" : "NO",
[foo:\"bar\"]"
} ]
},
"Qc6VL8c5RWaw1qXZ0Rg57g" : {
"node_name" : "Slipstream",
"bar" : "baz",
"foo" : "bar"
},
"store" : {
"shard_copy" : "AVAILABLE"
},
"final_decision" : "ALREADY_ASSIGNED", 2
"final_explanation" : "the shard is already assigned to this node",
"weight" : 0.0,
"decisions" : [ {
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated on the same node id
[Qc6VL8c5RWaw1qXZ0Rg57g] on which it already exists"
} ]
},
"PzdyMZGXQdGhqTJHF_hGgA" : {
"node_name" : "The Symbiote",
"node_attributes" : { },
"store" : {
},
"weight" : 3.6999998,
"decisions" : [ {
"decision" : "NO",
[foo:\"bar\"]"
} ]
}
}
}
1 - Node the shard is currently assigned to 2 - The decision is "ALREADY_ASSIGNED"

because the shard is currently assigned to this node
You can also have NG|Storage explain the allocation of the first unassigned shard it finds by
sending an empty body, such as:
$ curl -XGET 'http://localhost:9200/_cluster/allocation/explain'
If you would like to include all decisions that were factored into the final decision, the
include_yes_decisions parameter will return all decisions:

$ curl -XGET
'http://localhost:9200/_cluster/allocation/explain?include_yes_decisions=t
rue'
Additionally, you can return information gathered by the cluster info service about disk
usage and shard sizes by setting the include_disk_info parameter to true:
$ curl -XGET
'http://localhost:9200/_cluster/allocation/explain?include_disk_info=true'
24.3. Cluster Health
The cluster health API allows to get a very simple status on the health of the cluster. For
example, on a quiet single node cluster with a single index with 5 shards and one replica,
this:
GET _cluster/health
Returns this:
{
"cluster_name" : "testcluster",
"status" : "yellow",
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 50.0
}
The API can also be executed against one or more indices to get just the specified indices
health:
GET /_cluster/health/test1,test2
The cluster health status is: green, yellow or red. On the shard level, a red status
indicates that the specific shard is not allocated in the cluster, yellow means that the

primary shard is allocated but replicas are not, and green means that all shards are
allocated. The index level status is controlled by the worst shard status. The cluster status
is controlled by the worst index status.
One of the main benefits of the API is the ability to wait until the cluster reaches a certain
high water-mark health level. For example, the following will wait for 50 seconds for the
cluster to reach the yellow level (if it reaches the green or yellow status before 50
seconds elapse, it will return at that point):
GET /_cluster/health?wait_for_status=yellow&timeout=50s
Request Parameters
The cluster health API accepts the following request parameters:
level
Can be one of cluster, indices or shards. Controls the details level of the health
information returned. Defaults to cluster.
wait_for_status
One of green, yellow or red. Will wait (until the timeout provided) until the status of
the cluster changes to the one provided or better, i.e. green > yellow > red. By
default, will not wait for any status.
wait_for_relocating_shards
A number controlling to how many relocating shards to wait for. Usually will be 0 to
indicate to wait till all relocations have happened. Defaults to not wait.
wait_for_active_shards
A number controlling to how many active shards to wait for. Defaults to not wait.
wait_for_nodes
The request waits until the specified number N of nodes is available. It also accepts
>=N, ¬N, >N and <N. Alternatively, it is possible to use ge(N), le(N), gt(N) and
lt(N) notation.
timeout
A time based parameter controlling how long to wait if one of the wait_for_XXX are
provided. Defaults to 30s.

local
If true returns the local node information and does not provide the state from master
node. Default: false.
The following is an example of getting the cluster health at the shards level:
GET /_cluster/health/twitter?level=shards
24.4. Nodes Info
The cluster nodes info API allows to retrieve one or more (or all) of the cluster nodes
information.
curl -XGET 'http://localhost:9200/_nodes'

curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2'
The first command retrieves information of all the nodes in the cluster. The second
command selectively retrieves nodes information of only nodeId1 and nodeId2. All the
nodes selective options are explained here.
By default, it just returns all attributes and core settings for a node. For more information
please refer to the source ElasticSearch reference documentation chapter.
24.5. Nodes Stats
The cluster nodes stats API allows to retrieve one or more (or all) of the cluster nodes
statistics.
curl -XGET 'http://localhost:9200/_nodes/stats'

curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2/stats'
The first command retrieves stats of all the nodes in the cluster. The second command
selectively retrieves nodes stats of only nodeId1 and nodeId2.
chapter.
24.6. Pending Cluster Tasks
The pending cluster tasks API returns a list of any cluster-level changes (e.g. create index,

update mapping, allocate or fail shard) which have not yet been executed.
This API returns a list of any pending updates to the cluster state. These
are distinct from the tasks reported by the Task Management API which
include periodic tasks and tasks initiated by the user, such as node stats,
 search queries, or create index requests. However, if a user-initiated task

such as a create index command causes a cluster state update, the activity
of this task might be reported by both task api and pending cluster tasks
API.
$ curl -XGET 'http://localhost:9200/_cluster/pending_tasks'
Usually this will return an empty list as cluster-level changes are usually fast. However if
there are tasks queued up, the output will look something like this:
{
"tasks": [
{
"insert_order": 101,
"priority": "URGENT",
"source": "create-index [foo_9], cause [api]",
"time_in_queue_millis": 86,
"time_in_queue": "86ms"
},
{
"insert_order": 46,
"priority": "HIGH",
"source": "shard-started ([foo_2][1],
node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after
recovery from shard_store]",
},
{
"insert_order": 45,
"priority": "HIGH",
"source": "shard-started ([foo_2][0],
node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after
recovery from shard_store]",
}
]
}
24.7. Cluster Reroute
The reroute command allows to explicitly execute a cluster reroute allocation command

including specific commands. For example, a shard can be moved from one node to another
explicitly, an allocation can be canceled, or an unassigned shard can be explicitly allocated
on a specific node.
Here is a short example of how a simple reroute API call:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{

"commands" : [ {
"move" :
{
"index" : "test", "shard" : 0,
"from_node" : "node1", "to_node" : "node2"
}
},
{
"allocate_replica" : {
"index" : "test", "shard" : 1, "node" : "node3"
}
}
]
}'
An important aspect to remember is the fact that once when an allocation occurs, the
cluster will aim at re-balancing its state back to an even state. For example, if the
allocation includes moving a shard from node1 to node2, in an even state, then another
shard will be moved from node2 to node1 to even things out.
The cluster can be set to disable allocations, which means that only the explicitly
allocations will be performed. Obviously, only once all commands has been applied, the
cluster will aim to be re-balance its state.
Another option is to run the commands in dry_run (as a URI flag, or in the request body).
This will cause the commands to apply to the current cluster state, and return the resulting
cluster after the commands (and re-balancing) has been applied.
If the explain parameter is specified, a detailed explanation of why the commands could
or could not be executed is returned.
The commands supported are:
move
Move a started shard from one node to another node. Accepts index and shard for
index name and shard number, from_node for the node to move the shard from, and
to_node for the node to move the shard to.

cancel
Cancel allocation of a shard (or recovery). Accepts index and shard for index name
and shard number, and node for the node to cancel the shard allocation on. It also
accepts allow_primary flag to explicitly specify that it is allowed to cancel
allocation for a primary shard. This can be used to force resynchronization of existing
replicas from the primary shard by cancelling them and allowing them to be
reinitialized through the standard reallocation process.
allocate_replica
Allocate an unassigned replica shard to a node. Accepts the index and shard for
index name and shard number, and node to allocate the shard to. Takes allocation
deciders into account.
Two more commands are available that allow the allocation of a primary shard to a node.
These commands should however be used with extreme care, as primary shard allocation
is usually fully automatically handled by NG|Storage. Reasons why a primary shard cannot
be automatically allocated include the following:
• A new index was created but there is no node which satisfies the allocation deciders.
• An up-to-date shard copy of the data cannot be found on the current data nodes in the
cluster. To prevent data loss, the system does not automatically promote a stale shard
copy to primary.
As a manual override, two commands to forcefully allocate primary shards are available:
allocate_stale_primary
Allocate a primary shard to a node that holds a stale copy. Accepts the index and
shard for index name and shard number, and node to allocate the shard to. Using
this command may lead to data loss for the provided shard id. If a node which has the
good copy of the data rejoins the cluster later on, that data will be overwritten with the
data of the stale copy that was forcefully allocated with this command. To ensure that
these implications are well-understood, this command requires the special field
accept_data_loss to be explicitly set to true for it to work.
allocate_empty_primary
Allocate an empty primary shard to a node. Accepts the index and shard for index
name and shard number, and node to allocate the shard to. Using this command
leads to a complete loss of all data that was indexed into this shard, if it was
previously started. If a node which has a copy of the data rejoins the cluster later on,

that data will be deleted! To ensure that these implications are well-understood, this
command requires the special field accept_data_loss to be explicitly set to true
for it to work.
Retry failed shards
The cluster will attempt to allocate a shard a maximum of

index.allocation.max_retries times in a row (defaults to 5), before giving up and
leaving the shard unallocated. This scenario can be caused by structural problems such as
having an analyzer which refers to a stopwords file which doesn’t exist on all nodes.
Once the problem has been corrected, allocation can be manually retried by calling the
_reroute API with ?retry_failed, which will attempt a single retry round for these
shards.
24.8. Cluster State
The cluster state API allows to get a comprehensive state information of the whole cluster.
$ curl -XGET 'http://localhost:9200/_cluster/state'
By default, the cluster state request is routed to the master node, to ensure that the latest
cluster state is returned. For debugging purposes, you can retrieve the cluster state local
to a particular node by adding local=true to the query string.
Response Filters
As the cluster state can grow (depending on the number of shards and indices, your
mapping, templates), it is possible to filter the cluster state response specifying the parts in
the URL.
$ curl -XGET 'http://localhost:9200/_cluster/state/{metrics}/{indices}'
metrics can be a comma-separated list of
version
Shows the cluster state version.
master_node
Shows the elected master_node part of the response

nodes
Shows the nodes part of the response
routing_table
Shows the routing_table part of the response. If you supply a comma separated
list of indices, the returned output will only contain the indices listed.
metadata
Shows the metadata part of the response. If you supply a comma separated list of
indices, the returned output will only contain the indices listed.
blocks
Shows the blocks part of the response
A couple of example calls:
# return only metadata and routing_table data for specified indices

$ curl -XGET
'http://localhost:9200/_cluster/state/metadata,routing_table/foo,bar'
# return everything for these two indices

$ curl -XGET 'http://localhost:9200/_cluster/state/_all/foo,bar'
# Return only blocks data

$ curl -XGET 'http://localhost:9200/_cluster/state/blocks'
24.9. Cluster Stats
The Cluster Stats API allows to retrieve statistics from a cluster wide perspective. The API
returns basic index metrics (shard numbers, store size, memory usage) and information
about the current nodes that form the cluster (number, roles, os, jvm versions, memory
usage, cpu and installed plugins).
curl -XGET 'http://localhost:9200/_cluster/stats?human&pretty'
Will return, for example:
{
"timestamp": 1459427693515,
"cluster_name": "ngStorage",
"status": "green",
"indices": {
"count": 2,
"shards": {
"total": 10,
"primaries": 10,
"replication": 0,
"index": {
"shards": {
"min": 5,
"max": 5,
"avg": 5
},
"primaries": {
"min": 5,
"max": 5,
"avg": 5
},
"replication": {
"min": 0,
"max": 0,
"avg": 0
}
}
},
"docs": {
"count": 10,
"deleted": 0
},
"store": {
"size": "16.2kb",
"size_in_bytes": 16684,
"throttle_time": "0s",
"throttle_time_in_millis": 0
},
"fielddata": {
"memory_size": "0b",
"memory_size_in_bytes": 0,
"evictions": 0
},
"query_cache": {
"memory_size": "0b",
"memory_size_in_bytes": 0,
"total_count": 0,
"hit_count": 0,
"miss_count": 0,
"cache_size": 0,
"cache_count": 0,
"evictions": 0
},
"completion": {
"size": "0b",
"size_in_bytes": 0
},
"segments": {
"count": 4,
"memory": "8.6kb",
"memory_in_bytes": 8898,
"terms_memory": "6.3kb",
"terms_memory_in_bytes": 6522,
"stored_fields_memory": "1.2kb",
"stored_fields_memory_in_bytes": 1248,
"term_vectors_memory": "0b",

"term_vectors_memory_in_bytes": 0,
"norms_memory": "384b",
"norms_memory_in_bytes": 384,
"doc_values_memory": "744b",
"doc_values_memory_in_bytes": 744,
"index_writer_memory": "0b",
"index_writer_memory_in_bytes": 0,
"version_map_memory": "0b",
"version_map_memory_in_bytes": 0,
"fixed_bit_set": "0b",
"fixed_bit_set_memory_in_bytes": 0,
"file_sizes": {}
},
"percolator": {
"num_queries": 0
}
},
"nodes": {
"count": {
"total": 1,
"data": 1,
"coordinating_only": 0,
"master": 1,
"ingest": 1
},
"versions": [
"{version}"
],
"os": {
"available_processors": 8,
"allocated_processors": 8,
"names": [
{
"name": "Mac OS X",
"count": 1
}
]
},
"process": {
"cpu": {
"percent": 9
},
"open_file_descriptors": {
"min": 268,
"max": 268,
"avg": 268
}
},
"jvm": {
"max_uptime": "13.7s",
"max_uptime_in_millis": 13737,
"versions": [
{
"version": "1.8.0_74",
"vm_name": "Java HotSpot(TM) 64-Bit Server VM",
"vm_version": "25.74-b02",
"vm_vendor": "Oracle Corporation",
"count": 1

}
],
"mem": {
"heap_used": "57.5mb",
"heap_used_in_bytes": 60312664,
"heap_max": "989.8mb",
"heap_max_in_bytes": 1037959168
},
"threads": 90
},
"fs": {
"total": "200.6gb",
"total_in_bytes": 215429193728,
"free": "32.6gb",
"free_in_bytes": 35064553472,
"available": "32.4gb",
"available_in_bytes": 34802409472
},
"plugins": [
// all plugins installed on nodes
{
"name": "analysis-stempel",
"version": "{version}",
"description": "The Stempel (Polish) Analysis plugin
integrates Lucene stempel (polish) analysis module into ngStorage.",
"classname":
"org.ngStorage.plugin.analysis.stempel.AnalysisStempelPlugin"
}
]
}
}
24.10. Task Management API
experimental[The Task Management API is new and should still be considered

experimental. The API may change in ways that are not backwards compatible]
Current Tasks Information
The task management API allows to retrieve information about the tasks currently
executing on one or more nodes in the cluster.
GET _tasks 1
GET _tasks?nodes=nodeId1,nodeId2 2
GET _tasks?nodes=nodeId1,nodeId2&actions=cluster:* 3
1 - Retrieves all tasks currently running on all nodes in the cluster. 2 - Retrieves all tasks
running on nodes nodeId1 and nodeId2. See [cluster-nodes] for more info about how to
select individual nodes. 3 - Retrieves all cluster-related tasks running on nodes nodeId1
and nodeId2.

The result will look similar to the following:
{
"nodes" : {
"oTUltX4IQMOUUVeiohTt8A" : {
"name" : "Tamara Rahn",
"transport_address" : "127.0.0.1:9300",
"host" : "127.0.0.1",
"ip" : "127.0.0.1:9300",
"tasks" : {
"oTUltX4IQMOUUVeiohTt8A:124" : {
"node" : "oTUltX4IQMOUUVeiohTt8A",
"id" : 124,
"type" : "direct",
"action" : "cluster:monitor/tasks/lists[n]",
"start_time_in_millis" : 1458585884904,
"running_time_in_nanos" : 47402,
"cancellable" : false,
"parent_task_id" : "oTUltX4IQMOUUVeiohTt8A:123"
},
"oTUltX4IQMOUUVeiohTt8A:123" : {
"node" : "oTUltX4IQMOUUVeiohTt8A",
"id" : 123,
"type" : "transport",
"action" : "cluster:monitor/tasks/lists",
"start_time_in_millis" : 1458585884904,
"running_time_in_nanos" : 236042,
"cancellable" : false
}
}
}
}
}
It is also possible to retrieve information for a particular task:
GET _tasks/taskId:1 1
1 - This will return a 404 if the task isn’t found.
Or to retrieve all children of a particular task:
GET _tasks?parent_task_id=parentTaskId:1 1
1 - This won’t return a 404 if the parent isn’t found.
The task API can be also used to wait for completion of a particular task. The following call
will block for 10 seconds or until the task with id oTUltX4IQMOUUVeiohTt8A:12345 is
completed.

GET _tasks/oTUltX4IQMOUUVeiohTt8A:12345?wait_for_completion=true&timeout
=10s
You can also wait for all tasks for certain action types to finish. This command will wait for
all reindex tasks to finish:
GET _tasks?actions=*reindex&wait_for_completion=true&timeout=10s
Tasks can be also listed using _cat version of the list tasks command, which accepts the
same arguments as the standard list tasks command.
GET _cat/tasks
Task Cancellation
If a long-running task supports cancellation, it can be cancelled by the following command:
POST _tasks/taskId:1/_cancel
The task cancellation command supports the same task selection parameters as the list
tasks command, so multiple tasks can be cancelled at the same time. For example, the
following command will cancel all reindex tasks running on the nodes nodeId1 and
nodeId2.
POST _tasks/_cancel?node_id=nodeId1,nodeId2&actions=*reindex
Task Grouping
The task lists returned by task API commands can be grouped either by nodes (default) or
by parent tasks using the group_by parameter. The following command will change the
grouping to parent tasks:
GET _tasks?group_by=parents
24.11. Cluster Update Settings
Allows to update cluster wide specific settings. Settings updated can either be persistent
(applied cross restarts) or transient (will not survive a full cluster restart). Here is an
example:

curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"discovery.zen.minimum_master_nodes" : 2
}
}'
Or:

"transient" : {
"discovery.zen.minimum_master_nodes" : 2
}
}'
The cluster responds with the settings updated. So the response for the last example will
be:
{
"persistent" : {},
"transient" : {
"discovery.zen.minimum_master_nodes" : "2"
}
}'
Resetting persistent or transient settings can be done by assigning a null value. If a

transient setting is reset, the persistent setting is applied if available. Otherwise
NG|Storage will fallback to the setting defined at the configuration file or, if not existent, to
the default value. Here is an example:

"transient" : {
"discovery.zen.minimum_master_nodes" : null
}
}'
Reset settings will not be included in the cluster response. So the response for the last
example will be:
{
"persistent" : {},
"transient" : {}
}
Settings can also be reset using simple wildcards. For instance to reset all dynamic
discovery.zen setting a prefix can be used:

"transient" : {
"discovery.zen.*" : null
}
}'
Cluster wide settings can be returned using:
curl -XGET localhost:9200/_cluster/settings
Precedence of settings
Transient cluster settings take precedence over persistent cluster settings, which take
precedence over settings configured in the ngStorage.yml config file.
For this reason it is preferrable to use the ngStorage.yml file only for local
configurations, and set all cluster-wider settings with the settings API.
A list of dynamically updatable settings can be found in the Modules documentation.

Chapter 25. Document APIs
This section describes the following CRUD APIs:
Single document APIs
• Index API
• Get API
• Delete API
• Update API
Multi-document APIs
• Multi Get API
• Bulk API
• Update By Query API
• Reindex API
All CRUD APIs are single-index APIs. The index parameter accepts a
 single index name, or an alias which points to a single index.
25.1. Bulk API
The bulk API makes it possible to perform many index/delete operations in a single API
call. This can greatly increase the indexing speed.
The REST API endpoint is /_bulk, and it expects the following JSON structure:
action_and_meta_data\n
optional_source\n
optional_source\n
....
optional_source\n
NOTE: the final line of data must end with a newline character \n.
The possible actions are index, create, delete and update. index and create expect
a source on the next line, and have the same semantics as the op_type parameter to the
standard index API (i.e. create will fail if a document with the same index and type exists
already, whereas index will add or replace a document as necessary). delete does not
Chapter 25. Document APIs | 299
expect a source on the following line, and has the same semantics as the standard delete
API. update expects that the partial doc, upsert and script and its options are specified on
the next line.
If you’re providing text file input to curl, you must use the --data-binary flag instead of
plain -d. The latter doesn’t preserve newlines. Example:
$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary "@requests"; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_
version":1}}]}
Because this format uses literal `\n’s as delimiters, please be sure that the JSON actions
and sources are not pretty printed. Here is an example of a correct sequence of bulk
commands:
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }

{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }
In the above example doc for the update action is a partial document, that will be merged
with the already stored document.
The endpoints are /_bulk, /{index}/_bulk, and {index}/{type}/_bulk. When the

index or the index/type are provided, they will be used by default on bulk items that don’t
provide them explicitly.
A note on the format. The idea here is to make processing of this as fast as possible. As
some of the actions will be redirected to other shards on other nodes, only
action_meta_data is parsed on the receiving node side.
Client libraries using this protocol should try and strive to do something similar on the
client side, and reduce buffering as much as possible.
The response to a bulk action is a large JSON structure with the individual results of each
action that was performed. The failure of a single action does not affect the remaining
actions.
There is no "correct" number of actions to perform in a single bulk call. You should
300 | Chapter 25. Document APIs
experiment with different settings to find the optimum size for your particular workload.
If using the HTTP API, make sure that the client does not send HTTP chunks, as this will
slow things down.
Versioning
Each bulk item can include the version value using the _version/version field. It
automatically follows the behavior of the index / delete operation based on the _version
mapping. It also support the version_type/_version_type (see versioning)
Routing
Each bulk item can include the routing value using the _routing/routing field. It
automatically follows the behavior of the index / delete operation based on the _routing
mapping.
Parent
Each bulk item can include the parent value using the _parent/parent field. It
automatically follows the behavior of the index / delete operation based on the _parent /
_routing mapping.
Write Consistency
When making bulk calls, you can require a minimum number of active shards in the
partition through the consistency parameter. The values allowed are one, quorum, and
all. It defaults to the node level setting of action.write_consistency, which in turn
defaults to quorum.
For example, in a N shards with 2 replicas index, there will have to be at least 2 active
shards within the relevant partition (quorum) for the operation to succeed. In a N shards
with 1 replica scenario, there will need to be a single shard active (in this case, one and
quorum are the same).
Refresh
Control when the changes made by this request are visible to search. See Refresh API.
Update
When using update action _retry_on_conflict can be used as field in the action itself
(not in the extra payload line), to specify how many times an update should be retried in the

case of a version conflict.
The update action payload, supports the following options: doc (partial document),
upsert, doc_as_upsert, script, params (for script), lang (for script) and fields.
See update documentation for details on the options. Curl example with update actions:
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1",

"_retry_on_conflict" : 3} }
{ "doc" : {"field" : "value"} }
{ "update" : { "_id" : "0", "_type" : "type1", "_index" : "index1",
{ "script" : { "inline": "ctx._source.counter += params.param1", "lang" :
"painless", "params" : {"param1" : 1}}, "upsert" : {"counter" : 1}}
{ "doc" : {"field" : "value"}, "doc_as_upsert" : true }
"fields" : ["_source"]} }
{ "doc" : {"field" : "value"} }
{ "update" : {"_id" : "4", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field" : "value"}, "fields": ["_source"]}
Security
See [url-access-control]
25.2. Delete API
The delete API allows to delete a typed JSON document from a specific index based on its
id. The following example deletes the JSON document from an index called twitter, under a
type called tweet, with id valued 1:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/1'
The result of the above delete operation is:
{
"_shards" : {
"total" : 10,
"failed" : 0,
"successful" : 10
},
"found" : true,
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_version" : 2
}

Versioning
Each document indexed is versioned. When deleting a document, the version can be
specified to make sure the relevant document we are trying to delete is actually being
deleted and it has not changed in the meantime. Every write operation executed on a
document, deletes included, causes its version to be incremented.
Routing
When indexing using the ability to control the routing, in order to delete a document, the
routing value should also be provided. For example:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/1?routing=kimchy'
The above will delete a tweet with id 1, but will be routed based on the user. Note, issuing a
delete without the correct routing, will cause the document to not be deleted.
Many times, the routing value is not known when deleting a document. For those cases,
when specifying the _routing mapping as required, and no routing value is specified,
the delete will be broadcast automatically to all shards.
Parent
The parent parameter can be set, which will basically be the same as setting the routing
parameter.
Note that deleting a parent document does not automatically delete its children. One way of
deleting all child documents given a parent’s id is to use the Delete By Query API to
perform a index with the automatically generated (and indexed) field _parent, which is in
the format parent_type#parent_id.
Automatic index creation
The delete operation automatically creates an index if it has not been created before (check
out the create index API for manually creating an index), and also automatically creates a
dynamic type mapping for the specific type if it has not been created before (check out the
put mapping API for manually creating type mapping).
Distributed
The delete operation gets hashed into a specific shard id. It then gets redirected into the
primary shard within that id group, and replicated (if needed) to shard replicas within that id

group.
Write Consistency
Control if the operation will be allowed to execute based on the number of active shards
within that partition (replication group). The values allowed are one, quorum, and all. The
parameter to set it is consistency, and it defaults to the node level setting of
action.write_consistency which in turn defaults to quorum.
For example, in a N shards with 2 replicas index, there will have to be at least 2 active
shards within the relevant partition (quorum) for the operation to succeed. In a N shards
with 1 replica scenario, there will need to be a single shard active (in this case, one and
quorum is the same).
Refresh
Timeout
The primary shard assigned to perform the delete operation might not be available when
the delete operation is executed. Some reasons for this might be that the primary shard is
currently recovering from a store or undergoing relocation. By default, the delete operation
will wait on the primary shard to become available for up to 1 minute before failing and
responding with an error. The timeout parameter can be used to explicitly specify how
long it waits. Here is an example of setting it to 5 minutes:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/1?timeout=5m'
25.3. Delete By Query API
experimental[The delete-by-query API is new and should still be considered experimental.

The API may change in ways that are not backwards compatible]
The simplest usage of _delete_by_query just performs a deletion on every document

that match a query. Here is the API:

POST twitter/_delete_by_query
{
"query": { 1
"match": {
"message": "some message"
}
}
}
1 - The query must be passed as a value to the query key, in the same way as the Search
API. You can also use the q parameter in the same way as the search api.
That will return something like this:
{
"took" : 147,
"timed_out": false,
"deleted": 119,
"batches": 1,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0,
"total": 119,
"failures" : [ ]
}
_delete_by_query gets a snapshot of the index when it starts and deletes what it finds
using internal versioning. That means that you’ll get a version conflict if the document
changes between the time when the snapshot was taken and when the delete request is
processed. When the versions match the document is deleted.
Since internal versioning does not support the value 0 as a valid version
 number, documents with version equal to zero cannot be deleted using

_delete_by_query and will fail the request.
During the _delete_by_query execution, multiple search requests are sequentially

executed in order to find all the matching documents to delete. Every time a batch of
documents is found, a corresponding bulk request is executed to delete all these
documents. In case a search or bulk request got rejected, _delete_by_query relies on a
default policy to retry rejected requests (up to 10 times, with exponential back off).

Reaching the maximum retries limit causes the _delete_by_query to abort and all
failures are returned in the failures of the response. The deletions that have been
performed still stick. In other words, the process is not rolled back, only aborted. While the
first failure causes the abort all failures that are returned by the failing bulk request are
returned in the failures element so it’s possible for there to be quite a few.
If you’d like to count version conflicts rather than cause them to abort then set
conflicts=proceed on the url or "conflicts": "proceed" in the request body.
Back to the API format, you can limit _delete_by_query to a single type. This will only
delete tweet documents from the twitter index:
POST twitter/tweet/_delete_by_query?conflicts=proceed
{
"query": {
"match_all": {}
}
}
It’s also possible to delete documents of multiple indexes and multiple types at once, just
like the search API:
POST twitter,blog/tweet,post/_delete_by_query
{
"query": {
"match_all": {}
}
}
If you provide routing then the routing is copied to the scroll query, limiting the process to
the shards that match that routing value:
POST twitter/_delete_by_query?routing=1
{
"query": {
"range" : {
"age" : {
"gte" : 10
}
}
}
}
By default _delete_by_query uses scroll batches of 1000. You can change the batch size
with the scroll_size URL parameter:

POST twitter/_delete_by_query?scroll_size=5000
{
"query": {
"term": {
"user": "kimchy"
}
}
}
URL Parameters
In addition to the standard parameters like pretty, the Delete By Query API also supports
refresh, wait_for_completion, consistency, and timeout. For more information
please refer to the source ElasticSearch reference documentation chapter.
25.4. Get API
The get API allows to get a typed JSON document from the index based on its id. The
following example gets a JSON document from an index called twitter, under a type called
tweet, with id valued 1:
curl -XGET 'http://localhost:9200/twitter/tweet/1'
The result of the above get operation is:
{
"_type" : "tweet",
"_id" : "1",
"_version" : 1,
"found": true,
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out ngStorage"
}
}
The above result includes the _index, _type, _id and _version of the document we
wish to retrieve, including the actual _source of the document if it could be found (as
indicated by the found field in the response).
The API also allows to check for the existence of a document using HEAD, for example:
curl -XHEAD -i 'http://localhost:9200/twitter/tweet/1'

Realtime
By default, the get API is realtime, and is not affected by the refresh rate of the index (when
data will become visible for search). In order to disable realtime GET, one can set
realtime parameter to false.
When getting a document, one can specify fields to fetch from it. They will, when
possible, be fetched as stored fields (fields mapped as stored in the mapping). When using
realtime GET, there is no notion of stored fields (at least for a period of time, basically, until
the next flush), so they will be extracted from the source itself (note, even if source is not
enabled). It is a good practice to assume that the fields will be loaded from source when
using realtime GET, even if the fields are stored.
Optional Type
The get API allows for _type to be optional. Set it to _all in order to fetch the first
document matching the id across all types.
Source filtering
By default, the get operation returns the contents of the _source field unless you have
used the fields parameter or if the _source field is disabled. You can turn off _source
retrieval by using the _source parameter:
curl -XGET 'http://localhost:9200/twitter/tweet/1?_source=false'
If you only need one or two fields from the complete _source, you can use the
_source_include & _source_exclude parameters to include or filter out that parts
you need. This can be especially helpful with large documents where partial retrieval can
save on network overhead. Both parameters take a comma separated list of fields or
wildcard expressions. Example:
curl -XGET
'http://localhost:9200/twitter/tweet/1?_source_include=*.id&_source_exclud
e=entities'
If you only want to specify includes, you can use a shorter notation:
curl -XGET 'http://localhost:9200/twitter/tweet/1?_source=*.id,retweeted'
Fields

The get operation allows specifying a set of stored fields that will be returned by passing
the fields parameter. For example:
curl -XGET 'http://localhost:9200/twitter/tweet/1?fields=title,content'
For backward compatibility, if the requested fields are not stored, they will be fetched from
the _source (parsed and extracted). This functionality has been replaced by the source
filtering parameter.
Field values fetched from the document it self are always returned as an array. Metadata
fields like _routing and _parent fields are never returned as an array.
Also only leaf fields can be returned via the field option. So object fields can’t be returned
and such requests will fail.
Generated fields If no refresh occurred between indexing and refresh, GET will access the
transaction log to fetch the document. However, some fields are generated only when
indexing. If you try to access a field that is only generated when indexing, you will get an
exception (default). You can choose to ignore field that are generated if the transaction log
is accessed by setting ignore_errors_on_generated_fields=true.
Getting the _source directly
Use the /{index}/{type}/{id}/_source endpoint to get just the _source field of the
document, without any additional content around it. For example:
curl -XGET 'http://localhost:9200/twitter/tweet/1/_source'
You can also use the same source filtering parameters to control which parts of the
_source will be returned:
curl -XGET
'http://localhost:9200/twitter/tweet/1/_source?_source_include=*.id&_sourc
e_exclude=entities'
Note, there is also a HEAD variant for the _source endpoint to efficiently test for document
_source existence. An existing document will not have a _source if it is disabled in the
mapping. Curl example:
curl -XHEAD -i 'http://localhost:9200/twitter/tweet/1/_source'
Routing
When indexing using the ability to control the routing, in order to get a document, the
routing value should also be provided. For example:
curl -XGET 'http://localhost:9200/twitter/tweet/1?routing=kimchy'
The above will get a tweet with id 1, but will be routed based on the user. Note, issuing a get
without the correct routing, will cause the document not to be fetched.
Preference
Controls a preference of which shard replicas to execute the get request on. By default,
the operation is randomized between the shard replicas.
The preference can be set to:
_primary
The operation will go and be executed only on the primary shards.
_local
The operation will prefer to be executed on a local allocated shard if possible.
Custom (string) value
A custom value will be used to guarantee that the same shards will be used for the
same custom value. This can help with "jumping values" when hitting different shards
in different refresh states. A sample value can be something like the web session id,
or the user name.
Refresh
The refresh parameter can be set to true in order to refresh the relevant shard before
the get operation and make it searchable. Setting it to true should be done after careful
thought and verification that this does not cause a heavy load on the system (and slows
down indexing).
Distributed
The get operation gets hashed into a specific shard id. It then gets redirected to one of the
replicas within that shard id and returns the result. The replicas are the primary shard and
its replicas within that shard id group. This means that the more replicas we will have, the
better GET scaling we will have.
Versioning support
You can use the version parameter to retrieve the document only if its current version is
equal to the specified one. This behavior is the same for all version types with the exception
of version type FORCE which always retrieves the document.
Internally, NG|Storage has marked the old document as deleted and added an entirely new
document. The old version of the document doesn’t disappear immediately, although you
won’t be able to access it. NG|Storage cleans up deleted documents in the background as
you continue to index more data.
25.5. Index API
The index API adds or updates a typed JSON document in a specific index, making it
searchable. The following example inserts the JSON document into the "twitter" index,
under a type called "tweet" with an id of 1:
PUT twitter/tweet/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
}
The result of the above index operation is:
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
},
"_type" : "tweet",
"_id" : "1",
"_version" : 1,
"created" : true,
"forced_refresh": false
}
The _shards header provides information about the replication process of the index
operation.
• total - Indicates to how many shard copies (primary and replica shards) the index
operation should be executed on.
• successful- Indicates the number of shard copies the index operation succeeded on.

• failures - An array that contains replication related errors in the case an index
operation failed on a replica shard.
The index operation is successful in the case successful is at least 1.
Replica shards may not all be started when an indexing operation

successfully returns (by default, a quorum is required). In that case,
total will be equal to the total shards based on the index replica settings
 and successful will be equal to the number of shards started
(primary plus replicas). As there were no failures, the failed will be
0.
Automatic Index Creation
The index operation automatically creates an index if it has not been created before (check
out the create index API for manually creating an index), and also automatically creates a
dynamic type mapping for the specific type if one has not yet been created (check out the
put mapping API for manually creating a type mapping).
The mapping itself is very flexible and is schema-free. New fields and objects will
automatically be added to the mapping definition of the type specified. Check out the
mapping section for more information on mapping definitions.
Automatic index creation can be disabled by setting action.auto_create_index to

false in the config file of all nodes. Automatic mapping creation can be disabled by setting
index.mapper.dynamic to false in the config files of all nodes (or on the specific index
settings).
Automatic index creation can include a pattern based white/black list, for example, set
action.auto_create_index to +aaa*,-bbb*,+ccc*,-* (+ meaning allowed, and -
meaning disallowed).
Versioning
Each indexed document is given a version number. The associated version number is
returned as part of the response to the index API request. The index API optionally allows
for optimistic concurrency control when the version parameter is specified. This will
control the version of the document the operation is intended to be executed against. A
good example of a use case for versioning is performing a transactional read-then-update.
Specifying a version from the document initially read ensures no changes have happened
in the meantime (when reading in order to update, it is recommended to set preference
to _primary). For example:
PUT twitter/tweet/1?version=2
{
"message" : "ngStorage now has versioning support, double cool!"
}
NOTE: versioning is completely real time, and is not affected by the near real time aspects
of search operations. If no version is provided, then the operation is executed without any
version checks.
By default, internal versioning is used that starts at 1 and increments with each update,
deletes included. Optionally, the version number can be supplemented with an external
value (for example, if maintained in a database). To enable this functionality,
version_type should be set to external. The value provided must be a numeric, long
value greater or equal to 0, and less than around 9.2e+18. When using the external version
type, instead of checking for a matching version number, the system checks to see if the
version number passed to the index request is greater than the version of the currently
stored document. If true, the document will be indexed and the new version number used. If
the value provided is less than or equal to the stored document’s version number, a version
conflict will occur and the index operation will fail.
External versioning supports the value 0 as a valid version number. This

allows the version to be in sync with an external versioning system where
version numbers start from zero instead of one. It has the side effect that
 documents with version number equal to zero cannot neither be updated
using the Update-By-Query API nor be deleted using the Delete By Query
API as long as their version number is equal to zero.
A nice side effect is that there is no need to maintain strict ordering of async indexing
operations executed as a result of changes to a source database, as long as version
numbers from the source database are used. Even the simple case of updating the
NG|Storage index using data from a database is simplified if external versioning is used, as
only the latest version will be used if the index operations are out of order for whatever
reason.
Version types
Next to the internal & external version types explained above, NG|Storage also
supports other types for specific use cases. Here is an overview of the different version
types and their semantics.
internal
only index the document if the given version is identical to the version of the stored
document.
external or external_gt
only index the document if the given version is strictly higher than the version of the
stored document or if there is no existing document. The given version will be used as
the new version and will be stored with the new document. The supplied version must
be a non-negative long number.
external_gte
only index the document if the given version is equal or higher than the version of the
stored document. If there is no existing document the operation will succeed as well.
The given version will be used as the new version and will be stored with the new
document. The supplied version must be a non-negative long number.
force
the document will be indexed regardless of the version of the stored document or if
there is no existing document. The given version will be used as the new version and
will be stored with the new document. This version type is typically used for correcting
errors.
NOTE: The external_gte & force version types are meant for special use cases and
should be used with care. If used incorrectly, they can result in loss of data.
Operation Type
The index operation also accepts an op_type that can be used to force a create
operation, allowing for "put-if-absent" behavior. When create is used, the index operation
will fail if a document by that id already exists in the index.
Here is an example of using the op_type parameter:
PUT twitter/tweet/1?op_type=create
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
}
Another option to specify create is to use the following uri:

PUT twitter/tweet/1/_create
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
}
Automatic ID Generation
The index operation can be executed without specifying the id. In such a case, an id will be
generated automatically. In addition, the op_type will automatically be set to create.
Here is an example (note the POST used instead of PUT):
POST twitter/tweet/
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
}
The result of the above index operation is:
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
},
"_type" : "tweet",
"_id" : "6a8ca01c-7896-48e9-81cc-9f70661fcb32",
"_version" : 1,
"created" : true,
"forced_refresh": false
}
Routing
By default, shard placement - or routing - is controlled by using a hash of the document’s

id value. For more explicit control, the value fed into the hash function used by the router
can be directly specified on a per-operation basis using the routing parameter. For
example:

POST twitter/tweet?routing=kimchy
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
}
In the example above, the "tweet" document is routed to a shard based on the routing
parameter provided: "kimchy".
When setting up explicit mapping, the _routing field can be optionally used to direct the
index operation to extract the routing value from the document itself. This does come at the
(very minimal) cost of an additional document parsing pass. If the _routing mapping is
defined and set to be required, the index operation will fail if no routing value is provided
or extracted.
Parents & Children
A child document can be indexed by specifying its parent when indexing. For example:
PUT blogs
{
"mappings": {
"tag_parent": {},
"blog_tag": {
"_parent": {
"type": "tag_parent"
}
}
}
}
PUT blogs/blog_tag/1122?parent=1111
{
"tag" : "something"
}
When indexing a child document, the routing value is automatically set to be the same as its
parent, unless the routing value is explicitly specified using the routing parameter.
Distributed
The index operation is directed to the primary shard based on its route (see the Routing
section above) and performed on the actual node containing this shard. After the primary
shard completes the operation, if needed, the update is distributed to applicable replicas.
Write Consistency

To prevent writes from taking place on the "wrong" side of a network partition, by default,
index operations only succeed if a quorum (>replicas/2+1) of active shards are available.
This default can be overridden on a node-by-node basis using the
action.write_consistency setting. To alter this behavior per-operation, the
consistency request parameter can be used.
Valid write consistency values are one, quorum, and all.
Note, for the case where the number of replicas is 1 (total of 2 copies of the data), then the
default behavior is to succeed if 1 copy (the primary) can perform the write.
The index operation only returns after all active shards within the replication group have
indexed the document (sync replication).
Refresh
Noop Updates
When updating a document using the index api a new version of the document is always
created even if the document hasn’t changed. If this isn’t acceptable use the _update api
with detect_noop set to true. This option isn’t available on the index api because the
index api doesn’t fetch the old source and isn’t able to compare it against the new source.
There isn’t a hard and fast rule about when noop updates aren’t acceptable. It’s a
combination of lots of factors like how frequently your data source sends updates that are
actually noops and how many queries per second NG|Storage runs on the shard with
receiving the updates.
Timeout
The primary shard assigned to perform the index operation might not be available when the
index operation is executed. Some reasons for this might be that the primary shard is
currently recovering from a gateway or undergoing relocation. By default, the index
operation will wait on the primary shard to become available for up to 1 minute before
failing and responding with an error. The timeout parameter can be used to explicitly
specify how long it waits. Here is an example of setting it to 5 minutes:

PUT twitter/tweet/1?timeout=5m
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
}
25.6. Multi Get API
Multi GET API allows to get multiple documents based on an index, type (optional) and id
(and possibly routing). The response includes a docs array with all the fetched documents,
each element similar in structure to a document provided by the get API. Here is an
example:
curl 'localhost:9200/_mget' -d '{

"docs" : [
{
"_index" : "test",
"_type" : "type",
"_id" : "1"
},
{
"_index" : "test",
"_type" : "type",
"_id" : "2"
}
]
}'
The mget endpoint can also be used against an index (in which case it is not required in the
body):
curl 'localhost:9200/test/_mget' -d '{

"docs" : [
{
"_type" : "type",
"_id" : "1"
},
{
"_type" : "type",
"_id" : "2"
}
]
}'
And type:

curl 'localhost:9200/test/type/_mget' -d '{
"docs" : [
{
"_id" : "1"
},
{
"_id" : "2"
}
]
}'
In which case, the ids element can directly be used to simplify the request:
curl 'localhost:9200/test/type/_mget' -d '{

"ids" : ["1", "2"]
}'
Optional Type
The mget API allows for _type to be optional. Set it to _all or leave it empty in order to
fetch the first document matching the id across all types.
If you don’t set the type and have many documents sharing the same _id, you will end up
getting only the first matching document.
For example, if you have a document 1 within typeA and typeB then following request will
give you back only the same document twice:
curl 'localhost:9200/test/_mget' -d '{

"ids" : ["1", "1"]
}'
You need in that case to explicitly set the _type:
GET /test/_mget/
{
"docs" : [
{
"_type":"typeA",
"_id" : "1"
},
{
"_type":"typeB",
"_id" : "1"
}
]
}

Source filtering
By default, the _source field will be returned for every document (if stored). Similar to the
get API, you can retrieve only parts of the _source (or not at all) by using the _source
parameter. You can also use the url parameters _source,_source_include &
_source_exclude to specify defaults, which will be used when there are no per-
document instructions.
For example:

"docs" : [
{
"_index" : "test",
"_type" : "type",
"_id" : "1",
"_source" : false
},
{
"_index" : "test",
"_type" : "type",
"_id" : "2",
"_source" : ["field3", "field4"]
},
{
"_index" : "test",
"_type" : "type",
"_id" : "3",
"_source" : {
"include": ["user"],
"exclude": ["user.location"]
}
}
]
}'
Fields
Specific stored fields can be specified to be retrieved per document to get, similar to the
fields parameter of the Get API. For example:

"docs" : [
{
"_index" : "test",
"_type" : "type",
"_id" : "1",
"fields" : ["field1", "field2"]
},
{
"_index" : "test",
"_type" : "type",
"_id" : "2",
"fields" : ["field3", "field4"]
}
]
}'
Alternatively, you can specify the fields parameter in the query string as a default to be
applied to all documents.
curl 'localhost:9200/test/type/_mget?fields=field1,field2' -d '{

"docs" : [
{
"_id" : "1" 1
},
{
"_id" : "2",
"fields" : ["field3", "field4"] 2
}
]
}'
1 - Returns field1 and field2 2 - Returns field3 and field4
Generated fields
See [generated-fields] for fields are generated only when indexing.
Routing
You can also specify routing value as a parameter:

curl 'localhost:9200/_mget?routing=key1' -d '{
"docs" : [
{
"_index" : "test",
"_type" : "type",
"_id" : "1",
"_routing" : "key2"
},
{
"_index" : "test",
"_type" : "type",
"_id" : "2"
}
]
}'
In this example, document test/type/2 will be fetch from shard corresponding to

routing key key1 but document test/type/1 will be fetch from shard corresponding to
routing key key2.
Security
25.7. Multi Termvectors API
Multi termvectors API allows to get multiple termvectors at once. The documents from
which to retrieve the term vectors are specified by an index, type and id. But the documents
could also be artificially provided The response includes a docs array with all the fetched
termvectors, each element having the structure provided by the termvectors API. Here is an
example:

curl 'localhost:9200/_mtermvectors' -d '{
"docs": [
{
"_index": "testidx",
"_type": "test",
"_id": "2",
"term_statistics": true
},
{
"_type": "test",
"_id": "1",
"fields": [
"text"
]
}
]
}'
See the termvectors API for a description of possible parameters.
The _mtermvectors endpoint can also be used against an index (in which case it is not
required in the body):
curl 'localhost:9200/testidx/_mtermvectors' -d '{

"docs": [
{
"_type": "test",
"_id": "2",
"fields": [
"text"
],
},
{
"_type": "test",
"_id": "1"
}
]
}'
And type:

curl 'localhost:9200/testidx/test/_mtermvectors' -d '{
"docs": [
{
"_id": "2",
"fields": [
"text"
],
},
{
"_id": "1"
}
]
}'
If all requested documents are on same index and have same type and also the parameters
are the same, the request can be simplified:
curl 'localhost:9200/testidx/test/_mtermvectors' -d '{

"ids" : ["1", "2"],
"parameters": {
"fields": [
"text"
],
"term_statistics": true,
...
}
}'
Additionally, just like for the termvectors API, term vectors could be generated for user
provided documents. The mapping used is determined by _index and _type.
curl 'localhost:9200/_mtermvectors' -d '{

"docs": [
{
"_type": "test",
"doc" : {
"fullname" : "John Doe",
"text" : "twitter test test test"
}
},
{
"_type": "test",
"doc" : {
"fullname" : "Jane Doe",
"text" : "Another twitter test ..."
}
}
]
}'

25.8. Refresh API
The Index, Update, Delete, and Bulk APIs support setting refresh to control when
changes made by this request are made visible to search. These are the allowed values:
Empty string or true
Refresh the relevant primary and replica shards (not the whole index) immediately
after the operation occurs, so that the updated document appears in search results
immediately. This should ONLY be done after careful thought and verification that it
does not lead to poor performance, both from an indexing and a search standpoint.
wait_for
Wait for the changes made by the request to be made visible by a refresh before
replying. This doesn’t force an immediate refresh, rather, it waits for a refresh to
happen. NG|Storage automatically refreshes shards that have changed every
index.refresh_interval which defaults to one second. That setting is dynamic.
Calling the Refresh API or setting refresh to true on any of the APIs that support it
will also cause a refresh, in turn causing already running requests with
refresh=wait_for to return.
false (the default)
Take no refresh related actions. The changes made by this request will be made
visible at some point after the request returns.
Choosing which setting to use
Unless you have a good reason to wait for the change to become visible always use
refresh=false, or, because that is the default, just leave the refresh parameter out of
the URL. That is the simplest and fastest choice.
If you absolutely must have the changes made by a request visible synchronously with the
request then you must pick between putting more load on NG|Storage (true) and waiting
longer for the response (wait_for). Here are a few points that should inform that
decision:
• The more changes being made to the index the more work wait_for saves compared
to true. In the case that the index is only changed once every
index.refresh_interval then it saves no work.
• true creates less efficient indexes constructs (tiny segments) that must later be

merged into more efficient index constructs (larger segments). Meaning that the cost of
true is payed at index time to create the tiny segment, at search time to search the tiny
segment, and at merge time to make the larger segments.
• Never start multiple refresh=wait_for requests in a row. Instead batch them into a
single bulk request with refresh=wait_for and NG|Storage will start them all in
parallel and return only when they have all finished.
• If the refresh interval is set to -1, disabling the automatic refreshes, then requests with
refresh=wait_for will wait indefinitely until some action causes a refresh.
Conversely, setting index.refresh_interval to something shorter than the default
like 200ms will make refresh=wait_for come back faster, but it’ll still generate
inefficient segments.
• refresh=wait_for only affects the request that it is on, but, by forcing a refresh
immediately, refresh=true will affect other ongoing request. In general, if you have a
running system you don’t wish to disturb then refresh=wait_for is a smaller
modification.
refresh=wait_for Can Force a Refresh
If a refresh=wait_for request comes in when there are already

index.max_refresh_listeners (defaults to 1000) requests waiting for a refresh on
that shard then that request will behave just as though it had refresh set to true instead:
it will force a refresh. This keeps the promise that when a refresh=wait_for request
returns that its changes are visible for search while preventing unchecked resource usage
for blocked requests. If a request forced a refresh because it ran out of listener slots then
its response will contain "forced_refresh": true.
Bulk requests only take up one slot on each shard that they touch no matter how many
times they modify the shard.
Examples
These will create a document and immediately refresh the index so it is visible:
PUT /test/test/1?refresh
{"test": "test"}
PUT /test/test/2?refresh=true
{"test": "test"}
These will create a document without doing anything to make it visible for search:

PUT /test/test/3
{"test": "test"}
PUT /test/test/4?refresh=false
{"test": "test"}
This will create a document and wait for it to become visible for search:
PUT /test/test/4?refresh=wait_for
{"test": "test"}
25.9. Reindex API
The most basic form of _reindex just copies documents from one index to another. For
more information please refer to the source ElasticSearch reference documentation
chapter.
25.10. Term Vectors
Returns information and statistics on terms in the fields of a particular document. The
document could be stored in the index or artificially provided by the user. Term vectors are
realtime by default, not near realtime. This can be changed by setting realtime
parameter to false.
curl -XGET
'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true'
Optionally, you can specify the fields for which the information is retrieved either with a
parameter in the url
curl -XGET
'http://localhost:9200/twitter/tweet/1/_termvectors?fields=text,...'
or by adding the requested fields in the request body (see example below). Fields can also
be specified with wildcards in similar way to the multi match query
Note that the usage of /_termvector is deprecated in 2.0, and replaced

 by /_termvectors.
Return values
Three types of values can be requested: term information, term statistics and field

statistics. By default, all term information and field statistics are returned for all fields but
no term statistics.
Term information
• term frequency in the field (always returned)
• term positions (positions : true)
• start and end offsets (offsets : true)
• term payloads (payloads : true), as base64 encoded bytes
If the requested information wasn’t stored in the index, it will be computed on the fly if
possible. Additionally, term vectors could be computed for documents not even existing in
the index, but instead provided by the user.
Start and end offsets assume UTF-16 encoding is being used. If you want
to use these offsets in order to get the original text that produced this
 token, you should make sure that the string you are taking a sub-string of
is also encoded using UTF-16.
Term statistics
Setting term_statistics to true (default is false) will return
• total term frequency (how often a term occurs in all documents)
• document frequency (the number of documents containing the current term)
By default these values are not returned since term statistics can have a serious
performance impact.
Field statistics
Setting field_statistics to false (default is true) will omit :
• document count (how many documents contain this field)
• sum of document frequencies (the sum of document frequencies for all terms in this
field)
• sum of total term frequencies (the sum of total term frequencies of each term in this
field)
Terms Filtering

With the parameter filter, the terms returned could also be filtered based on their tf-idf
scores. This could be useful in order find out a good characteristic vector of a document.
This feature works in a similar manner to the second phase of the More Like This Query.
See example 5 for usage.
The following sub-parameters are supported:
max_num_terms
Maximum number of terms that must be returned per field. Defaults to 25.
min_term_freq
Ignore words with less than this frequency in the source doc. Defaults to 1.
max_term_freq
Ignore words with more than this frequency in the source doc. Defaults to unbounded.
min_doc_freq
Ignore terms which do not occur in at least this many docs. Defaults to 1.
max_doc_freq
Ignore words which occur in more than this many docs. Defaults to unbounded.
min_word_length
The minimum word length below which words will be ignored. Defaults to 0.
max_word_length
The maximum word length above which words will be ignored. Defaults to unbounded
(0).
Behaviour
The term and field statistics are not accurate. Deleted documents are not taken into
account. The information is only retrieved for the shard the requested document resides in.
The term and field statistics are therefore only useful as relative measures whereas the
absolute numbers have no meaning in this context. By default, when requesting term
vectors of artificial documents, a shard to get the statistics from is randomly selected. Use
routing only to hit a particular shard.
chapter.

25.11. Update API
The update API allows to update a document based on a script provided. The operation gets
the document (collocated with the shard) from the index, runs the script (with optional
script language and parameters), and index back the result (also allows to delete, or ignore
the operation). It uses versioning to make sure no updates have happened during the "get"
and "reindex".
Note, this operation still means full reindex of the document, it just removes some network
roundtrips and reduces chances of version conflicts between the get and the index. The
_source field needs to be enabled for this feature to work.
For example, lets index a simple doc:
curl -XPUT localhost:9200/test/type1/1 -d '{

"counter" : 1,
"tags" : ["red"]
}'
Scripted updates
Now, we can execute a script that would increment the counter:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{

"script" : {
"inline": "ctx._source.counter += params.count",
"lang": "painless",
"params" : {
"count" : 4
}
}
}'
We can add a tag to the list of tags (note, if the tag exists, it will still add it, since its a list):

"script" : {
"inline": "ctx._source.tags += params.tag",
"lang": "painless",
"params" : {
"tag" : "blue"
}
}
}'
In addition to _source, the following variables are available through the ctx map:

_index, _type, _id, _version, _routing, _parent.
We can also add a new field to the document:

"script" : "ctx._source.name_of_new_field = \"value_of_new_field\""
}'
Or remove a field from the document:

"script" : "ctx._source.remove(\"name_of_field\")"
}'
And, we can even change the operation that is executed. This example deletes the doc if
the tags field contain blue, otherwise it does nothing (noop):

"script" : {
"inline": "ctx._source.tags.contains(params.tag) ? ctx.op =
\"delete\" : ctx.op = \"none\"",
"lang": "painless",
"params" : {
"tag" : "blue"
}
}
}'
Updates with a partial document
The update API also support passing a partial document, which will be merged into the
existing document (simple recursive merge, inner merging of objects, replacing core
"keys/values" and arrays). For example:

"doc" : {
"name" : "new_name"
}
}'
If both doc and script are specified, then doc is ignored. Best is to put your field pairs of
the partial document in the script itself.
Detecting noop updates
If doc is specified its value is merged with the existing _source. By default the document

is only reindexed if the new _source field differs from the old. Setting detect_noop to
false will cause NG|Storage to always update the document even if it hasn’t changed. For
example:

"doc" : {
"name" : "new_name"
},
"detect_noop": false
}'
If name was new_name before the request was sent then document is still reindexed.
Upserts
If the document does not already exist, the contents of the upsert element will be inserted
as a new document. If the document does exist, then the script will be executed instead:

"script" : {
"inline": "ctx._source.counter += params.count",
"lang": "painless",
"params" : {
"count" : 4
}
},
"upsert" : {
"counter" : 1
}
}'
scripted_upsert
If you would like your script to run regardless of whether the document exists or not¬—¬i.e.
the script handles initializing the document instead of the upsert element¬—¬then set
scripted_upsert to true:

curl -XPOST 'localhost:9200/sessions/session/dh3sgudg8gsrgl/_update' -d '{
"scripted_upsert":true,
"script" : {
"id": "my_web_session_summariser",
"params" : {
"pageViewEvent" : {
"url":"foo.com/bar",
"response":404,
"time":"2014-01-01 12:32"
}
}
},
"upsert" : {}
}'
doc_as_upsert
Instead of sending a partial doc plus an upsert doc, setting doc_as_upsert to true will
use the contents of doc as the upsert value:

"doc" : {
"name" : "new_name"
},
"doc_as_upsert" : true
}'
Parameters
The update operation supports the following query-string parameters:
retry_on_conflict
In between the get and indexing phases of the update, it is possible that another
process might have already updated the same document. By default, the update will
fail with a version conflict exception. The retry_on_conflict parameter controls
how many times to retry the update before finally throwing an exception.
routing
Routing is used to route the update request to the right shard and sets the routing for
the upsert request if the document being updated doesn’t exist. Can’t be used to
update the routing of an existing document.
parent
Parent is used to route the update request to the right shard and sets the parent for
the upsert request if the document being updated doesn’t exist. Can’t be used to
update the parent of an existing document. If an alias index routing is specified then
it overrides the parent routing and it is used to route the request.
timeout
Timeout waiting for a shard to become available.
consistency
The write consistency of the index/delete operation.
refresh
Control when the changes made by this request are visible to search. See Refresh
API.
fields
Return the relevant fields from the updated document. Specify _source to return the
full updated source.
version & version_type
The update API uses the NG|Storage’s versioning support internally to make sure the
document doesn’t change during the update. You can use the version parameter to
specify that the document should only be updated if its version matches the one
specified. By setting version type to force you can force the new version of the
document after update (use with care! with force there is no guarantee the
document didn’t change).
The update API does not support external versioning
External versioning (version types external & external_gte) is not

 supported by the update API as it would result in NG|Storage version
numbers being out of sync with the external system. Use the index API
instead.
25.12. Update By Query API
experimental[The update-by-query API is new and should still be considered experimental.

The API may change in ways that are not backwards compatible]
The simplest usage of _update_by_query just performs an update on every document in

the index without changing the source. This is useful to pick up a new property or some
other online mapping change. Here is the API:
POST twitter/_update_by_query?conflicts=proceed

That will return something like this:
{
"took" : 147,
"timed_out": false,
"updated": 120,
"deleted": 0,
"batches": 1,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"requests_per_second": -1.0,
"throttled_until_millis": 0,
"total": 120,
"failures" : [ ]
}
_update_by_query gets a snapshot of the index when it starts and indexes what it finds
using internal versioning. That means that you’ll get a version conflict if the document
changes between the time when the snapshot was taken and when the index request is
processed. When the versions match the document is updated and the version number is
incremented.
Since internal versioning does not support the value 0 as a valid version
 number, documents with version equal to zero cannot be updated using

_update_by_query and will fail the request.
All update and query failures cause the _update_by_query to abort and are returned in
the failures of the response. The updates that have been performed still stick. In other
words, the process is not rolled back, only aborted. While the first failure causes the abort
all failures that are returned by the failing bulk request are returned in the failures
element so it’s possible for there to be quite a few.
If you want to simply count version conflicts not cause the _update_by_query to abort
you can set conflicts=proceed on the url or "conflicts": "proceed" in the
request body. The first example does this because it is just trying to pick up an online
mapping change and a version conflict simply means that the conflicting document was
updated between the start of the _update_by_query and the time when it attempted to
update the document. This is fine because that update will have picked up the online
mapping update.

Back to the API format, you can limit _update_by_query to a single type. This will only
update tweet documents from the twitter index:
POST twitter/tweet/_update_by_query?conflicts=proceed
You can also limit _update_by_query using the Query DSL. This will update all
documents from the twitter index for the user kimchy:
POST twitter/_update_by_query?conflicts=proceed
{
"query": { 1
"term": {
"user": "kimchy"
}
}
}
1 - The query must be passed as a value to the query key, in the same way as the Search
API. You can also use the q parameter in the same way as the search api.
So far we’ve only been updating documents without changing their source. That is
genuinely useful for things like picking up new properties but it’s only half the fun.
_update_by_query supports a script object to update the document. This will
increment the likes field on all of kimchy’s tweets:
POST twitter/_update_by_query
{
"script": {
"inline": "ctx._source.likes++",
"lang": "painless"
},
"query": {
"term": {
"user": "kimchy"
}
}
}
Just as in Update API you can set ctx.op to change the operation that is executed:
noop
Set ctx.op = "noop" if your script decides that it doesn’t have to make any
changes. That will cause _update_by_query to omit that document from its
updates. This no operation will be reported in the noop counter in the response body.

delete
Set ctx.op = "delete" if your script decides that the document must be deleted.
The deletion will be reported in the deleted counter in the response body.
Setting ctx.op to anything else is an error. Setting any other field in ctx is an error.
Note that we stopped specifying conflicts=proceed. In this case we want a version

conflict to abort the process so we can handle the failure.
This API doesn’t allow you to move the documents it touches, just modify their source. This
is intentional! We’ve made no provisions for removing the document from its original
location.
It’s also possible to do this whole thing on multiple indexes and multiple types at once, just
like the search API:
POST twitter,blog/tweet,post/_update_by_query
If you provide routing then the routing is copied to the scroll query, limiting the process to
the shards that match that routing value:
POST twitter/_update_by_query?routing=1
By default _update_by_query uses scroll batches of 1000. You can change the batch size
with the scroll_size URL parameter:
POST twitter/_update_by_query?scroll_size=100
_update_by_query can also use the Ingest Node feature by specifying a pipeline like
this:
PUT _ingest/pipeline/set-foo
{
"description" : "sets foo",
"processors" : [ {
"set" : {
"field": "foo",
"value": "bar"
}
} ]
}
POST twitter/_update_by_query?pipeline=set-foo
URL Parameters
In addition to the standard parameters like pretty, the Update By Query API also supports
refresh, wait_for_completion, consistency, and timeout.
Sending the refresh will update all shards in the index being updated when the request
completes. This is different than the Index API’s refresh parameter which causes just the
shard that received the new data to be indexed.
If the request contains wait_for_completion=false then NG|Storage will perform

some preflight checks, launch the request, and then return a task which can be used with
Tasks APIs to cancel or get the status of the task. NG|Storage will also create a record of
this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as
you see fit. When you are done with it, delete it so NG|Storage can reclaim the space it
uses.
consistency controls how many copies of a shard must respond to each write request.
timeout controls how long each write request waits for unavailable shards to become
available. Both work exactly how they work in the Bulk API.
requests_per_second can be set to any positive decimal number (1.4, 6, 1000, etc)
and throttles the number of requests per second that the update-by-query issues or it can
be set to -1 to disabled throttling. The throttling is done waiting between bulk batches so
that it can manipulate the scroll timeout. The wait time is the difference between the time it
took the batch to complete and the time requests_per_second *
requests_in_the_batch. Since the batch isn’t broken into multiple bulk requests large
batch sizes will cause NG|Storage to create many requests and then wait for a while before
starting the next set. This is "bursty" instead of "smooth". The default is -1.
Response body
The JSON response looks like this:
{
"took" : 639,
"updated": 0,
"batches": 1,
"retries": {
"bulk": 0,
"search": 0
}
"failures" : [ ]
}

took
The number of milliseconds from start to end of the whole operation.
updated
The number of documents that were successfully updated.
batches
The number of scroll responses pulled back by the the update by query.
version_conflicts
The number of version conflicts that the update by query hit.
retries
The number of retries attempted by update-by-query. bulk is the number of bulk
actions retried and search is the number of search actions retried.
throttled_millis
Number of milliseconds the request slept to conform to requests_per_second.
failures
Array of all indexing failures. If this is non-empty then the request aborted because of
those failures. See conflicts for how to prevent version conflicts from aborting the
operation.
Works with the Task API
You can fetch the status of all running update-by-query requests with the Task API:
GET _tasks?detailed=true&action=*byquery
The responses looks like:

{
"nodes" : {
"r1A2WoRbTwKZ516z6NEs5A" : {
"name" : "Tyrannus",
"transport_address" : "127.0.0.1:9300",
"host" : "127.0.0.1",
"ip" : "127.0.0.1:9300",
"attributes" : {
"testattr" : "test",
"portsfile" : "true"
},
"tasks" : {
"r1A2WoRbTwKZ516z6NEs5A:36619" : {
"node" : "r1A2WoRbTwKZ516z6NEs5A",
"id" : 36619,
"type" : "transport",
"action" : "indices:data/write/update/byquery",
"status" : { 1
"total" : 6154,
"updated" : 3500,
"created" : 0,
"deleted" : 0,
"batches" : 4,
"version_conflicts" : 0,
"noops" : 0,
"retries": {
"bulk": 0,
"search": 0
}
"throttled_millis": 0
},
"description" : ""
}
}
}
}
}
1 - this object contains the actual status. It is just like the response json with the important
addition of the total field. total is the total number of operations that the reindex
expects to perform. You can estimate the progress by adding the updated, created, and
deleted fields. The request will finish when their sum is equal to the total field.
With the task id you can look up the task directly:
GET /_tasks/taskId:1
The advantage of this API is that it integrates with wait_for_completion=false to

transparently return the status of completed tasks. If the task is completed and
wait_for_completion=false was set on it them it’ll come back with a results or an
error field. The cost of this feature is the document that wait_for_completion=false
creates at .tasks/task/${taskId}. It is up to you to delete that document.
Works with the Cancel Task API
Any Update By Query can be canceled using the Task Cancel API:
POST _tasks/taskid:1/_cancel
The task_id can be found using the tasks API above.
Cancelation should happen quickly but might take a few seconds. The task status API above
will continue to list the task until it is wakes to cancel itself.
Rethrottling
The value of requests_per_second can be changed on a running update by query using

the _rethrottle API:
POST _update_by_query/taskid:1/_rethrottle?requests_per_second=-1
The task_id can be found using the tasks API above.
Just like when setting it on the _update_by_query API requests_per_second can be

either -1 to disable throttling or any decimal number like 1.7 or 12 to throttle to that level.
Rethrottling that speeds up the query takes effect immediately but rethrotting that slows
down the query will take effect on after completing the current batch. This prevents scroll
timeouts.
Pick up a new property
Say you created an index without dynamic mapping, filled it with data, and then added a
mapping value to pick up more fields from the data:

PUT test
{
"mappings": {
"test": {
"dynamic": false, 1
"properties": {
"text": {"type": "text"}
}
}
}
}
POST test/test?refresh
{
"text": "words words",
"flag": "bar"
}
POST test/test?refresh
{
"text": "words words",
"flag": "foo"
}
PUT test/_mapping/test 2
{
"properties": {
"text": {"type": "text"},
"flag": {"type": "text", "analyzer": "keyword"}
}
}
1 - This means that new fields won’t be indexed, just stored in _source.
2 - This updates the mapping to add the new flag field. To pick up the new field you have to
reindex all documents with it.
Searching for the data won’t find anything:
POST test/_search?filter_path=hits.total
{
"query": {
"match": {
"flag": "foo"
}
}
}
{
"hits" : {
"total" : 0
}
}

But you can issue an _update_by_query request to pick up the new mapping:
POST test/_update_by_query?refresh&conflicts=proceed
POST test/_search?filter_path=hits.total
{
"query": {
"match": {
"flag": "foo"
}
}
}
{
"hits" : {
"total" : 1
}
}
You can do the exact same thing when adding a field to a multifield.

Chapter 26. Index Modules
Index Modules are modules created per index and control all aspects related to an index.
Index Settings
Index level settings can be set per-index. Settings may be:
static
They can only be set at index creation time or on a closed index.
dynamic
They can be changed on a live index using the update-index-settings API.
Changing static or dynamic index settings on a closed index could result in
 incorrect settings that are impossible to rectify without deleting and

recreating the index.
Static index settings
Below is a list of all static index settings that are not associated with any specific index
module:
index.number_of_shards
The number of primary shards that an index should have. Defaults to 5. This setting
can only be set at index creation time. It cannot be changed on a closed index.
index.shard.check_on_startup
experimental[] Whether or not shards should be checked for corruption before
opening. When corruption is detected, it will prevent the shard from being opened.
Accepts:
false
(default) Don’t check for corruption when opening a shard.
checksum
Check for physical corruption.
true
Check for both physical and logical corruption. This is much more expensive in terms
of CPU and memory usage.
344 | Chapter 26. Index Modules

fix
Check for both physical and logical corruption. Segments that were reported as
corrupted will be automatically removed. This option may result in data loss. Use
with extreme caution!
Checking shards may take a lot of time on large indices.
index.codec
The default value compresses stored data with LZ4 compression, but this can be set
to best_compression which uses DEFLATE for a higher compression ratio, at the
expense of slower stored fields performance.
Dynamic index settings
Below is a list of all dynamic index settings that are not associated with any specific index
module:
index.number_of_replicas
The number of replicas each primary shard has. Defaults to 1.
index.auto_expand_replicas
Auto-expand the number of replicas based on the number of available nodes. Set to a
dash delimited lower and upper bound (e.g. 0-5) or use all for the upper bound (e.g.
0-all). Defaults to false (i.e. disabled).
index.refresh_interval
How often to perform a refresh operation, which makes recent changes to the index
visible to search. Defaults to 1s. Can be set to -1 to disable refresh.
index.max_result_window
The maximum value of from + size for searches to this index. Defaults to 10000.
Search requests take heap memory and time proportional to from + size and this
limits that memory. See Scroll or Search After for a more efficient alternative to
raising this.
index.max_rescore_window
The maximum value of window_size for rescore`s in searches of this
index. Defaults to ìndex.max_result_window which defaults to 10000.
Search requests take heap memory and time proportional to max(window_size,
from + size) and this limits that memory.
Chapter 26. Index Modules | 345

index.blocks.read_only
Set to true to make the index and index metadata read only, false to allow writes
and metadata changes.
index.blocks.read
Set to true to disable read operations against the index.
index.blocks.write
Set to true to disable write operations against the index.
index.blocks.metadata
Set to true to disable index metadata reads and writes.
index.max_refresh_listeners
Maximum number of refresh listeners available on each shard of the index. These
listeners are used to implement refresh=wait_for.
Settings in other index modules
Other index settings are available in index modules:
Analysis
Settings to define analyzers, tokenizers, token filters and character filters.
Index shard allocation
Control over where, when, and how shards are allocated to nodes.
Mapping
Enable or disable dynamic mapping for an index.
Merging
Control over how shards are merged by the background merge process.
Similarities
Configure custom similarity settings to customize how search results are scored.
Slowlog
Control over how slow queries and fetch requests are logged.
Store
Configure the type of filesystem used to access shard data.

Translog
Control over the transaction log and background flush operations.
26.1. Index Shard Allocation
This module provides per-index settings to control the allocation of shards to nodes:
• Shard allocation filtering: Controlling which shards are allocated to which nodes.
• Delayed allocation: Delaying allocation of unassigned shards caused by a node leaving.
• Total shards per node: A hard limit on the number of shards from the same index per
node.
26.1.1. Delayed Allocation
When a node leaves the cluster for whatever reason, intentional or otherwise, the master
reacts by:
• Promoting a replica shard to primary to replace any primaries that were on the node.
• Allocating replica shards to replace the missing replicas (assuming there are enough
nodes).
• Rebalancing shards evenly across the remaining nodes.
These actions are intended to protect the cluster against data loss by ensuring that every
shard is fully replicated as soon as possible.
Even though we throttle concurrent recoveries both at the node level and at the cluster
level, this ``shard-shuffle'' can still put a lot of extra load on the cluster which may not be
necessary if the missing node is likely to return soon. Imagine this scenario:
• Node 5 loses network connectivity.
• The master promotes a replica shard to primary for each primary that was on Node 5.
• The master allocates new replicas to other nodes in the cluster.
• Each new replica makes an entire copy of the primary shard across the network.
• More shards are moved to different nodes to rebalance the cluster.
• Node 5 returns after a few minutes.
• The master rebalances the cluster by allocating shards to Node 5.

If the master had just waited for a few minutes, then the missing shards could have been
re-allocated to Node 5 with the minimum of network traffic. This process would be even
quicker for idle shards (shards not receiving indexing requests) which have been
automatically sync-flushed.
The allocation of replica shards which become unassigned because a node has left can be
delayed with the index.unassigned.node_left.delayed_timeout dynamic setting,
which defaults to 1m.
This setting can be updated on a live index (or on all indices):
PUT _all/_settings
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "5m"
}
}
With delayed allocation enabled, the above scenario changes to look like this:
• Node 5 loses network connectivity.
• The master promotes a replica shard to primary for each primary that was on Node 5.
• The master logs a message that allocation of unassigned shards has been delayed, and
for how long.
• The cluster remains yellow because there are unassigned replica shards.
• Node 5 returns after a few minutes, before the timeout expires.
• The missing replicas are re-allocated to Node 5 (and sync-flushed shards recover
almost immediately).
This setting will not affect the promotion of replicas to primaries, nor will it
affect the assignment of replicas that have not been assigned previously.
 In particular, delayed allocation does not come into effect after a full
cluster restart. Also, in case of a master failover situation, elapsed delay
time is forgotten (i.e. reset to the full initial delay).
Cancellation of shard relocation
If delayed allocation times out, the master assigns the missing shards to another node
which will start recovery. If the missing node rejoins the cluster, and its shards still have
the same sync-id as the primary, shard relocation will be cancelled and the synced shard

will be used for recovery instead.
For this reason, the default timeout is set to just one minute: even if shard relocation
begins, cancelling recovery in favour of the synced shard is cheap.
Monitoring delayed unassigned shards
The number of shards whose allocation has been delayed by this timeout setting can be
viewed with the cluster health API:
GET _cluster/health 1
1 - This request will return a delayed_unassigned_shards value.
Removing a node permanently
If a node is not going to return and you would like NG|Storage to allocate the missing
shards immediately, just update the timeout to zero:
PUT _all/_settings
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "0"
}
}
You can reset the timeout as soon as the missing shards have started to recover.
26.1.2. Shard Allocation Filtering
Shard allocation filtering allows you to specify which nodes are allowed to host the shards
of a particular index.
The per-index shard allocation filters explained below work in conjunction
 with the cluster-wide allocation filters explained in Cluster Level Shard

Allocation.
It is possible to assign arbitrary metadata attributes to each node at startup. For instance,
nodes could be assigned a rack and a size attribute as follows:
bin/ngStorage -Enode.attr.rack=rack1 -Enode.attr.size=big 1
1 - These attribute settings can also be specified in the ngStorage.yml config file.

These metadata attributes can be used with the index.routing.allocation.*
settings to allocate an index to a particular group of nodes. For instance, we can move the
index test to either big or medium nodes as follows:
PUT test/_settings
{
"index.routing.allocation.include.size": "big,medium"
}
Alternatively, we can move the index test away from the small nodes with an exclude
rule:
PUT test/_settings
{
"index.routing.allocation.exclude.size": "small"
}
Multiple rules can be specified, in which case all conditions must be satisfied. For instance,
we could move the index test to big nodes in rack1 with the following:
PUT test/_settings
{
"index.routing.allocation.include.size": "big",
"index.routing.allocation.include.rack": "rack1"
}
 If some conditions cannot be satisfied then shards will not be moved.
The following settings are dynamic, allowing live indices to be moved from one set of nodes
to another:
index.routing.allocation.include.{attribute}
Assign the index to a node whose {attribute} has at least one of the comma-
separated values.
index.routing.allocation.require.{attribute}
Assign the index to a node whose {attribute} has all of the comma-separated
values.
index.routing.allocation.exclude.{attribute}
Assign the index to a node whose {attribute} has none of the comma-separated
values.

These special attributes are also supported:
_name
Match nodes by node name
_host_ip
Match nodes by host IP address (IP associated with hostname)
_publish_ip
Match nodes by publish IP address
_ip
Match either _host_ip or _publish_ip
_host
Match nodes by hostname
All attribute values can be specified with wildcards, eg:
PUT test/_settings
{
"index.routing.allocation.include._ip": "192.168.2.*"
}
26.1.3. Index Recovery Prioritization
Unallocated shards are recovered in order of priority, whenever possible. Indices are sorted
into priority order as follows:
• the optional index.priority setting (higher before lower)
• the index creation date (higher before lower)
• the index name (higher before lower)
This means that, by default, newer indices will be recovered before older indices.
Use the per-index dynamically updateable index.priority setting to customise the

index prioritization order. For instance:

PUT index_1
PUT index_2
PUT index_3
{
"settings": {
"index.priority": 10
}
}
PUT index_4
{
"settings": {
"index.priority": 5
}
}
In the above example:
• index_3 will be recovered first because it has the highest index.priority.
• index_4 will be recovered next because it has the next highest priority.
• index_2 will be recovered next because it was created more recently.
• index_1 will be recovered last.
This setting accepts an integer, and can be updated on a live index with the update index
settings API:
PUT index_4/_settings
{
"index.priority": 1
}
26.1.4. Total Shards Per Node
The cluster-level shard allocator tries to spread the shards of a single index across as
many nodes as possible. However, depending on how many shards and indices you have,
and how big they are, it may not always be possible to spread shards evenly.
The following dynamic setting allows you to specify a hard limit on the total number of
shards from a single index allowed per node:
index.routing.allocation.total_shards_per_node
The maximum number of shards (replicas and primaries) that will be allocated to a
single node. Defaults to unbounded.

You can also limit the amount of shards a node can have regardless of the index:
cluster.routing.allocation.total_shards_per_node
The maximum number of shards (replicas and primaries) that will be allocated to a
single node globally. Defaults to unbounded (-1).
These settings impose a hard limit which can result in some shards not
 being allocated.
Use with caution.
26.2. Analysis
The index analysis module acts as a configurable registry of analyzers that can be used in
order to convert a string field into individual terms which are:
• added to the inverted index in order to make the document searchable
• used by high level queries such as the match query to generate search terms.
See Analysis for configuration details.
26.3. Mapper
The mapper module acts as a registry for the type mapping definitions added to an index
either when creating it or by using the put mapping api. It also handles the dynamic
mapping support for types that have no explicit mappings pre defined. For more
information about mapping definitions, check out the mapping section.
26.4. Merge
A shard in NG|Storage is a Lucene index, and a Lucene index is broken down into
segments. Segments are internal storage elements in the index where the index data is
stored, and are immutable. Smaller segments are periodically merged into larger
segments to keep the index size at bay and to expunge deletes.
The merge process uses auto-throttling to balance the use of hardware resources between
merging and other activities like search.
Merge scheduling

The merge scheduler (ConcurrentMergeScheduler) controls the execution of merge
operations when they are needed. Merges run in separate threads, and when the
maximum number of threads is reached, further merges will wait until a merge thread
becomes available.
The merge scheduler supports the following dynamic setting:
index.merge.scheduler.max_thread_count
The maximum number of threads that may be merging at once. Defaults to
Math.max(1, Math.min(4,
Runtime.getRuntime().availableProcessors() / 2)) which works well
for a good solid-state-disk (SSD). If your index is on spinning platter drives instead,
decrease this to 1.
26.5. Similarity Module
A similarity (scoring / ranking model) defines how matching documents are scored.
Similarity is per field, meaning that via the mapping one can define a different similarity per
field.
Configuring a custom similarity is considered a expert feature and the builtin similarities
are most likely sufficient as is described in Similarity.
Configuring a similarity
Most existing or custom Similarities have configuration options which can be configured via
the index settings as shown below. The index options can be provided when creating an
index or updating index settings.
"similarity" : {
"my_similarity" : {
"type" : "DFR",
"basic_model" : "g",
"after_effect" : "l",
"normalization" : "h2",
"normalization.h2.c" : "3.0"
}
}
Here we configure the DFRSimilarity so it can be referenced as my_similarity in

mappings as is illustrate in the below example:

{
"book" : {
"properties" : {
"title" : { "type" : "text", "similarity" : "my_similarity" }
}
}
Available similarities
BM25 similarity (default)
TF/IDF based similarity that has built-in tf normalization and is supposed to work better for
short fields (like names). See Okapi_BM25 for more details. This similarity has the
following options:
k1
Controls non-linear term frequency normalization (saturation). The default value is
1.2.
b
Controls to what degree document length normalizes tf values. The default value is
0.75.
discount_overlaps
Determines whether overlap tokens (Tokens with 0 position increment) are ignored
when computing norm. By default this is true, meaning overlap tokens do not count
when computing norms.
Type name: BM25
Classic similarity
The classic similarity that is based on the TF/IDF model. This similarity has the following
option:
discount_overlaps
Determines whether overlap tokens (Tokens with 0 position increment) are ignored
when computing norm. By default this is true, meaning overlap tokens do not count
when computing norms.
Type name: classic
DFR similarity

Similarity that implements the divergence from randomness framework. This similarity has
the following options:
basic_model
Possible values: be, d, g, if, in, ine and p.
after_effect
Possible values: no, b and l.
normalization
Possible values: no, h1, h2, h3 and z.
All options but the first option need a normalization value.
Type name: DFR
DFI similarity
Similarity that implements the divergence from independence model. This similarity has
the following options:
independence_measure
Possible values standardized, saturated, chisquared.
Type name: DFI
IB similarity.
Information based model . The algorithm is based on the concept that the information
content in any symbolic 'distribution' sequence is primarily determined by the repetitive
usage of its basic elements. For written texts this challenge would correspond to
comparing the writing styles of different authors. This similarity has the following options:
distribution
Possible values: ll and spl.
lambda
Possible values: df and ttf.
normalization
Same as in DFR similarity.
Type name: IB

LM Dirichlet similarity.
LM Dirichlet similarity . This similarity has the following options:
mu
Default to 2000.
Type name: LMDirichlet
LM Jelinek Mercer similarity.
LM Jelinek Mercer similarity . The algorithm attempts to capture important patterns in the
text, while leaving out noise. This similarity has the following options:
lambda
The optimal value depends on both the collection and the query. The optimal value is
around 0.1 for title queries and 0.7 for long queries. Default to 0.1. When value
approaches 0, documents that match more query terms will be ranked higher than
those that match fewer terms.
Type name: LMJelinekMercer
Default and Base Similarities
By default, NG|Storage will use whatever similarity is configured as default. However, the
similarity functions queryNorm() and coord() are not per-field. Consequently, for
expert users wanting to change the implementation used for these two methods, while not
changing the default, it is possible to configure a similarity with the name base. This
similarity will then be used for the two methods.
You can change the default similarity for all fields by putting the following setting into
ngStorage.yml:
index.similarity.default.type: classic
26.6. Slow Log
Search Slow Log
Shard level slow search log allows to log slow search (query and fetch phases) into a
dedicated log file.

Thresholds can be set for both the query phase of the execution, and fetch phase, here is a
sample:
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms
All of the above settings are dynamic and can be set per-index.
By default, none are enabled (set to -1). Levels (warn, info, debug, trace) allow to
control under which logging level the log will be logged. Not all are required to be
configured (for example, only warn threshold can be set). The benefit of several levels is
the ability to quickly "grep" for specific thresholds breached.
The logging is done on the shard level scope, meaning the execution of a search request
within a specific shard. It does not encompass the whole search request, which can be
broadcast to several shards in order to execute. Some of the benefits of shard level logging
is the association of the actual execution on the specific machine, compared with request
level.
The logging file is configured by default using the following configuration (found in
logging.yml):
index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
Index Slow log
The indexing slow log, similar in functionality to the search slow log. The log file name ends
with _index_indexing_slowlog.log. Log and the thresholds are configured in the
ngStorage.yml file in the same way as the search slowlog. Index slowlog sample:

index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms
index.indexing.slowlog.level: info
index.indexing.slowlog.source: 1000
All of the above settings are dynamic and can be set per-index.
By default NG|Storage will log the first 1000 characters of the _source in the slowlog. You
can change that with index.indexing.slowlog.source. Setting it to false or 0 will
skip logging the source entirely an setting it to true will log the entire source regardless of
size.
The index slow log file is configured by default in the logging.yml file:
index_indexing_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_indexing_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
26.7. Store
The store module allows you to control how index data is stored and accessed on disk.
File system storage types
There are different file system implementations or storage types. By default, NG|Storage
will pick the best implementation based on the operating environment.
This can be overridden for all indices by adding this to the config/ngStorage.yml file:
index.store.type: niofs
It is a static setting that can be set on a per-index basis at index creation time:
PUT /my_index
{
"settings": {
"index.store.type": "niofs"
}
}

experimental[This is an expert-only setting and may be removed in the future]
The following sections lists all the different storage types supported.
fs
Default file system implementation. This will pick the best implementation depending
on the operating environment: simplefs on Windows 32bit, niofs on other 32bit
systems and mmapfs on 64bit systems.
¬simplefs
The Simple FS type is a straightforward implementation of file system storage (maps

to Lucene SimpleFsDirectory) using a random access file. This implementation
has poor concurrent performance (multiple threads will bottleneck). It is usually
better to use the niofs when you need index persistence.
¬niofs
The NIO FS type stores the shard index on the file system (maps to Lucene
NIOFSDirectory) using NIO. It allows multiple threads to read from the same file
concurrently. It is not recommended on Windows because of a bug in the SUN Java
implementation.
¬mmapfs
The MMap FS type stores the shard index on the file system (maps to Lucene
MMapDirectory) by mapping a file into memory (mmap). Memory mapping uses up a
portion of the virtual memory address space in your process equal to the size of the
file being mapped. Before using this class, be sure you have allowed plenty of virtual
address space.
¬default_fs deprecated[5.0.0, The default_fs store type is deprecated - use fs

instead]
The default type is deprecated and is aliased to fs for backward compatibility.
Pre-loading data into the file system cache
experimental[This is an expert-only setting and may be removed in the future]
By default, NG|Storage completely relies on the operating system file system cache for
caching I/O operations. It is possible to set index.store.preload in order to tell the
operating system to load the content of hot index files into memory upon opening. This
setting accept a comma-separated list of files extensions: all files whose extenion is in the
list will be pre-loaded upon opening. This can be useful to improve search performance of
an index, especially when the host operating system is restarted, since this causes the file
system cache to be trashed. However note that this may slow down the opening of indices,
as they will only become available after data have been loaded into physical memory.
This setting is best-effort only and may not work at all depending on the store type and host
operating system.
The index.store.pre_load is a static setting that can either be set in the

config/ngStorage.yml:
index.store.pre_load: ["nvd", "dvd"]
or in the index settings at index creation time:
PUT /my_index
{
"settings": {
"index.store.pre_load": ["nvd", "dvd"]
}
}
The default value is the empty array, which means that nothing will be loaded into the file-
system cache eagerly. For indices that are actively searched, you might want to set it to
["nvd", "dvd"], which will cause norms and doc values to be loaded eagerly into
physical memory. These are the two first extensions to look at since NG|Storage performs
random access on them.
A wildcard can be used in order to indicate that all files should be preloaded:
index.store.pre_load: ["*"]. Note however that it is generally not useful to load all
files into memory, in particular those for stored fields and term vectors, so a better option
might be to set it to ["nvd", "dvd", "tim", "doc", "dim"], which will preload
norms, doc values, terms dictionaries, postings lists and points, which are the most
important parts of the index for search and aggregations.
Note that this setting can be dangerous on indices that are larger than the size of the main
memory of the host, as it would cause the filesystem cache to be trashed upon reopens
after large merges, which would make indexing and searching slower.
26.8. Translog
Changes to Lucene are only persisted to disk during a Lucene commit, which is a relatively
heavy operation and so cannot be performed after every index or delete operation. Changes
that happen after one commit and before another will be lost in the event of process exit or
HW failure.
To prevent this data loss, each shard has a transaction log or write ahead log associated
with it. Any index or delete operation is written to the translog after being processed by the
internal Lucene index.
In the event of a crash, recent transactions can be replayed from the transaction log when
the shard recovers.
An NG|Storage flush is the process of performing a Lucene commit and starting a new
translog. It is done automatically in the background in order to make sure the transaction
log doesn’t grow too large, which would make replaying its operations take a considerable
amount of time during recovery. It is also exposed through an API, though its rarely needed
to be performed manually.
Flush settings
The following dynamically updatable settings control how often the in-memory buffer is
flushed to disk:
index.translog.flush_threshold_size
Once the translog hits this size, a flush will happen. Defaults to 512mb.
Translog settings
The data in the transaction log is only persisted to disk when the translog is fsynced and
committed. In the event of hardware failure, any data written since the previous translog
commit will be lost.
By default, NG|Storage fsyncs and commits the translog every 5 seconds if

index.translog.durability is set to async or if set to request (default) at the end
of every index, delete, update, or bulk request. In fact, NG|Storage will only report success
of an index, delete, update, or bulk request to the client after the transaction log has been
successfully fsynced and committed on the primary and on every allocated replica.
The following dynamically updatable per-index settings control the behaviour of the
transaction log:
index.translog.sync_interval
How often the translog is fsynced to disk and committed, regardless of write
operations. Defaults to 5s. Values less than 100ms are not allowed.

index.translog.durability
Whether or not to fsync and commit the translog after every index, delete, update, or
bulk request. This setting accepts the following parameters:
request
(default) fsync and commit after every request. In the event of hardware failure, all
acknowledged writes will already have been committed to disk.
async
fsync and commit in the background every sync_interval. In the event of
hardware failure, all acknowledged writes since the last automatic commit will be
discarded.
Only the listed parameters are accepted

Chapter 27. Indices APIs
The indices APIs are used to manage individual indices, index settings, aliases, mappings,
and index templates.
27.1. Index Aliases
APIs in NG|Storage accept an index name when working against a specific index, and
several indices when applicable. The index aliases API allow to alias an index with a name,
with all APIs automatically converting the alias name to the actual index name. An alias can
also be mapped to more than one index, and when specifying it, the alias will automatically
expand to the aliases indices. An alias can also be associated with a filter that will
automatically be applied when searching, and routing values. An alias cannot have the
same name as an index.
Here is a sample of associating the alias alias1 with index test1:
POST /_aliases
{
"actions" : [
{ "add" : { "index" : "test1", "alias" : "alias1" } }
]
}
An alias can also be removed, for example:
POST /_aliases
{
"actions" : [
{ "remove" : { "index" : "test1", "alias" : "alias1" } }
]
}
chapter.
27.2. Analyze
Performs the analysis process on a text and return the tokens breakdown of the text.
Can be used without specifying an index against one of the many built in analyzers:
364 | Chapter 27. Indices APIs

curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "this is a test"
}'
If text parameter is provided as array of strings, it is analyzed as a multi-valued field.

{
"analyzer" : "standard",
"text" : ["this is a test", "the second text"]
}'
Or by building a custom transient analyzer out of tokenizers, token filters and char filters.
Token filters can use the shorter 'filter' parameter name:

{
"tokenizer" : "keyword",
"filter" : ["lowercase"],
}'

{
"tokenizer" : "keyword",
"token_filter" : ["lowercase"],
"char_filter" : ["html_strip"],
"text" : "this is a test"
}'
deprecated[5.0.0, Use filter/token_filter/char_filter instead of filters

/token_filters/char_filters]
It can also run against a specific index:
curl -XGET 'localhost:9200/test/_analyze' -d '

{
}'
The above will run an analysis on the "this is a test" text, using the default index analyzer
associated with the test index. An analyzer can also be provided to use a different
analyzer:
Chapter 27. Indices APIs | 365

{
"analyzer" : "whitespace",
}'
Also, the analyzer can be derived based on a field mapping, for example:

{
"field" : "obj1.field1",
}'
Will cause the analysis to happen based on the analyzer configured in the mapping for
obj1.field1 (and if not, the default index analyzer).
All parameters can also supplied as request parameters. For example:
curl -XGET
'localhost:9200/_analyze?tokenizer=keyword&filter=lowercase&text=this+is+a
+test'
For backwards compatibility, we also accept the text parameter as the body of the request,
provided it doesn’t start with { :
curl -XGET
'localhost:9200/_analyze?tokenizer=keyword&token_filter=lowercase&char_fil
ter=html_strip' -d 'this is a test'
Explain Analyze
If you want to get more advanced details, set explain to true (defaults to false). It will
output all token attributes for each token. You can filter token attributes you want to output
by setting attributes option.
experimental[The format of the additional detail information is experimental and can

change at any time]

GET _analyze
{
"tokenizer" : "standard",
"token_filter" : ["snowball"],
"text" : "detailed output",
"explain" : true,
"attributes" : ["keyword"] 1
}
1 - Set "keyword" to output "keyword" attribute only
coming[2.0.0, body based parameters were added in 2.0.0]
The request returns the following result:

{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [ ],
"tokenizer" : {
"name" : "standard",
"tokens" : [ {
"token" : "detailed",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "output",
"start_offset" : 9,
"end_offset" : 15,
"position" : 1
} ]
},
"tokenfilters" : [ {
"name" : "snowball",
"tokens" : [ {
"token" : "detail",
"start_offset" : 0,
"end_offset" : 8,
"position" : 0,
"keyword" : false 1
}, {
"token" : "output",
"start_offset" : 9,
"end_offset" : 15,
"position" : 1,
"keyword" : false 1
} ]
} ]
}
}
1 - Output only "keyword" attribute, since specify "attributes" in the request.
27.3. Clear Cache
The clear cache API allows to clear either all caches or specific cached associated with one
or more indices.
$ curl -XPOST 'http://localhost:9200/twitter/_cache/clear'
The API, by default, will clear all caches. Specific caches can be cleaned explicitly by setting
query, fielddata or request.
All caches relating to a specific field(s) can also be cleared by specifying fields
parameter with a comma delimited list of the relevant fields.
Multi Index
The clear cache API can be applied to more than one index with a single call, or even on
_all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,ngStorage/_cache/clear'
$ curl -XPOST 'http://localhost:9200/_cache/clear'
27.4. Create Index
The create index API allows to instantiate an index. NG|Storage provides support for
multiple indices, including executing operations across several indices.
Index Settings
Each index created can have specific settings associated with it.
$ curl -XPUT 'http://localhost:9200/twitter/' -d '{

"settings" : {
"index" : {
"number_of_shards" : 3, 1
"number_of_replicas" : 2 2
}
}
}'
1 - Default for number_of_shards is 5
2 - Default for number_of_replicas is 1 (ie one replica for each primary shard)
The above second curl example shows how an index called twitter can be created with
specific settings for it using YAML. In this case, creating an index with 3 shards, each with 2
replicas. The index settings can also be defined with JSON:

"settings" : {
"index" : {
"number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}'

or more simplified

"settings" : {
}
}'
You do not have to explicitly specify index section inside the settings
 section.
For more information regarding all the different index level settings that can be set when
creating an index, please check the index modules section.
Mappings
The create index API allows to provide a set of one or more mappings:
curl -XPOST localhost:9200/test -d '{

"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"properties" : {
"field1" : { "type" : "text" }
}
}
}
}'
Aliases
The create index API allows also to provide a set of aliases:
curl -XPUT localhost:9200/test -d '{

"aliases" : {
"alias_1" : {},
"alias_2" : {
"filter" : {
"term" : {"user" : "kimchy" }
},
"routing" : "kimchy"
}
}
}'
Creation Date
When an index is created, a timestamp is stored in the index metadata for the creation date.
By default this is automatically generated but it can also be specified using the
creation_date parameter on the create index API:
curl -XPUT localhost:9200/test -d '{

"creation_date" : 1407751337000 1
}'
1 -creation_date is set using epoch time in milliseconds.
27.5. Delete Index
The delete index API allows to delete an existing index.
$ curl -XDELETE 'http://localhost:9200/twitter/'
The above example deletes an index called twitter. Specifying an index, alias or wildcard
expression is required.
The delete index API can also be applied to more than one index, by either using a comma
separated list, or on all indices (be careful!) by using _all or * as index.
In order to disable allowing to delete indices via wildcards or _all, set

action.destructive_requires_name setting in the config to true. This setting can
also be changed via the cluster update settings api.
27.6. Flush
The flush API allows to flush one or more indices through an API. The flush process of an
index basically frees memory from the index by flushing data to the index storage and
clearing the internal transaction log. By default, NG|Storage uses memory heuristics in
order to automatically trigger flush operations as required in order to clear memory.
chapter.
27.7. Force Merge
The force merge API allows to force merging of one or more indices through an API. The
merge relates to the number of segments a Lucene index holds within each shard. The

force merge operation allows to reduce the number of segments by merging them.
This call will block until the merge is complete. If the http connection is lost, the request
will continue in the background, and any new requests will block until the previous force
merge is complete.
$ curl -XPOST 'http://localhost:9200/twitter/_forcemerge'
Request Parameters
The force merge API accepts the following request parameters:
max_num_segments
The number of segments to merge to. To fully merge the index, set it to 1. Defaults to
simply checking if a merge needs to execute, and if so, executes it.
only_expunge_deletes
Should the merge process only expunge segments with deletes in it. In Lucene, a
document is not deleted from a segment, just marked as deleted. During a merge
process of segments, a new segment is created that does not have those deletes. This
flag allows to only merge segments that have deletes. Defaults to false. Note that
this won’t override the index.merge.policy.expunge_deletes_allowed
threshold.
flush
Should a flush be performed after the forced merge. Defaults to true.
Multi Index
The force merge API can be applied to more than one index with a single call, or even on
_all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,ngStorage/_forcemerge'
$ curl -XPOST 'http://localhost:9200/_forcemerge'
27.8. Get Field Mapping
The get field mapping API allows you to retrieve mapping definitions for one or more fields.
This is useful when you do not need the complete type mapping returned by the Get
Mapping API.

The following returns the mapping of the field text only:
curl -XGET 'http://localhost:9200/twitter/_mapping/tweet/field/text'
For which the response is (assuming text is a default string field):
{
"twitter": {
"tweet": {
"text": {
"full_name": "text",
"mapping": {
"text": { "type": "text" }
}
}
}
}
}
Multiple Indices, Types and Fields
The get field mapping API can be used to get the mapping of multiple fields from more than
one index or type with a single call. General usage of the API follows the following syntax:
host:port/{index}/{type}/_mapping/field/{field} where {index}, {type}
and {field} can stand for comma-separated list of names or wild cards. To get mappings
for all indices you can use _all for {index}. The following are some examples:
curl -XGET 'http://localhost:9200/twitter,kimchy/_mapping/field/message'
curl -XGET
'http://localhost:9200/_all/_mapping/tweet,book/field/message,user.id'
curl -XGET 'http://localhost:9200/_all/_mapping/tw*/field/*.id'
Specifying fields
The get mapping api allows you to specify one or more fields separated with by a comma.
You can also use wildcards. The field names can be any of the following:
Full names
the full path, including any parent object name the field is part of (ex. user.id).
Field names
the name of the field without the path to it (ex. id for { "user" : { "id" : 1 }
}).

The above options are specified in the order the field parameter is resolved. The first
field found which matches is returned. This is especially important if index names or field
names are used as those can be ambiguous.
For example, consider the following mapping:
{
"article": {
"properties": {
"id": { "type": "text" },
"title": { "type": "text"},
"abstract": { "type": "text"},
"author": {
"properties": {
"id": { "type": "text" },
"name": { "type": "text" }
}
}
}
}
}
To select the id of the author field, you can use its full name author.id. name will
return the field author.name:
curl -XGET
"http://localhost:9200/publications/_mapping/article/field/author.id,abstr
act,name"
returns:

{
"publications": {
"article": {
"abstract": {
"full_name": "abstract",
"mapping": {
"abstract": { "type": "text" }
}
},
"author.id": {
"full_name": "author.id",
"mapping": {
"id": { "type": "text" }
}
},
"name": {
"full_name": "author.name",
"mapping": {
}
}
}
}
}
Note how the response always use the same fields specified in the request as keys. The
full_name in every entry contains the full name of the field whose mapping were
returned. This is useful when the request can refer to to multiple fields.
Other options
include_defaults
adding include_defaults=true to the query string will cause the response to
include default values, which are normally suppressed.
27.9. Get Index
The get index API allows to retrieve information about one or more indexes.
$ curl -XGET 'http://localhost:9200/twitter/'
The above example gets the information for an index called twitter. Specifying an index,
alias or wildcard expression is required.
The get index API can also be applied to more than one index, or on all indices by using
_all or * as index.

Filtering index information
The information returned by the get API can be filtered to include only specific features by
specifying a comma delimited list of features in the URL:
$ curl -XGET 'http://localhost:9200/twitter/_settings,_mappings'
The above command will only return the settings and mappings for the index called
twitter.
The available features are _settings, _mappings and _aliases.
27.10. Get Mapping
The get mapping API allows to retrieve mapping definitions for an index or index/type.
curl -XGET 'http://localhost:9200/twitter/_mapping/tweet'
Multiple Indices and Types
The get mapping API can be used to get more than one index or type mapping with a single
call. General usage of the API follows the following syntax:
host:port/{index}/_mapping/{type} where both {index} and {type} can accept
a comma-separated list of names. To get mappings for all indices you can use _all for
{index}. The following are some examples:
curl -XGET 'http://localhost:9200/_mapping/twitter,kimchy'
curl -XGET 'http://localhost:9200/_all/_mapping/tweet,book'
If you want to get mappings of all indices and types then the following two examples are
equivalent:
curl -XGET 'http://localhost:9200/_all/_mapping'
curl -XGET 'http://localhost:9200/_mapping'
27.11. Get Settings
The get settings API allows to retrieve settings of index/indices:

$ curl -XGET 'http://localhost:9200/twitter/_settings'
Multiple Indices and Types
The get settings API can be used to get settings for more than one index with a single call.
General usage of the API follows the following syntax: host:port/{index}/_settings
where {index} can stand for comma-separated list of index names and aliases. To get
settings for all indices you can use _all for {index}. Wildcard expressions are also
supported. The following are some examples:
curl -XGET 'http://localhost:9200/twitter,kimchy/_settings'
curl -XGET 'http://localhost:9200/_all/_settings'
curl -XGET 'http://localhost:9200/2013-*/_settings'
Filtering settings by name
The settings that are returned can be filtered with wildcard matching as follows:
curl -XGET 'http://localhost:9200/2013-*/_settings/name=index.number_*'
27.12. Indices Exists
Used to check if the index (indices) exists or not. For example:
curl -XHEAD -i 'http://localhost:9200/twitter'
The HTTP status code indicates if the index exists or not. A 404 means it does not exist, and
200 means it does.
27.13. Open / Close Index API
The open and close index APIs allow to close an index, and later on opening it. A closed
index has almost no overhead on the cluster (except for maintaining its metadata), and is
blocked for read/write operations. A closed index can be opened which will then go through
the normal recovery process.
The REST endpoint is /{index}/_close and /{index}/_open. For example:

curl -XPOST 'localhost:9200/my_index/_close'
curl -XPOST 'localhost:9200/my_index/_open'
It is possible to open and close multiple indices. An error will be thrown if the request
explicitly refers to a missing index. This behaviour can be disabled using the
ignore_unavailable=true parameter.
All indices can be opened or closed at once using _all as the index name or specifying
patterns that identify them all (e.g. *).
Identifying indices via wildcards or _all can be disabled by setting the

action.destructive_requires_name flag in the config file to true. This setting can
also be changed via the cluster update settings api.
Closed indices consume a significant amount of disk-space which can cause problems
issues in managed environments. Closing indices can be disabled via the cluster settings
API by setting cluster.indices.close.enable to false. The default is true.
27.14. Put Mapping
The PUT mapping API allows you to add a new type to an existing index, or new fields to an
existing type:

PUT twitter 1
{
"mappings": {
"tweet": {
"properties": {
"message": {
"type": "text"
}
}
}
}
}
PUT twitter/_mapping/user 2
{
"properties": {
"name": {
"type": "text"
}
}
}
PUT twitter/_mapping/tweet 3
{
"properties": {
"user_name": {
"type": "text"
}
}
}
1 - Creates an index called twitter with the message field in the tweet mapping type.
2 - Uses the PUT mapping API to add a new mapping type called user.
3 - Uses the PUT mapping API to add a new field called user_name to the tweet mapping
type.
More information on how to define type mappings can be found in the mapping section.
Multi-index
The PUT mapping API can be applied to multiple indices with a single request. It has the
following format:
PUT /{index}/_mapping/{type}
{ body }
• {index} accepts multiple index names and wildcards.

• {type} is the name of the type to update.
• {body} contains the mapping changes that should be applied.
Updating field mappings
In general, the mapping for existing fields cannot be updated. There are some exceptions
to this rule. For instance:
• new Properties can be added to [object] fields.
• new multi-fields can be added to existing fields.
• Doc Values can be disabled, but not enabled.
• the Ignore Above parameter can be updated.
For example:

PUT my_index 1
{
"mappings": {
"user": {
"properties": {
"name": {
"properties": {
"first": {
"type": "text"
}
}
},
"user_id": {
"type": "keyword"
}
}
}
}
}
PUT my_index/_mapping/user
{
"properties": {
"name": {
"properties": {
"last": { 2
"type": "text"
}
}
},
"user_id": {
"type": "keyword",
"ignore_above": 100 3
}
}
}
1 - Create an index with a first field under the name [object] field, and a user_id field.
2 - Add a last field under the name object field.
3 - Update the ignore_above setting from its default of 0.
Each mapping parameter specifies whether or not its setting can be updated on an existing
field.
Conflicts between fields in different types
Fields in the same index with the same name in two different types must have the same
mapping, as they are backed by the same field internally. Trying to update a mapping
parameter for a field which exists in more than one type will throw an exception, unless you
specify the update_all_types parameter, in which case it will update that parameter
across all fields with the same name in the same index.
The only parameters which are exempt from this rule¬—¬they can be set
 to different values on each field¬—¬can be found in [field-conflicts].
For example, this fails:
PUT my_index
{
"mappings": {
"type_one": {
"properties": {
"text": { ¬
"type": "text",
}
}
},
"type_two": {
"properties": {
"text": { 1
"type": "text",
}
}
}
}
}
PUT my_index/_mapping/type_one 2
{
"properties": {
"text": {
"type": "text",
"search_analyzer": "whitespace"
}
}
}
1 - Create an index with two types, both of which contain a text field which have the same
mapping.
2 - Trying to update the search_analyzer just for type_one throws an exception like
"Merge failed with failures…¬".
But this then running this succeeds:

PUT my_index/_mapping/type_one?update_all_types 1
{
"properties": {
"text": {
"type": "text",
"search_analyzer": "whitespace"
}
}
}
1 - Adding the update_all_types parameter updates the text field in type_one and
type_two.
27.15. Indices Recovery
The indices recovery API provides insight into on-going index shard recoveries. Recovery
status may be reported for specific indices, or cluster-wide.
For example, the following command would show recovery information for the indices
"index1" and "index2".
curl -XGET http://localhost:9200/index1,index2/_recovery
To see cluster-wide recovery status simply leave out the index names.
curl -XGET http://localhost:9200/_recovery?pretty&human
Response:
{
"index1" : {
"shards" : [ {
"id" : 0,
"type" : "SNAPSHOT",
"stage" : "INDEX",
"primary" : true,
"start_time" : "2014-02-24T12:15:59.716",
"start_time_in_millis": 1393244159716,
"total_time" : "2.9m"
"total_time_in_millis" : 175576,
"source" : {
"repository" : "my_repository",
"snapshot" : "my_snapshot",
"index" : "index1"
},
"target" : {
"id" : "ryqJ5lO5S4-lSFbGntkEkg",

"hostname" : "my.fqdn",
"ip" : "10.0.1.7",
"name" : "my_es_node"
},
"index" : {
"size" : {
"total" : "75.4mb"
"total_in_bytes" : 79063092,
"reused" : "0b",
"reused_in_bytes" : 0,
"recovered" : "65.7mb",
"recovered_in_bytes" : 68891939,
"percent" : "87.1%"
},
"files" : {
"total" : 73,
"reused" : 0,
"recovered" : 69,
"percent" : "94.5%"
},
"total_time" : "0s",
"total_time_in_millis" : 0
},
"translog" : {
"recovered" : 0,
"total" : 0,
"percent" : "100.0%",
"total_on_start" : 0,
},
"start" : {
"check_index_time" : "0s",
"check_index_time_in_millis" : 0,
}
} ]
}
}
The above response shows a single index recovering a single shard. In this case, the source
of the recovery is a snapshot repository and the target of the recovery is the node with
name "my_es_node".
Additionally, the output shows the number and percent of files recovered, as well as the
number and percent of bytes recovered.
In some cases a higher level of detail may be preferable. Setting "detailed=true" will
present a list of physical files in recovery.
curl -XGET http://localhost:9200/_recovery?pretty&human&detailed=true

Response:
{
"index1" : {
"shards" : [ {
"id" : 0,
"type" : "STORE",
"stage" : "DONE",
"primary" : true,
"start_time" : "2014-02-24T12:38:06.349",
"start_time_in_millis" : "1393245486349",
"stop_time" : "2014-02-24T12:38:08.464",
"stop_time_in_millis" : "1393245488464",
"total_time" : "2.1s",
"total_time_in_millis" : 2115,
"source" : {
"id" : "RGMdRc-yQWWKIBM4DGvwqQ",
"ip" : "10.0.1.7",
},
"target" : {
"id" : "RGMdRc-yQWWKIBM4DGvwqQ",
"ip" : "10.0.1.7",
},
"index" : {
"size" : {
"total" : "24.7mb",
"total_in_bytes" : 26001617,
"reused" : "24.7mb",
"reused_in_bytes" : 26001617,
"recovered" : "0b",
"recovered_in_bytes" : 0,
"percent" : "100.0%"
},
"files" : {
"total" : 26,
"reused" : 26,
"recovered" : 0,
"percent" : "100.0%",
"details" : [ {
"name" : "segments.gen",
"length" : 20,
"recovered" : 20
}, {
"name" : "_0.cfs",
"length" : 135306,
"recovered" : 135306
}, {
"name" : "segments_2",
"length" : 251,
"recovered" : 251
},
...
]

},
"total_time" : "2ms",
},
"translog" : {
"recovered" : 71,
"total_time" : "2.0s",
},
"start" : {
"check_index_time" : 0,
"total_time" : "88ms",
}
} ]
}
}
This response shows a detailed listing (truncated for brevity) of the actual files recovered
and their sizes.
Also shown are the timings in milliseconds of the various stages of recovery: index
retrieval, translog replay, and index start time.
Note that the above listing indicates that the recovery is in stage "done". All recoveries,
whether on-going or complete, are kept in cluster state and may be reported on at any
time. Setting "active_only=true" will cause only on-going recoveries to be reported.
Here is a complete list of options:
detailed
Display a detailed view. This is primarily useful for viewing the recovery of physical
index files. Default: false.
active_only
Display only those recoveries that are currently on-going. Default: false.
Description of output fields:
id
Shard ID
type
Recovery type:
• store

• snapshot
• replica
• relocating
stage
Recovery stage:
• init: Recovery has not started
• index: Reading index meta-data and copying bytes from source to destination
• start: Starting the engine; opening the index for use
• translog: Replaying transaction log
• finalize: Cleanup
• done: Complete
primary
True if shard is primary, false otherwise
start_time
Timestamp of recovery start
stop_time
Timestamp of recovery finish
total_time_in_millis
Total time to recover shard in milliseconds
source
Recovery source:
• repository description if recovery is from a snapshot
• description of source node otherwise
target
Destination node
index
Statistics about physical index recovery

translog
Statistics about translog recovery
start
Statistics about time to open and start the index
27.16. Refresh
The refresh API allows to explicitly refresh one or more index, making all operations
performed since the last refresh available for search. The (near) real-time capabilities
depend on the index engine used. For example, the internal one requires refresh to be
called, but by default a refresh is scheduled periodically.
$ curl -XPOST 'http://localhost:9200/twitter/_refresh'
Multi Index
The refresh API can be applied to more than one index with a single call, or even on _all
the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,ngStorage/_refresh'
$ curl -XPOST 'http://localhost:9200/_refresh'
27.17. Rollover Index
The rollover index API rolls an alias over to a new index when the existing index is
considered to be too large or too old.
The API accepts a single alias name and a list of conditions. The alias must point to a
single index only. If the index satisfies the specified conditions then a new index is created
and the alias is switched to point to the new alias.

PUT /logs-0001 1
{
"aliases": {
"logs_write": {}
}
}
POST logs_write/_rollover 2
{
"conditions": {
"max_age": "7d",
"max_docs": 1000
}
}
1 - Creates an index called logs-0001 with the alias logs_write.
2 - If the index pointed to by logs_write was created 7 or more days ago, or contains
1,000 or more documents, then the logs-0002 index is created and the logs_write
alias is updated to point to logs-0002.
The above request might return the following response:
{
"old_index": "logs-0001",
"new_index": "logs-0002",
"rolled_over": true, 1
"dry_run": false, 2
"conditions": { 3
"[max_age: 7d]": false,
"[max_docs: 1000]": true
}
}
1 - Whether the index was rolled over.
2 - Whether the rollover was dry run.
3 - The result of each condition.
Naming the new index
If the name of the existing index ends with - and a number¬—¬e.g. logs-0001¬—¬then
the name of the new index will follow the same pattern, just incrementing the number
(logs-0002).
If the old name doesn’t match this pattern then you must specify the name for the new
index as follows:

POST my_alias/_rollover/my_new_index_name
{...}
Defining the new index
The settings, mappings, and aliases for the new index are taken from any matching index
templates. Additionally, you can specify settings, mappings, and aliases in the body of
the request, just like the create index API. Values specified in the request override any
values set in matching index templates. For example, the following rollover request
overrides the index.number_of_shards setting:
PUT /logs-0001
{
"aliases": {
"logs_write": {}
}
}
POST logs_write/_rollover
{
"conditions" : {
"max_age": "7d",
"max_docs": 1000
},
"settings": {
"index.number_of_shards": 2
}
}
Dry run
The rollover API supports dry_run mode, where request conditions can be checked
without performing the actual rollover:
PUT /logs-0001
{
"aliases": {
"logs_write": {}
}
}
POST logs_write/_rollover?dry_run
{
"conditions" : {
"max_age": "7d",
"max_docs": 1000
}
}

27.18. Indices Segments
Provide low level segments information that a Lucene index (shard level) is built with.
Allows to be used to provide more information on the state of a shard and an index, possibly
optimization information, data "wasted" on deletes, and so on.
Endpoints include segments for a specific index, several indices, or all:
curl -XGET 'http://localhost:9200/test/_segments'

curl -XGET 'http://localhost:9200/test1,test2/_segments'
curl -XGET 'http://localhost:9200/_segments'
Response:
{
...
"_3": {
"generation": 3,
"num_docs": 1121,
"deleted_docs": 53,
"memory_in_bytes": 3211,
"committed": true,
"search": true,
"version": "4.6",
"compound": true
}
...
}
_0
The key of the JSON document is the name of the segment. This name is used to
generate file names: all files starting with this segment name in the directory of the
shard belong to this segment.
generation
A generation number that is basically incremented when needing to write a new

segment. The segment name is derived from this generation number.
num_docs
The number of non-deleted documents that are stored in this segment.
deleted_docs
The number of deleted documents that are stored in this segment. It is perfectly fine

if this number is greater than 0, space is going to be reclaimed when this segment
gets merged.
size_in_bytes
The amount of disk space that this segment uses, in bytes.
memory_in_bytes
Segments need to store some data into memory in order to be searchable efficiently.
This number returns the number of bytes that are used for that purpose. A value of -1
indicates that NG|Storage was not able to compute this number.
committed
Whether the segment has been sync’ed on disk. Segments that are committed would
survive a hard reboot. No need to worry in case of false, the data from uncommitted
segments is also stored in the transaction log so that NG|Storage is able to replay
changes on the next start.
search
Whether the segment is searchable. A value of false would most likely mean that the
segment has been written to disk but no refresh occurred since then to make it
searchable.
version
The version of Lucene that has been used to write this segment.
compound
Whether the segment is stored in a compound file. When true, this means that Lucene
merged all files from the segment in a single one in order to save file descriptors.
Verbose mode
To add additional information that can be used for debugging, use the verbose flag.
experimental[The format of the additional verbose information is experimental and can

change at any time]
curl -XGET 'http://localhost:9200/test/_segments?verbose=true'
Response:

{
...
"_3": {
...
"ram_tree": [
{
"description": "postings
[PerFieldPostings(format=1)]",
"children": [
{
"description": "format 'Lucene50_0' ...",
"children" :[ ... ]
},
...
]
},
...
]
}
...
}
27.19. Shadow Replica Indices
experimental[]
If you would like to use a shared filesystem, you can use the shadow replicas settings to
choose where on disk the data for an index should be kept, as well as how NG|Storage
should replay operations on all the replica shards of an index.
In order to fully utilize the index.data_path and index.shadow_replicas settings,

you need to allow NG|Storage to use the same data directory for multiple instances by
setting node.add_lock_id_to_custom_path to false in ngStorage.yml:
node.add_lock_id_to_custom_path: false
You will also need to indicate to the security manager where the custom indices will be, so
that the correct permissions can be applied. You can do this by setting the
path.shared_data setting in ngStorage.yml:
path.shared_data: /opt/data
This means that NG|Storage can read and write to files in any subdirectory of the
path.shared_data setting.
You can then create an index with a custom data path, where each node will use this path
for the data:
Because shadow replicas do not index the document on replica shards, it’s
possible for the replica’s known mapping to be behind the index’s known
 mapping if the latest cluster state has not yet been processed on the node
containing the replica. Because of this, it is highly recommended to use
pre-defined mappings when using shadow replicas.
curl -XPUT 'localhost:9200/my_index' -d '

{
"index" : {
"number_of_replicas" : 4,
"data_path": "/opt/data/my_index",
"shadow_replicas": true
}
}'
In the above example, the "/opt/data/my_index" path is a shared filesystem

that must be available on every node in the NG|Storage cluster. You must
 also ensure that the NG|Storage process has the correct permissions to
read from and write to the directory used in the index.data_path
setting.
The data_path does not have to contain the index name, in this case, "my_index" was
used but it could easily also have been "/opt/data/"
An index that has been created with the index.shadow_replicas setting set to "true"
will not replicate document operations to any of the replica shards, instead, it will only
continually refresh. Once segments are available on the filesystem where the shadow
replica resides (after an NG|Storage "flush"), a regular refresh (governed by the
index.refresh_interval) can be used to make the new data searchable.
Since documents are only indexed on the primary shard, realtime GET
requests could fail to return a document if executed on the replica shard,
 therefore, GET API requests automatically have the

?preference=_primary flag set if there is no preference flag already
set.
In order to ensure the data is being synchronized in a fast enough manner, you may need to

tune the flush threshold for the index to a desired number. A flush is needed to fsync
segment files to disk, so they will be visible to all other replica nodes. Users should test
what flush threshold levels they are comfortable with, as increased flushing can impact
indexing performance.
The NG|Storage cluster will still detect the loss of a primary shard, and transform the
replica into a primary in this situation. This transformation will take slightly longer, since
no IndexWriter is maintained for each shadow replica.
Below is the list of settings that can be changed using the update settings API:
index.data_path (string)
Path to use for the index’s data. Note that by default NG|Storage will append the node
ordinal by default to the path to ensure multiple instances of NG|Storage on the same
machine do not share a data directory.
index.shadow_replicas
Boolean value indicating this index should use shadow replicas. Defaults to false.
index.shared_filesystem
Boolean value indicating this index uses a shared filesystem. Defaults to the true if
index.shadow_replicas is set to true, false otherwise.
index.shared_filesystem.recover_on_any_node
Boolean value indicating whether the primary shards for the index should be allowed
to recover on any node in the cluster. If a node holding a copy of the shard is found,
recovery prefers that node. Defaults to false.
Node level settings related to shadow replicas
These are non-dynamic settings that need to be configured in ngStorage.yml
node.add_lock_id_to_custom_path
Boolean setting indicating whether NG|Storage should append the node’s ordinal to
the custom data path. For example, if this is enabled and a path of "/tmp/foo" is used,
the first locally-running node will use "/tmp/foo/0", the second will use "/tmp/foo/1",
the third "/tmp/foo/2", etc. Defaults to true.
27.20. Indices Shard Stores
Provides store information for shard copies of indices. Store information reports on which

nodes shard copies exist, the shard copy allocation ID, a unique identifier for each shard
copy, and any exceptions encountered while opening the shard index or from earlier engine
failure.
By default, only lists store information for shards that have at least one unallocated copy.
When the cluster health status is yellow, this will list store information for shards that have
at least one unassigned replica. When the cluster health status is red, this will list store
information for shards, which has unassigned primaries.
Endpoints include shard stores information for a specific index, several indices, or all:
curl -XGET 'http://localhost:9200/test/_shard_stores'

curl -XGET 'http://localhost:9200/test1,test2/_shard_stores'
curl -XGET 'http://localhost:9200/_shard_stores'
The scope of shards to list store information can be changed through status param.
Defaults to 'yellow' and 'red'. 'yellow' lists store information of shards with at least one
unassigned replica and 'red' for shards with unassigned primary shard. Use 'green' to list
store information for shards with all assigned copies.
curl -XGET 'http://localhost:9200/_shard_stores?status=green'
Response:
The shard stores information is grouped by indices and shard ids.

{
...
"0": { 1
"stores": [ 2
{
"sPa3OgxLSYGvQ4oPs-Tajw": { 3
"name": "node_t0",
"transport_address": "local[1]",
"attributes": {
"mode": "local"
}
},
"allocation_id": "2iNySv_OQVePRX-yaRH_lQ", 4
"legacy_version": 42, 5
"allocation" : "primary" | "replica" | "unused", 6
"store_exception": ... 7
},
...
]
},
...
}
1 - The key is the corresponding shard ID for the store information
2 - A list of store information for all copies of the shard
3 - The node information that hosts a copy of the store, the key is the unique node ID.
4 - The allocation ID of the store copy
5 - The version of the store copy (available only for legacy shard copies that have not yet
been active in a current version of NG|Storage)
6 - The status of the store copy, whether it is used as a primary, replica or not used at all
7 - Any exception encountered while opening the shard index or from earlier engine failure
27.21. Shrink Index
The shrink index API allows you to shrink an existing index into a new index with fewer
primary shards. The requested number of primary shards in the target index must be a
factor of the number of shards in the source index. For example an index with 8 primary
shards can be shrunk into 4, 2 or 1 primary shards or an index with 15 primary shards can
be shrunk into 5, 3 or 1. If the number of shards in the index is a prime number it can only
be shrunk into a single primary shard. Before shrinking, a (primary or replica) copy of every
shard in the index must be present on the same node.

Shrinking works as follows:
• First, it creates a new target index with the same definition as the source index, but with
a smaller number of primary shards.
• Then it hard-links segments from the source index into the target index. (If the file
system doesn’t support hard-linking, then all segments are copied into the new index,
which is a much more time consuming process.)
• Finally, it recovers the target index as though it were a closed index which had just been
re-opened.
Preparing an index for shrinking
In order to shrink an index, the index must be marked as read-only, and a (primary or
replica) copy of every shard in the index must be relocated to the same node and have
health green.
These two conditions can be achieved with the following request:
PUT /my_source_index/_settings
{
"settings": {
"index.routing.allocation.require._name": "shrink_node_name",
1
"index.blocks.write": true 2
}
}
1 - Forces the relocation of a copy of each shard to the node with name
shrink_node_name. See Shard Allocation Filtering for more options.
2 - Prevents write operations to this index while still allowing metadata changes like
deleting the index.
It can take a while to relocate the source index. Progress can be tracked with the _cat
recovery API, or the cluster health API can be used to wait until all shards have
relocated with the wait_for_relocating_shards parameter.
Shrinking an index
To shrink my_source_index into a new index called my_target_index, issue the

following request:
POST my_source_index/_shrink/my_target_index

The above request returns immediately once the target index has been added to the cluster
state¬—¬it doesn’t wait for the shrink operation to start.
Indices can only be shrunk if they satisfy the following requirements:
• the target index must not exist
• The index must have more primary shards than the target index.
• The number of primary shards in the target index must be a factor of

the number of primary shards in the source index. The source index
 must have more primary shards than the target index.
• The index must not contain more than 2,147,483,519 documents in

total across all shards that will be shrunk into a single shard on the
target index as this is the maximum number of docs that can fit into a
single shard.
• The node handling the shrink process must have sufficient free disk
space to accommodate a second copy of the existing index.
The _shrink API is similar to the create index API and accepts settings and
aliases parameters for the target index:
POST my_source_index/_shrink/my_target_index
{
"settings": {
"index.number_of_replicas": 1,
"index.number_of_shards": 1, 1
"index.codec": "best_compression" 2
},
"aliases": {
"my_search_indices": {}
}
}
1 - The number of shards in the target index. This must be a factor of the number of shards
in the source index.
2 - Best compression will only take affect when new writes are made to the index, such as
when force-merging the shard to a single segment.
Mappings may not be specified in the _shrink request, and all
 index.analysis. and index.similarity. settings will be

overwritten with the settings from the source index.

Monitoring the shrink process
The shrink process can be monitored with the _cat recovery API, or the cluster
health API can be used to wait until all primary shards have been allocated by setting the
wait_for_status parameter to yellow.
The _shrink API returns as soon as the target index has been added to the cluster state,
before any shards have been allocated. At this point, all shards are in the state
unassigned. If, for any reason, the target index can’t be allocated on the shrink node, its
primary shard will remain unassigned until it can be allocated on that node.
Once the primary shard is allocated, it moves to state initializing, and the shrink
process begins. When the shrink operation completes, the shard will become active. At
that point, NG|Storage will try to allocate any replicas and may decide to relocate the
primary shard to another node.
27.22. Indices Stats
Indices level stats provide statistics on different operations happening on an index. The API
provides statistics on the index level scope (though most stats can also be retrieved using
node level scope).
The following returns high level aggregation and index level stats for all indices:
curl localhost:9200/_stats
Specific index stats can be retrieved using:
curl localhost:9200/index1,index2/_stats
By default, all stats are returned, returning only specific stats can be specified as well in
the URI. Those stats can be any of:
docs
The number of docs / deleted docs (docs not yet merged out). Note, affected by
refreshing the index.
store
The size of the index.

indexing
Indexing statistics, can be combined with a comma separated list of types to provide
document type level stats.
get
Get statistics, including missing stats.
search
Search statistics including suggest statistics. You can include statistics for custom
groups by adding an extra groups parameter (search operations can be associated
with one or more groups). The groups parameter accepts a comma separated list of
group names. Use _all to return statistics for all groups.
segments
Retrieve the memory use of the open segments. Optionally, setting the
include_segment_file_sizes flag, report the aggregated disk usage of each one
of the Lucene index files.
completion
Completion suggest statistics.
fielddata
Fielddata statistics.
flush
Flush statistics.
merge
Merge statistics.
request_cache
Shard request cache statistics.
refresh
Refresh statistics.
warmer
Warmer statistics.
translog
Translog statistics.

Some statistics allow per field granularity which accepts a list comma-separated list of
included fields. By default all fields are included:
fields
List of fields to be included in the statistics. This is used as the default list unless a
more specific field list is provided (see below).
completion_fields
List of fields to be included in the Completion Suggest statistics.
fielddata_fields
List of fields to be included in the Fielddata statistics.
Here are some samples:
# Get back stats for merge and refresh only for all indices
curl 'localhost:9200/_stats/merge,refresh'
# Get back stats for type1 and type2 documents for the my_index index
curl 'localhost:9200/my_index/_stats/indexing?types=type1,type2
# Get back just search stats for group1 and group2
curl 'localhost:9200/_stats/search?groups=group1,group2
The stats returned are aggregated on the index level, with primaries and total
aggregations, where primaries are the values for only the primary shards, and total
are the cumulated values for both primary and replica shards.
In order to get back shard level stats, set the level parameter to shards.
Note, as shards move around the cluster, their stats will be cleared as they are created on
other nodes. On the other hand, even though a shard "left" a node, that node will still retain
the stats that shard contributed to.
27.23. Index Templates
Index templates allow you to define templates that will automatically be applied when new
indices are created. The templates include both settings and mappings, and a simple
pattern template that controls whether the template should be applied to the new index.
Templates are only applied at index creation time. Changing a template

 will have no impact on existing indices.
For example:

PUT _template/template_1
{
"template": "te*",
"settings": {
"number_of_shards": 1
},
"mappings": {
"type1": {
"_source": {
"enabled": false
},
"properties": {
"host_name": {
"type": "keyword"
},
"created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z YYYY"
}
}
}
}
}
Defines a template named template_1, with a template pattern of te*. The settings and
mappings will be applied to any index name that matches the te* template.
It is also possible to include aliases in an index template as follows:
curl -XPUT localhost:9200/_template/template_1 -d '

{
"template" : "te*",
"settings" : {
},
"aliases" : {
"alias1" : {},
"alias2" : {
"filter" : {
"term" : {"user" : "kimchy" }
},
"routing" : "kimchy"
},
"{index}-alias" : {} 1
}
}
'
1 - the {index} placeholder within the alias name will be replaced with the actual index
name that the template gets applied to during index creation.
Deleting a Template

Index templates are identified by a name (in the above case template_1) and can be
deleted as well:
curl -XDELETE localhost:9200/_template/template_1
Getting templates
Index templates are identified by a name (in the above case template_1) and can be
retrieved using the following:
curl -XGET localhost:9200/_template/template_1
You can also match several templates by using wildcards like:
curl -XGET localhost:9200/_template/temp*

curl -XGET localhost:9200/_template/template_1,template_2
To get list of all index templates you can run:
curl -XGET localhost:9200/_template/
Templates exists
Used to check if the template exists or not. For example:
curl -XHEAD -i localhost:9200/_template/template_1
The HTTP status code indicates if the template with the given name exists or not. A status
code 200 means it exists, a 404 it does not.
Multiple Template Matching
Multiple index templates can potentially match an index, in this case, both the settings and
mappings are merged into the final configuration of the index. The order of the merging can
be controlled using the order parameter, with lower order being applied first, and higher
orders overriding them. For example:

{
"template" : "*",
"order" : 0,
"settings" : {
},
"mappings" : {
"type1" : {
"_source" : { "enabled" : false }
}
}
}
'

{
"template" : "te*",
"order" : 1,
"settings" : {
},
"mappings" : {
"type1" : {
"_source" : { "enabled" : true }
}
}
}
'
The above will disable storing the _source on all type1 types, but for indices of that start
with te*, source will still be enabled. Note, for mappings, the merging is "deep", meaning
that specific object/property based mappings can easily be added/overridden on higher
order templates, with lower order templates providing the basis.
27.24. Types Exists
Used to check if a type/types exists in an index/indices.
curl -XHEAD -i 'http://localhost:9200/twitter/tweet'
The HTTP status code indicates if the type exists or not. A 404 means it does not exist, and
200 means it does.
27.25. Update Indices Settings
Change specific index level settings in real time.

The REST endpoint is /_settings (to update all indices) or {index}/_settings to
update one (or more) indices settings. The body of the request includes the updated
settings, for example:
{
"index" : {
}
}
The above will change the number of replicas to 4 from the current number of replicas.
Here is a curl example:
curl -XPUT 'localhost:9200/my_index/_settings' -d '

{
"index" : {
}
}'
The list of per-index settings which can be updated dynamically on live indices can be found
in Index Modules.
Bulk Indexing Usage
For example, the update settings API can be used to dynamically change the index from
being more performant for bulk indexing, and then move it to more real time indexing state.
Before the bulk indexing is started, use:
curl -XPUT localhost:9200/test/_settings -d '{

"index" : {
"refresh_interval" : "-1"
} }'
(Another optimization option is to start the index without any replicas, and only later adding
them, but that really depends on the use case).
Then, once bulk indexing is done, the settings can be updated (back to the defaults for
example):
curl -XPUT localhost:9200/test/_settings -d '{

"index" : {
"refresh_interval" : "1s"
} }'

And, a force merge should be called:
curl -XPOST 'http://localhost:9200/test/_forcemerge?max_num_segments=5'
Updating Index Analysis
It is also possible to define new analyzers for the index. But it is required to close the index
first and open it after the changes are made.
For example if content analyzer hasn’t been defined on myindex yet you can use the
following commands to add it:
curl -XPOST 'localhost:9200/myindex/_close'
curl -XPUT 'localhost:9200/myindex/_settings' -d '{

"analysis" : {
"analyzer":{
"content":{
"type":"custom",
"tokenizer":"whitespace"
}
}
}
}'
curl -XPOST 'localhost:9200/myindex/_open'
27.26. Upgrade
The upgrade API allows to upgrade one or more indices to the latest Lucene format through
an API. The upgrade process converts any segments written with older formats.
chapter.
Ingest Node
You can use ingest node to pre-process documents before the actual indexing takes place.
This pre-processing happens by an ingest node that intercepts bulk and index requests,
applies the transformations, and then passes the documents back to the index or bulk APIs.
You can enable ingest on any node or even have dedicated ingest nodes. Ingest is enabled
by default on all nodes. To disable ingest on a node, configure the following setting in the
ngStorage.yml file:

node.ingest: false
To pre-process documents before indexing, you define a pipeline that specifies a series of
processors. Each processor transforms the document in some way. For example, you may
have a pipeline that consists of one processor that removes a field from the document
followed by another processor that renames a field.
To use a pipeline, you simply specify the pipeline parameter on an index or bulk request
to tell the ingest node which pipeline to use. For example:
PUT my-index/my-type/my-id?pipeline=my_pipeline_id
{
"foo": "bar"
}
See Ingest APIs for more information about creating, adding, and deleting pipelines.

Chapter 28. Pipeline Definition
A pipeline is a definition of a series of processors that are to be executed in the same order
as they are declared. A pipeline consists of two main fields: a description and a list of
processors:
{
"description" : "...",
"processors" : [ ... ]
}
The description is a special field to store a helpful description of what the pipeline does.
The processors parameter defines a list of processors to be executed in order.
Chapter 28. Pipeline Definition | 409

Chapter 29. Ingest APIs
The following ingest APIs are available for managing pipelines:
• Put Pipeline API to add or update a pipeline
• Get Pipeline API to return a specific pipeline
• Delete Pipeline API to delete a pipeline
• Simulate Pipeline API to simulate a call to a pipeline
29.1. Put Pipeline API
The put pipeline API adds pipelines and updates existing pipelines in the cluster.
PUT _ingest/pipeline/my-pipeline-id
{
"description" : "describe pipeline",
"processors" : [
{
"set" : {
"field": "foo",
"value": "bar"
}
}
]
}
The put pipeline API also instructs all ingest nodes to reload their in-
 memory representation of pipelines, so that pipeline changes take

effect immediately.
29.2. Get Pipeline API
The get pipeline API returns pipelines based on ID. This API always returns a local
reference of the pipeline.
GET _ingest/pipeline/my-pipeline-id
Example response:
410 | Chapter 29. Ingest APIs

{
"pipelines": [ {
"id": "my-pipeline-id",
"config": {
"description": "describe pipeline",
"processors": [
{
"set" : {
"field": "foo",
"value": "bar"
}
}
]
}
} ]
}
For each returned pipeline, the source and the version are returned. The version is useful
for knowing which version of the pipeline the node has. You can specify multiple IDs to
return more than one pipeline. Wildcards are also supported.
29.3. Delete Pipeline API
The delete pipeline API deletes pipelines by ID.
DELETE _ingest/pipeline/my-pipeline-id
29.4. Simulate Pipeline API
The simulate pipeline API executes a specific pipeline against the set of documents
provided in the body of the request.
You can either specify an existing pipeline to execute against the provided documents, or
supply a pipeline definition in the body of the request.
Here is the structure of a simulate request with a pipeline definition provided in the body of
the request:
Chapter 29. Ingest APIs | 411

POST _ingest/pipeline/_simulate
{
"pipeline" : {
// pipeline definition here
},
"docs" : [
{ /** first document **/ },
{ /** second document **/ },
// ...
]
}
Here is the structure of a simulate request against an existing pipeline:
POST _ingest/pipeline/my-pipeline-id/_simulate
{
"docs" : [
{ /** first document **/ },
{ /** second document **/ },
// ...
]
}
Here is an example of a simulate request with a pipeline defined in the request and its
response:

POST _ingest/pipeline/_simulate
{
"pipeline" :
{
"description": "_description",
"processors": [
{
"set" : {
"field" : "field2",
"value" : "_value"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "type",
"_id": "id",
"_source": {
"foo": "bar"
}
},
{
"_index": "index",
"_type": "type",
"_id": "id",
"_source": {
"foo": "rab"
}
}
]
}
Response:

{
"docs": [
{
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field2": "_value",
"foo": "bar"
},
"_ingest": {
"timestamp": "2016-01-04T23:53:27.186+0000"
}
}
},
{
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field2": "_value",
"foo": "rab"
},
"_ingest": {
"timestamp": "2016-01-04T23:53:27.186+0000"
}
}
}
]
}
29.4.1. Viewing Verbose Results
You can use the simulate pipeline API to see how each processor affects the ingest
document as it passes through the pipeline. To see the intermediate results of each
processor in the simulate request, you can add the verbose parameter to the request.
Here is an example of a verbose request and its response:

POST _ingest/pipeline/_simulate?verbose
{
"pipeline" :
{
"description": "_description",
"processors": [
{
"set" : {
"field" : "field2",
"value" : "_value2"
}
},
{
"set" : {
"field" : "field3",
"value" : "_value3"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "type",
"_id": "id",
"_source": {
"foo": "bar"
}
},
{
"_index": "index",
"_type": "type",
"_id": "id",
"_source": {
"foo": "rab"
}
}
]
}
Response:
{
"docs": [
{
"processor_results": [
{
"tag": "processor[set]-0",
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field2": "_value2",
"foo": "bar"
},
"_ingest": {
"timestamp": "2016-01-05T00:02:51.383+0000"
}
}
},
{
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"foo": "bar"
},
"_ingest": {
"timestamp": "2016-01-05T00:02:51.383+0000"
}
}
}
]
},
{
"processor_results": [
{
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"foo": "rab"
},
"_ingest": {
"timestamp": "2016-01-05T00:02:51.384+0000"
}
}
},
{
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,

"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"foo": "rab"
},
"_ingest": {
"timestamp": "2016-01-05T00:02:51.384+0000"
}
}
}
]
}
]
}

Chapter 30. Accessing Data in Pipelines
The processors in a pipeline have read and write access to documents that pass through
the pipeline. The processors can access fields in the source of a document and the
document’s metadata fields.
Accessing Fields in the Source
Accessing a field in the source is straightforward. You simply refer to fields by their name.
For example:
{
"set": {
"field": "my_field"
"value": 582.1
}
}
On top of this, fields from the source are always accessible via the _source prefix:
{
"set": {
"field": "_source.my_field"
"value": 582.1
}
}
Accessing Metadata Fields
You can access metadata fields in the same way that you access fields in the source. This is
possible because NG|Storage doesn’t allow fields in the source that have the same name as
metadata fields.
The following example sets the _id metadata field of a document to 1:
{
"set": {
"field": "_id"
"value": "1"
}
}
The following metadata fields are accessible by a processor: _index, _type, _id,
_routing, _parent.
418 | Chapter 30. Accessing Data in Pipelines
Accessing Ingest Metadata Fields
Beyond metadata fields and source fields, ingest also adds ingest metadata to the
documents that it processes. These metadata properties are accessible under the
_ingest key. Currently ingest adds the ingest timestamp under the
_ingest.timestamp key of the ingest metadata. The ingest timestamp is the time when
NG|Storage received the index or bulk request to pre-process the document.
Any processor can add ingest-related metadata during document processing. Ingest
metadata is transient and is lost after a document has been processed by the pipeline.
Therefore, ingest metadata won’t be indexed.
The following example adds a field with the name received. The value is the ingest
timestamp:
{
"set": {
"field": "received"
"value": "{{_ingest.timestamp}}"
}
}
Unlike NG|Storage metadata fields, the ingest metadata field name _ingest can be used
as a valid field name in the source of a document. Use _source._ingest to refer to the
field in the source document. Otherwise, _ingest will be interpreted as an ingest
metadata field.
Accessing Fields and Metafields in Templates
A number of processor settings also support templating. Settings that support templating
can have zero or more template snippets. A template snippet begins with {{ and ends with
}}. Accessing fields and metafields in templates is exactly the same as via regular
processor field settings.
The following example adds a field named field_c. Its value is a concatenation of the
values of field_a and field_b.
Chapter 30. Accessing Data in Pipelines | 419

{
"set": {
"field": "field_c"
"value": "{{field_a}} {{field_b}}"
}
}
The following example uses the value of the geoip.country_iso_code field in the
source to set the index that the document will be indexed into:
{
"set": {
"field": "_index"
"value": "{{geoip.country_iso_code}}"
}
}
420 | Chapter 30. Accessing Data in Pipelines

Chapter 31. Handling Failures in Pipelines
In its simplest use case, a pipeline defines a list of processors that are executed
sequentially, and processing halts at the first exception. This behavior may not be desirable
when failures are expected. For example, you may have logs that don’t match the specified
grok expression. Instead of halting execution, you may want to index such documents into a
separate index.
To enable this behavior, you can use the on_failure parameter. The on_failure
parameter defines a list of processors to be executed immediately following the failed
processor. You can specify this parameter at the pipeline level, as well as at the processor
level. If a processor specifies an on_failure configuration, whether it is empty or not, any
exceptions that are thrown by the processor are caught, and the pipeline continues
executing the remaining processors. Because you can define further processors within the
scope of an on_failure statement, you can nest failure handling.
The following example defines a pipeline that renames the foo field in the processed
document to bar. If the document does not contain the foo field, the processor attaches an
error message to the document for later analysis within NG|Storage.
{
"description" : "my first pipeline with handled exceptions",
"processors" : [
{
"rename" : {
"field" : "foo",
"target_field" : "bar",
"on_failure" : [
{
"set" : {
"field" : "error",
"value" : "field \"foo\" does not exist, cannot rename to
\"bar\""
}
}
]
}
}
]
}
The following example defines an on_failure block on a whole pipeline to change the
index to which failed documents get sent.
Chapter 31. Handling Failures in Pipelines | 421

{
"processors" : [ ... ],
"on_failure" : [
{
"set" : {
"field" : "_index",
"value" : "failed-{{ _index }}"
}
}
]
}
Alternatively instead of defining behaviour in case of processor failure, it is also possible to

ignore a failure and continue with the next processor by specifying the ignore_failure
setting.
In case in the example below the field foo doesn’t exist the failure will be caught and the
pipeline continues to execute, which in this case means that the pipeline does nothing.
{
"processors" : [
{
"rename" : {
"field" : "foo",
"target_field" : "bar",
"ignore_failure" : true
}
}
]
}
The ignore_failure can be set on any processor and defaults to false.
Accessing Error Metadata From Processors Handling Exceptions
You may want to retrieve the actual error message that was thrown by a failed processor.
To do so you can access metadata fields called on_failure_message,
on_failure_processor_type, and on_failure_processor_tag. These fields are
only accessible from within the context of an on_failure block.
Here is an updated version of the example that you saw earlier. But instead of setting the
error message manually, the example leverages the on_failure_message metadata
field to provide the error message.
422 | Chapter 31. Handling Failures in Pipelines

{
"processors" : [
{
"rename" : {
"field" : "foo",
"to" : "bar",
"on_failure" : [
{
"set" : {
"field" : "error",
"value" : "{{ _ingest.on_failure_message }}"
}
}
]
}
}
]
}
Chapter 31. Handling Failures in Pipelines | 423

Chapter 32. Processors
The following processors are available:
• Append Processor
• Convert Processor
• Date Processor
• Date Index Name Processor
• Fail Processor
• Foreach Processor
• Grok Processor
• Gsub Processor
• Join Processor
• Lowercase Processor
• Remove Processor
• Rename Processor
• Script Processor
• Set Processor
• Split Processor
• Sort Processor
• Trim Processor
• Uppercase Processor
chapter.
424 | Chapter 32. Processors

Mapping
Mapping is the process of defining how a document, and the fields it contains, are stored
and indexed. For instance, use mappings to define:
• which string fields should be treated as full text fields.
• which fields contain numbers, dates, or geolocations.
• whether the values of all fields in the document should be indexed into the catch-all
_all field.
• the format of date values.
• custom rules to control the mapping for dynamically added fields.
Chapter 32. Processors | 425

Chapter 33. Mapping Types
Each index has one or more mapping types, which are used to divide the documents in an
index into logical groups. User documents might be stored in a user type, and blog posts in
a blogpost type.
Each mapping type has:
Meta-fields
Meta-fields are used to customize how a document’s metadata associated is treated.

Examples of meta-fields include the document’s _index, _type, _id, and _source
fields.
Fields or properties
Each mapping type contains a list of fields or properties pertinent to that type. A
user type might contain title, name, and age fields, while a blogpost type might
contain title, body, user_id and created fields. Fields with the same name in
different mapping types in the same index must have the same mapping.
Field datatypes
Each field has a data type which can be:
• a simple type like text, keyword, date, long, double, boolean or ip.
• a type which supports the hierarchical nature of JSON such as object or nested.
• or a specialised type like geo_point, geo_shape, or completion.
It is often useful to index the same field in different ways for different purposes. For
instance, a string field could be indexed as a text field for full-text search, and as a
keyword field for sorting or aggregations. Alternatively, you could index a string field with
the standard analyzer, the english analyzer, and the french analyzer.
This is the purpose of multi-fields. Most datatypes support multi-fields via the Multi-Fields
parameter.
Dynamic mapping
Fields and mapping types do not need to be defined before being used. Thanks to dynamic
mapping, new mapping types and new field names will be added automatically, just by
indexing a document. New fields can be added both to the top-level mapping type, and to
426 | Chapter 33. Mapping Types

inner object and nested fields.
The dynamic mapping rules can be configured to customise the mapping that is used for
new types and new fields.
Explicit mappings
You know more about your data than NG|Storage can guess, so while dynamic mapping can
be useful to get started, at some point you will want to specify your own explicit mappings.
You can create mapping types and field mappings when you create an index, and you can
add mapping types and fields to an existing index with the PUT mapping API.
Updating existing mappings
Other than where documented, existing type and field mappings cannot be updated.
Changing the mapping would mean invalidating already indexed documents. Instead, you
should create a new index with the correct mappings and reindex your data into that index.
Fields are shared across mapping types
Mapping types are used to group fields, but the fields in each mapping type are not
independent of each other. Fields with:
• the same name
• in the same index
• in different mapping types
• map to the same field internally,
• and must have the same mapping.
If a title field exists in both the user and blogpost mapping types, the title fields
must have exactly the same mapping in each type. The only exceptions to this rule are the
Copy-To, Dynamic, Enabled, Ignore Above, Include In All, and Properties parameters, which
may have different settings per field.
Usually, fields with the same name also contain the same type of data, so having the same
mapping is not a problem. When conflicts do arise, these can be solved by choosing more
descriptive names, such as user_title and blog_title.
Example mapping
A mapping for the example described above could be specified when creating the index, as
Chapter 33. Mapping Types | 427
follows:
PUT my_index 1
{
"mappings": {
"user": { 2
"_all": { "enabled": false }, 3
"properties": { 4
"title": { "type": "text" }, 5
"name": { "type": "text" }, 5
"age": { "type": "integer" } 5
}
},
"blogpost": { 2
"_all": { "enabled": false }, 3
"properties": { 4
"title": { "type": "text" }, 5
"body": { "type": "text" }, 5
"user_id": {
"type": "keyword" 5
},
"created": {
"type": "date", 5
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}
1 - Create an index called my_index.
2 - Add mapping types called user and blogpost.
3 - Disable the _all meta field for the user mapping type.
4 - Specify fields or properties in each mapping type.
5 - Specify the data type and mapping for each field.
428 | Chapter 33. Mapping Types

Chapter 34. Dynamic Mapping
One of the most important features of NG|Storage is that it tries to get out of your way and
let you start exploring your data as quickly as possible. To index a document, you don’t have
to first create an index, define a mapping type, and define your fields you can just index a
document and the index, type, and fields will spring to life automatically:
PUT data/counters/1 1
{ "count": 5 }
1 Creates the data index, the counters mapping type, and a field called count with
datatype long.
The automatic detection and addition of new types and fields is called dynamic mapping.
The dynamic mapping rules can be customised to suit your purposes with:
default mapping
Configure the base mapping to be used for new mapping types.
Dynamic field mappings
The rules governing dynamic field detection.
Dynamic templates
Custom rules to configure the mapping for dynamically added fields.
Index templates allow you to configure the default mappings, settings and
 aliases for new indices, whether created automatically or explicitly.
Disabling automatic type creation
Automatic type creation can be disabled by setting the index.mapper.dynamic setting to

false, either by setting the default value in the config/ngStorage.yml file, or per-
index as an index setting:
PUT data/_settings 1
{
"index.mapper.dynamic":false
}
1 - Disable automatic type creation for all indices.
Chapter 34. Dynamic Mapping | 429

Regardless of the value of this setting, types can still be added explicitly when creating an
index or with the PUT mapping API.
34.1. Default Mapping
The default mapping, which will be used as the base mapping for any new mapping types,
can be customised by adding a mapping type with the name default to an index, either
when creating the index or later on with the PUT mapping API.
PUT my_index
{
"mappings": {
"_default_": { 1
"_all": {
"enabled": false
}
},
"user": {}, 2
"blogpost": { 3
"_all": {
"enabled": true
}
}
}
}
1 - The default mapping defaults the _all field to disabled.
2 - The user type inherits the settings from default.
3 - The blogpost type overrides the defaults and enables the _all field.
While the default mapping can be updated after an index has been created, the new
defaults will only affect mapping types that are created afterwards.
The default mapping can be used in conjunction with Index templates to control
dynamically created types within automatically created indices:
430 | Chapter 34. Dynamic Mapping

PUT _template/logging
{
"template": "logs-*", 1
"settings": { "number_of_shards": 1 }, 2
"mappings": {
"_default_": {
"_all": { 3
"enabled": false
},
"dynamic_templates": [
{
"strings": { 4
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
]
}
}
}
PUT logs-2015.10.01/event/1
{ "message": "error:16" }
1 - The logging template will match any indices beginning with logs-.
2 - Matching indices will be created with a single primary shard.
3 - The _all field will be disabled by default for new type mappings.
4 - String fields will be created with a text main field, and a keyword .raw field.
34.2. Dynamic Field Mapping
By default, when a previously unseen field is found in a document, NG|Storage will add the
new field to the type mapping. This behaviour can be disabled, both at the document and at
the object level, by setting the dynamic parameter to false or to strict.
Assuming dynamic field mapping is enabled, some simple rules are used to determine
which datatype the field should have:

JSON datatype
NG|Storage datatype
null
No field is added.
true or false
boolean field
floating point number
float field
integer
long field
object
object field
array
Depends on the first non-null value in the array.
string
Either a date field (if the value passes date detection), a double or long field (if
the value passes numeric detection) or a text field, with a keyword sub-field.
These are the only field datatypes that are dynamically detected. All other datatypes must
be mapped explicitly.
Besides the options listed below, dynamic field mapping rules can be further customised
with dynamic_templates.
Settings to prevent mappings explosion
Two settings allow to control mapping explosion, in order to prevent adversary documents
to create huge mappings through dynamic mappings for instance:
index.mapping.total_fields.limit
The maximum number of fields in an index. The default value is 1000.
index.mapping.depth.limit
The maximum depth for a field, which is measured as the number of nested objects.

For instance, if all fields are defined at the root object level, then the depth is 1. If
there is one object mapping, then the depth is 2, etc. The default is 20.
Date detection
If date_detection is enabled (default), then new string fields are checked to see whether
their contents match any of the date patterns specified in dynamic_date_formats. If a
match is found, a new date field is added with the corresponding format.
The default value for dynamic_date_formats is:
[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]
For example:
PUT my_index/my_type/1
{
"create_date": "2015/09/02"
}
GET my_index/_mapping 1
1 - The create_date field has been added as a date field with the format:
"yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z".
Disabling date detection
Dynamic date detection can be disabled by setting date_detection to false:
PUT my_index
{
"mappings": {
"my_type": {
"date_detection": false
}
}
}
PUT my_index/my_type/1 1
{
"create": "2015/09/02"
}
1 - The create_date field has been added as a text field.
Customising detected date formats
Alternatively, the dynamic_date_formats can be customised to support your own date

formats:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_date_formats": ["MM/dd/yyyy"]
}
}
}
{
"create_date": "09/25/2015"
}
Numeric detection
While JSON has support for native floating point and integer datatypes, some applications
or languages may sometimes render numbers as strings. Usually the correct solution is to
map these fields explicitly, but numeric detection (which is disabled by default) can be
enabled to do this automatically:
PUT my_index
{
"mappings": {
"my_type": {
"numeric_detection": true
}
}
}
{
"my_float": "1.0", 1
"my_integer": "1" 2
}
1 - The my_float field is added as a double field.
2 - The my_integer field is added as a long field.
34.3. Dynamic Templates
Dynamic templates allow you to define custom mappings that can be applied to dynamically
added fields based on:
• the datatype detected by NG|Storage, with match_mapping_type.

• the name of the field, with match and unmatch or match_pattern.
• the full dotted path to the field, with path_match and path_unmatch.
The original field name {name} and the detected datatype {dynamic_type} template
variables can be used in the mapping specification as placeholders.
Dynamic field mappings are only added when a field contains a concrete
value¬—¬not null or an empty array. This means that if the null_value
 option is used in a dynamic_template, it will only be applied after the
first document with a concrete value for the field has been indexed.
Dynamic templates are specified as an array of named objects:
{
"my_template_name": { 1
... match conditions ... 2
"mapping": { ... } 3
}
},
...
]
1 - The template name can be any string value.
2 - The match conditions can include any of : match_mapping_type, match,

match_pattern, unmatch, path_match, path_unmatch.
3 - The mapping that the matched field should use.
Templates are processed in order¬—¬the first matching template wins. New templates can
be appended to the end of the list with the PUT mapping API. If a new template has the
same name as an existing template, it will replace the old version.
Match Mapping Type
The match_mapping_type matches on the datatype detected by dynamic field mapping,

in other words, the datatype that NG|Storage thinks the field should have. Only the
following datatypes can be automatically detected: boolean, date, double, long,
object, string. It also accepts * to match all datatypes.
For example, if we wanted to map all integer fields as integer instead of long, and all
string fields as both text and keyword, we could use the following template:

PUT my_index
{
"mappings": {
"my_type": {
{
"integers": {
"match_mapping_type": "long",
"mapping": {
"type": "integer"
}
}
},
{
"strings": {
"mapping": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
]
}
}
}
{
"my_integer": 5, 1
"my_string": "Some string" 2
}
1 - The my_integer field is mapped as an integer.
2 - The my_string field is mapped as a text, with a keyword multi field.
match and unmatch
The match parameter uses a pattern to match on the fieldname, while unmatch uses a
pattern to exclude fields matched by match.
The following example matches all string fields whose name starts with long_ (except
for those which end with _text) and maps them as long fields:

PUT my_index
{
"mappings": {
"my_type": {
{
"longs_as_strings": {
"match": "long_*",
"unmatch": "*_text",
"mapping": {
"type": "long"
}
}
}
]
}
}
}
{
"long_num": "5", 1
"long_text": "foo" 2
}
1 - The long_num field is mapped as a long.
2 - The long_text field uses the default string mapping.
match_pattern
The match_pattern parameter adjusts the behavior of the match parameter such that it
supports full Java regular expression matching on the field name instead of simple
wildcards, for instance:
"match_pattern": "regex",
"match": "^profit_\d+$"
path_match and path_unmatch
The path_match and path_unmatch parameters work in the same way as match and
unmatch, but operate on the full dotted path to the field, not just the final name, e.g.
some_object.*.some_field.
This example copies the values of any fields in the name object to the top-level full_name
field, except for the middle field:

PUT my_index
{
"mappings": {
"my_type": {
{
"full_name": {
"path_match": "name.*",
"path_unmatch": "*.middle",
"mapping": {
"type": "text",
"copy_to": "full_name"
}
}
}
]
}
}
}
{
"name": {
"first": "Alice",
"middle": "Mary",
"last": "White"
}
}
{name} and {dynamic_type}
The {name} and {dynamic_type} placeholders are replaced in the mapping with the
field name and detected dynamic type. The following example sets all string fields to use
an analyzer with the same name as the field, and disables doc_values for all non-
string fields:

PUT my_index
{
"mappings": {
"my_type": {
{
"named_analyzers": {
"match": "*",
"mapping": {
"type": "text",
"analyzer": "{name}"
}
}
},
{
"no_doc_values": {
"match_mapping_type":"*",
"mapping": {
"type": "{dynamic_type}",
"doc_values": false
}
}
}
]
}
}
}
{
"english": "Some English text", 1
"count": 5 2
}
1 - The english field is mapped as a string field with the english analyzer.
2 - The count field is mapped as a long field with doc_values disabled
Template examples
Here are some examples of potentially useful dynamic templates:
Structured search
By default NG|Storage will map string fields as a text field with a sub keyword field.
However if you are only indexing structured content and not interested in full text search,
you can make NG|Storage map your fields only as `keyword`s. Note that this means that in
order to search those fields, you will have to search on the exact same value that was
indexed.

PUT my_index
{
"mappings": {
"my_type": {
{
"strings_as_keywords": {
"mapping": {
"type": "keyword"
}
}
}
]
}
}
}
text-only mappings for strings
On the contrary to the previous example, if the only thing that you care about on your string
fields is full-text search, and if you don’t plan on running aggregations, sorting or exact
search on your string fields, you could tell NG|Storage to map it only as a text field (which
was the default behaviour before 5.0):
PUT my_index
{
"mappings": {
"my_type": {
{
"strings_as_text": {
"mapping": {
"type": "text"
}
}
}
]
}
}
}
Disabled norms
Norms are index-time scoring factors. If you do not care about scoring, which would be the
case for instance if you never sort documents by score, you could disable the storage of
these scoring factors in the index and save some space.

PUT my_index
{
"mappings": {
"my_type": {
{
"strings_as_keywords": {
"mapping": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
]
}
}
}
The sub keyword field appears in this template to be consistent with the default rules of
dynamic mappings. Of course if you do not need them because you don’t need to perform
exact search or aggregate on this field, you could remove it as described in the previous
section.
Time-series
When doing time series analysis with elastisearch, it is common to have many numeric
fields that you will often aggregate on but never filter on. In such a case, you could disable
indexing on those fields to save disk space and also maybe gain some indexing speed:

PUT my_index
{
"mappings": {
"my_type": {
{
"unindexed_longs": {
"match_mapping_type": "long",
"mapping": {
"type": "long",
"index": false
}
}
},
{
"unindexed_doubles": {
"match_mapping_type": "double",
"mapping": {
"type": "float", 1
"index": false
}
}
}
]
}
}
}
1 - Like the default dynamic mapping rules, doubles are mapped as floats, which are
usually accurate enough, yet require half the disk space.
Override default template
You can override the default mappings for all indices and all types by specifying a default
type mapping in an index template which matches all indices.
For example, to disable the _all field by default for all types in all new indices, you could
create the following index template:
PUT _template/disable_all_field
{
"order": 0,
"template": "*", 1
"mappings": {
"_default_": { 2
"_all": { 3
"enabled": false
}
}
}
}

1 - Applies the mappings to an index which matches the pattern *, in other words, all new
indices.
2 - Defines the default type mapping types within the index.
3 - Disables the _all field by default.

Chapter 35. Meta-Fields
Each document has metadata associated with it, such as the _index, mapping _type, and
_id meta-fields. The behaviour of some of these meta-fields can be customised when a
mapping type is created.
Identity meta-fields
_index
The index to which the document belongs.
_uid
A composite field consisting of the _type and the _id.
_type
The document’s mapping type.
_id
The document’s ID.
Document source meta-fields
_source
The original JSON representing the body of the document.
Indexing meta-fields
_all
A catch-all field that indexes the values of all other fields.
_field_names
All fields in the document which contain non-null values.
Routing meta-fields
_parent
Used to create a parent-child relationship between two mapping types.
_routing
A custom routing value which routes a document to a particular shard.
Other meta-field
444 | Chapter 35. Meta-Fields

_meta
Application specific metadata.
35.1. All Field
The all field is a special _catch-all field which concatenates the values of all
of the other fields into one big string, using space as a delimiter, which is then analyzed and
indexed, but not stored. This means that it can be searched, but not retrieved.
The _all field allows you to search for values in documents without knowing which field
contains the value. This makes it a useful option when getting started with a new dataset.
For instance:
PUT my_index/user/1 1
{
"first_name": "John",
"last_name": "Smith",
"date_of_birth": "1970-10-24"
}
{
"query": {
"match": {
"_all": "john smith 1970"
}
}
}
1 - The _all field will contain the terms: [ "john", "smith", "1970", "10", "24" ]
All values treated as strings
The date_of_birth field in the above example is recognised as a date

field and so will index a single term representing 1970-10-24
00:00:00 UTC. The _all field, however, treats all values as strings, so
 the date value is indexed as the three string terms: "1970", "24", "10".
It is important to note that the all field combines the original

values from each field as a string. It does not combine
the _terms from each field.
The _all field is just a text field, and accepts the same parameters that other string
fields accept, including analyzer, term_vectors, index_options, and store.
Chapter 35. Meta-Fields | 445

The _all field can be useful, especially when exploring new data using simple filtering.
However, by concatenating field values into one big string, the _all field loses the
distinction between short fields (more relevant) and long fields (less relevant). For use
cases where search relevance is important, it is better to query individual fields specifically.
The _all field is not free: it requires extra CPU cycles and uses more disk space. If not
needed, it can be completely disabled or customised on a per-field basis.
Using the _all field in queries
The query_string and simple_query_string queries query the _all field by default,
unless another field is specified:
GET _search
{
"query": {
"query_string": {
"query": "john smith 1970"
}
}
}
The same goes for the ?q= parameter in URI search requests (which is rewritten to a
query_string query internally):
GET _search?q=john+smith+1970
Other queries, such as the match and term queries require you to specify the _all field
explicitly, as per the first example.
Disabling the _all field
The _all field can be completely disabled per-type by setting enabled to false:

PUT my_index
{
"mappings": {
"type_1": { 1
"properties": {...}
},
"type_2": { 2
"_all": {
"enabled": false
},
"properties": {...}
}
}
}
1 - The _all field in type_1 is enabled.
2 - The _all field in type_2 is completely disabled.
If the _all field is disabled, then URI search requests and the query_string and
simple_query_string queries will not be able to use it for queries (see [querying-all-
field]). You can configure them to use a different field with the
index.query.default_field setting:
PUT my_index
{
"mappings": {
"my_type": {
"_all": {
"enabled": false ¬
},
"properties": {
"content": {
"type": "text"
}
}
}
},
"settings": {
"index.query.default_field": "content" ¬
}
}
1 - The _all field is disabled for the my_type type.
2 - The query_string query will default to querying the content field in this index.
Excluding fields from _all
Individual fields can be included or excluded from the _all field with the

include_in_all setting.
Index boosting and the _all field
Individual fields can be boosted at index time, with the boost parameter. The _all field
takes these boosts into account:
PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"title": { 1
"type": "text",
"boost": 2
},
"content": { 1
"type": "text"
}
}
}
}
}
1 - When querying the _all field, words that originated in the title field are twice as
relevant as words that originated in the content field.
Using index-time boosting with the _all field has a significant impact on
 query performance. Usually the better solution is to query fields

individually, with optional query time boosting.
Custom _all fields
While there is only a single _all field per index, the copy_to parameter allows the
creation of multiple custom _all fields. For instance, first_name and last_name fields
can be combined together into the full_name field:

PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name" 1
},
"last_name": {
"type": "text",
},
"full_name": {
"type": "text"
}
}
}
}
}
PUT myindex/mytype/1
{
"last_name": "Smith"
}
GET myindex/_search
{
"query": {
"match": {
"full_name": "John Smith"
}
}
}
1 - The first_name and last_name values are copied to the full_name field.
Highlighting and the _all field
A field can only be used for highlighting if the original string value is available, either from
the _source field or as a stored field.
The _all field is not present in the _source field and it is not stored by default, and so
cannot be highlighted. There are two options. Either store the _all field or highlight the
original fields.
Store the _all field
If store is set to true, then the original field value is retrievable and can be highlighted:

PUT myindex
{
"mappings": {
"mytype": {
"_all": {
"store": true
}
}
}
}
{
}
GET _search
{
"query": {
"match": {
"_all": "John Smith"
}
},
"highlight": {
"fields": {
"_all": {}
}
}
}
Of course, storing the _all field will use significantly more disk space and, because it is a
combination of other fields, it may result in odd highlighting results.
The _all field also accepts the term_vector and index_options parameters, allowing
the use of the fast vector highlighter and the postings highlighter.
Highlight original fields
You can query the _all field, but use the original fields for highlighting as follows:

PUT myindex
{
"mappings": {
"mytype": {
"_all": {}
}
}
}
{
}
GET _search
{
"query": {
"match": {
"_all": "John Smith" 1
}
},
"highlight": {
"fields": {
"*_name": { 2
"require_field_match": false 3
}
}
}
}
1 - The query inspects the _all field to find matching documents.
2 - Highlighting is performed on the two name fields, which are available from the
_source.
3 - The query wasn’t run against the name fields, so set require_field_match to
false.
35.2. ID Field
Each document indexed is associated with a _type (see Mapping Types) and an _id. The
_id field is not indexed as its value can be derived automatically from the _uid field.
The value of the id field is accessible in certain queries (term, terms,

match, query_string, simple_query_string), but _not in aggregations,
scripts or when sorting, where the _uid field should be used instead:

# Example documents
{
"text": "Document with ID 1"
}
{
}
{
"query": {
"terms": {
"_id": [ "1", "2" ] 1
}
}
}
1 - Querying on the _id field (also see the ids query)
35.3. Index Field
When performing queries across multiple indexes, it is sometimes desirable to add query
clauses that are associated with documents of only certain indexes. The _index field
allows matching on the index a document was indexed into. Its value is accessible in term,
or terms queries, aggregations, scripts, and when sorting:
The _index is exposed as a virtual field¬—¬it is not added to the Lucene

index as a real field. This means that you can use the _index field in a
 term or terms query (or any query that is rewritten to a term query, such
as the match, query_string or simple_query_string query), but it
does not support prefix, wildcard, regexp, or fuzzy queries.

# Example documents
PUT index_1/my_type/1
{
"text": "Document in index 1"
}
PUT index_2/my_type/2
{
"text": "Document in index 2"
}
GET index_1,index_2/_search
{
"query": {
"terms": {
"_index": ["index_1", "index_2"] 1
}
},
"aggs": {
"indices": {
"terms": {
"field": "_index", 2
"size": 10
}
}
},
"sort": [
{
"_index": { 3
"order": "asc"
}
}
],
"script_fields": {
"index_name": {
"script": {
"lang": "painless",
"inline": "doc['_index']" 4
}
}
}
}
1 - Querying on the _index field
2 - Aggregating on the _index field
3 - Sorting on the _index field
4 - Accessing the _index field in scripts

35.4. Meta Field
Each mapping type can have custom meta data associated with it. These are not used at all
by NG|Storage, but can be used to store application-specific metadata, such as the class
that a document belongs to:
PUT my_index
{
"mappings": {
"user": {
"_meta": { 1
"class": "MyApp::User",
"version": {
"min": "1.0",
"max": "1.3"
}
}
}
}
}
1 - This _meta info can be retrieved with the GET mapping API.
The _meta field can be updated on an existing type using the PUT mapping API.
35.5. Parent Field
A parent-child relationship can be established between documents in the same index by

making one mapping type the parent of another:

PUT my_index
{
"mappings": {
"my_parent": {},
"my_child": {
"_parent": {
"type": "my_parent" 1
}
}
}
}
PUT my_index/my_parent/1 2
{
"text": "This is a parent document"
}
PUT my_index/my_child/2?parent=1 3
{
"text": "This is a child document"
}
PUT my_index/my_child/3?parent=1 4
{
"text": "This is another child document"
}
GET my_index/my_parent/_search
{
"query": {
"has_child": { 4
"type": "my_child",
"query": {
"match": {
"text": "child document"
}
}
}
}
}
1 - The my_parent type is parent to the my_child type.
2 - Index a parent document.
3 - Index two child documents, specifying the parent document’s ID.
4 - Find all parent documents that have children which match the query.
See the has_child and has_parent queries, the children aggregation, and inner hits
for more information.
The value of the _parent field is accessible in queries, aggregations, and scripts:

{
"query": {
"terms": {
"_parent": [ "1" ] 1
}
},
"aggs": {
"parents": {
"terms": {
"field": "_parent", 2
"size": 10
}
}
},
"script_fields": {
"parent": {
"script": {
"lang": "painless",
"inline": "doc['_parent']" 3
}
}
}
}
1 - Querying on the _parent field (also see the has_parent query and the has_child
query)
2 - Aggregating on the _parent field (also see the children aggregation)
3 - Accessing the _parent field in scripts
Parent-child restrictions
• The parent and child types must be different¬—¬parent-child relationships cannot be

established between documents of the same type.
• The _parent.type setting can only point to a type that doesn’t exist yet. This means
that a type cannot become a parent type after it is has been created.
• Parent and child documents must be indexed on the same shard. The parent ID is
used as the routing value for the child, to ensure that the child is indexed on the same
shard as the parent. This means that the same parent value needs to be provided
when getting, deleting, or updating a child document.
Global ordinals
Parent-child uses global ordinals to speed up joins. Global ordinals need to be rebuilt after
any change to a shard. The more parent id values are stored in a shard, the longer it takes

to rebuild the global ordinals for the _parent field.
Global ordinals, by default, are built lazily: the first parent-child query or aggregation after
a refresh will trigger building of global ordinals. This can introduce a significant latency
spike for your users. You can use eager_global_ordinals to shift the cost of building global
ordinals from query time to refresh time, by mapping the _parent field as follows:
PUT my_index
{
"mappings": {
"my_parent": {},
"my_child": {
"_parent": {
"type": "my_parent",
"eager_global_ordinals": true
}
}
}
}
The amount of heap used by global ordinals can be checked as follows:
# Per-index
GET _stats/fielddata?human&fields=_parent
# Per-node per-index
GET _nodes/stats/indices/fielddata?human&fields=_parent
35.6. Routing field
A document is routed to a particular shard in an index using the following formula:
shard_num = hash(_routing) % num_primary_shards
The default value used for _routing is the document’s _id or the document’s _parent
ID, if present.
Custom routing patterns can be implemented by specifying a custom routing value per
document. For instance:
PUT my_index/my_type/1?routing=user1 1
{
"title": "This is a document"
}
GET my_index/my_type/1?routing=user1 2

1 - This document uses user1 as its routing value, instead of its ID.
2 - The same routing value needs to be provided when getting, deleting, or updating
the document.
The value of the _routing field is accessible in queries and scripts:
{
"query": {
"terms": {
"_routing": [ "user1" ] 1
}
},
"script_fields": {
"Routing value": {
"script": {
"lang": "painless",
"inline": "doc['_routing']" 2
}
}
}
}
1 - Querying on the _routing field (also see the ids query)
2 - Accessing the _routing field in scripts
Searching with custom routing
Custom routing can reduce the impact of searches. Instead of having to fan out a search
request to all the shards in an index, the request can be sent to just the shard that matches
the specific routing value (or values):
GET my_index/_search?routing=user1,user2 1
{
"query": {
"match": {
"title": "document"
}
}
}
1 - This search request will only be executed on the shards associated with the user1 and
user2 routing values.
Making a routing value required
When using custom routing, it is important to provide the routing value whenever indexing,
getting, deleting, or updating a document.
Forgetting the routing value can lead to a document being indexed on more than one shard.
As a safeguard, the _routing field can be configured to make a custom routing value
required for all CRUD operations:
PUT my_index2
{
"mappings": {
"my_type": {
"_routing": {
"required": true 1
}
}
}
}
PUT my_index2/my_type/1 2
{
"text": "No routing value provided"
}
1 - Routing is required for my_type documents.
2 - This index request throws a routing_missing_exception.
Unique IDs with custom routing
When indexing documents specifying a custom _routing, the uniqueness of the _id is not
guaranteed across all of the shards in the index. In fact, documents with the same _id
might end up on different shards if indexed with different _routing values.
It is up to the user to ensure that IDs are unique across the index.
35.7. Source Field
The source field contains the original JSON document body that was
passed at index time. The _source field itself is not indexed (and
thus is not searchable), but it is stored so that it can be
returned when executing _fetch requests, like get or search.
Disabling the _source field
Though very handy to have around, the source field does incur storage overhead within the
index. For this reason, it can be disabled as follows:

PUT tweets
{
"mappings": {
"tweet": {
"_source": {
"enabled": false
}
}
}
}
Think before disabling the _source field
Users often disable the _source field without thinking about the
consequences, and then live to regret it. If the _source field isn’t
available then a number of features are not supported:
• The update, update_by_query, and reindex APIs.
• On the fly highlighting.

 • The ability to reindex from one NG|Storage index to another, either to
change mappings or analysis, or to upgrade an index to a new major
version.
• The ability to debug queries or aggregations by viewing the original

document used at index time.
• Potentially in the future, the ability to repair index corruption

automatically.
If disk space is a concern, rather increase the compression level instead of

 disabling the _source.
The metrics use case
The metrics use case is distinct from other time-based or logging use cases in that
there are many small documents which consist only of numbers, dates, or keywords.
There are no updates, no highlighting requests, and the data ages quickly so there is
no need to reindex. Search requests typically use simple queries to filter the dataset
by date or tags, and the results are returned as aggregations.
In this case, disabling the _source field will save space and reduce I/O. It is also
advisable to disable the _all field in the metrics case.

Including / Excluding fields from _source
An expert-only feature is the ability to prune the contents of the _source field after the
document has been indexed, but before the _source field is stored.
Removing fields from the _source has similar downsides to disabling
 _source, especially the fact that you cannot reindex documents from one
NG|Storage index to another. Consider using source filtering instead.
The includes/excludes parameters (which also accept wildcards) can be used as

follows:

PUT logs
{
"mappings": {
"event": {
"_source": {
"includes": [
"*.count",
"meta.*"
],
"excludes": [
"meta.description",
"meta.other.*"
]
}
}
}
}
PUT logs/event/1
{
"requests": {
"count": 10,
"foo": "bar" 1
},
"meta": {
"name": "Some metric",
"description": "Some metric description", 1
"other": {
"foo": "one", 1
"baz": "two" 1
}
}
}
GET logs/event/_search
{
"query": {
"match": {
"meta.other.foo": "one" 2
}
}
}
1 - These fields will be removed from the stored _source field.
2 - We can still search on this field, even though it is not in the stored _source.
35.8. Type Field
Each document indexed is associated with a _type (see Mapping Types) and an _id. The
_type field is indexed in order to make searching by type name fast.
The value of the _type field is accessible in queries, aggregations, scripts, and when
sorting:
# Example documents
PUT my_index/type_1/1
{
"text": "Document with type 1"
}
PUT my_index/type_2/2
{
"text": "Document with type 2"
}
GET my_index/type_*/_search
{
"query": {
"terms": {
"_type": [ "type_1", "type_2" ] 1
}
},
"aggs": {
"types": {
"terms": {
"field": "_type", 2
"size": 10
}
}
},
"sort": [
{
"_type": { 3
"order": "desc"
}
}
],
"script_fields": {
"type": {
"script": {
"lang": "painless",
"inline": "doc['_type']" 4
}
}
}
}
1 - Querying on the _type field 2 - Aggregating on the _type field 3 - Sorting on the _type
field 4 - Accessing the _type field in scripts
35.9. UID Field
Each document indexed is associated with a _type (see Mapping Types) and an _id.
These values are combined as {type}#{id} and indexed as the _uid field.

The value of the _uid field is accessible in queries, aggregations, scripts, and when
sorting:
# Example documents
{
}
{
}
{
"query": {
"terms": {
"_uid": [ "my_type#1", "my_type#2" ] 1
}
},
"aggs": {
"UIDs": {
"terms": {
"field": "_uid", 2
"size": 10
}
}
},
"sort": [
{
"_uid": { 3
"order": "desc"
}
}
],
"script_fields": {
"UID": {
"script": {
"lang": "painless",
"inline": "doc['_uid']" 4
}
}
}
}
1 - Querying on the _uid field (also see the ids query)
2 - Aggregating on the _uid field
3 - Sorting on the _uid field
4 - Accessing the _uid field in scripts

Chapter 36. Mapping Parameters
The following pages provide detailed explanations of the various mapping parameters that
are used by field mappings:
The following mapping parameters are common to some or all field datatypes:
• analyzer
• boost
• coerce
• copy_to
• doc_values
• dynamic
• enabled
• fielddata
• geohash
• geohash_precision
• geohash_prefix
• format
• ignore_above
• ignore_malformed
• include_in_all
• index_options
• lat_lon
• index
• fields
• norms
• null_value
• position_increment_gap
• properties
• search_analyzer
• similarity
• store
Chapter 36. Mapping Parameters | 465

• term_vector
36.1. Analyzer
The values of analyzed string fields are passed through an analyzer to convert the string
into a stream of tokens or terms. For instance, the string "The quick Brown Foxes."
may, depending on which analyzer is used, be analyzed to the tokens: quick, brown, fox.
These are the actual terms that are indexed for the field, which makes it possible to search
efficiently for individual words within big blobs of text.
This analysis process needs to happen not just at index time, but also at query time: the
query string needs to be passed through the same (or a similar) analyzer so that the terms
that it tries to find are in the same format as those that exist in the index.
NG|Storage ships with a number of pre-defined analyzers, which can be used without
further configuration. It also ships with many character filters, tokenizers, and Token
Filters which can be combined to configure custom analyzers per index.
Analyzers can be specified per-query, per-field or per-index. At index time, NG|Storage will
look for an analyzer in this order:
• The analyzer defined in the field mapping.
• An analyzer named default in the index settings.
• The standard analyzer.
At query time, there are a few more layers:
• The analyzer defined in a full-text query.
• The search_analyzer defined in the field mapping.
• The analyzer defined in the field mapping.
• An analyzer named default_search in the index settings.
• An analyzer named default in the index settings.
• The standard analyzer.
The easiest way to specify an analyzer for a particular field is to define it in the field
mapping, as follows:
466 | Chapter 36. Mapping Parameters

PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"text": { 1
"type": "text",
"fields": {
"english": { 2
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
}
GET my_index/_analyze?field=text 3
{
"text": "The quick Brown Foxes."
}
GET my_index/_analyze?field=text.english 4
{
"text": "The quick Brown Foxes."
}
1 - The text field uses the default standard analyzer`. 2 - The text.english multi-
field uses the english analyzer, which removes stop words and applies stemming. 3 - This
returns the tokens: [ the, quick, brown, foxes ]. 4 - This returns the tokens: [ quick,
brown, fox ].
search_quote_analyzer
The search_quote_analyzer setting allows you to specify an analyzer for phrases, this
is particularly useful when dealing with disabling stop words for phrase queries.
To disable stop words for phrases a field utilising three analyzer settings will be required:
1. An analyzer setting for indexing all terms including stop words
2. A search_analyzer setting for non-phrase queries that will remove stop words
3. A search_quote_analyzer setting for phrase queries that will not remove stop
words

PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{ 1
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase"
]
},
"my_stop_analyzer":{ 2
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"english_stop"
]
}
},
"filter":{
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
},
"mappings":{
"my_type":{
"properties":{
"title": {
"type":"text",
"analyzer":"my_analyzer", 3
"search_analyzer":"my_stop_analyzer", 4
"search_quote_analyzer":"my_analyzer" 5
}
}
}
}
}

{
"title":"The Quick Brown Fox"
}
{
"title":"A Quick Brown Fox"
}
GET my_index/my_type/_search
{
"query":{
"query_string":{
"query":"\"the quick brown fox\"" 6
}
}
}
1 - my_analyzer analyzer which tokens all terms including stop words 2 -

my_stop_analyzer analyzer which removes stop words 3 - analyzer setting that points
to the my_analyzer analyzer which will be used at index time 4 - search_analyzer
setting that points to the my_stop_analyzer and removes stop words for non-phrase
queries 5 - search_quote_analyzer setting that points to the my_analyzer analyzer
and ensures that stop words are not removed from phrase queries 6 - Since the query is
wrapped in quotes it is detected as a phrase query therefore the
search_quote_analyzer kicks in and ensures the stop words are not removed from the
query. The my_analyzer analyzer will then return the following tokens [the, quick,
brown, fox] which will match one of the documents. Meanwhile term queries will be
analyzed with the my_stop_analyzer analyzer which will filter out stop words. So a
search for either The quick brown fox or A quick brown fox will return both
documents since both documents contain the following tokens [quick, brown, fox].
Without the search_quote_analyzer it would not be possible to do exact matches for
phrase queries as the stop words from phrase queries would be removed resulting in both
documents matching.
36.2. Boost
Individual fields can be boosted automatically (count more towards the relevance score) at
query time, with the boost parameter as follows:

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "text",
"boost": 2 1
},
"content": {
"type": "text"
}
}
}
}
}
1 - Matches on the title field will have twice the weight as those on the content field,
which has the default boost of 1.0.
The boost is applied only for term queries (prefix, range and fuzzy queries
 are not boosted).
You can achieve the same effect by using the boost parameter directly in the query, for
instance the following query (with field time boost):
POST _search
{
"query": {
"match" : {
"title": {
"query": "quick brown fox"
}
}
}
}
is equivalent to:
POST _search
{
"query": {
"match" : {
"title": {
"query": "quick brown fox",
"boost": 2
}
}
}
}

The boost is also applied when it is copied with the value in the _all field. This means that,
when querying the _all field, words that originated from the title field will have a higher
score than words that originated in the content field. This functionality comes at a cost:
queries on the _all field are slower when field boosting is used.
deprecated[5.0.0, index time boost is deprecated. Instead, the field mapping boost is
applied at query time. For indices created before 5.0.0 the boost will still be applied at index
time.]
Why index time boosting is a bad idea
We advise against using index time boosting for the following reasons:
• You cannot change index-time boost values without reindexing all of

your documents.
 • Every query supports query-time boosting which achieves the same

effect. The difference is that you can tweak the boost value without
having to reindex.
• Index-time boosts are stored as part of the norm, which is only one
byte. This reduces the resolution of the field length normalization
factor which can lead to lower quality relevance calculations.
36.3. Coerce
Data is not always clean. Depending on how it is produced a number might be rendered in
the JSON body as a true JSON number, e.g. 5, but it might also be rendered as a string, e.g.
"5". Alternatively, a number that should be an integer might instead be rendered as a
floating point, e.g. 5.0, or even "5.0".
chapter.
36.4. Copy-To
The copy_to parameter allows you to create custom _all fields. In other words, the
values of multiple fields can be copied into a group field, which can then be queried as a
single field. For instance, the first_name and last_name fields can be copied to the
full_name field as follows:

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"first_name": {
"type": "text",
},
"last_name": {
"type": "text",
},
"full_name": {
"type": "text"
}
}
}
}
}
{
}
{
"query": {
"match": {
"full_name": { 2
"query": "John Smith",
"operator": "and"
}
}
}
}
1 - The values of the first_name and last_name fields are copied to the full_name
field.
2 - The first_name and last_name fields can still be queried for the first name and
last name respectively, but the full_name field can be queried for both first and last
names.
Some important points:
• It is the field value which is copied, not the terms (which result from the analysis
process).
• The original _source field will not be modified to show the copied values.

• The same value can be copied to multiple fields, with "copy_to": [ "field_1",
"field_2" ]
36.5. Doc Values
Most fields are indexed by default, which makes them searchable. The inverted index
allows queries to look up the search term in unique sorted list of terms, and from that
immediately have access to the list of documents that contain the term.
Sorting, aggregations, and access to field values in scripts requires a different data access
pattern. Instead of looking up the term and finding documents, we need to be able to look
up the document and find the terms that it has in a field.
Doc values are the on-disk data structure, built at document index time, which makes this
data access pattern possible. They store the same values as the _source but in a column-
oriented fashion that is way more efficient for sorting and aggregations. Doc values are
supported on almost all field types, with the notable exception of analyzed string fields.
All fields which support doc values have them enabled by default. If you are sure that you
don’t need to sort or aggregate on a field, or access the field value from a script, you can
disable doc values in order to save disk space:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"status_code": { 1
"type": "keyword"
},
"session_id": { 2
"type": "keyword",
"doc_values": false
}
}
}
}
}
1 - The status_code field has doc_values enabled by default. 2 - The session_id has
doc_values disabled, but can still be queried.

The doc_values setting is allowed to have different settings for fields of
 the same name in the same index. It can be disabled (set to false) on
existing fields using the PUT mapping API.
36.6. Dynamic
By default, fields can be added dynamically to a document, or to inner objects within a

document, just by indexing a document containing the new field. For instance:
{
"username": "johnsmith",
"name": {
"first": "John",
"last": "Smith"
}
}
{
"username": "marywhite",
"email": "mary@white.com",
"name": {
"first": "Mary",
"middle": "Alice",
"last": "White"
}
}
1 - This document introduces the string field username, the object field name, and two
string fields under the name object which can be referred to as name.first and
name.last. 2 - Check the mapping to verify the above. 3 - This document adds two string
fields: email and name.middle. 4 - Check the mapping to verify the changes.
The details of how new fields are detected and added to the mapping is explained in
Dynamic Mapping.
The dynamic setting controls whether new fields can be added dynamically or not. It
accepts three settings:
true
Newly detected fields are added to the mapping. (default)

false
Newly detected fields are ignored. New fields must be added explicitly.
strict
If new fields are detected, an exception is thrown and the document is rejected.
The dynamic setting may be set at the mapping type level, and on each inner object. Inner
objects inherit the setting from their parent object or from the mapping type. For instance:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic": false, 1
"properties": {
"user": { 2
"properties": {
"name": {
"type": "text"
},
"social_networks": { 3
"dynamic": true,
"properties": {}
}
}
}
}
}
}
}
1 - Dynamic mapping is disabled at the type level, so no new top-level fields will be added
dynamically.
2 - The user object inherits the type-level setting.
3 - The user.social_networks object enables dynamic mapping, so new fields may be

added to this inner object.
The dynamic setting is allowed to have different settings for fields of the
 same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
36.7. Enabled
NG|Storage tries to index all of the fields you give it, but sometimes you want to just store

the field without indexing it. For instance, imagine that you are using NG|Storage as a web
session store. You may want to index the session ID and last update time, but you don’t
need to query or run aggregations on the session data itself.
The enabled setting, which can be applied only to the mapping type and to object fields,
causes NG|Storage to skip parsing of the contents of the field entirely. The JSON can still
be retrieved from the _source field, but it is not searchable or stored in any other way:
PUT my_index
{
"mappings": {
"session": {
"properties": {
"user_id": {
"type": "keyword"
},
"last_updated": {
"type": "date"
},
"session_data": { 1
"enabled": false
}
}
}
}
}
PUT my_index/session/session_1
{
"user_id": "kimchy",
"session_data": { 2
"arbitrary_object": {
"some_array": [ "foo", "bar", { "baz": 2 } ]
}
},
"last_updated": "2015-12-06T18:20:22"
}
{
"user_id": "jpountz",
"session_data": "none", 3
"last_updated": "2015-12-06T18:22:13"
}
1 - The session_data field is disabled. 2 - Any arbitrary data can be passed to the
session_data field as it will be entirely ignored. 3 - The session_data will also ignore
values that are not JSON objects.
The entire mapping type may be disabled as well, in which case the document is stored in
the _source field, which means it can be retrieved, but none of its contents are indexed in

any way:
PUT my_index
{
"mappings": {
"session": { 1
"enabled": false
}
}
}
{
"user_id": "kimchy",
"session_data": {
"arbitrary_object": {
"some_array": [ "foo", "bar", { "baz": 2 } ]
}
},
"last_updated": "2015-12-06T18:20:22"
}
GET my_index/session/session_1 2
1 - The entire session mapping type is disabled. 2 - The document can be retrieved. 3 -
Checking the mapping reveals that no fields have been added.
The enabled setting is allowed to have different settings for fields of the
 same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
36.8. Field-Data
Most fields are indexed by default, which makes them searchable. Sorting, aggregations,
and accessing field values in scripts, however, requires a different access pattern from
search.
Search needs to answer the question "Which documents contain this term?", while sorting
and aggregations need to answer a different question: "What is the value of this field for
this document?".
Most fields can use index-time, on-disk doc_values for this data access pattern, but
text fields do not support doc_values.
Instead, text fields use a query-time in-memory data structure called fielddata. This

data structure is built on demand the first time that a field is used for aggregations, sorting,
or in a script. It is built by reading the entire inverted index for each segment from disk,
inverting the term - document relationship, and storing the result in memory, in the JVM
heap.
Fielddata is disabled on text fields by default
Fielddata can consume a lot of heap space, especially when loading high cardinality text
fields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the
segment. Also, loading fielddata is an expensive process which can cause users to
experience latency hits. This is why fielddata is disabled by default.
If you try to sort, aggregate, or access values from a script on a text field, you will see this
exception:
Fielddata is disabled on text fields by default. Set fielddata=true on

[your_field_name] in order to load fielddata in memory by uninverting the inverted
index. Note that this can however use significant memory.
Before enabling fielddata
Before you enable fielddata, consider why you are using a text field for aggregations,
sorting, or in a script. It usually doesn’t make sense to do so.
A text field is analyzed before indexing so that a value like New York can be found by
searching for new or for york. A terms aggregation on this field will return a new bucket
and a york bucket, when you probably want a single bucket called New York.
Instead, you should have a text field for full text searches, and an unanalyzed keyword
field with doc_values enabled for aggregations, as follows:

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"my_field": { 1
"type": "text",
"fields": {
"keyword": { 2
"type": "keyword"
}
}
}
}
}
}
}
1 - Use the my_field field for searches. 2 - Use the my_field.keyword field for
aggregations, sorting, or in scripts.
Enabling fielddata on text fields
You can enable fielddata on an existing text field using the PUT mapping API as follows:
PUT my_index/_mapping/my_type
{
"properties": {
"my_field": { 1
"type": "text",
"fielddata": true
}
}
}
1 - The mapping that you specify for my_field should consist of the existing mapping for
that field, plus the fielddata parameter.
The fielddata.* parameter must have the same settings for fields of
 the same name in the same index. Its value can be updated on existing
fields using the PUT mapping API.

Global ordinals
Global ordinals is a data-structure on top of fielddata and doc values, that maintains an
incremental numbering for each unique term in a lexicographic order. Each term has a
unique number and the number of term 'A' is lower than the number of term 'B'.
Global ordinals are only supported on text and keyword fields.
Fielddata and doc values also have ordinals, which is a unique numbering for all terms
in a particular segment and field. Global ordinals just build on top of this, by providing a
mapping between the segment ordinals and the global ordinals, the latter being unique
across the entire shard.
Global ordinals are used for features that use segment ordinals, such as sorting and
the terms aggregation, to improve the execution time. A terms aggregation relies
purely on global ordinals to perform the aggregation at the shard level, then converts
global ordinals to the real term only for the final reduce phase, which combines results
from different shards.
Global ordinals for a specified field are tied to all the segments of a shard, while
fielddata and doc values ordinals are tied to a single segment. which is different than
for field data for a specific field which is tied to a single segment. For this reason
global ordinals need to be entirely rebuilt whenever a once new segment becomes
visible.
The loading time of global ordinals depends on the number of terms in a field, but in
general it is low, since it source field data has already been loaded. The memory
overhead of global ordinals is a small because it is very efficiently compressed. Eager
loading of global ordinals can move the loading time from the first search request, to
the refresh itself.
fielddata_frequency_filter
Fielddata filtering can be used to reduce the number of terms loaded into memory, and
thus reduce memory usage. Terms can be filtered by frequency:
The frequency filter allows you to only load terms whose document frequency falls between
a min and max value, which can be expressed an absolute number (when the number is
bigger than 1.0) or as a percentage (eg 0.01 is 1% and 1.0 is 100%). Frequency is
calculated per segment. Percentages are based on the number of docs which have a value

for the field, as opposed to all docs in the segment.
Small segments can be excluded completely by specifying the minimum number of docs
that the segment should contain with min_segment_size:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"tag": {
"type": "text",
"fielddata": true,
"fielddata_frequency_filter": {
"min": 0.001,
"max": 0.1,
"min_segment_size": 500
}
}
}
}
}
}
36.9. Format
In JSON documents, dates are represented as strings. NG|Storage uses a set of

preconfigured formats to recognize and parse these strings into a long value representing
milliseconds-since-the-epoch in UTC.
Besides the built-in formats, your own custom formats can be specified using the familiar
yyyy/MM/dd syntax:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}
}
Many APIs which support date values also support date math expressions, such as now-
1m/d¬—¬the current time, minus one month, rounded down to the nearest day.

The format setting must have the same setting for fields of the same
 name in the same index. Its value can be updated on existing fields using
the PUT mapping API.
Custom date formats
Completely customizable date formats are supported. The syntax for these is explained in
the Joda docs.
For more information on built-in formats please refer to the source ElasticSearch
reference documentation chapter.
36.10. Geohash
Geohashes are a form of lat/lon encoding which divides the earth up into a grid.
https://www.elastic.co/guide/en/elasticsearch/reference/current/geohash.html [chapter].
36.11. Ignore Above
Strings longer than the ignore_above setting will not be indexed or stored.

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"message": {
"type": "keyword",
"ignore_above": 20 1
}
}
}
}
}
{
"message": "Syntax error"
}
{
"message": "Syntax error with some long stacktrace"
}
GET _search 4
{
"aggs": {
"messages": {
"terms": {
"field": "message"
}
}
}
}
1 - This field will ignore any string longer than 20 characters. 2 - This document is indexed
successfully. 3 - This document will be indexed, but without indexing the message field. 4 -
Search returns both documents, but only the first is present in the terms aggregation.
The ignore_above setting is allowed to have different settings for fields
 of the same name in the same index. Its value can be updated on existing
This option is also useful for protecting against Lucene’s term byte-length limit of 32766.
The value for ignore_above is the character count, but Lucene counts
bytes. If you use UTF-8 text with many non-ASCII characters, you may want
 to set the limit to 32766 / 3 = 10922 since UTF-8 characters may
occupy at most 3 bytes.

36.12. Ignored Malformed
Sometimes you don’t have much control over the data that you receive. One user may send
a login field that is a date, and another sends a login field that is an email address.
Trying to index the wrong datatype into a field throws an exception by default, and rejects
the whole document. The ignore_malformed parameter, if set to true, allows the
exception to be ignored. The malformed field is not indexed, but other fields in the
document are processed normally.
For example:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"number_one": {
"type": "integer",
"ignore_malformed": true
},
"number_two": {
"type": "integer"
}
}
}
}
}
{
"text": "Some text value",
"number_one": "foo" 1
}
{
"text": "Some text value",
"number_two": "foo" 2
}
1 - This document will have the text field indexed, but not the number_one field. 2 - This
document will be rejected because number_two does not allow malformed values.
The ignore_malformed setting is allowed to have different settings for
 fields of the same name in the same index. Its value can be updated on

Index-level default
The index.mapping.ignore_malformed setting can be set on the index level to allow

to ignore malformed content globally across all mapping types.
PUT my_index
{
"settings": {
"index.mapping.ignore_malformed": true 1
},
"mappings": {
"my_type": {
"properties": {
"number_one": { 1
"type": "byte"
},
"number_two": {
"type": "integer",
"ignore_malformed": false 2
}
}
}
}
}
1 - The number_one field inherits the index-level setting. 2 - The number_two field
overrides the index-level setting to turn off ignore_malformed.
36.13. Include In All
The include_in_all parameter provides per-field control over which fields are included
in the _all field. It defaults to true, unless index is set to no.
This example demonstrates how to exclude the date field from the _all field:

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": { 1
"type": "text"
},
"content": { 1
"type": "text"
},
"date": { 2
"type": "date",
"include_in_all": false
}
}
}
}
}
1 - The title and content fields will be included in the _all field. 2 - The date field will
not be included in the _all field.
The include_in_all setting is allowed to have different settings for
 fields of the same name in the same index. Its value can be updated on
The include_in_all parameter can also be set at the type level and on object or
nested fields, in which case all sub- fields inherit that setting. For instance:

PUT my_index
{
"mappings": {
"my_type": {
"include_in_all": false, 1
"properties": {
"title": { "type": "text" },
"author": {
"include_in_all": true, 2
"properties": {
"first_name": { "type": "text" },
"last_name": { "type": "text" }
}
},
"editor": {
"properties": {
"first_name": { "type": "text" }, 3
"last_name": { "type": "text", "include_in_all": true }
4
}
}
}
}
}
}
1 - All fields in my_type are excluded from _all. 2 - The author.first_name and
author.last_name fields are included in _all. 3 - Only the editor.last_name field is
included in _all. The editor.first_name inherits the type-level setting and is
excluded.
Multi-fields and include_in_all
The original field value is added to the _all field, not the terms produced
 by a field’s analyzer. For this reason, it makes no sense to set
include_in_all to true on multi-fields, as each multi-field has exactly
the same value as its parent.
36.14. Index
The index option controls whether field values are indexed. It accepts true or false.
Fields that are not indexed are not queryable.
36.15. Index Options
The index_options parameter controls what information is added to the inverted index,
for search and highlighting purposes. It accepts the following settings:

docs
Only the doc number is indexed. Can answer the question Does this term exist in this
field?
freqs
Doc number and term frequencies are indexed. Term frequencies are used to score
repeated terms higher than single terms.
positions
Doc number, term frequencies, and term positions (or order) are indexed. Positions
can be used for proximity or phrase queries.
offsets
Doc number, term frequencies, positions, and start and end character offsets (which
map the term back to the original string) are indexed. Offsets are used by the postings
highlighter.
Analyzed string fields use positions as the default, and all other fields use docs as the
default.

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"index_options": "offsets"
}
}
}
}
}
{
"text": "Quick brown fox"
}
{
"query": {
"match": {
"text": "brown fox"
}
},
"highlight": {
"fields": {
"text": {} 1
}
}
}
1 - The text field will use the postings highlighter by default because offsets are
indexed.
36.16. Multi-Fields
It is often useful to index the same field in different ways for different purposes. This is the
purpose of multi-fields. For instance, a string field could be mapped as a text field for
full-text search, and as a keyword field for sorting or aggregations:

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "text",
"fields": {
"raw": { 1
"type": "keyword"
}
}
}
}
}
}
}
{
"city": "New York"
}
{
"city": "York"
}
{
"query": {
"match": {
"city": "york" 2
}
},
"sort": {
"city.raw": "asc" 3
},
"aggs": {
"Cities": {
"terms": {
"field": "city.raw" 4
}
}
}
}
1 - The city.raw field is a keyword version of the city field. 2 - The city field can be
used for full text search. 3 - The city.raw field can be used for sorting and aggregations
 Multi-fields do not change the original _source field.

The fields setting is allowed to have different settings for fields of the
 same name in the same index. New multi-fields can be added to existing
Multi-fields with multiple analyzers
Another use case of multi-fields is to analyze the same field in different ways for better
relevance. For instance we could index a field with the standard analyzer which breaks
text up into words, and again with the english analyzer which stems words into their root
form:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"text": { 1
"type": "text",
"fields": {
"english": { 2
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
}
{ "text": "quick brown fox" } 3
{ "text": "quick brown foxes" } 3
{
"query": {
"multi_match": {
"query": "quick brown foxes",
"fields": [ 4
"text",
"text.english"
],
"type": "most_fields" 4
}
}
}
1 - The text field uses the standard analyzer. 2 - The text.english field uses the

english analyzer. 3 - Index two documents, one with fox and the other with foxes. 4 -
Query both the text and text.english fields and combine the scores.
The text field contains the term fox in the first document and foxes in the second
document. The text.english field contains fox for both documents, because foxes is
stemmed to fox.
The query string is also analyzed by the standard analyzer for the text field, and by the
english analyzer` for the text.english field. The stemmed field allows a query for
foxes to also match the document containing just fox. This allows us to match as many
documents as possible. By also querying the unstemmed text field, we improve the
relevance score of the document which matches foxes exactly.
36.17. Norms
Norms store various normalization factors that are later used at query time in order to
compute the score of a document relatively to a query.
Although useful for scoring, norms also require quite a lot of disk (typically in the order of
one byte per document per field in your index, even for documents that don’t have this
specific field). As a consequence, if you don’t need scoring on a specific field, you should
disable norms on that field. In particular, this is the case for fields that are used solely for
filtering or aggregations.
The norms setting must have the same setting for fields of the same name
 in the same index. Norms can be disabled on existing fields using the PUT
mapping API.
Norms can be disabled (but not reenabled) after the fact, using the PUT mapping API like
so:
PUT my_index/_mapping/my_type
{
"properties": {
"title": {
"type": "text",
"norms": false
}
}
}

Norms will not be removed instantly, but will be removed as old segments
are merged into new segments as you continue indexing new documents.
 Any score computation on a field that has had norms removed might
return inconsistent results since some documents won’t have norms
anymore while other documents might still have norms.
36.18. Null Value
A null value cannot be indexed or searched. When a field is set to null, (or an empty
array or an array of null values) it is treated as though that field has no values.
The null_value parameter allows you to replace explicit null values with the specified
value so that it can be indexed and searched. For instance:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"status_code": {
"type": "keyword",
"null_value": "NULL" 1
}
}
}
}
}
{
"status_code": null
}
{
"status_code": [] 2
}
{
"query": {
"term": {
"status_code": "NULL" 3
}
}
}
1 - Replace explicit null values with the term NULL. 2 - An empty array does not contain
an explicit null, and so won’t be replaced with the null_value. 3 - A query for NULL
returns document 1, but not document 2.
The null_value needs to be the same datatype as the field. For

 instance, a long field cannot have a string null_value.
36.19. Position Increment Gap
Analyzed text fields take term positions into account, in order to be able to support
proximity or phrase queries. When indexing text fields with multiple values a "fake" gap is
added between the values to prevent most phrase queries from matching across the
values. The size of this gap is configured using position_increment_gap and defaults
to 100.
For example:
PUT my_index/groups/1
{
"names": [ "John Abraham", "Lincoln Smith"]
}
GET my_index/groups/_search
{
"query": {
"match_phrase": {
"names": {
"query": "Abraham Lincoln" 1
}
}
}
}
{
"query": {
"match_phrase": {
"names": {
"query": "Abraham Lincoln",
"slop": 101 2
}
}
}
}
1 - This phrase query doesn’t match our document which is totally expected. 2 - This phrase
query matches our document, even though Abraham and Lincoln are in separate
strings, because slop > position_increment_gap.
The position_increment_gap can be specified in the mapping. For instance:

PUT my_index
{
"mappings": {
"groups": {
"properties": {
"names": {
"type": "text",
"position_increment_gap": 0 1
}
}
}
}
}
PUT my_index/groups/1
{
"names": [ "John Abraham", "Lincoln Smith"]
}
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln" 2
}
}
}
1 - The first term in the next array element will be 0 terms apart from the last term in the
previous array element. 2 - The phrase query matches our document which is weird, but its
what we asked for in the mapping.
The position_increment_gap setting is allowed to have different
 settings for fields of the same name in the same index. Its value can be
updated on existing fields using the PUT mapping API.
36.20. Properties
Type mappings, object fields and nested fields contain sub-fields, called properties.
These properties may be of any datatype, including object and nested. Properties can
be added:
• explicitly by defining them when creating an index.
• explicitly by defining them when adding or updating a mapping type with the PUT
mapping API.
• dynamically just by indexing documents containing new fields.

Below is an example of adding properties to a mapping type, an object field, and a
nested field:
PUT my_index
{
"mappings": {
"my_type": { 1
"properties": {
"manager": { 2
"properties": {
"age": { "type": "integer" },
}
},
"employees": { 3
"type": "nested",
"properties": {
"age": { "type": "integer" },
}
}
}
}
}
}
{
"region": "US",
"manager": {
"name": "Alice White",
"age": 30
},
"employees": [
{
"name": "John Smith",
"age": 34
},
{
"name": "Peter Brown",
"age": 26
}
]
}
1 - Properties under the my_type mapping type. 2 - Properties under the manager object
field. 3 - Properties under the employees nested field. 4 - An example document which
corresponds to the above mapping.
The properties setting is allowed to have different settings for fields of
 the same name in the same index. New properties can be added to

*Dot notatio*n
Inner fields can be referred to in queries, aggregations, etc., using dot notation:
{
"query": {
"match": {
"manager.name": "Alice White" 1
}
},
"aggs": {
"Employees": {
"nested": {
"path": "employees"
},
"aggs": {
"Employee Ages": {
"histogram": {
"field": "employees.age", 2
"interval": 5
}
}
}
}
}
}
 The full path to the inner field must be specified.
36.21. Search Analyzer
Usually, the same analyzer should be applied at index time and at search time, to ensure
that the terms in the query are in the same format as the terms in the inverted index.
Sometimes, though, it can make sense to use a different analyzer at search time, such as
when using the edge_ngram tokenizer for autocomplete.
By default, queries will use the analyzer defined in the field mapping, but this can be
overridden with the search_analyzer setting:

PUT my_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": { 1
"type": "custom",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete", 2
"search_analyzer": "standard" 2
}
}
}
}
}
{
"text": "Quick Brown Fox" 3
}
{
"query": {
"match": {
"text": {
"query": "Quick Br", 4
"operator": "and"
}
}
}
}
1 - Analysis settings to define the custom autocomplete analyzer.
2 - The text field uses the autocomplete analyzer at index time, but the standard
analyzer at search time.
3 - This field is indexed as the terms: [ q, qu, qui, quic, quick, b, br, bro, brow, brown,
f, fo, fox ]
4 - The query searches for both of these terms: [ quick, br ]
See {defguide}/_index_time_search_as_you_type.html[Index time search-as-you- type] for

a full explanation of this example.
The search_analyzer setting must have the same setting for fields of
 the same name in the same index. Its value can be updated on existing
36.22. Similarity
NG|Storage allows you to configure a scoring algorithm or similarity per field. The
similarity setting provides a simple way of choosing a similarity algorithm other than
the default TF/IDF, such as BM25.
Similarities are mostly useful for text fields, but can also apply to other field types.
Custom similarities can be configured by tuning the parameters of the built-in similarities.
For more details about this expert options, see the similarity module.
The only similarities which can be used out of the box, without any further configuration
are:
classic
The Default TF/IDF algorithm used by NG|Storage and Lucene. See
{defguide}/practical-scoring-function.html[Lucene’s Practical Scoring Function] for
more information.
BM25
The Okapi BM25 algorithm. See {defguide}/pluggable-similarites.html[Pluggable
Similarity Algorithms] for more information.
The similarity can be set on the field level when a field is first created, as follows:

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"default_field": { 1
"type": "text"
},
"bm25_field": {
"type": "text",
"similarity": "BM25" 2
}
}
}
}
}
1 - The default_field uses the classic similarity (ie TF/IDF).
2 - The bm25_field uses the BM25 similarity.
36.23. Term Vector
Term vectors contain information about the terms produced by the analysis process,
including:
• a list of terms.
• the position (or order) of each term.
• the start and end character offsets mapping the term to its origin in the original string.
These term vectors can be stored so that they can be retrieved for a particular document.
The term_vector setting accepts:
no
No term vectors are stored. (default)
yes
Just the terms in the field are stored.
with_positions
Terms and positions are stored.
with_offsets
Terms and character offsets are stored.

with_positions_offsets
Terms, positions, and character offsets are stored.
The fast vector highlighter requires with_positions_offsets. The term vectors API
can retrieve whatever is stored.
 Setting with_positions_offsets will double the size of a field’s index.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
}
{
"text": "Quick brown fox"
}
{
"query": {
"match": {
"text": "brown fox"
}
},
"highlight": {
"fields": {
"text": {} 1
}
}
}
1 - The fast vector highlighter will be used by default for the text field because term
vectors are enabled.

Chapter 37. Field Data-types
NG|Storage supports a number of different datatypes for the fields in a document:
Core datatypes
string
text and keyword
[number]
long, integer, short, byte, double, float
[date]
date
[boolean]
boolean
[binary]
binary
Complex datatypes
[array]
Array support does not require a dedicated type
[object]
object for single JSON objects
[nested]
nested for arrays of JSON objects
Geo datatypes
[geo-point]
geo_point for lat/lon points
[geo-shape]
geo_shape for complex shapes like polygons
Specialised datatypes
502 | Chapter 37. Field Data-types

[ip]
ip for IPv4 and IPv6 addresses
Completion datatype
completion to provide auto-complete suggestions
[token-count]
token_count to count the number of tokens in a string
[percolator]
Accepts queries from the query-dsl
chapter.
Modules
This section contains modules responsible for various aspects of the functionality in
NG|Storage. Each module has settings which may be:
static
These settings must be set at the node level, either in the ngStorage.yml file, or as
an environment variable or on the command line when starting a node. They must be
set on every relevant node in the cluster.
dynamic
These settings can be dynamically updated on a live cluster with the cluster-update-
settings API.
The modules in this section are:
Cluster-level routing and shard allocation
Settings to control where, when, and how shards are allocated to nodes.
Discovery
How nodes discover each other to form a cluster.
Gateway
How many nodes need to join the cluster before recovery can start.
Chapter 37. Field Data-types | 503

HTTP
Settings to control the HTTP REST interface.
Indices
Global index-related settings.
Network
Controls default network settings.
Node client
A Java node client joins the cluster, but doesn’t hold data or act as a master node.
Painless
A built-in scripting language for NG|Storage that’s designed to be as secure as

possible.
Plugins
Using plugins to extend NG|Storage.
Scripting
Custom scripting available in Lucene Expressions, Groovy, Python, and Javascript.

You can also write scripts in the built-in scripting language, Painless.
Snapshot/Restore
Backup your data with snapshot/restore.
Thread pools
Information about the dedicated thread pools used in NG|Storage.
Transport
Configure the transport networking layer, used internally by NG|Storage to

communicate between nodes.
Tribe nodes
A tribe node joins one or more clusters and acts as a federated client across them.
504 | Chapter 37. Field Data-types

Chapter 38. Cluster
One of the main roles of the master is to decide which shards to allocate to which nodes,
and when to move shards between nodes in order to rebalance the cluster.
There are a number of settings available to control the shard allocation process:
• Cluster Level Shard Allocation lists the settings to control the allocation and
rebalancing operations.
• Disk-Based Shard Allocation explains how NG|Storage takes available disk space into
account, and the related settings.
• Shard Allocation Awareness and [forced-awareness] control how shards can be

distributed across different racks or availability zones.
• Shard Allocation Filtering allows certain nodes or groups of nodes excluded from
allocation so that they can be decommissioned.
Besides these, there are a few other miscellaneous cluster-level settings.
All of the settings in this section are dynamic settings which can be updated on a live
cluster with the cluster-update-settings API.
38.1. Shard Allocation Awareness
When running nodes on multiple VMs on the same physical server, on multiple racks, or
across multiple awareness zones, it is more likely that two nodes on the same physical
server, in the same rack, or in the same awareness zone will crash at the same time, rather
than two unrelated nodes crashing simultaneously.
If NG|Storage is aware of the physical configuration of your hardware, it can ensure that the
primary shard and its replica shards are spread across different physical servers, racks, or
zones, to minimise the risk of losing all shard copies at the same time.
The shard allocation awareness settings allow you to tell NG|Storage about your hardware
configuration.
As an example, let’s assume we have several racks. When we start a node, we can tell it
which rack it is in by assigning it an arbitrary metadata attribute called rack_id¬—¬we
could use any attribute name. For example:
Chapter 38. Cluster | 505

./bin/ngStorage -Enode.attr.rack_id=rack_one 1
1 - This setting could also be specified in the ngStorage.yml config file.
Now, we need to setup shard allocation awareness by telling NG|Storage which attributes
to use. This can be configured in the ngStorage.yml file on all master-eligible nodes, or
it can be set (and changed) with the cluster-update-settings API.
For our example, we’ll set the value in the config file:
cluster.routing.allocation.awareness.attributes: rack_id
With this config in place, let’s say we start two nodes with node.attr.rack_id set to
rack_one, and we create an index with 5 primary shards and 1 replica of each primary. All
primaries and replicas are allocated across the two nodes.
Now, if we start two more nodes with node.attr.rack_id set to rack_two, NG|Storage
will move shards across to the new nodes, ensuring (if possible) that no two copies of the
same shard will be in the same rack. However if rack_two were to fail, taking down both
of its nodes, NG|Storage will still allocate the lost shard copies to nodes in rack_one.
Prefer local shards
When executing search or GET requests, with shard awareness enabled, NG|Storage
will prefer using local shards¬—¬shards in the same awareness group¬—¬to execute
the request. This is usually faster than crossing racks or awareness zones.
Multiple awareness attributes can be specified, in which case the combination of values
from each attribute is considered to be a separate value.
cluster.routing.allocation.awareness.attributes: rack_id,zone
When using awareness attributes, shards will not be allocated to nodes

 that don’t have values set for those attributes.
506 | Chapter 38. Cluster

Number of primary/replica of a shard allocated on a specific group of
nodes with the same awareness attribute value is determined by the
 number of attribute values. When the number of nodes in groups is

unbalanced and there are many replicas, replica shards may be left
unassigned.
Forced Awareness
Imagine that you have two awareness zones and enough hardware across the two zones to
host all of your primary and replica shards. But perhaps the hardware in a single zone,
while sufficient to host half the shards, would be unable to host ALL the shards.
With ordinary awareness, if one zone lost contact with the other zone, NG|Storage would
assign all of the missing replica shards to a single zone. But in this example, this sudden
extra load would cause the hardware in the remaining zone to be overloaded.
Forced awareness solves this problem by NEVER allowing copies of the same shard to be
allocated to the same zone.
For example, lets say we have an awareness attribute called zone, and we know we are
going to have two zones, zone1 and zone2. Here is how we can force awareness on a
node:
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2 1
cluster.routing.allocation.awareness.attributes: zone
1 - We must list all possible values that the zone attribute can have.
Now, if we start 2 nodes with node.attr.zone set to zone1 and create an index with 5
shards and 1 replica. The index will be created, but only the 5 primary shards will be
allocated (with no replicas). Only when we start more shards with node.attr.zone set to
zone2 will the replicas be allocated.
The cluster.routing.allocation.awareness.* settings can all be updated

dynamically on a live cluster with the cluster-update-settings API.
38.2. Shard Allocation Filtering
While Index Shard Allocation provides per-index settings to control the allocation of shards
to nodes, cluster-level shard allocation filtering allows you to allow or disallow the
allocation of shards from any index to particular nodes.
The typical use case for cluster-wide shard allocation filtering is when you want to
decommission a node, and you would like to move the shards from that node to other nodes
in the cluster before shutting it down.
For instance, we could decommission a node using its IP address as follows:
PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.exclude._ip" : "10.0.0.1"
}
}
Shards will only be relocated if it is possible to do so without breaking
 another routing constraint, such as never allocating a primary and replica

shard to the same node.
Cluster-wide shard allocation filtering works in the same way as index-level shard
allocation filtering (see Index Shard Allocation for details).
The available dynamic cluster settings are as follows, where {attribute} refers to an
arbitrary node attribute.:
cluster.routing.allocation.include.{attribute}
Assign the index to a node whose {attribute} has at least one of the comma-
separated values.
cluster.routing.allocation.require.{attribute}
Assign the index to a node whose {attribute} has all of the comma-separated
values.
cluster.routing.allocation.exclude.{attribute}
Assign the index to a node whose {attribute} has none of the comma-separated
values.
These special attributes are also supported:
_name
Match nodes by node name
_ip
Match nodes by IP address (the IP address associated with the hostname)

_host
Match nodes by hostname
All attribute values can be specified with wildcards, eg:
{
"transient": {
"cluster.routing.allocation.include._ip": "192.168.2.*"
}
}
38.3. Disk-Based Shard Allocation
NG|Storage factors in the available disk space on a node before deciding whether to
allocate new shards to that node or to actively relocate shards away from that node.
Below are the settings that can be configured in the ngStorage.yml config file or updated
dynamically on a live cluster with the cluster-update-settings API:
cluster.routing.allocation.disk.threshold_enabled
Defaults to true. Set to false to disable the disk allocation decider.
cluster.routing.allocation.disk.watermark.low
Controls the low watermark for disk usage. It defaults to 85%, meaning ES will not
allocate new shards to nodes once they have more than 85% disk used. It can also be
set to an absolute byte value (like 500mb) to prevent ES from allocating shards if less
than the configured amount of space is available.
cluster.routing.allocation.disk.watermark.high
Controls the high watermark. It defaults to 90%, meaning ES will attempt to relocate
shards to another node if the node disk usage rises above 90%. It can also be set to an
absolute byte value (similar to the low watermark) to relocate shards once less than
the configured amount of space is available on the node.
Percentage values refer to used disk space, while byte values refer to free
disk space. This can be confusing, since it flips the meaning of high and
 low. For example, it makes sense to set the low watermark to 10gb and the
high watermark to 5gb, but not the other way around.
cluster.info.update.interval

How often NG|Storage should check on disk usage for each node in the cluster.
Defaults to 30s.
cluster.routing.allocation.disk.include_relocations
Defaults to true, which means that NG|Storage will take into account shards that are
currently being relocated to the target node when computing a node’s disk usage.
Taking relocating shards' sizes into account may, however, mean that the disk usage
for a node is incorrectly estimated on the high side, since the relocation could be 90%
complete and a recently retrieved disk usage would include the total size of the
relocating shard as well as the space already used by the running relocation.
An example of updating the low watermark to no more than 80% of the disk size, a high
watermark of at least 50 gigabytes free, and updating the information about the cluster
every minute:
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "80%",
"cluster.routing.allocation.disk.watermark.high": "50gb",
"cluster.info.update.interval": "1m"
}
}
Prior to 2.0.0, when using multiple data paths, the disk threshold decider
only factored in the usage across all data paths (if you had two data paths,
 one with 50b out of 100b free (50% used) and another with 40b out of 50b
free (80% used) it would see the node’s disk usage as 90b out of 150b). In
2.0.0, the minimum and maximum disk usages are tracked separately.
38.4. Miscellaneous Cluster Settings
Metadata
An entire cluster may be set to read-only with the following dynamic setting:
cluster.blocks.read_only
Make the whole cluster read only (indices do not accept write operations), metadata is
not allowed to be modified (create or delete indices).

Don’t rely on this setting to prevent changes to your cluster. Any user with
 access to the cluster-update-settings API can make the cluster read-write

again.
Index Tombstones
The cluster state maintains index tombstones to explicitly denote indices that have been
deleted. The number of tombstones maintained in the cluster state is controlled by the
following property, which cannot be updated dynamically:
cluster.indices.tombstones.size
Index tombstones prevent nodes that are not part of the cluster when a delete occurs
from joining the cluster and reimporting the index as though the delete was never
issued. To keep the cluster state from growing huge we only keep the last
cluster.indices.tombstones.size deletes, which defaults to 500. You can
increase it if you expect nodes to be absent from the cluster and miss more than 500
deletes. We think that is rare, thus the default. Tombstones don’t take up much
space, but we also think that a number like 50,000 is probably too big.
Logger
The settings which control logging can be updated dynamically with the logger. prefix.
For instance, to increase the logging level of the indices.recovery module to DEBUG,
issue this request:
PUT /_cluster/settings
{
"transient": {
"logger.indices.recovery": "DEBUG"
}
}
38.5. Cluster Level Shard Allocation
Shard allocation is the process of allocating shards to nodes. This can happen during initial
recovery, replica allocation, rebalancing, or when nodes are added or removed.
Shard Allocation Settings
The following dynamic settings may be used to control shard allocation and recovery:

cluster.routing.allocation.enable
Enable or disable allocation for specific kinds of shards:
• all - (default) Allows shard allocation for all kinds of shards.
• primaries - Allows shard allocation only for primary shards.
• new_primaries - Allows shard allocation only for primary shards for new
indices.
• none - No shard allocations of any kind are allowed for any indices.
This setting does not affect the recovery of local primary shards when restarting a node. A
restarted node that has a copy of an unassigned primary shard will recover that primary
immediately, assuming that its allocation id matches one of the active allocation ids in the
cluster state.
cluster.routing.allocation.node_concurrent_incoming_recoveries
How many concurrent incoming shard recoveries are allowed to happen on a node.
Incoming recoveries are the recoveries where the target shard (most likely the replica
unless a shard is relocating) is allocated on the node. Defaults to 2.
cluster.routing.allocation.node_concurrent_outgoing_recoveries
How many concurrent outgoing shard recoveries are allowed to happen on a node.
Outgoing recoveries are the recoveries where the source shard (most likely the
primary unless a shard is relocating) is allocated on the node. Defaults to 2.
cluster.routing.allocation.node_initial_primaries_recoveries
While the recovery of replicas happens over the network, the recovery of an
unassigned primary after node restart uses data from the local disk. These should be
fast so more initial primary recoveries can happen in parallel on the same node.
Defaults to 4.
cluster.routing.allocation.same_shard.host
Allows to perform a check to prevent allocation of multiple instances of the same
shard on a single host, based on host name and host address. Defaults to false,
meaning that no check is performed by default. This setting only applies if multiple
nodes are started on the same machine.
Shard Rebalancing Settings
The following dynamic settings may be used to control the rebalancing of shards across the

cluster:
cluster.routing.rebalance.enable
Enable or disable rebalancing for specific kinds of shards:
• all - (default) Allows shard balancing for all kinds of shards.
• primaries - Allows shard balancing only for primary shards.
• replicas - Allows shard balancing only for replica shards.
• none - No shard balancing of any kind are allowed for any indices.
cluster.routing.allocation.allow_rebalance
Specify when shard rebalancing is allowed:
• always - Always allow rebalancing.
• indices_primaries_active - Only when all primaries in the cluster are

allocated.
• indices_all_active - (default) Only when all shards (primaries and

replicas) in the cluster are allocated.
cluster.routing.allocation.cluster_concurrent_rebalance
Allow to control how many concurrent shard rebalances are allowed cluster wide.
Defaults to 2.
Shard Balancing Heuristics
The following settings are used together to determine where to place each shard. The
cluster is balanced when no allowed action can bring the weights of each node closer
together by more then the balance.threshold.
cluster.routing.allocation.balance.shard
Defines the weight factor for shards allocated on a node (float). Defaults to 0.45f.
Raising this raises the tendency to equalize the number of shards across all nodes in
the cluster.
cluster.routing.allocation.balance.index
Defines a factor to the number of shards per index allocated on a specific node
(float). Defaults to 0.55f. Raising this raises the tendency to equalize the number of
shards per index across all nodes in the cluster.

cluster.routing.allocation.balance.threshold
Minimal optimization value of operations that should be performed (non negative
float). Defaults to 1.0f. Raising this will cause the cluster to be less aggressive
about optimizing the shard balance.
• Regardless of the result of the balancing algorithm, rebalancing might not be

allowed due to forced awareness or allocation filtering.

Chapter 39. Discovery
The discovery module is responsible for discovering nodes within a cluster, as well as
electing a master node.
Note, NG|Storage is a peer to peer based system, nodes communicate with one another
directly if operations are delegated / broadcast. All the main APIs (index, delete, search) do
not communicate with the master node. The responsibility of the master node is to maintain
the global cluster state, and act if nodes join or leave the cluster by reassigning shards.
Each time a cluster state is changed, the state is made known to the other nodes in the
cluster (the manner depends on the actual discovery implementation).
Settings
The cluster.name allows to create separated clusters from one another. The default
value for the cluster name is NG|Storage, though it is recommended to change this to
reflect the logical group name of the cluster running.
39.1. Azure Classic Discovery
Azure classic discovery allows to use the Azure Classic APIs to perform automatic
discovery (similar to multicast). It is available as a plugin.
39.2. EC2 Discovery
EC2 discovery is available as a plugin.
39.3. Google Compute Engine Discovery
Google Compute Engine (GCE) discovery allows to use the GCE APIs to perform automatic
discovery (similar to multicast). It is available as a plugin.
39.4. Zen Discovery
The zen discovery is the built in discovery module for NG|Storage and the default. It
provides unicast discovery, but can be extended to support cloud environments and other
forms of discovery.
The zen discovery is integrated with other modules, for example, all communication
between nodes is done using the transport module.
Chapter 39. Discovery | 515
It is separated into several sub modules, which are explained below:
Ping
This is the process where a node uses the discovery mechanisms to find other nodes.
Unicast
The unicast discovery requires a list of hosts to use that will act as gossip routers. It
provides the following settings with the discovery.zen.ping.unicast prefix:
Setting Description
hosts Either an array setting or a comma delimited
setting. Each value should be in the form of
host:port or host (where port defaults
to 9300). Note that IPv6 hosts must be
bracketed. Defaults to 127.0.0.1, [::1]
The unicast discovery uses the transport module to perform the discovery.
Master Election
As part of the ping process a master of the cluster is either elected or joined to. This is
done automatically. The discovery.zen.ping_timeout (which defaults to 3s) allows
for the tweaking of election time to handle cases of slow or congested networks (higher
values assure less chance of failure). Once a node joins, it will send a join request to the
master (discovery.zen.join_timeout) with a timeout defaulting at 20 times the ping
timeout.
When the master node stops or has encountered a problem, the cluster nodes start pinging
again and will elect a new master. This pinging round also serves as a protection against
(partial) network failures where a node may unjustly think that the master has failed. In this
case the node will simply hear from other nodes about the currently active master.
If discovery.zen.master_election.ignore_non_master_pings is true, pings

from nodes that are not master eligible (nodes where node.master is false) are ignored
during master election; the default value is false.
Nodes can be excluded from becoming a master by setting node.master to false.
The discovery.zen.minimum_master_nodes sets the minimum number of master

eligible nodes that need to join a newly elected master in order for an election to complete
and for the elected node to accept it’s mastership. The same setting controls the minimum
516 | Chapter 39. Discovery

number of active master eligible nodes that should be a part of any active cluster. If this
requirement is not met the active master node will step down and a new master election
will be begin.
This setting must be set to a quorum of your master eligible nodes. It is recommended to
avoid having only two master eligible nodes, since a quorum of two is two. Therefore, a loss
of either master node will result in an inoperable cluster.
Fault Detection
There are two fault detection processes running. The first is by the master, to ping all the
other nodes in the cluster and verify that they are alive. And on the other end, each node
pings to master to verify if its still alive or an election process needs to be initiated.
The following settings control the fault detection process using the discovery.zen.fd
prefix:
Setting Description
ping_interval How often a node gets pinged. Defaults to
1s.
ping_timeout How long to wait for a ping response,
defaults to 30s.
ping_retries How many ping failures / timeouts cause a
node to be considered failed. Defaults to 3.
Cluster state updates
The master node is the only node in a cluster that can make changes to the cluster state.
The master node processes one cluster state update at a time, applies the required
changes and publishes the updated cluster state to all the other nodes in the cluster. Each
node receives the publish message, acknowledges it, but does not yet apply it. If the master
does not receive acknowledgement from at least
discovery.zen.minimum_master_nodes nodes within a certain time (controlled by
the discovery.zen.commit_timeout setting and defaults to 30 seconds) the cluster
state change is rejected.
Once enough nodes have responded, the cluster state is committed and a message will be
sent to all the nodes. The nodes then proceed to apply the new cluster state to their
internal state. The master node waits for all nodes to respond, up to a timeout, before going
ahead processing the next updates in the queue. The
discovery.zen.publish_timeout is set by default to 30 seconds and is measured
from the moment the publishing started. Both timeout settings can be changed dynamically
Chapter 39. Discovery | 517
through the cluster update settings api
No master block
For the cluster to be fully operational, it must have an active master and the number of
running master eligible nodes must satisfy the
discovery.zen.minimum_master_nodes setting if set. The
discovery.zen.no_master_block settings controls what operations should be
rejected when there is no active master.
The discovery.zen.no_master_block setting has two valid options:
all
All operations on the node—¬i.e. both read & writes—¬will be rejected. This also
applies for api cluster state read or write operations, like the get index settings, put
mapping and cluster state api.
write
(default) Write operations will be rejected. Read operations will succeed, based on the
last known cluster configuration. This may result in partial reads of stale data as this
node may be isolated from the rest of the cluster.
The discovery.zen.no_master_block setting doesn’t apply to nodes based apis (for

example cluster stats, node info and node stats apis) which will not be blocked and try to
execute on any node possible.
518 | Chapter 39. Discovery

Chapter 40. Indices
The indices module controls index-related settings that are globally managed for all
indices, rather than being configurable at a per-index level.
Available settings include:
Circuit breaker
Circuit breakers set limits on memory usage to avoid out of memory exceptions.
Fielddata cache
Set limits on the amount of heap used by the in-memory fielddata cache.
Node query cache
Configure the amount heap used to cache queries results.
Indexing buffer
Control the size of the buffer allocated to the indexing process.
Shard request cache
Control the behaviour of the shard-level request cache.
Recovery
Control the resource limits on the shard recovery process.
40.1. Circuit Breaker
NG|Storage contains multiple circuit breakers used to prevent operations from causing an
OutOfMemoryError. Each breaker specifies a limit for how much memory it can use.
Additionally, there is a parent-level breaker that specifies the total amount of memory that
can be used across all breakers.
settings API.
Parent circuit breaker
The parent-level breaker can be configured with the following setting:
indices.breaker.total.limit
Starting limit for overall parent breaker, defaults to 70% of JVM heap.
Chapter 40. Indices | 519
Field data circuit breaker
The field data circuit breaker allows NG|Storage to estimate the amount of memory a field
will require to be loaded into memory. It can then prevent the field data loading by raising
an exception. By default the limit is configured to 60% of the maximum JVM heap. It can be
configured with the following parameters:
indices.breaker.fielddata.limit
Limit for fielddata breaker, defaults to 60% of JVM heap
indices.breaker.fielddata.overhead
A constant that all field data estimations are multiplied with to determine a final
estimation. Defaults to 1.03
Request circuit breaker
The request circuit breaker allows NG|Storage to prevent per-request data structures (for
example, memory used for calculating aggregations during a request) from exceeding a
certain amount of memory.
indices.breaker.request.limit
Limit for request breaker, defaults to 40% of JVM heap
indices.breaker.request.overhead
A constant that all request estimations are multiplied with to determine a final
estimation. Defaults to 1
In flight requests circuit breaker
The in flight requests circuit breaker allows NG|Storage to limit the memory usage of all
currently active incoming requests on transport or HTTP level from exceeding a certain
amount of memory on a node. The memory usage is based on the content length of the
request itself.
network.breaker.inflight_requests.limit
Limit for in flight requests breaker, defaults to 100% of JVM heap. This means that it
is bound by the limit configured for the parent circuit breaker.
network.breaker.inflight_requests.overhead
A constant that all in flight requests estimations are multiplied with to determine a
final estimation. Defaults to 1
520 | Chapter 40. Indices

40.2. Fielddata
The field data cache is used mainly when sorting on or computing aggregations on a field. It
loads all the field values to memory in order to provide fast document based access to
those values. The field data cache can be expensive to build for a field, so its recommended
to have enough memory to allocate it, and to keep it loaded.
The amount of memory used for the field data cache can be controlled using
indices.fielddata.cache.size. Note: reloading the field data which does not fit into
your cache will be expensive and perform poorly.
indices.fielddata.cache.size
The max size of the field data cache, eg 30% of node heap space, or an absolute value,
eg 12GB. Defaults to unbounded. Also see [fielddata-circuit-breaker].
These are static settings which must be configured on every data node in
 the cluster.
Monitoring field data
You can monitor memory usage for field data as well as the field data circuit breaker using
Nodes Stats API
40.3. Indexing Buffer
The indexing buffer is used to store newly indexed documents. When it fills up, the
documents in the buffer are written to a segment on disk. It is divided between all shards
on the node.
The following settings are static and must be configured on every data node in the cluster:
indices.memory.index_buffer_size
Accepts either a percentage or a byte size value. It defaults to 10%, meaning that 10%
of the total heap allocated to a node will be used as the indexing buffer size shared
across all shards.
indices.memory.min_index_buffer_size
If the index_buffer_size is specified as a percentage, then this setting can be
used to specify an absolute minimum. Defaults to 48mb.

indices.memory.max_index_buffer_size
If the index_buffer_size is specified as a percentage, then this setting can be
used to specify an absolute maximum. Defaults to unbounded.
40.4. Node Query Cache
The query cache is responsible for caching the results of queries. There is one queries
cache per node that is shared by all shards. The cache implements an LRU eviction policy:
when a cache becomes full, the least recently used data is evicted to make way for new
data.
The query cache only caches queries which are being used in a filter context.
The following setting is static and must be configured on every data node in the cluster:
indices.queries.cache.size
Controls the memory size for the filter cache , defaults to 10%. Accepts either a
percentage value, like 5%, or an exact value, like 512mb.
The following setting is an index setting that can be configured on a per-index basis:
index.queries.cache.enabled
Controls whether to enable query caching. Accepts true (default) or false.
40.5. Indices Recovery
The following expert settings can be set to manage the recovery policy.
indices.recovery.file_chunk_size
Defaults to 512kb.
indices.recovery.translog_ops
Defaults to 1000.
indices.recovery.translog_size
Defaults to 512kb.
indices.recovery.compress
Defaults to true.
indices.recovery.max_bytes_per_sec

Defaults to 40mb.
settings API.
40.6. Shard Request Cache
When a search request is run against an index or against many indices, each involved shard
executes the search locally and returns its local results to the coordinating node, which
combines these shard-level results into a ``global'' result set.
The shard-level request cache module caches the local results on each shard. This allows
frequently used (and potentially heavy) search requests to return results almost instantly.
The requests cache is a very good fit for the logging use case, where only the most recent
index is being actively updated¬—¬results from older indices will be served directly from
the cache.
By default, the requests cache will only cache the results of search
requests where size=0, so it will not cache hits, but it will cache
 hits.total, aggregations, and suggestions.
Queries that use now (see [date-math]) cannot be cached.
Cache invalidation
The cache is smart¬—¬it keeps the same near real-time promise as uncached search.
Cached results are invalidated automatically whenever the shard refreshes, but only if the
data in the shard has actually changed. In other words, you will always get the same
results from the cache as you would for an uncached search request.
The longer the refresh interval, the longer that cached entries will remain valid. If the
cache is full, the least recently used cache keys will be evicted.
The cache can be expired manually with the clear-cache API:
curl -XPOST
'localhost:9200/kimchy,ngStorage/_cache/clear?request_cache=true'
Enabling caching by default
The cache is not enabled by default, but can be enabled when creating a new index as

follows:
curl -XPUT localhost:9200/my_index -d'

{
"settings": {
"index.requests.cache.enable": true
}
}
'
It can also be enabled or disabled dynamically on an existing index with the update-
settings API:
curl -XPUT localhost:9200/my_index/_settings -d'

{ "index.requests.cache.enable": true }
'
Enabling caching per request
The request_cache query-string parameter can be used to enable or disable caching on

a per-request basis. If set, it overrides the index-level setting:
curl 'localhost:9200/my_index/_search?request_cache=true' -d'

{
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "colors"
}
}
}
}
'
If your query uses a script whose result is not deterministic (e.g. it uses a
 random function or references the current time) you should set the
request_cache flag to false to disable caching for that request.
Requests size is greater than 0 will not be cached even if the request cache is enabled in
the index settings. To cache these requests you will need to use the query-string parameter
detailed here.
Cache key
The whole JSON body is used as the cache key. This means that if the JSON changes¬—

¬for instance if keys are output in a different order¬—¬then the cache key will not be
recognised.
Most JSON libraries support a canonical mode which ensures that JSON
keys are always emitted in the same order. This canonical mode can be
 used in the application to ensure that a request is always serialized in the
same way.
Cache settings
The cache is managed at the node level, and has a default maximum size of 1% of the heap.
This can be changed in the config/ngStorage.yml file with:
indices.requests.cache.size: 2%
Also, you can use the indices.requests.cache.expire setting to specify a TTL for cached
results, but there should be no reason to do so. Remember that stale results are
automatically invalidated when the index is refreshed. This setting is provided for
completeness' sake only.
Monitoring cache usage
The size of the cache (in bytes) and the number of evictions can be viewed by index, with the
indices-stats API:
curl 'localhost:9200/_stats/request_cache?pretty&human'
or by node with the nodes-stats API:
curl 'localhost:9200/_nodes/stats/indices/request_cache?pretty&human'

Chapter 41. Scripting
The scripting module enables you to use scripts to evaluate custom expressions. For
example, you could use a script to return "script fields" as part of a search request or
evaluate a custom score for a query.
NG|Storage now has a built-in scripting language called Painless that

provides a more secure alternative for implementing scripts for
 NG|Storage. We encourage you to try it out¬—¬for more information, see
Painless Scripting Language.
The default scripting language is groovy. Additional lang plugins enable you to run scripts
written in other languages. Everywhere a script can be used, you can include a lang
parameter to specify the language of the script.
chapter.
41.1. Advanced Text Scoring in Scripts
experimental[The functionality described on this page is considered experimental and may

be changed or removed in a future release]
Text features, such as term or document frequency for a specific term can be accessed in
scripts with the _index variable. This can be useful if, for example, you want to implement
your own scoring model using for example a script inside a function score query. Statistics
over the document collection are computed per shard, not per index.
Nomenclature:
df
document frequency. The number of documents a term appears in. Computed per
field.
tf
term frequency. The number times a term appears in a field in one specific document.
ttf
total term frequency. The number of times this term appears in all documents, that is,
the sum of tf over all documents. Computed per field.
526 | Chapter 41. Scripting

df and ttf are computed per shard and therefore these numbers can vary depending on
the shard the current document resides in.
Shard statistics:
_index.numDocs()
Number of documents in shard.
_index.maxDoc()
Maximal document number in shard.
_index.numDeletedDocs()
Number of deleted documents in shard.
Field statistics:
Field statistics can be accessed with a subscript operator like this: _index['FIELD'].
_index['FIELD'].docCount()
Number of documents containing the field FIELD. Does not take deleted documents
into account.
_index['FIELD'].sumttf()
Sum of ttf over all terms that appear in field FIELD in all documents.
_index['FIELD'].sumdf()
The sum of df s over all terms that appear in field FIELD in all documents.
Field statistics are computed per shard and therefore these numbers can vary depending
on the shard the current document resides in. The number of terms in a field cannot be
accessed using the _index variable. See [token-count] for how to do that.
Term statistics:
Term statistics for a field can be accessed with a subscript operator like this:
_index['FIELD']['TERM']. This will never return null, even if term or field does not
exist. If you do not need the term frequency, call _index['FIELD'].get('TERM', 0)
to avoid unnecessary initialization of the frequencies. The flag will have only affect is your
set the index_options to docs.
_index['FIELD']['TERM'].df()
df of term TERM in field FIELD. Will be returned, even if the term is not present in the
Chapter 41. Scripting | 527

current document.
_index['FIELD']['TERM'].ttf()
The sum of term frequencies of term TERM in field FIELD over all documents. Will be
returned, even if the term is not present in the current document.
_index['FIELD']['TERM'].tf()
tf of term TERM in field FIELD. Will be 0 if the term is not present in the current
document.
Term positions, offsets and payloads:
If you need information on the positions of terms in a field, call

_index['FIELD'].get('TERM', flag) where flag can be
_POSITIONS
if you need the positions of the term
_OFFSETS
if you need the offsets of the term
_PAYLOADS
if you need the payloads of the term
_CACHE
if you need to iterate over all positions several times
The iterator uses the underlying lucene classes to iterate over positions. For efficiency
reasons, you can only iterate over positions once. If you need to iterate over the positions
several times, set the _CACHE flag.
You can combine the operators with a | if you need more than one info. For example, the
following will return an object holding the positions and payloads, as well as all statistics:
`_index['FIELD'].get('TERM', _POSITIONS | _PAYLOADS)`
Positions can be accessed with an iterator that returns an object (POS_OBJECT) holding
position, offsets and payload for each term position.
POS_OBJECT.position
The position of the term.

POS_OBJECT.startOffset
The start offset of the term.
POS_OBJECT.endOffset
The end offset of the term.
POS_OBJECT.payload
The payload of the term.
POS_OBJECT.payloadAsInt(missingValue)
The payload of the term converted to integer. If the current position has no payload,
the missingValue will be returned. Call this only if you know that your payloads are
integers.
POS_OBJECT.payloadAsFloat(missingValue)
The payload of the term converted to float. If the current position has no payload, the
missingValue will be returned. Call this only if you know that your payloads are
floats.
POS_OBJECT.payloadAsString()
The payload of the term converted to string. If the current position has no payload,
null will be returned. Call this only if you know that your payloads are strings.
Example: sums up all payloads for the term foo.
termInfo = _index['my_field'].get('foo',_PAYLOADS);
score = 0;
for (pos in termInfo) {
score = score + pos.payloadAsInt(0);
}
return score;
Term vectors:
The _index variable can only be used to gather statistics for single terms. If you want to
use information on all terms in a field, you must store the term vectors (see Term Vector).
To access them, call _index.termVectors() to get a Fields instance. This object can
then be used as described in lucene doc to iterate over fields and then for each field iterate
over each term in the field. The method will return null if the term vectors were not stored.

41.2. Lucene Expressions Language
Lucene’s expressions compile a javascript expression to bytecode. They are designed

for high-performance custom ranking and sorting functions and are enabled for inline
and stored scripting by default.
Performance
Expressions were designed to have competitive performance with custom Lucene code.
This performance is due to having low per-document overhead as opposed to other
scripting engines: expressions do more "up-front".
This allows for very fast execution, even faster than if you had written a native script.
Syntax
Expressions support a subset of javascript syntax: a single expression.
See the expressions module documentation for details on what operators and functions are
available.
Variables in expression scripts are available to access:
• document fields, e.g. doc['myfield'].value
• variables and methods that the field supports, e.g. doc['myfield'].empty
• Parameters passed into the script, e.g. mymodifier
• The current document’s score, _score (only available when used in a script_score)
You can use Expressions scripts for script_score, script_fields, sort scripts, and
numeric aggregation scripts, simply set the lang parameter to expression.
Numeric field API
Expression Description
doc['field_name'].value The value of the field, as a double
doc['field_name'].empty A boolean indicating if the field has no values
within the doc.
doc['field_name'].length The number of values in this document.
doc['field_name'].min() The minimum value of the field in this
document.
doc['field_name'].max() The maximum value of the field in this
document.

doc['field_name'].median() The median value of the field in this
document.
doc['field_name'].avg() The average of the values in this document.
doc['field_name'].sum() The sum of the values in this document.
When a document is missing the field completely, by default the value will be treated as 0.
You can treat it as another value instead, e.g. doc['myfield'].empty ? 100 :
doc['myfield'].value
When a document has multiple values for the field, by default the minimum value is
returned. You can choose a different value instead, e.g. doc['myfield'].sum().
When a document is missing the field completely, by default the value will be treated as 0.
Boolean fields are exposed as numerics, with true mapped to 1 and false mapped to 0.
For example: doc['on_sale'].value ? doc['price'].value * 0.5 :
doc['price'].value
Date field API
Date fields are treated as the number of milliseconds since January 1, 1970 and support the
Numeric Fields API above, plus access to some date-specific fields:
doc['field_name'].date.centuryOfE Century (1-2920000)
ra
doc['field_name'].date.dayOfMonth Day (1-31), e.g. 1 for the first of the month.
doc['field_name'].date.dayOfWeek Day of the week (1-7), e.g. 1 for Monday.
doc['field_name'].date.dayOfYear Day of the year, e.g. 1 for January 1.
doc['field_name'].date.era Era: 0 for BC, 1 for AD.
doc['field_name'].date.hourOfDay Hour (0-23).
doc['field_name'].date.millisOfDa Milliseconds within the day (0-86399999).
y
doc['field_name'].date.millisOfSe Milliseconds within the second (0-999).
cond
doc['field_name'].date.minuteOfDa Minute within the day (0-1439).
y
doc['field_name'].date.minuteOfHo Minute within the hour (0-59).
ur
doc['field_name'].date.monthOfYea Month within the year (1-12), e.g. 1 for
r January.
doc['field_name'].date.secondOfDa Second within the day (0-86399).
y

doc['field_name'].date.secondOfMi Second within the minute (0-59).
nute
doc['field_name'].date.year Year (-292000000 - 292000000).
doc['field_name'].date.yearOfCent Year within the century (1-100).
ury
doc['field_name'].date.yearOfEra Year within the era (1-292000000).
The following example shows the difference in years between the date fields date0 and
date1:
doc['date1'].date.year - doc['date0'].date.year
geo_point field API
doc['field_name'].empty A boolean indicating if the field has no values
within the doc.
doc['field_name'].lat The latitude of the geo point.
doc['field_name'].lon The longitude of the geo point.
The following example computes distance in kilometers from Washington, DC:
haversin(38.9072, 77.0369, doc['field_name'].lat,

doc['field_name'].lon)
In this example the coordinates could have been passed as parameters to the script, e.g.
based on geolocation of the user.
Limitations
There are a few limitations relative to other script languages:
• Only numeric, boolean, date, and geo_point fields may be accessed
• Stored fields are not available
41.3. Accessing Document Fields and Special Variables
Depending on where a script is used, it will have access to certain special variables and
document fields.
Update Scripts
A script used in the update, update-by-query, or reindex API will have access to the ctx

variable which exposes:
ctx._source
Access to the document _source field.
ctx.op
The operation that should be applied to the document: index or delete.
ctx._index etc
Access to document meta-fields, some of which may be read-only.
Search and Aggregation scripts
With the exception of script fields which are executed once per search hit, scripts used in
search and aggregations will be executed once for every document which might match a
query or an aggregation. Depending on how many documents you have, this could mean
millions or billions of executions: these scripts need to be fast!
Field values can be accessed from a script using doc-values, or stored fields or _source
field, which are explained below.
Scripts may also have access to the document’s relevance _score and, via the
experimental _index variable, to term statistics for advanced text scoring.
Accessing the score of a document within a script
Scripts used in the function_score query, in script-based sorting, or in aggregations

have access to the _score variable which represents the current relevance score of a
document.
Here’s an example of using a script in a function_score query to alter the relevance

_score of each document:

{
"text": "quick brown fox",
"popularity": 1
}
{
"text": "quick fox",
"popularity": 5
}
{
"query": {
"function_score": {
"query": {
"match": {
"text": "quick brown fox"
}
},
"script_score": {
"script": {
"lang": "expression",
"inline": "_score * doc['popularity']"
}
}
}
}
}
Doc Values
By far the fastest most efficient way to access a field value from a script is to use the
doc['field_name'] syntax, which retrieves the field value from doc values. Doc values
are a columnar field value store, enabled by default on all fields except for analyzed text
fields.

{
"cost_price": 100
}
{
"script_fields": {
"sales_price": {
"script": {
"inline": "doc['cost_price'] * markup",
"params": {
"markup": 0.2
}
}
}
}
}
Doc-values can only return "simple" field values like numbers, dates, geo- points, terms,
etc, or arrays of these values if the field is multi-valued. It cannot return JSON objects.
Doc values and text fields
The doc['field'] syntax can also be used for analyzed text fields if
fielddata is enabled, but BEWARE: enabling fielddata on a text field
 requires loading all of the terms into the JVM heap, which can be very
expensive both in terms of memory and CPU. It seldom makes sense to
access text fields from scripts.
Stored Fields and _source
Stored fields¬—¬fields explicitly marked as "store": true¬—¬can be accessed using

the _fields['field_name'].value or _fields['field_name'].values syntax.
The document _source, which is really just a special stored field, can be accessed using
the _source.field_name syntax. The _source is loaded as a map-of-maps, so
properties within object fields can be accessed as, for example, _source.name.first.

Prefer doc-values to stored fields
Stored fields (which includes the stored _source field) are much slower
than doc-values. They are optimised for returning several fields per
result, while doc values are optimised for accessing the value of a specific
 field in many documents.
It makes sense to use _source or stored fields when generating a script

field for the top ten hits from a search result but, for other search and
aggregation use cases, always prefer using doc values.
For instance:

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": { 1
"type": "text"
},
"first_name": {
"type": "text",
"store": true
},
"last_name": {
"type": "text",
"store": true
}
}
}
}
}
{
"title": "Mr",
"first_name": "Barry",
"last_name": "White"
}
{
"script_fields": {
"source": {
"script": {
"lang": "groovy",
"inline": "_source.title + ' ' + _source.first_name + ' ' +
_source.last_name" 2
}
},
"stored_fields": {
"script": {
"lang": "groovy",
"inline": "_fields['first_name'].value + ' ' +
_fields['last_name'].value"
}
}
}
}
1 - The title field is not stored and so cannot be used with the _fields[] syntax. 2 - The
title field can still be accessed from the _source.

Stored vs _source
The _source field is just a special stored field, so the performance is

similar to that of other stored fields. The _source provides access to the
original document body that was indexed (including the ability to
 distinguish null values from empty fields, single-value arrays from plain
scalars, etc).
The only time it really makes sense to use stored fields instead of the
_source field is when the _source is very large and it is less costly to
access a few small stored fields instead of the entire _source.
41.4. Groovy Scripting Language
Groovy is the default scripting language available in NG|Storage. Although limited by the
Java Security Manager, it is not a sandboxed language and only file scripts may be used
by default.
Enabling inline or stored Groovy scripting is a security risk and should only be
considered if your NG|Storage cluster is protected from the outside world. Even a simple
while (true) { } loop could behave as a denial-of- service attack on your cluster.
See Scripting and Security for details on security issues with scripts, including how to
customize class whitelisting.
Doc value properties and methods
Doc values in Groovy support the following properties and methods (depending on the
underlying field type):
doc['field_name'].value
The native value of the field. For example, if its a short type, it will be short.
doc['field_name'].values
The native array values of the field. For example, if its a short type, it will be short[].
Remember, a field can have several values within a single doc. Returns an empty
array if the field has no values.
doc['field_name'].empty
A boolean indicating if the field has no values within the doc.

doc['field_name'].lat
The latitude of a geo point type, or null.
doc['field_name'].lon
The longitude of a geo point type, or null.
doc['field_name'].lats
The latitudes of a geo point type, or an empty array.
doc['field_name'].lons
The longitudes of a geo point type, or an empty array.
doc['field_name'].distance(lat, lon)
The plane distance (in meters) of this geo point field from the provided lat/lon.
doc['field_name'].distanceWithDefault(lat, lon, default)

The plane distance (in meters) of this geo point field from the provided lat/lon with a
default value.
doc['field_name'].distanceInMiles(lat, lon)
The plane distance (in miles) of this geo point field from the provided lat/lon.
doc['field_name'].distanceInMilesWithDefault(lat, lon, default)

The plane distance (in miles) of this geo point field from the provided lat/lon with a
default value.
doc['field_name'].distanceInKm(lat, lon)
The plane distance (in km) of this geo point field from the provided lat/lon.
doc['field_name'].distanceInKmWithDefault(lat, lon, default)

The plane distance (in km) of this geo point field from the provided lat/lon with a
default value.
doc['field_name'].arcDistance(lat, lon)
The arc distance (in meters) of this geo point field from the provided lat/lon.
doc['field_name'].arcDistanceWithDefault(lat, lon, default)

The arc distance (in meters) of this geo point field from the provided lat/lon with a
default value.
doc['field_name'].arcDistanceInMiles(lat, lon)

The arc distance (in miles) of this geo point field from the provided lat/lon.
doc['field_name'].arcDistanceInMilesWithDefault(lat, lon, default)

The arc distance (in miles) of this geo point field from the provided lat/lon with a
default value.
doc['field_name'].arcDistanceInKm(lat, lon)
The arc distance (in km) of this geo point field from the provided lat/lon.
doc['field_name'].arcDistanceInKmWithDefault(lat, lon, default)

The arc distance (in km) of this geo point field from the provided lat/lon with a default
value.
doc['field_name'].factorDistance(lat, lon)
The distance factor of this geo point field from the provided lat/lon.
doc['field_name'].factorDistance(lat, lon, default)

The distance factor of this geo point field from the provided lat/lon with a default
value.
doc['field_name'].geohashDistance(geohash)
The arc distance (in meters) of this geo point field from the provided geohash.
doc['field_name'].geohashDistanceInKm(geohash)
The arc distance (in km) of this geo point field from the provided geohash.
doc['field_name'].geohashDistanceInMiles(geohash)
The arc distance (in miles) of this geo point field from the provided geohash.
Groovy Built In Functions
There are several built in functions that can be used within scripts. They include:
Function Description
sin(a) Returns the trigonometric sine of an angle.
cos(a) Returns the trigonometric cosine of an
angle.
tan(a) Returns the trigonometric tangent of an
angle.
asin(a) Returns the arc sine of a value.
acos(a) Returns the arc cosine of a value.
atan(a) Returns the arc tangent of a value.

toRadians(angdeg) Converts an angle measured in degrees to
an approximately equivalent angle measured
in radians
toDegrees(angrad) Converts an angle measured in radians to an
approximately equivalent angle measured in
degrees.
exp(a) Returns Euler’s number e raised to the
power of value.
log(a) Returns the natural logarithm (base e) of a
value.
log10(a) Returns the base 10 logarithm of a value.
sqrt(a) Returns the correctly rounded positive
square root of a value.
cbrt(a) Returns the cube root of a double value.
IEEEremainder(f1, f2) Computes the remainder operation on two
arguments as prescribed by the IEEE 754
standard.
ceil(a) Returns the smallest (closest to negative
infinity) value that is greater than or equal to
the argument and is equal to a mathematical
integer.
floor(a) Returns the largest (closest to positive
infinity) value that is less than or equal to the
argument and is equal to a mathematical
integer.
rint(a) Returns the value that is closest in value to
the argument and is equal to a mathematical
integer.
atan2(y, x) Returns the angle theta from the conversion
of rectangular coordinates (x, y) to polar
coordinates (r,theta).
pow(a, b) Returns the value of the first argument
raised to the power of the second argument.
round(a) Returns the closest int to the argument.
random() Returns a random double value.
abs(a) Returns the absolute value of a value.
max(a, b) Returns the greater of two values.
min(a, b) Returns the smaller of two values.
ulp(d) Returns the size of an ulp of the argument.
signum(d) Returns the signum function of the
argument.
sinh(x) Returns the hyperbolic sine of a value.

cosh(x) Returns the hyperbolic cosine of a value.
tanh(x) Returns the hyperbolic tangent of a value.
hypot(x, y) Returns sqrt(x2 + y2) without intermediate
overflow or underflow.
41.5. Native (Java) Scripts
Sometimes groovy and expression aren’t enough. For those times you can implement a
native script.
The best way to implement a native script is to write a plugin and install it.
To register the actual script you’ll need to implement NativeScriptFactory to

construct the script. The actual script will extend either AbstractExecutableScript or
AbstractSearchScript. The second one is likely the most useful and has several
helpful subclasses you can extend like AbstractLongSearchScript and
AbstractDoubleSearchScript. Finally, your plugin should register the native script by
declaring the onModule(ScriptModule) method.
If you squashed the whole thing into one class it’d look like:

public class MyNativeScriptPlugin extends Plugin {
@Override
public String name() {
return "my-native-script";
}
@Override
public String description() {
return "my native script that does something great";
}
public void onModule(ScriptModule scriptModule) {
scriptModule.registerScript("my_script", MyNativeScriptFactory
.class);
}
public static class MyNativeScriptFactory implements

NativeScriptFactory {
@Override
public ExecutableScript newScript(@Nullable Map<String, Object>
params) {
return new MyNativeScript();
}
@Override
public boolean needsScores() {
return false;
}
}
public static class MyNativeScript extends AbstractDoubleSearchScript

{
@Override
public double runAsDouble() {
double a = (double) source().get("a");
double b = (double) source().get("b");
return a * b;
}
}
}
You can execute the script by specifying its lang as native, and the name of the script as
the id:

curl -XPOST localhost:9200/_search -d '{
"query": {
"function_score": {
"query": {
"match": {
"body": "foo"
}
},
"functions": [
{
"script_score": {
"script": {
"id": "my_script",
"lang" : "native"
}
}
}
]
}
}
}'
41.6. Painless Scripting Language
experimental[The Painless scripting language is new and is still marked as experimental.

The syntax or API may be changed in the future in non-backwards compatible ways if
required.]
Painless is a simple, secure scripting language available in NG|Storage by default. It is

designed specifically for use with NG|Storage and can safely be used with inline and
stored scripting, which is enabled by default.
The Painless syntax is similar to Groovy.
You can use Painless anywhere a script can be used in NG|Storage—¬simply set the lang
parameter to painless.
Painless Features
• Fast performance: several times faster than the alternatives.
• Safety: Fine-grained whitelist with method call/field granularity.
• Optional typing: Variables and parameters can use explicit types or the dynamic def
type.
• Syntax: Extends Java’s syntax with a subset of Groovy for ease of use. See the Syntax
Overview.

• Optimizations: Designed specifically for NG|Storage scripting.
Painless Examples
To illustrate how Painless works, let’s load some hockey stats into an NG|Storage index:
PUT hockey/player/_bulk?refresh
{"index":{"_id":1}}
{"first":"johnny","last":"gaudreau","goals":[9,27,1],"assists":[17,46,0],"
gp":[26,82,1]}
{"index":{"_id":2}}
{"first":"sean","last":"monohan","goals":[7,54,26],"assists":[11,26,13],"g
p":[26,82,82]}
{"index":{"_id":3}}
{"first":"jiri","last":"hudler","goals":[5,34,36],"assists":[11,62,42],"gp
":[24,80,79]}
{"index":{"_id":4}}
{"first":"micheal","last":"frolik","goals":[4,6,15],"assists":[8,23,15],"g
p":[26,82,82]}
{"index":{"_id":5}}
{"first":"sam","last":"bennett","goals":[5,0,0],"assists":[8,1,0],"gp":[26
,1,0]}
{"index":{"_id":6}}
{"first":"dennis","last":"wideman","goals":[0,26,15],"assists":[11,30,24],
"gp":[26,81,82]}
{"index":{"_id":7}}
{"first":"david","last":"jones","goals":[7,19,5],"assists":[3,17,4],"gp":[
26,45,34]}
{"index":{"_id":8}}
{"first":"tj","last":"brodie","goals":[2,14,7],"assists":[8,42,30],"gp":[2
6,82,82]}
{"index":{"_id":39}}
{"first":"mark","last":"giordano","goals":[6,30,15],"assists":[3,30,24],"g
p":[26,60,63]}
{"index":{"_id":10}}
{"first":"mikael","last":"backlund","goals":[3,15,13],"assists":[6,24,18],
"gp":[26,82,82]}
{"index":{"_id":11}}
{"first":"joe","last":"colborne","goals":[3,18,13],"assists":[6,20,24],"gp
":[26,67,82]}
Accessing Doc Values from Painless
Document values can be accessed from a Map named doc.
For example, the following script calculates a player’s total goals. This example uses a
strongly typed int and a for loop.

GET hockey/_search
{
"query": {
"function_score": {
"script_score": {
"script": {
"lang": "painless",
"inline": "int total = 0; for (int i = 0; i <
doc['goals'].length; ++i) { total += doc['goals'][i]; } return total;"
}
}
}
}
}
Alternatively, you could do the same thing using a script field instead of a function score:
GET hockey/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"total_goals": {
"script": {
"lang": "painless",
"inline": "int total = 0; for (int i = 0; i < doc['goals'].length;
++i) { total += doc['goals'][i]; } return total;"
}
}
}
}
The following example uses a Painless script to sort the players by their combined first and
last names. The names are accessed using doc['first'].value and
doc['last'].value.

GET hockey/_search
{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"type": "string",
"order": "asc",
"script": {
"lang": "painless",
"inline": "doc['first.keyword'].value + ' ' +
doc['last.keyword'].value"
}
}
}
}
Updating Fields with Painless
You can also easily update fields. You access the original source for a field as
ctx._source.<field-name>.
First, let’s look at the source data for a player by submitting the following request:
GET hockey/_search
{
"stored_fields": [
"_id",
"_source"
],
"query": {
"term": {
"_id": 1
}
}
}
To change player 1’s last name to hockey, simply set ctx._source.last to the new
value:
POST hockey/player/1/_update
{
"script": {
"lang": "painless",
"inline": "ctx._source.last = params.last",
"params": {
"last": "hockey"
}
}
}

You can also add fields to a document. For example, this script adds a new field that
contains the player’s nickname, hockey.
POST hockey/player/1/_update
{
"script": {
"lang": "painless",
"inline": "ctx._source.last = params.last; ctx._source.nick =
params.nick",
"params": {
"last": "gaudreau",
"nick": "hockey"
}
}
}
Regular expressions
Painless’s native support for regular expressions has syntax constructs:
• /pattern/: Pattern literals create patterns. This is the only way to create a pattern in
painless. The pattern inside the `/`s are just Java regular expressions. See [modules-
scripting-painless-regex-flags] for more.
• =~: The find operator return a boolean, true if a subsequence of the text matches,
false otherwise.
• ==~: The match operator returns a boolean, true if the text matches, false if it
doesn’t.
Using the find operator (=~) you can update all hockey players with "b" in their last name:
POST hockey/player/_update_by_query
{
"script": {
"lang": "painless",
"inline": "if (ctx._source.last =~ /b/) {ctx._source.last +=
\"matched\"} else {ctx.op = 'noop'}"
}
}
Using the match operator (==~) you can update all the hockey players who’s names start
with a consonant and end with a vowel:

{
"script": {
"lang": "painless",
"inline": "if (ctx._source.last ==~ /[âeiou].*[aeiou]/)
{ctx._source.last += \"matched\"} else {ctx.op = 'noop'}"
}
}
You can use the Pattern.matcher directly to get a Matcher instance and remove all of
the vowels in all of their last names:
{
"script": {
"lang": "painless",
"inline": "ctx._source.last =
/[aeiou]/.matcher(ctx._source.last).replaceAll('')"
}
}
Matcher.replaceAll is just a call to Java’s Matcher’s replaceAll method so it

supports `$1 and \1 for replacements:
{
"script": {
"lang": "painless",
"inline": "ctx._source.last =
/n([aeiou])/.matcher(ctx._source.last).replaceAll('$1')"
}
}
If you need more control over replacements you can call replaceAll on a
CharSequence with a Function<Matcher, String> that builds the replacement. This
does not support $1 or \1 to access replacements because you already have a reference to
the matcher and can get them with m.group(1).
Calling Matcher.find inside of the function that builds the replacement

 is rude and will likely break the replacement process.
This will make all of the vowels in the hockey player’s last names upper case:

{
"script": {
"lang": "painless",
"inline": "ctx._source.last = ctx._source.last.replaceAll(/[aeiou]/, m
-> m.group().toUpperCase(Locale.ROOT))"
}
}
Or you can use the CharSequence.replaceFirst to make the first vowel in their last
names upper case:
{
"script": {
"lang": "painless",
"inline": "ctx._source.last = ctx._source.last.replaceFirst(/[aeiou]/,
m -> m.group().toUpperCase(Locale.ROOT))"
}
}
Note: all of the _update_by_query examples above could really do with a query to limit
the data that they pull back. While you could use a Script Query it wouldn’t be as efficient as
using any other query because script queries aren’t able to use the inverted index to limit
the documents that they have to check.
Painless API
The following Java packages are available for use in the Painless language:
• java.lang
• java.math
• java.text
• java.time
• java.time.chrono
• java.time.format
• java.time.temporal
• java.time.zone
• java.util
• java.util.function

• java.util.regex
• java.util.stream
Note that unsafe classes and methods are not included, there is no support for:
• Manipulation of processes and threads
• Input/Output
• Reflection
41.7. Painless Syntax
experimental[The Painless scripting language is new and is still marked as experimental.

The syntax or API may be changed in the future in non-backwards compatible ways if
required.]
Variable types
Painless supports all of Java’s types, including array types, but adds some additional built-
in types.
Def
The dynamic type def serves as a placeholder for any other type. It adopts the behavior of
whatever runtime type it represents.
String
String constants can be declared with single quotes, to avoid escaping horrors with JSON:
def mystring = 'foo';
List
Lists can be created explicitly (e.g. new ArrayList()) or initialized similar to Groovy:
def list = [1,2,3];
Lists can also be accessed similar to arrays: they support subscript and .length:
def list = [1,2,3];

return list[0]

Map
Maps can be created explicitly (e.g. new HashMap()) or initialized similar to Groovy:
def person = ['name': 'Joe', 'age': 63];
Map keys can also be accessed as properties.
def person = ['name': 'Joe', 'age': 63];

person.retired = true;
return person.name
Map keys can also be accessed via subscript (for keys containing special characters):
return map['something-absurd!']
Pattern
Regular expression constants are directly supported:
Pattern p = /[aeiou]/
Patterns can only be created via this mechanism. This ensures fast performance, regular
expressions are always constants and compiled efficiently a single time.
Pattern flags
You can define flags on patterns in Painless by adding characters after the trailing / like
/foo/i or /foo \w #comment/iUx. Painless exposes all the flags from Java’s Pattern
class using these characters:
Character Java Constant Example

c CANON_EQ 'a¬' ==~ /a¬/c (open in
hex editor to see)
i CASE_INSENSITIVE 'A' ==~ /a/i
l LITERAL '[a]' ==~ /[a]/l
m MULTILINE 'a\nb\nc' =~ /^b$/m
s DOTALL (aka single line) 'a\nb\nc' =~ /.b./s
U UNICODE_CHARACTER_CLA '¬' ==~ /\\w/U
SS
u UNICODE_CASE '¬' ==~ /¬/iu
x COMMENTS (aka extended) 'a' ==~ /a #comment/x

Operators
All of Java’s operators are supported with the same precedence, promotion, and semantics.
There are only a few minor differences and add-ons:
• == behaves as Java’s for numeric types, but for non-numeric types acts as
Object.equals()
• === and !== support exact reference comparison (e.g. x === y)
• =~ true if a portion of the text matches a pattern (e.g. x =~ /b/)
• ==~ true if the entire text matches a pattern (e.g. x ==~ /[Bb]ob/)
Control flow
Java’s control flow statements are supported, with the exception of the switch statement.
In addition to Java’s enhanced for loop, the for in syntax from groovy can also be
used:
for (item : list) {

...
}
Functions
Functions can be declared at the beginning of the script, for example:
boolean isNegative(def x) { x < 0 }

...
if (isNegative(someVar)) {
...
}
Lambda expressions
Lambda expressions and method references work the same as Java’s.
list.removeIf(item -> item == 2);

list.removeIf((int item) -> item == 2);
list.removeIf((int item) -> { item == 2 });
list.sort((x, y) -> x - y);
list.sort(Integer::compare);
Method references to functions within the script can be accomplished using this, e.g.

list.sort(this::mycompare).
41.8. Scripting and Security
You should never run NG|Storage as the root user, as this would allow a script to access
or do anything on your server, without limitations.
You should not expose NG|Storage directly to users, but instead have a proxy application
inbetween. If you do intend to expose NG|Storage directly to your users, then you have to
decide whether you trust them enough to run scripts on your box or not, and apply the
appropriate safety measures.
Enabling dynamic scripting
The script.* settings allow for fine-grained control of which script languages (e.g
groovy, painless) are allowed to run in which context ( e.g. search, aggs, update),
and where the script source is allowed to come from (i.e. inline, stored, file).
For instance, the following setting enables stored update scripts for groovy:
script.engine.groovy.inline.update: true
Less fine-grained settings exist which allow you to enable or disable scripts for all sources,
all languages, or all contexts. The following settings enable inline and stored scripts
for all languages in all contexts:
script.inline: true
script.stored: true
The above settings mean that anybody who can send requests to your
NG|Storage instance can run whatever scripts they choose! This is a
 security risk and may well lead to your NG|Storage cluster being
compromised.
Script source settings
Scripts may be enabled or disabled depending on their source: inline, stored in the
cluster state, or from a file on each node in the cluster. Each of these settings takes one
of these values:

false
Scripting is enabled.
true
Scripting is disabled.
The default values are the following:
script.inline: false
script.stored: false
script.file: true
Global scripting settings affect the mustache scripting language. Search

templates internally use the mustache language, and will still be enabled
 by default as the mustache engine is sandboxed, but they will be

enabled/disabled according to fine-grained settings specified in
ngStorage.yml.
Script context settings
Scripting may also be enabled or disabled in different contexts in the NG|Storage API. The
supported contexts are:
aggs
Aggregations
search
Search api, Percolator API and Suggester API
update
Update api
plugin
Any plugin that makes use of scripts under the generic plugin category
Plugins can also define custom operations that they use scripts for instead of using the
generic plugin category. Those operations can be referred to in the following form:
${pluginName}_${operation}.
The following example disables scripting for update and plugin operations, regardless of
the script source or language. Scripts can still be executed from sandboxed languages as
part of aggregations, search and plugins execution though, as the above defaults still
get applied.
script.update: false
script.plugin: false
Fine-grained script settings
First, the high-level script settings described above are applied in order (context settings
have precedence over source settings). Then, fine-grained settings which include the
script language take precedence over any high-level settings.
Fine-grained settings have the form:
script.engine.{lang}.{source}.{context}: true|false
And
script.engine.{lang}.{inline|file|stored}: true|false
For example:
script.inline: false 1
script.stored: false 1
script.file: false 1
script.engine.groovy.inline: true 2
script.engine.groovy.stored.search: true 3
script.engine.groovy.stored.aggs: true 3
script.engine.mustache.stored.search: true 4
1 - Disable all scripting from any source. 2 - Allow inline Groovy scripts for all operations 3
- Allow stored Groovy scripts to be used for search and aggregations. 4 - Allow stored
Mustache templates to be used for search.
Java Security Manager
NG|Storage runs with the Java Security Manager enabled by default. The security policy in
NG|Storage locks down the permissions granted to each class to the bare minimum
required to operate. The benefit of doing this is that it severely limits the attack vectors
available to a hacker.
Restricting permissions is particularly important with scripting languages like Groovy and
Javascript which are designed to do anything that can be done in Java itself, including

writing to the file system, opening sockets to remote servers, etc.
Script Classloader Whitelist
Scripting languages are only allowed to load classes which appear in a hardcoded whitelist
that can be found in org.ngStorage.script.ClassPermission.
In a script, attempting to load a class that does not appear in the whitelist may result in a
ClassNotFoundException, for instance this script:
GET _search
{
"script_fields": {
"the_hour": {
"script": "use(java.math.BigInteger); new BigInteger(1)"
}
}
}
will return the following exception:
{
"reason": {
"type": "script_exception",
"reason": "failed to run inline script [use(java.math.BigInteger); new
BigInteger(1)] using lang [groovy]",
"caused_by": {
"type": "no_class_def_found_error",
"reason": "java/math/BigInteger",
"caused_by": {
"type": "class_not_found_exception",
"reason": "java.math.BigInteger"
}
}
}
}
However, classloader issues may also result in more difficult to interpret exceptions. For
instance, this script:
use(groovy.time.TimeCategory); new Date(123456789).format('HH')
Returns the following exception:

{
"reason": {
"type": "script_exception",
"reason": "failed to run inline script [use(groovy.time.TimeCategory);
new Date(123456789).format('HH')] using lang [groovy]",
"caused_by": {
"type": "missing_property_exception",
"reason": "No such property: groovy for class:
8d45f5c1a07a1ab5dda953234863e283a7586240"
}
}
}
Dealing with Java Security Manager issues
If you encounter issues with the Java Security Manager, you have two options for resolving
these issues:
Fix the security problem
The safest and most secure long term solution is to change the code causing the security
issue. We recognise that this may take time to do correctly and so we provide the following
two alternatives.
Customising the classloader whitelist
The classloader whitelist can be customised by tweaking the local Java Security Policy
either:
• system wide: $JAVA_HOME/lib/security/java.policy,
• for just the ngStorage user: /home/ngStorage/.java.policy
• by adding a system property to the es-java-opts configuration:

-Djava.security.policy=someURL, or
• via the ES_JAVA_OPTS environment variable with

-Djava.security.policy=someURL:
export ES_JAVA_OPTS="${ES_JAVA_OPTS} -Djava.security.policy=file

:///path/to/my.policy`
./bin/ngStorage
Permissions may be granted at the class, package, or global level. For instance:

grant {
permission org.ngStorage.script.ClassPermission "java.util.Base64"; //
allow class
permission org.ngStorage.script.ClassPermission "java.util.*"; //
allow package
permission org.ngStorage.script.ClassPermission "*"; // allow all
(disables filtering basically)
};
Here is an example of how to enable the groovy.time.TimeCategory class:
grant {
permission org.ngStorage.script.ClassPermission "java.lang.Class";
permission org.ngStorage.script.ClassPermission
"groovy.time.TimeCategory";
};
Before adding classes to the whitelist, consider the security impact that it
will have on NG|Storage. Do you really need an extra class or can your
code be rewritten in a more secure way?
 It is quite possible that we have not whitelisted a generically useful and

safe class. If you have a class that you think should be whitelisted by
default, please open an issue on GitHub and we will consider the impact of
doing so.
See http://docs.oracle.com/javase/7/docs/technotes/guides/security/PolicyFiles.html for

more information.
41.9. How to Use Scripts
Wherever scripting is supported in the NG|Storage API, the syntax follows the same
pattern:
"script": {
"lang": "...", 1
"inline" | "id" | "file": "...", 2
"params": { ... } 3
}
1 - The language the script is written in, which defaults to groovy. 2 - The script itself
which may be specfied as inline, id, or file. 3 - Any named parameters that should be
passed into the script.

For example, the following script is used in a search request to return a scripted field:
{
"my_field": 5
}
{
"script_fields": {
"my_doubled_field": {
"script": {
"inline": "doc['my_field'] * multiplier",
"params": {
"multiplier": 2
}
}
}
}
}
Script Parameters
lang
Specifies the language the script is written in. Defaults to groovy but may be set to
any of languages listed in Scripting. The default language may be changed in the
ngStorageyml config file by setting script.default_lang to the appropriate
language.
inline, id, file
Specifies the source of the script. An inline script is specified inline as in the
example above, a stored script with the specified id is retrieved from the cluster
state (see Stored Scripts), and a file script is retrieved from a file in the
config/scripts directory (see File Scripts).
While languages like expression and painless can be used out of the box as
inline or stored scripts, other languages like groovy can only be specified as file
unless you first adjust the default scripting security settings.
params
Specifies any named parameters that are passed into the script as variables.

Prefer parameters
The first time NG|Storage sees a new script, it compiles it and stores the
compiled version in a cache. Compilation can be a heavy process.
If you need to pass variables into the script, you should pass them in as
named params instead of hard-coding values into the script itself. For
example, if you want to be able to multiply a field value by different
multipliers, don’t hard-code the multiplier into the script:
 "inline": "doc['my_field'] * 2"
Instead, pass it in as a named parameter:
"inline": "doc['my_field'] * multiplier",

"params": {
"multiplier": 2
}
The first version has to be recompiled every time the multiplier changes.
The second version is only compiled once.
File-based Scripts
To increase security, non-sandboxed languages can only be specified in script files stored
on every node in the cluster. File scripts must be saved in the scripts directory whose
default location depends on whether you use the zip/tar.gz
($ES_HOME/config/scripts/), RPM, or Debian package. The default may be changed
with the path.scripts setting.
The languages which are assumed to be safe by default are: painless, expression, and
mustache (used for search and query templates).
Any files placed in the scripts directory will be compiled automatically when the node
starts up and then every 60 seconds thereafter.
The file should be named as follows: {script-name}.{lang}. For instance, the

following example creates a Groovy script called calculate-score:
cat "log(_score * 2) + my_modifier" > config/scripts/calculate-

score.groovy
This script can be used as follows:

{
"query": {
"script": {
"script": {
"lang": "groovy", 1
"file": "calculate-score", 2
"params": {
"my_modifier": 2
}
}
}
}
}
1 - The language of the script, which should correspond with the script file suffix. 2 - The
name of the script, which should be the name of the file.
The script directory may contain sub-directories, in which case the hierarchy of
directories is flattened and concatenated with underscores. A script in
group1/group2/my_script.groovy should use group1_group2_myscript as the
file name.
Automatic script reloading
The scripts directory will be rescanned every 60s (configurable with the
resource.reload.interval setting) and new, changed, or removed scripts will be
compiled, updated, or deleted from the script cache.
Script reloading can be completely disabled by setting script.auto_reload_enabled

to false.
Stored Scripts
Scripts may be stored in and retrieved from the cluster state using the _scripts end-
point:
/_scripts/{lang}/{id} 1,2
1 - The lang represents the script language. 2 - The id is a unique identifier or script
name.
This example stores a Groovy script called calculate-score in the cluster state:

POST _scripts/groovy/calculate-score
{
"script": "log(_score * 2) + my_modifier"
}
This same script can be retrieved with:
GET _scripts/groovy/calculate-score
Stored scripts can be used by specifying the lang and id parameters as follows:
GET _search
{
"query": {
"script": {
"script": {
"lang": "groovy",
"id": "calculate-score",
"params": {
"my_modifier": 2
}
}
}
}
}
And deleted with:
DELETE _scripts/groovy/calculate-score
Script Caching
All scripts are cached by default so that they only need to be recompiled when updates
occur. File scripts keep a static cache and will always reside in memory. Both inline and
stored scripts are stored in a cache that can evict residing scripts. By default, scripts do not
have a time-based expiration, but you can change this behavior by using the
script.cache.expire setting. You can configure the size of this cache by using the
script.cache.max_size setting. By default, the cache size is 100.
The size of stored scripts is limited to 65,535 bytes. This can be changed by
setting script.max_size_in_bytes setting to increase that soft limit,
 but if scripts are really large then alternatives like native scripts should be
considered instead.

Chapter 42. Advanced Modules
42.1. Local Gateway
The local gateway module stores the cluster state and shard data across full cluster
restarts.
The following static settings, which must be set on every data node in the cluster, controls
how long nodes should wait before they try to recover any shards which are stored locally:
gateway.expected_nodes
The number of (data or master) nodes that are expected to be in the cluster. Recovery
of local shards will start as soon as the expected number of nodes have joined the
cluster. Defaults to 0
gateway.expected_master_nodes
The number of master nodes that are expected to be in the cluster. Recovery of local
shards will start as soon as the expected number of master nodes have joined the
gateway.expected_data_nodes
The number of data nodes that are expected to be in the cluster. Recovery of local
shards will start as soon as the expected number of data nodes have joined the
gateway.recover_after_time
If the expected number of nodes is not achieved, the recovery process waits for the
configured amount of time before trying to recover regardless. Defaults to 5m if one of
the expected_nodes settings is configured.
Once the recover_after_time duration has timed out, recovery will start as long as the
following conditions are met:
gateway.recover_after_nodes
Recover as long as this many data or master nodes have joined the cluster.
gateway.recover_after_master_nodes
Recover as long as this many master nodes have joined the cluster.
564 | Chapter 42. Advanced Modules

gateway.recover_after_data_nodes
Recover as long as this many data nodes have joined the cluster.
 These settings only take effect on a full cluster restart.
42.2. HTTP
The http module allows to expose NG|Storage APIs over HTTP.
The http mechanism is completely asynchronous in nature, meaning that there is no

blocking thread waiting for a response. The benefit of using asynchronous communication
for HTTP is solving the C10k problem.
When possible, consider using HTTP keep alive when connecting for better performance
and try to get your favorite client not to do HTTP chunking.
Settings
The settings in the table below can be configured for HTTP. Note that none of them are
dynamically updatable so for them to take effect they should be set in ngStorage.yml.
Setting Description
http.port A bind port range. Defaults to 9200-9300.
http.publish_port The port that HTTP clients should use when
communicating with this node. Useful when
a cluster node is behind a proxy or firewall
and the http.port is not directly
addressable from the outside. Defaults to the
actual port assigned via http.port.
http.bind_host The host address to bind the HTTP service to.
Defaults to http.host (if set) or
network.bind_host.
http.publish_host The host address to publish for HTTP clients
to connect to. Defaults to http.host (if set)
or network.publish_host.
http.host Used to set the http.bind_host and the
http.publish_host Defaults to
http.host or network.host.
http.max_content_length The max content of an HTTP request.
Defaults to 100mb. If set to greater than
Integer.MAX_VALUE, it will be reset to
100mb.
http.max_initial_line_length The max length of an HTTP URL. Defaults to
4kb
Chapter 42. Advanced Modules | 565
Setting Description
http.max_header_size The max size of allowed headers. Defaults to
8kB
http.compression Support for compression when possible (with
Accept-Encoding). Defaults to true.
http.compression_level Defines the compression level to use for
HTTP responses. Valid values are in the
range of 1 (minimum compression) and 9
(maximum compression). Defaults to 3.
http.cors.enabled Enable or disable cross-origin resource
sharing, i.e. whether a browser on another
origin can do requests to NG
Storage. Defaults to false. http.cors.allow-origin
Which origins to allow. Defaults to no origins Storage instance is open to cross origin
allowed. If you prepend and append a / to requests from anywhere.
the value, this will be treated as a regular
expression, allowing you to support HTTP
and HTTPs. for example using
/https?:\/\/localhost(:[0-9]+)?/
would return the request header
appropriately in both cases. is a valid
value but is considered a
*security risk as your NG
http.cors.max-age Browsers send a "preflight" OPTIONS-
request to determine CORS settings. max-
age defines how long the result should be
cached for. Defaults to 1728000 (20 days)
http.cors.allow-methods Which methods to allow. Defaults to
OPTIONS, HEAD, GET, POST, PUT,
DELETE.
http.cors.allow-headers Which headers to allow. Defaults to X-
Requested-With, Content-Type,
Content-Length.
http.cors.allow-credentials Whether the Access-Control-Allow-
Credentials header should be returned.
Note: This header is only returned, when the
setting is set to true. Defaults to false
http.detailed_errors.enabled Enables or disables the output of detailed
error messages and stack traces in response
output. Note: When set to false and the
error_trace request parameter is
specified, an error will be returned; when
error_trace is not specified, a simple
message will be returned. Defaults to true
http.pipelining Enable or disable HTTP pipelining, defaults
to true.

Setting Description
http.pipelining.max_events The maximum number of events to be
queued up in memory before a HTTP
connection is closed, defaults to 10000.
It also uses the common network settings.
Disable HTTP
The http module can be completely disabled and not started by setting http.enabled to
false. NG|Storage nodes (and Java clients) communicate internally using the transport
interface, not HTTP. It might make sense to disable the http layer entirely on nodes which
are not meant to serve REST requests directly. For instance, you could disable HTTP on
data-only nodes if you also have client nodes which are intended to serve all REST
requests. Be aware, however, that you will not be able to send any REST requests (eg to
retrieve node stats) directly to nodes which have HTTP disabled.
42.3. Memcached
The memcached module allows to expose NG|Storage APIs over the memcached protocol
(as closely as possible).
It is provided as a plugin called transport-memcached and installing is explained here .

Another option is to download the memcached plugin and placing it under the plugins
directory.
The memcached protocol supports both the binary and the text protocol, automatically
detecting the correct one to use.
Mapping REST to Memcached Protocol
Memcached commands are mapped to REST and handled by the same generic REST layer
in NG|Storage. Here is a list of the memcached commands supported:
GET
The memcached GET command maps to a REST GET. The key used is the URI (with
parameters). The main downside is the fact that the memcached GET does not allow body in
the request (and SET does not allow to return a result…¬). For this reason, most REST APIs
(like search) allow to accept the "source" as a URI parameter as well.
SET

The memcached SET command maps to a REST POST. The key used is the URI (with
parameters), and the body maps to the REST body.
DELETE
The memcached DELETE command maps to a REST DELETE. The key used is the URI (with
parameters).
QUIT
The memcached QUIT command is supported and disconnects the client.
Settings
The following are the settings the can be configured for memcached:
Setting Description
memcached.port A bind port range. Defaults to 11211-
11311.
Disable memcached
The memcached module can be completely disabled and not started using by setting
memcached.enabled to false. By default it is enabled once it is detected as a plugin.
42.4. Network Settings
NG|Storage binds to localhost only by default. This is sufficient for you to run a local
development server (or even a development cluster, if you start multiple nodes on the same
machine), but you will need to configure some basic network settings in order to run a real
production cluster across multiple servers.
Be careful with the network configuration!

 Never expose an unprotected node to the public internet.
Commonly Used Network Settings
network.host
The node will bind to this hostname or IP address and publish (advertise) this host to
other nodes in the cluster. Accepts an IP address, hostname, a special value, or an
array of any combination of these.
Defaults to local.
discovery.zen.ping.unicast.hosts
In order to join a cluster, a node needs to know the hostname or IP address of at least
some of the other nodes in the cluster. This setting provides the initial list of other
nodes that this node will try to contact. Accepts IP addresses or hostnames.
Defaults to ["127.0.0.1", "[::1]"].
http.port
Port to bind to for incoming HTTP requests. Accepts a single value or a range. If a
range is specified, the node will bind to the first available port in the range.
Defaults to 9200-9300.
transport.tcp.port
Port to bind for communication between nodes. Accepts a single value or a range. If a
range is specified, the node will bind to the first available port in the range.
Defaults to 9300-9400.
Special values for network.host
The following special values may be passed to network.host:
[networkInterface]
Addresses of a network interface, for example en0.
local
Any loopback addresses on the system, for example 127.0.0.1.
site
Any site-local addresses on the system, for example 192.168.0.1.
global
Any globally-scoped addresses on the system, for example 8.8.8.8.
IPv4 vs IPv6
These special values will work over both IPv4 and IPv6 by default, but you can also limit this
with the use of :ipv4 of :ipv6 specifiers. For example, en0:ipv4 would only bind to the
IPv4 addresses of interface en0.

Discovery in the cloud
More special settings are available when running in the cloud with either
the {plugins}/discovery-ec2-discovery.html#discovery-ec2-network-
 host[EC2 discovery plugin] or the {plugins}/discovery-gce-network-
host.html#discovery-gce-network-host[Google Compute Engine discovery
plugin] installed.
Advanced network settings
The network.host setting explained in Commonly used network settings is a shortcut

which sets the bind host and the publish host at the same time. In advanced used cases,
such as when running behind a proxy server, you may need to set these settings to different
values:
network.bind_host
This specifies which network interface(s) a node should bind to in order to listen for
incoming requests. A node can bind to multiple interfaces, e.g. two network cards, or
a site-local address and a local address. Defaults to network.host.
network.publish_host
The publish host is the single interface that the node advertises to other nodes in the
cluster, so that those nodes can connect to it. Currently an NG|Storage node may be
bound to multiple addresses, but only publishes one. If not specified, this defaults to
the `best'' address from `network.host, sorted by IPv4/IPv6 stack
preference, then by reachability.
Both of the above settings can be configured just like network.host¬—¬they accept IP
addresses, host names, and special values.
Advanced TCP Settings
Any component that uses TCP (like the HTTP and Transport modules) share the following
settings:
network.tcp.no_delay
Enable or disable the TCP no delay setting. Defaults to true.
network.tcp.keep_alive
Enable or disable TCP keep alive. Defaults to true.

network.tcp.reuse_address
Should an address be reused or not. Defaults to true on non-windows machines.
network.tcp.send_buffer_size
The size of the TCP send buffer (specified with size units). By default not explicitly set.
network.tcp.receive_buffer_size
The size of the TCP receive buffer (specified with size units). By default not explicitly
set.
Transport and HTTP protocols
An NG|Storage node exposes two network protocols which inherit the above settings, but
may be further configured independently:
TCP Transport
Used for communication between nodes in the cluster, by the Java

{javaclient}/transport-client.html[Transport client] and by the Tribe node. See the
Transport module for more information.
HTTP
Exposes the JSON-over-HTTP interface used by all clients other than the Java clients.
See the HTTP module for more information.
42.5. Node
Any time that you start an instance of NG|Storage, you are starting a node. A collection of
connected nodes is called a cluster. If you are running a single node of NG|Storage, then
you have a cluster of one node.
Every node in the cluster can handle HTTP and Transport traffic by default. The transport
layer is used exclusively for communication between nodes and between nodes and the
{javaclient}/transport-client.html[Java TransportClient]; the HTTP layer is used only by
external REST clients.
All nodes know about all the other nodes in the cluster and can forward client requests to
the appropriate node. Besides that, each node serves one or more purpose:
Master-eligible node
A node that has node.master set to true (default), which makes it eligible to be

elected as the master node, which controls the cluster.
Data node
A node that has node.data set to true (default). Data nodes hold data and perform
data related operations such as CRUD, search, and aggregations.
Ingest node
A node that has node.ingest set to true (default). Ingest nodes are able to apply
an ingest pipeline to a document in order to transform and enrich the document
before indexing. With a heavy ingest load, it makes sense to use dedicated ingest
nodes and to mark the master and data nodes as node.ingest: false.
Tribe node
A tribe node, configured via the tribe.* settings, is a special type of coordinating
only node that can connect to multiple clusters and perform search and other
operations across all connected clusters.
By default a node is a master-eligible node and a data node, plus it can pre-process
documents through ingest pipelines. This is very convenient for small clusters but, as the
cluster grows, it becomes important to consider separating dedicated master-eligible
nodes from dedicated data nodes.
Coordinating node
Requests like search requests or bulk-indexing requests may involve data

held on different data nodes. A search request, for example, is executed in
two phases which are coordinated by the node which receives the client
request¬—¬the coordinating node.
In the scatter phase, the coordinating node forwards the request to the
data nodes which hold the data. Each data node executes the request
 locally and returns its results to the coordinating node. In the gather
phase, the coordinating node reduces each data node’s results into a
single global resultset.
Every node is implicitly a coordinating node. This means that a node that
has all three node.master, node.data and node.ingest set to false
will only act as a coordinating node, which cannot be disabled. As a result,
such a node needs to have enough memory and CPU in order to deal with
the gather phase.

Master Eligible Node
The master node is responsible for lightweight cluster-wide actions such as creating or
deleting an index, tracking which nodes are part of the cluster, and deciding which shards
to allocate to which nodes. It is important for cluster health to have a stable master node.
Any master-eligible node (all nodes by default) may be elected to become the master node
by the master election process.
Master nodes must have access to the data/ directory (just like data
 nodes) as this is where the cluster state is persisted between node

restarts.
Indexing and searching your data is CPU-, memory-, and I/O-intensive work which can put
pressure on a node’s resources. To ensure that your master node is stable and not under
pressure, it is a good idea in a bigger cluster to split the roles between dedicated master-
eligible nodes and dedicated data nodes.
While master nodes can also behave as coordinating nodes and route search and indexing
requests from clients to data nodes, it is better not to use dedicated master nodes for this
purpose. It is important for the stability of the cluster that master-eligible nodes do as little
work as possible.
To create a standalone master-eligible node, set:
node.master: true 1
node.data: false 2
node.ingest: false 3
1 - The node.master role is enabled by default. 2 - Disable the node.data role (enabled
by default). 3 - Disable the node.ingest role (enabled by default).
Avoiding split brain with minimum_master_nodes
To prevent data loss, it is vital to configure the

discovery.zen.minimum_master_nodes setting (which defaults to 1) so that each
master-eligible node knows the minimum number of master-eligible nodes that must be
visible in order to form a cluster.
To explain, imagine that you have a cluster consisting of two master-eligible nodes. A
network failure breaks communication between these two nodes. Each node sees one
master-eligible node…¬ itself. With minimum_master_nodes set to the default of 1, this

is sufficient to form a cluster. Each node elects itself as the new master (thinking that the
other master-eligible node has died) and the result is two clusters, or a split brain. These
two nodes will never rejoin until one node is restarted. Any data that has been written to
the restarted node will be lost.
Now imagine that you have a cluster with three master-eligible nodes, and
minimum_master_nodes set to 2. If a network split separates one node from the other
two nodes, the side with one node cannot see enough master-eligible nodes and will realise
that it cannot elect itself as master. The side with two nodes will elect a new master (if
needed) and continue functioning correctly. As soon as the network split is resolved, the
single node will rejoin the cluster and start serving requests again.
This setting should be set to a quorum of master-eligible nodes:
(master_eligible_nodes / 2) + 1
In other words, if there are three master-eligible nodes, then minimum master nodes
should be set to (3 / 2) + 1 or 2:
discovery.zen.minimum_master_nodes: 2 1
1 - Defaults to 1.
This setting can also be changed dynamically on a live cluster with the cluster update
settings API:
{
"transient": {
"discovery.zen.minimum_master_nodes": 2
}
}
An advantage of splitting the master and data roles between dedicated

nodes is that you can have just three master-eligible nodes and set
 minimum_master_nodes to 2. You never have to change this setting, no
matter how many dedicated data nodes you add to the cluster.
Data Node
Data nodes hold the shards that contain the documents you have indexed. Data nodes
handle data related operations like CRUD, search, and aggregations. These operations are

I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add
more data nodes if they are overloaded.
The main benefit of having dedicated data nodes is the separation of the master and data
roles.
To create a dedicated data node, set:
node.master: false 1
node.data: true 2
1 - Disable the node.master role (enabled by default). 2 - The node.data role is enabled
by default. 3 - Disable the node.ingest role (enabled by default).
Ingest Node
Ingest nodes can execute pre-processing pipelines, composed of one or more ingest
processors. Depending on the type of operations performed by the ingest processors and
the required resources, it may make sense to have dedicated ingest nodes, that will only
perform this specific task.
To create a dedicated ingest node, set:
node.data: false 2
node.ingest: true 3
1 - Disable the node.master role (enabled by default). 2 - Disable the node.data role
(enabled by default). 3 - The node.ingest role is enabled by default.
Coordinating only node
If you take away the ability to be able to handle master duties, to hold data, and pre-process
documents, then you are left with a coordinating node that can only route requests, handle
the search reduce phase, and distribute bulk indexing. Essentially, coordinating only nodes
behave as smart load balancers.
Coordinating only nodes can benefit large clusters by offloading the coordinating node role
from data and master-eligible nodes. They join the cluster and receive the full cluster
state, like every other node, and they use the cluster state to route requests directly to the
appropriate place(s).

Adding too many coordinating only nodes to a cluster can increase the
burden on the entire cluster because the elected master node must await
 acknowledgement of cluster state updates from every node! The benefit of

coordinating only nodes should not be overstated¬—¬data nodes can
happily serve the same purpose.
To create a coordinating only node, set:
node.data: false 2
1 - Disable the node.master role (enabled by default). 2 - Disable the node.data role
(enabled by default). 3 - Disable the node.ingest role (enabled by default).
Node data path settings
path.data
Every data and master-eligible node requires access to a data directory where shards and
index and cluster metadata will be stored. The path.data defaults to $ES_HOME/data
but can be configured in the ngStorage.yml config file an absolute path or a path relative
to $ES_HOME as follows:
path.data: /var/ngStorage/data
Like all node settings, it can also be specified on the command line as:
./bin/ngStorage -Epath.data=/var/ngStorage/data
When using the .zip or .tar.gz distributions, the path.data setting

should be configured to locate the data directory outside the NG|Storage
 home directory, so that the home directory can be deleted without deleting
your data! The RPM and Debian distributions do this for you already.
node.max_local_storage_nodes
The data path can be shared by multiple nodes, even by nodes from different clusters. This
is very useful for testing failover and different configurations on your development
machine. In production, however, it is recommended to run only one node of NG|Storage
per server.
To prevent more than one node from sharing the same data path, add this setting to the
ngStorage.yml config file:
node.max_local_storage_nodes: 1
Never run different node types (i.e. master, data) from the same data
 directory. This can lead to unexpected data loss.
Other node settings
More node settings can be found in Modules. Of particular note are the cluster.name,
the node.name and the network settings.
42.6. Thread Pool
A node holds several thread pools in order to improve how threads memory consumption
are managed within a node. Many of these pools also have queues associated with them,
which allow pending requests to be held instead of discarded.
There are several thread pools, but the important ones include:
generic
For generic operations (e.g., background node discovery). Thread pool type is
scaling.
index
For index/delete operations. Thread pool type is fixed with a size of # of
available processors, queue_size of 200. The maximum size for this pool is 1
+ # of available processors.
search
For count/search/suggest operations. Thread pool type is fixed with a size of
int((# of available_processors * 3) / 2) + 1, queue_size of 1000.
get
For get operations. Thread pool type is fixed with a size of # of available
processors, queue_size of 1000.
bulk
For bulk operations. Thread pool type is fixed with a size of # of available

processors, queue_size of 50. The maximum size for this pool is 1 + # of
available processors.
percolate
For percolate operations. Thread pool type is fixed with a size of # of available
processors, queue_size of 1000.
snapshot
For snapshot/restore operations. Thread pool type is scaling with a keep-alive of 5m
and a max of min(5, (# of available processors)/2).
warmer
For segment warm-up operations. Thread pool type is scaling with a keep-alive of
5m and a max of min(5, (# of available processors)/2).
refresh
For refresh operations. Thread pool type is scaling with a keep-alive of 5m and a
max of min(10, (# of available processors)/2).
listener
Mainly for java client executing of action when listener threaded is set to true. Thread
pool type is scaling with a default max of min(10, (# of available
processors)/2).
Changing a specific thread pool can be done by setting its type-specific parameters; for
example, changing the index thread pool to have more threads:
thread_pool:
index:
size: 30
Thread pool types
The following are the types of thread pools and their respective parameters:
fixed
The fixed thread pool holds a fixed size of threads to handle the requests with a queue
(optionally bounded) for pending requests that have no threads to service them.
The size parameter controls the number of threads, and defaults to the number of cores
times 5.

The queue_size allows to control the size of the queue of pending requests that have no
threads to execute them. By default, it is set to -1 which means its unbounded. When a
request comes in and the queue is full, it will abort the request.
thread_pool:
index:
size: 30
queue_size: 1000
scaling
The scaling thread pool holds a dynamic number of threads. This number is proportional
to the workload and varies between the value of the core and max parameters.
The keep_alive parameter determines how long a thread should be kept around in the
thread pool without it doing any work.
thread_pool:
warmer:
core: 1
max: 8
keep_alive: 2m
Processors setting
The number of processors is automatically detected, and the thread pool settings are
automatically set based on it. Sometimes, the number of processors are wrongly detected,
in such cases, the number of processors can be explicitly set using the processors
setting.
In order to check the number of processors detected, use the nodes info API with the os
flag.
42.7. Transport
The transport module is used for internal communication between nodes within the cluster.
Each call that goes from one node to the other uses the transport module (for example,
when an HTTP GET request is processed by one node, and should actually be processed by
another node that holds the data).
The transport mechanism is completely asynchronous in nature, meaning that there is no

blocking thread waiting for a response. The benefit of using asynchronous communication
is first solving the C10k problem, as well as being the ideal solution for scatter (broadcast) /

gather operations such as search in NG|Storage.
TCP Transport
The TCP transport is an implementation of the transport module using TCP. It allows for
the following settings:
Setting Description
transport.tcp.port A bind port range. Defaults to 9300-9400.
transport.publish_port The port that other nodes in the cluster
should use when communicating with this
node. Useful when a cluster node is behind a
proxy or firewall and the
transport.tcp.port is not directly
addressable from the outside. Defaults to the
actual port assigned via
transport.tcp.port.
transport.bind_host The host address to bind the transport
service to. Defaults to transport.host (if
set) or network.bind_host.
transport.publish_host The host address to publish for nodes in the
cluster to connect to. Defaults to
transport.host (if set) or
network.publish_host.
transport.host Used to set the transport.bind_host
and the transport.publish_host
Defaults to transport.host or
network.host.
transport.tcp.connect_timeout The socket connect timeout setting (in time
setting format). Defaults to 30s.
transport.tcp.compress Set to true to enable compression (LZF)
between all nodes. Defaults to false.
transport.ping_schedule Schedule a regular ping message to ensure
that connections are kept alive. Defaults to
5s in the transport client and -1 (disabled)
elsewhere.
TCP Transport Profiles
NG|Storage allows you to bind to multiple ports on different interfaces by the use of
transport profiles. See this example configuration

transport.profiles.default.port: 9300-9400
transport.profiles.default.bind_host: 10.0.0.1
transport.profiles.client.port: 9500-9600
transport.profiles.client.bind_host: 192.168.0.1
transport.profiles.dmz.port: 9700-9800
transport.profiles.dmz.bind_host: 172.16.1.2
The default profile is a special. It is used as fallback for any other profiles, if those do not
have a specific configuration setting set. Note that the default profile is how other nodes in
the cluster will connect to this node usually. In the future this feature will allow to enable
node-to-node communication via multiple interfaces.
The following parameters can be configured like that
• port: The port to bind to
• bind_host: The host to bind
• publish_host: The host which is published in informational APIs
• tcp_no_delay: Configures the TCP_NO_DELAY option for this socket
• tcp_keep_alive: Configures the SO_KEEPALIVE option for this socket
• reuse_address: Configures the SO_REUSEADDR option for this socket
• tcp_send_buffer_size: Configures the send buffer size of the socket
• tcp_receive_buffer_size: Configures the receive buffer size of the socket
Local Transport
This is a handy transport to use when running integration tests within the JVM. It is
automatically enabled when using NodeBuilder#local(true).
Transport Tracer
The transport module has a dedicated tracer logger which, when activated, logs incoming
and out going requests. The log can be dynamically activated by settings the level of the
transport.tracer logger to TRACE:

"transient" : {
"logger.transport.tracer" : "TRACE"
}
}'
You can also control which actions will be traced, using a set of include and exclude

wildcard patterns. By default every request will be traced except for fault detection pings:

"transient" : {
"transport.tracer.include" : "*"
"transport.tracer.exclude" : "internal:discovery/zen/fd*"
}
}'
42.8. Tribe Node
The tribes feature allows a tribe node to act as a federated client across multiple clusters.
The tribe node works by retrieving the cluster state from all connected clusters and
merging them into a global cluster state. With this information at hand, it is able to perform
read and write operations against the nodes in all clusters as if they were local. Note that a
tribe node needs to be able to connect to each single node in every configured cluster.
The ngStorage.yml config file for a tribe node just needs to list the clusters that should
be joined, for instance:
tribe:
t1: 1
cluster.name: cluster_one
t2: 1
cluster.name: cluster_two
1 - t1 and t2 are arbitrary names representing the connection to each cluster.
The example above configures connections to two clusters, name t1 and t2 respectively.
The tribe node will create a node client to connect each cluster using unicast discovery by
default. Any other settings for the connection can be configured under tribe.{name}, just
like the cluster.name in the example.
The merged global cluster state means that almost all operations work in the same way as
a single cluster: distributed search, suggest, percolation, indexing, etc.
However, there are a few exceptions:
• The merged view cannot handle indices with the same name in multiple clusters. By
default it will pick one of them, see later for on_conflict options.
• Master level read operations (eg Cluster State, Cluster Health) will automatically
execute with a local flag set to true since there is no master.

• Master level write operations (eg Create Index) are not allowed. These should be
performed on a single cluster.
The tribe node can be configured to block all write operations and all metadata operations
with:
tribe:
blocks:
write: true
metadata: true
The tribe node can also configure blocks on selected indices:
tribe:
blocks:
write.indices: hk*,ldn*
metadata.indices: hk*,ldn*
When there is a conflict and multiple clusters hold the same index, by default the tribe node
will pick one of them. This can be configured using the tribe.on_conflict setting. It
defaults to any, but can be set to drop (drop indices that have a conflict), or
prefer_[tribeName] to prefer the index from a specific tribe.
Tribe node settings
The tribe node starts a node client for each listed cluster. The following configuration
options are passed down from the tribe node to each node client:
• node.name (used to derive the node.name for each node client)
• network.host
• network.bind_host
• network.publish_host
• transport.host
• transport.bind_host
• transport.publish_host
• path.home
• path.conf
• path.logs
• path.scripts

• shield.*
Almost any setting (except for path.*) may be configured at the node client level itself, in
which case it will override any passed through setting from the tribe node. Settings you
may want to set at the node client level include:
• network.host
• network.bind_host
• network.publish_host
• transport.host
• transport.bind_host
• transport.publish_host
• cluster.name
• discovery.zen.ping.unicast.hosts
path.scripts: some/path/to/config 1
network.host: 192.168.1.5 2
tribe:
t1:
cluster.name: cluster_one
t2:
cluster.name: cluster_two
network.host: 10.1.2.3 3
1 - The path.scripts setting is inherited by both t1 and t2. 2 - The network.host

setting is inherited by t1. 3 - The t3 node client overrides the inherited from the tribe node.
Query DSL
NG|Storage provides a full Query DSL based on JSON to define queries. Think of the Query
DSL as an AST of queries, consisting of two types of clauses:
Leaf query clauses
Leaf query clauses look for a particular value in a particular field, such as the match,
term or range queries. These queries can be used by themselves.
Compound query clauses
Compound query clauses wrap other leaf or compound queries and are used to
combine multiple queries in a logical fashion (such as the bool or dis_max query),

or to alter their behaviour (such as the constant_score query).
Query clauses behave differently depending on whether they are used in query context or
filter context.

Chapter 43. Query DSL
43.1. Bool Query
A query that matches documents matching boolean combinations of other queries. The bool
query maps to Lucene BooleanQuery. It is built using one or more boolean clauses, each
clause with a typed occurrence. The occurrence types are:
Occur Description
must The clause (query) must appear in matching
documents and will contribute to the score.
filter The clause (query) must appear in matching
documents. However unlike must the score
of the query will be ignored.
should The clause (query) should appear in the
matching document. In a boolean query with
no must or filter clauses, one or more
should clauses must match a document.
The minimum number of should clauses to
match can be set using the
minimum_should_match parameter.
must_not The clause (query) must not appear in the
matching documents.
Bool query in filter context
 If this query is used in a filter context and it has should clauses then at
least one should clause is required to match.
The bool query also supports disable_coord parameter (defaults to false). Basically
the coord similarity computes a score factor based on the fraction of all query terms that a
document contains. See Lucene BooleanQuery for more details.
The bool query takes a more-matches-is-better approach, so the score from each
matching must or should clause will be added together to provide the final _score for
each document.
586 | Chapter 43. Query DSL

POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "user" : "kimchy" }
},
"filter": {
"term" : { "tag" : "tech" }
},
"must_not" : {
"range" : {
"age" : { "from" : 10, "to" : 20 }
}
},
"should" : [
{ "term" : { "tag" : "wow" } },
{ "term" : { "tag" : "NG|Storage" } }
],
"minimum_should_match" : 1,
"boost" : 1.0
}
}
}
Scoring with bool.filter
Queries specified under the filter element have no effect on scoring¬—¬scores are
returned as 0. Scores are only affected by the query that has been specified. For instance,
all three of the following queries return all documents where the status field contains the
term active.
This first query assigns a score of 0 to all documents, as no scoring query has been
specified:
GET _search
{
"query": {
"bool": {
"filter": {
"term": {
"status": "active"
}
}
}
}
}
This bool query has a match_all query, which assigns a score of 1.0 to all documents.
Chapter 43. Query DSL | 587

GET _search
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"status": "active"
}
}
}
}
}
This constant_score query behaves in exactly the same way as the second example
above. The constant_score query assigns a score of 1.0 to all documents matched by
the filter.
GET _search
{
"query": {
"constant_score": {
"filter": {
"term": {
"status": "active"
}
}
}
}
}
Using named queries to see which clauses matched
If you need to know which of the clauses in the bool query matched the documents returned
from the query, you can use named queries to assign a name to each clause.
43.2. Boosting Query
The boosting query can be used to effectively demote results that match a given query.
Unlike the "NOT" clause in bool query, this still selects documents that contain undesirable
terms, but reduces their overall score.

GET /_search
{
"query": {
"boosting" : {
"positive" : {
"term" : {
"field1" : "value1"
}
},
"negative" : {
"term" : {
"field2" : "value2"
}
},
"negative_boost" : 0.2
}
}
}
43.3. Common Terms Query
The common terms query is a modern alternative to stopwords which improves the
precision and recall of search results (by taking stopwords into account), without sacrificing
performance.
The problem
Every term in a query has a cost. A search for "The brown fox" requires three term
queries, one for each of "the", "brown" and "fox", all of which are executed against all
documents in the index. The query for "the" is likely to match many documents and thus
has a much smaller impact on relevance than the other two terms.
Previously, the solution to this problem was to ignore terms with high frequency. By
treating "the" as a stopword, we reduce the index size and reduce the number of term
queries that need to be executed.
The problem with this approach is that, while stopwords have a small impact on relevance,
they are still important. If we remove stopwords, we lose precision, (eg we are unable to
distinguish between "happy" and "not happy") and we lose recall (eg text like "The
The" or "To be or not to be" would simply not exist in the index).
The solution
The common terms query divides the query terms into two groups: more important (ie low
frequency terms) and less important (ie high frequency terms which would previously have

been stopwords).
First it searches for documents which match the more important terms. These are the
terms which appear in fewer documents and have a greater impact on relevance.
Then, it executes a second query for the less important terms¬—¬terms which appear
frequently and have a low impact on relevance. But instead of calculating the relevance
score for all matching documents, it only calculates the _score for documents already
matched by the first query. In this way the high frequency terms can improve the relevance
calculation without paying the cost of poor performance.
If a query consists only of high frequency terms, then a single query is executed as an AND
(conjunction) query, in other words all terms are required. Even though each individual
term will match many documents, the combination of terms narrows down the resultset to
only the most relevant. The single query can also be executed as an OR with a specific
minimum_should_match, in this case a high enough value should probably be used.
Terms are allocated to the high or low frequency groups based on the
cutoff_frequency, which can be specified as an absolute frequency (>=1) or as a
relative frequency (0.0 .. 1.0). (Remember that document frequencies are computed on
a per shard level as explained in the blog post {defguide}/relevance-is-
broken.html[Relevance is broken].)
Perhaps the most interesting property of this query is that it adapts to domain specific
stopwords automatically. For example, on a video hosting site, common terms like "clip"
or "video" will automatically behave as stopwords without the need to maintain a manual
list.
Examples
In this example, words that have a document frequency greater than 0.1% (eg "this" and
"is") will be treated as common terms.
GET /_search
{
"query": {
"common": {
"body": {
"query": "this is bonsai cool",
"cutoff_frequency": 0.001
}
}
}
}

The number of terms which should match can be controlled with the
minimum_should_match (high_freq, low_freq), low_freq_operator (default
"or") and high_freq_operator (default "or") parameters.
For low frequency terms, set the low_freq_operator to "and" to make all terms
required:
GET /_search
{
"query": {
"common": {
"body": {
"query": "nelly the elephant as a cartoon",
"cutoff_frequency": 0.001,
"low_freq_operator": "and"
}
}
}
}
which is roughly equivalent to:
GET /_search
{
"query": {
"bool": {
"must": [
{ "term": { "body": "nelly"}},
{ "term": { "body": "elephant"}},
{ "term": { "body": "cartoon"}}
],
"should": [
{ "term": { "body": "the"}},
{ "term": { "body": "as"}},
{ "term": { "body": "a"}}
]
}
}
}
Alternatively use minimum_should_match to specify a minimum number or percentage

of low frequency terms which must be present, for instance:

GET /_search
{
"query": {
"common": {
"body": {
"query": "nelly the elephant as a cartoon",
"minimum_should_match": 2
}
}
}
}
GET /_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": [
],
}
},
"should": [
{ "term": { "body": "the"}},
{ "term": { "body": "as"}},
{ "term": { "body": "a"}}
]
}
}
}
minimum_should_match
A different minimum_should_match can be applied for low and high frequency terms with
the additional low_freq and high_freq parameters. Here is an example when providing
additional parameters (note the change in structure):

GET /_search
{
"query": {
"common": {
"body": {
"query": "nelly the elephant not as a cartoon",
"minimum_should_match": {
"low_freq" : 2,
"high_freq" : 3
}
}
}
}
}
GET /_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": [
],
}
},
"should": {
"bool": {
"should": [
{ "term": { "body": "the"}},
{ "term": { "body": "not"}},
{ "term": { "body": "as"}},
{ "term": { "body": "a"}}
],
}
}
}
}
}
In this case it means the high frequency terms have only an impact on relevance when
there are at least three of them. But the most interesting use of the
minimum_should_match for high frequency terms is when there are only high frequency
terms:

GET /_search
{
"query": {
"common": {
"body": {
"query": "how not to be",
"minimum_should_match": {
"low_freq" : 2,
"high_freq" : 3
}
}
}
}
}
GET /_search
{
"query": {
"bool": {
"should": [
{ "term": { "body": "how"}},
{ "term": { "body": "not"}},
{ "term": { "body": "to"}},
{ "term": { "body": "be"}}
],
"minimum_should_match": "3<50%"
}
}
}
The high frequency generated query is then slightly less restrictive than with an AND.
The common terms query also supports boost, analyzer and disable_coord as
parameters
43.4. Compound Queries
Compound queries wrap other compound or leaf queries, either to combine their results
and scores, to change their behaviour, or to switch from query to filter context.
The queries in this group are:
constant_score query
A query which wraps another query, but executes it in filter context. All matching
documents are given the same `constant'' `_score.

bool query
The default query for combining multiple leaf or compound query clauses, as must,
should, must_not, or filter clauses. The must and should clauses have their
scores combined¬—¬the more matching clauses, the better¬—¬while the must_not
and filter clauses are executed in filter context.
dis_max query
A query which accepts multiple queries, and returns any documents which match any
of the query clauses. While the bool query combines the scores from all matching
queries, the dis_max query uses the score of the single best- matching query clause.
function_score query
Modify the scores returned by the main query with functions to take into account
factors like popularity, recency, distance, or custom algorithms implemented with
scripting.
boosting query
Return documents which match a positive query, but reduce the score of
documents which also match a negative query.
indices query
Execute one query for the specified indices, and another for other indices.
43.5. Constant Score Query
A query that wraps another query and simply returns a constant score equal to the query
boost for every document in the filter. Maps to Lucene ConstantScoreQuery.
GET /_search
{
"query": {
"filter" : {
"term" : { "user" : "kimchy"}
},
"boost" : 1.2
}
}
}

43.6. Dis Max Query
A query that generates the union of documents produced by its subqueries, and that scores
each document with the maximum score for that document as produced by any subquery,
plus a tie breaking increment for any additional matching subqueries.
This is useful when searching for a word in multiple fields with different boost factors (so
that the fields cannot be combined equivalently into a single search field). We want the
primary score to be the one associated with the highest boost, not the sum of the field
scores (as Boolean Query would give). If the query is "albino elephant" this ensures that
"albino" matching one field and "elephant" matching another gets a higher score than
"albino" matching both fields. To get this result, use both Boolean Query and
DisjunctionMax Query: for each term a DisjunctionMaxQuery searches for it in each field,
while the set of these DisjunctionMaxQuery’s is combined into a BooleanQuery.
The tie breaker capability allows results that include the same term in multiple fields to be
judged better than results that include this term in only the best of those multiple fields,
without confusing this with the better case of two different terms in the multiple fields.The
default tie_breaker is 0.0.
This query maps to Lucene DisjunctionMaxQuery.
GET /_search
{
"query": {
"dis_max" : {
"tie_breaker" : 0.7,
"boost" : 1.2,
"queries" : [
{
"term" : { "age" : 34 }
},
{
"term" : { "age" : 35 }
}
]
}
}
}
43.7. Exists Query
Returns documents that have at least one non-null value in the original field:

GET /_search
{
"query": {
"exists" : { "field" : "user" }
}
}
For instance, these documents would all match the above query:
{ "user": "jane" }
{ "user": "" } 1
{ "user": "-" } 2
{ "user": ["jane"] }
{ "user": ["jane", null ] } 3
1 - An empty string is a non-null value.23 - Even though the standard analyzer would
emit zero tokens, the original field is non-null.
2 - At least one non-null value is required.
3 - These documents would not match the above query:
{ "user": null }
{ "user": [] } 1
{ "user": [null] } 2
{ "foo": "bar" } 3
1 - This field has no values.
2 - At least one non-null value is required.
3 - The user field is missing completely.
null_value mapping
If the field mapping includes the null_value setting then explicit null values are
replaced with the specified null_value. For instance, if the user field were mapped as
follows:
"user": {
"type": "text",
"null_value": "_null_"
}
then explicit null values would be indexed as the string null, and the following docs
would match the exists filter:

{ "user": null }
{ "user": [null] }
However, these docs—¬without explicit null values—¬would still have no values in the
user field and thus would not match the exists filter:
{ "user": [] }
{ "foo": "bar" }
missing query
'missing' query has been removed because it can be advantageously replaced by an

exists query inside a must_not clause as follows:
GET /_search
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "user"
}
}
}
}
}
This query returns documents that have no value in the user field.
43.8. Full Text Queries
The high-level full text queries are usually used for running full text queries on full text
fields like the body of an email. They understand how the field being queried is analyzed
and will apply each field’s analyzer (or search_analyzer) to the query string before
executing.
match query
The standard query for performing full text queries, including fuzzy matching and
phrase or proximity queries.
match_phrase query
Like the match query but used for matching exact phrases or word proximity

matches.
match_phrase_prefix query
The poor man’s search-as-you-type. Like the match_phrase query, but does a
wildcard search on the final word.
multi_match query
The multi-field version of the match query.
common_terms query
A more specialized query which gives more preference to uncommon words.
query_string query
Supports the compact Lucene query string syntax, allowing you to specify
AND|OR|NOT conditions and multi-field search within a single query string. For
expert users only.
simple_query_string
A simpler, more robust version of the query_string syntax suitable for exposing
directly to users.
43.9. Function Score Query
The function_score allows you to modify the score of documents that are retrieved by a
query. This can be useful if, for example, a score function is computationally expensive and
it is sufficient to compute the score on a filtered set of documents.
To use function_score, the user has to define a query and one or more functions, that
compute a new score for each document returned by the query.
function_score can be used with only one function like this:
GET /_search
{
"query": {
"function_score": {
"query": {},
"boost": "5",
"random_score": {}, 1
"boost_mode":"multiply"
}
}
}

1 - See [score-functions] for a list of supported functions.
Furthermore, several functions can be combined. In this case one can optionally choose to
apply the function only if a document matches a given filtering query
GET /_search
{
"query": {
"function_score": {
"query": {},
"boost": "5", 1
"functions": [
{
"filter": {},
"random_score": {}, 2
"weight": 23
},
{
"filter": {},
"weight": 42
}
],
"max_boost": 42,
"score_mode": "max",
"boost_mode": "multiply",
"min_score" : 42
}
}
}
1 - Boost for the whole query.
2 - See [score-functions] for a list of supported functions.
 The scores produced by the filtering query of each function do not matter.
chapter.
43.10. Geo Queries
NG|Storage supports two types of geo data: geo_point fields which support lat/lon pairs,
and geo_shape fields, which support points, lines, circles, polygons, multi-polygons etc.
The queries in this group are:Filter documents related to mapping and location:
• Geo Bounding Box Query
• Geo Distance Query

• Geo Distance Range Query
• Geohash Cell Query
• Geo Polygon Query
• Geo Queries
• GeoShape Query
chapter.
43.11. Has Child Query
The has_child filter accepts a query and the child type to run against, and results in
parent documents that have child docs matching the query. Here is an example:
GET /_search
{
"query": {
"has_child" : {
"type" : "blog_tag",
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
}
Scoring capabilities
The has_child also has scoring support. The supported score modes are min, max, sum,
avg or none. The default is none and yields the same behaviour as in previous versions. If
the score mode is set to another value than none, the scores of all the matching child
documents are aggregated into the associated parent documents. The score type can be
specified with the score_mode field inside the has_child query:

GET /_search
{
"query": {
"has_child" : {
"score_mode" : "min",
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
}
Min/Max Children
The has_child query allows you to specify that a minimum and/or maximum number of
children are required to match for the parent doc to be considered a match:
GET /_search
{
"query": {
"has_child" : {
"score_mode" : "min",
"min_children": 2, 1
"max_children": 10, 1
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
}
1 - Both min_children and max_children are optional.
The min_children and max_children parameters can be combined with the

score_mode parameter.
Ignore Unmapped
When set to true the ignore_unmapped option will ignore an unmapped type and will
not match any documents for this query. This can be useful when querying multiple indexes
which might have different mappings. When set to false (the default value) the query will
throw an exception if the type is not mapped.

43.12. Has Parent Query
The has_parent query accepts a query and a parent type. The query is executed in the
parent document space, which is specified by the parent type. This query returns child
documents which associated parents have matched. For the rest has_parent query has
the same options and works in the same manner as the has_child query.
GET /_search
{
"query": {
"has_parent" : {
"parent_type" : "blog",
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
}
Scoring capabilities
The has_parent also has scoring support. The default is false which ignores the score
from the parent document. The score is in this case equal to the boost on the has_parent
query (Defaults to 1). If the score is set to true, then the score of the matching parent
document is aggregated into the child documents belonging to the matching parent
document. The score mode can be specified with the score field inside the has_parent
query:
GET /_search
{
"query": {
"has_parent" : {
"parent_type" : "blog",
"score" : true,
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
}
Ignore Unmapped
When set to true the ignore_unmapped option will ignore an unmapped type and will
not match any documents for this query. This can be useful when querying multiple indexes
which might have different mappings. When set to false (the default value) the query will
43.13. IDs Query
Filters documents that only have the provided ids. Note, this query uses the _uid field.
GET /_search
{
"query": {
"ids" : {
"type" : "my_type",
"values" : ["1", "4", "100"]
}
}
}
The type is optional and can be omitted, and can also accept an array of values. If no type
is specified, all types defined in the index mapping are tried.
nI ===dices Query
The indices query is useful in cases where a search is executed across multiple indices.
It allows to specify a list of index names and an inner query that is only executed for indices
matching names on that list. For other indices that are searched, but that don’t match
entries on the list, the alternative no_match_query is executed.
GET /_search
{
"query": {
"indices" : {
"indices" : ["index1", "index2"],
"query" : { "term" : { "tag" : "wow" } },
"no_match_query" : { "term" : { "tag" : "kow" } }
}
}
}
You can use the index field to provide a single index.
no_match_query can also have "string" value of none (to match no documents), and all
(to match all). Defaults to all.
query is mandatory, as well as indices (or index).

43.14. Joining Queries
Performing full SQL-style joins in a distributed system like NG|Storage is prohibitively

expensive. Instead, NG|Storage offers two forms of join which are designed to scale
horizontally.
nested query
Documents may contains fields of type nested. These fields are used to index arrays
of objects, where each object can be queried (with the nested query) as an
independent document.
has_child and has_parent queries
A parent-child relationship can exist between two document types within a single
index. The has_child query returns parent documents whose child documents
match the specified query, while the has_parent query returns child documents
whose parent document matches the specified query.
Also see the terms-lookup mechanism in the terms query, which allows you to build a
terms query from values contained in another document.
43.15. Match All Query
The most simple query, which matches all documents, giving them all a _score of 1.0.
GET /_search
{
"query": {
"match_all": {}
}
}
The _score can be changed with the boost parameter:
GET /_search
{
"query": {
"match_all": { "boost" : 1.2 }
}
}
Match None Query
This is the inverse of the match_all query, which matches no documents.

GET /_search
{
"query": {
"match_none": {}
}
}
43.16. Match Phrase Prefix Query
The match_phrase_prefix is the same as match_phrase, except that it allows for

prefix matches on the last term in the text. For example:
GET /_search
{
"query": {
"match_phrase_prefix" : {
"message" : "quick brown f"
}
}
}
It accepts the same parameters as the phrase type. In addition, it also accepts a
max_expansions parameter (default 50) that can control to how many prefixes the last
term will be expanded. It is highly recommended to set it to an acceptable value to control
the execution time of the query. For example:
GET /_search
{
"query": {
"match_phrase_prefix" : {
"message" : {
"query" : "quick brown f",
"max_expansions" : 10
}
}
}
}

The match_phrase_prefix query is a poor-man’s autocomplete. It is
very easy to use, which let’s you get started quickly with search-as-you-
type but it’s results, which usually are good enough, can sometimes be
confusing.
Consider the query string quick brown f. This query works by creating
a phrase query out of quick and brown (i.e. the term quick must exist
and must be followed by the term brown). Then it looks at the sorted term
dictionary to find the first 50 terms that begin with f, and adds these terms
 to the phrase query.
The problem is that the first 50 terms may not include the term fox so the
phase quick brown fox will not be found. This usually isn’t a problem
as the user will continue to type more letters until the word they are
looking for appears.
For better solutions for search-as-you-type see the completion suggester

and {guide}/_index_time_search_as_you_type.html[Index-Time Search-
as-You-Type].
43.17. Match Phrase Query
The match_phrase query analyzes the text and creates a phrase query out of the
analyzed text. For example:
GET /_search
{
"query": {
"match_phrase" : {
"message" : "this is a test"
}
}
}
A phrase query matches terms up to a configurable slop (which defaults to 0) in any order.
Transposed terms have a slop of 2.
The analyzer can be set to control which analyzer will perform the analysis process on
the text. It defaults to the field explicit mapping definition, or the default search analyzer,
for example:

GET /_search
{
"query": {
"match_phrase" : {
"message" : {
"query" : "this is a test",
"analyzer" : "my_analyzer"
}
}
}
}
43.18. Match Query
match queries accept text/numerics/dates, analyzes them, and constructs a query. For
example:
GET /_search
{
"query": {
"match" : {
"message" : "this is a test"
}
}
}
Note, message is the name of a field, you can substitute the name of any field (including
_all) instead.
match
The match query is of type boolean. It means that the text provided is analyzed and the
analysis process constructs a boolean query from the provided text. The operator flag
can be set to or or and to control the boolean clauses (defaults to or). The minimum
number of optional should clauses to match can be set using the
minimum_should_match parameter.
The analyzer can be set to control which analyzer will perform the analysis process on
the text. It defaults to the field explicit mapping definition, or the default search analyzer.
The lenient parameter can be set to true to ignore exceptions caused by data-type
mismatches, such as trying to query a numeric field with a text query string. Defaults to
false.
Fuzziness

fuzziness allows fuzzy matching based on the type of field being queried. See [fuzziness]
for allowed settings.
The prefix_length and max_expansions can be set in this case to control the fuzzy
process. If the fuzzy option is set the query will use
top_terms_blended_freqs_${max_expansions} as its rewrite method the
fuzzy_rewrite parameter allows to control how the query will get rewritten.
Fuzzy transpositions (ab → ba) are allowed by default but can be disabled by setting
fuzzy_transpositions to false.
Here is an example when providing additional parameters (note the slight change in
structure, message is the field name):
GET /_search
{
"query": {
"match" : {
"message" : {
"operator" : "and"
}
}
}
}
Zero terms query
If the analyzer used removes all tokens in a query like a stop filter does, the default
behavior is to match no documents at all. In order to change that the zero_terms_query
option can be used, which accepts none (default) and all which corresponds to a
match_all query.
GET /_search
{
"query": {
"match" : {
"message" : {
"query" : "to be or not to be",
"operator" : "and",
"zero_terms_query": "all"
}
}
}
}
Cutoff frequency

The match query supports a cutoff_frequency that allows specifying an absolute or
relative document frequency where high frequency terms are moved into an optional
subquery and are only scored if one of the low frequency (below the cutoff) terms in the
case of an or operator or all of the low frequency terms in the case of an and operator
match.
This query allows handling stopwords dynamically at runtime, is domain independent and
doesn’t require a stopword file. It prevents scoring / iterating high frequency terms and
only takes the terms into account if a more significant / lower frequency term matches a
document. Yet, if all of the query terms are above the given cutoff_frequency the query
is automatically transformed into a pure conjunction (and) query to ensure fast execution.
The cutoff_frequency can either be relative to the total number of documents if in the
range [0..1) or absolute if greater or equal to 1.0.
Here is an example showing a query composed of stopwords exclusively:
GET /_search
{
"query": {
"match" : {
"message" : {
"query" : "to be or not to be",
"cutoff_frequency" : 0.001
}
}
}
}
The cutoff_frequency option operates on a per-shard-level. This

means that when trying it out on test indexes with low document numbers
 you should follow the advice in {defguide}/relevance-is-
broken.html[Relevance is broken].
Comparison to query_string / field
The match family of queries does not go through a "query parsing" process. It does not
support field name prefixes, wildcard characters, or other "advanced" features. For
this reason, chances of it failing are very small / non existent, and it provides an
excellent behavior when it comes to just analyze and run that text as a query behavior
(which is usually what a text search box does). Also, the phrase_prefix type can
provide a great "as you type" behavior to automatically load search results.

43.19. Minimum Should Match
The minimum_should_match parameter possible values:
Type Example Description

Integer 3 Indicates a fixed value
regardless of the number of
optional clauses.
Negative integer -2 Indicates that the total
number of optional clauses,
minus this number should be
mandatory.
Percentage 75% Indicates that this percent of
the total number of optional
clauses are necessary. The
number computed from the
percentage is rounded down
and used as the minimum.
Negative percentage -25% Indicates that this percent of
the total number of optional
clauses can be missing. The
number computed from the
percentage is rounded down,
before being subtracted from
the total to determine the
minimum.
Combination 3<90% A positive integer, followed by
the less-than symbol,
followed by any of the
previously mentioned
specifiers is a conditional
specification. It indicates that
if the number of optional
clauses is equal to (or less
than) the integer, they are all
required, but if it’s greater
than the integer, the
specification applies. In this
example: if there are 1 to 3
clauses they are all required,
but for 4 or more clauses only
90% are required.

Type Example Description
Multiple combinations 2¬25% 9¬3 Multiple conditional
specifications can be
separated by spaces, each
one only being valid for
numbers greater than the
one before it. In this example:
if there are 1 or 2 clauses
both are required, if there are
3-9 clauses all but 25% are
required, and if there are
more than 9 clauses, all but
three are required.
NOTE:
When dealing with percentages, negative values can be used to get different behavior in
edge cases. 75% and -25% mean the same thing when dealing with 4 clauses, but when
dealing with 5 clauses 75% means 3 are required, but -25% means 4 are required.
If the calculations based on the specification determine that no optional clauses are
needed, the usual rules about BooleanQueries still apply at search time (a BooleanQuery
containing no required clauses must still match at least one optional clause)
No matter what number the calculation arrives at, a value greater than the number of
optional clauses, or a value less than 1 will never be used. (ie: no matter how low or how
high the result of the calculation result is, the minimum number of required matches will
never be lower than 1 or greater than the number of clauses.
43.20. More Like This Query
The More Like This Query (MLT Query) finds documents that are "like" a given set of
documents. In order to do so, MLT selects a set of representative terms of these input
documents, forms a query using these terms, executes the query and returns the results.
The user controls the input documents, how the terms should be selected and how the
query is formed. more_like_this can be shortened to mlt deprecated[5.0.0,use
more_like_this instead).
chapter.

43.21. Multi Match Query
The multi_match query builds on the match query to allow multi-field queries:
GET /_search
{
"query": {
"multi_match" : {
"query": "this is a test", 1
"fields": [ "subject", "message" ] 2
}
}
}
1 - The query string.
2 - The fields to be queried.
fields and per-field boosting
Fields can be specified with wildcards, eg:
GET /_search
{
"query": {
"multi_match" : {
"query": "Will Smith",
"fields": [ "title", "*_name" ] 1
}
}
}
1 - Query the title, first_name and last_name fields.
Individual fields can be boosted with the caret (^) notation:
GET /_search
{
"query": {
"multi_match" : {
"fields" : [ "subject^3", "message" ] 1
}
}
}
1 - The subject field is three times as important as the message field.
Types of multi_match query:

The way the multi_match query is executed internally depends on the type parameter,
which can be set to:
best_fields
(default) Finds documents which match any field, but uses the _score from the best
field. See [type-best-fields].
most_fields
Finds documents which match any field and combines the _score from each field.
See [type-most-fields].
cross_fields
Treats fields with the same analyzer as though they were one big field. Looks for
each word in any field. See [type-cross-fields].
phrase
Runs a match_phrase query on each field and combines the _score from each
field. See [type-phrase].
phrase_prefix
Runs a match_phrase_prefix query on each field and combines the _score from
each field. See [type-phrase].
tie_breaker
By default, each per-term blended query will use the best score returned by any field in a
group, then these scores are added together to give the final score. The tie_breaker
parameter can change the default behaviour of the per-term blended queries. It accepts:
0.0
Take the single best score out of (eg) first_name:will and last_name:will
(default)
1.0
Add together the scores for (eg) first_name:will and last_name:will
0.0 < n < 1.0

Take the single best score plus tie_breaker multiplied by each of the scores from
other matching fields.

cross_fields and fuzziness
 The fuzziness parameter cannot be used with the cross_fields type.
best_fields, most_fields, cross_fields, cross_field and analysis
chapter.
43.22. Multi Term Query Rewrite
Multi term queries, like wildcard and prefix are called multi term queries and end up going
through a process of rewrite. This also happens on the query_string. All of those queries
allow to control how they will get rewritten using the rewrite parameter:
• constant_score (default): A rewrite method that performs like

constant_score_boolean when there are few matching terms and otherwise visits
all matching terms in sequence and marks documents for that term. Matching
documents are assigned a constant score equal to the query’s boost.
• scoring_boolean: A rewrite method that first translates each term into a should
clause in a boolean query, and keeps the scores as computed by the query. Note that
typically such scores are meaningless to the user, and require non-trivial CPU to
compute, so it’s almost always better to use constant_score_auto. This rewrite
method will hit too many clauses failure if it exceeds the boolean query limit (defaults to
1024).
• constant_score_boolean: Similar to scoring_boolean except scores are not

computed. Instead, each matching document receives a constant score equal to the
query’s boost. This rewrite method will hit too many clauses failure if it exceeds the
boolean query limit (defaults to 1024).
• top_terms_N: A rewrite method that first translates each term into should clause in
boolean query, and keeps the scores as computed by the query. This rewrite method
only uses the top scoring terms so it will not overflow boolean max clause count. The N
controls the size of the top scoring terms to use.
• top_terms_boost_N: A rewrite method that first translates each term into should
clause in boolean query, but the scores are only computed as the boost. This rewrite
method only uses the top scoring terms so it will not overflow the boolean max clause
count. The N controls the size of the top scoring terms to use.

• top_terms_blended_freqs_N: A rewrite method that first translates each term into
should clause in boolean query, but all term queries compute scores as if they had the
same frequency. In practice the frequency which is used is the maximum frequency of
all matching terms. This rewrite method only uses the top scoring terms so it will not
overflow boolean max clause count. The N controls the size of the top scoring terms to
use.
43.23. Nested Query
Nested query allows to query nested objects / docs (see nested mapping). The query is
executed against the nested objects / docs as if they were indexed as separate docs (they
are, internally) and resulting in the root parent doc (or parent nested mapping). Here is a
sample mapping we will work with:
PUT /my_index
{
"mappings": {
"type1" : {
"properties" : {
"obj1" : {
"type" : "nested"
}
}
}
}
}
And here is a sample nested query usage:
GET /_search
{
"query": {
"nested" : {
"path" : "obj1",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{ "match" : {"obj1.name" : "blue"} },
{ "range" : {"obj1.count" : {"gt" : 5}} }
]
}
}
}
}
}
The query path points to the nested object path, and the query includes the query that will
run on the nested docs matching the direct path, and joining with the root parent docs. Note
that any fields referenced inside the query must use the complete path (fully qualified).
The score_mode allows to set how inner children matching affects scoring of parent. It
defaults to avg, but can be sum, min, max and none.
There is also an ignore_unmapped option which, when set to true will ignore an
unmapped path and will not match any documents for this query. This can be useful when
querying multiple indexes which might have different mappings. When set to false (the
default value) the query will throw an exception if the path is not mapped.
Multi level nesting is automatically supported, and detected, resulting in an inner nested
query to automatically match the relevant nesting level (and not root) if it exists within
another nested query.
43.24. Parent ID Query
added[5.0.0]
The parent_id query can be used to find child documents which belong to a particular
parent. Given the following mapping definition:
PUT /my_index
{
"mappings": {
"blog_post": {
"properties": {
"name": {
"type": "keyword"
}
}
},
"blog_tag": {
"_parent": {
"type": "blog_post"
},
"_routing": {
"required": true
}
}
}
}

GET /my_index/_search
{
"query": {
"parent_id" : {
"id" : "1"
}
}
}
The above is functionally equivalent to using the following has_parent query, but
performs better as it does not need to do a join:
GET /my_index/_search
{
"query": {
"has_parent": {
"type": "blog_post",
"query": {
"term": {
"_id": "1"
}
}
}
}
}
Parameters
This query has two required parameters:
type
The child type. This must be a type with _parent field.
id
The required parent id select documents must referrer to.
ignore_unmapped
When set to true this will ignore an unmapped type and will not match any
documents for this query. This can be useful when querying multiple indexes which
might have different mappings. When set to false (the default value) the query will
43.25. Percolate Query
The percolate query can be used to match queries stored in an index. The percolate

query itself contains the document that will be used as query to match with the stored
queries.
Sample Usage
Create an index with two mappings:
PUT /my-index
{
"mappings": {
"doctype": {
"properties": {
"message": {
"type": "string"
}
}
},
"queries": {
"properties": {
"query": {
"type": "percolator"
}
}
}
}
}
The doctype mapping is the mapping used to preprocess the document defined in the
percolator query before it gets indexed into a temporary index.
The queries mapping is the mapping used for indexing the query documents. The query
field will hold a json object that represents an actual NG|Storage query. The query field
has been configured to use the percolator field type. This field type understands the query
dsl and stored the query in such a way that it can be used later on to match documents
defined on the percolate query.
Register a query in the percolator:
PUT /my-index/queries/1
{
"query" : {
"match" : {
"message" : "bonsai tree"
}
}
}
Match a document to the registered percolator queries:

GET /my-index/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document_type" : "doctype",
"document" : {
"message" : "A new bonsai tree in the office"
}
}
}
}
The above request will yield the following response:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5716521,
"hits": [
{ 1
"_index": "my-index",
"_type": "queries",
"_id": "1",
"_score": 0.5716521,
"_source": {
"query": {
"match": {
"message": "bonsai tree"
}
}
}
}
]
}
}
1 - The query with id 1 matches our document.
Parameters
The following parameters are required when percolating a document:
field
The field of type percolator and that holds the indexed queries. This is a required

parameter.
document_type
The type / mapping of the document being percolated. This is a required parameter.
document
The source of the document being percolated.
Instead of specifying a the source of the document being percolated, the source can also be
retrieved from an already stored document. The percolate query will then internally
execute a get request to fetch that document.
In that case the document parameter can be substituted with the following parameters:
index
The index the document resides in. This is a required parameter.
type
The type of the document to fetch. This is a required parameter.
id
The id of the document to fetch. This is a required parameter.
routing
Optionally, routing to be used to fetch document to percolate.
preference
Optionally, preference to be used to fetch document to percolate.
version
Optionally, the expected version of the document to be fetched.
Percolating an Existing Document
In order to percolate a newly indexed document, the percolate query can be used. Based
on the response from an index request, the _id and other meta information can be used to
immediately percolate the newly added document.
Example
Based on the previous example.
Index the document we want to percolate:

PUT /my-index/message/1
{
"message" : "A new bonsai tree in the office"
}
Index response:
{
"_type": "message",
"_id": "1",
"_version": 1,
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": true
}
Percolating an existing document, using the index response as basis to build to new search
request:
{
"query" : {
"percolate" : {
"field": "query",
"index" : "my-index",
"type" : "message",
"id" : "1",
"version" : 1 1
}
}
}
1 - The version is optional, but useful in certain cases. We can then ensure that we are try
to percolate the document we just have indexed. A change may be made after we have
indexed, and if that is the case the then the search request would fail with a version conflict
error.
The search response returned is identical as in the previous example.
Percolate query and highlighting
The percolate query is handled in a special way when it comes to highlighting. The
queries hits are used to highlight the document that is provided in the percolate query.

Whereas with regular highlighting the query in the search request is used to highlight the
hits.
Example
This example is based on the mapping of the first example.
Save a query:
{
"query" : {
"match" : {
"message" : "brown fox"
}
}
}
Save another query:
{
"query" : {
"match" : {
"message" : "lazy dog"
}
}
}
Execute a search request with the percolate query and highlighting enabled:
{
"query" : {
"percolate" : {
"field": "query",
"document" : {
"message" : "The quick brown fox jumps over the lazy dog"
}
}
},
"highlight": {
"fields": {
"message": {}
}
}
}
This will yield the following response.

{
"took": 83,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5446649,
"hits": [
{
"_type": "queries",
"_id": "2",
"_score": 0.5446649,
"_source": {
"query": {
"match": {
"message": "lazy dog"
}
}
},
"highlight": {
"message": [
"The quick brown fox jumps over the lazy
dog" 1
]
}
},
{
"_type": "queries",
"_id": "1",
"_score": 0.5446649,
"_source": {
"query": {
"match": {
"message": "brown fox"
}
}
},
"highlight": {
"message": [
"The quick brown fox jumps over the lazy
dog" 1
]
}
}
]
}
}
Instead of the query in the search request highlighting the percolator hits, the percolator
queries are highlighting the document defined in the percolate query.
How it Works Under the Hood
When indexing a document into an index that has the percolator field type mapping
configured, the query part of the documents gets parsed into a Lucene query and are stored
into the Lucene index. A binary representation of the query gets stored, but also the query’s
terms are analyzed and stored into an indexed field.
At search time, the document specified in the request gets parsed into a Lucene document
and is stored in a in-memory temporary Lucene index. This in-memory index can just hold
this one document and it is optimized for that. After this a special query is build based on
the terms in the in-memory index that select candidate percolator queries based on their
indexed query terms. These queries are then evaluated by the in-memory index if they
actually match.
The selecting of candidate percolator queries matches is an important performance

optimization during the execution of the percolate query as it can significantly reduce the
number of candidate matches the in-memory index needs to evaluate. The reason the
percolate query can do this is because during indexing of the percolator queries the
query terms are being extracted and indexed with the percolator query. Unfortunately the
percolator cannot extract terms from all queries (for example the wildcard or
geo_shape query) and as a result of that in certain cases the percolator can’t do the
selecting optimization (for example if an unsupported query is defined in a required clause
of a boolean query or the unsupported query is the only query in the percolator document).
These queries are marked by the percolator and can be found by running the following
search:
GET /_search
{
"query": {
"term" : {
"query.extraction_result" : "failed"
}
}
}
The above example assumes that there is a query field of type

 percolator in the mappings.
43.26. Prefix Query
Matches documents that have fields containing terms with a specified prefix (not analyzed).

The prefix query maps to Lucene PrefixQuery. The following matches documents where
the user field contains a term that starts with ki:
GET /_search
{ "query": {
"prefix" : { "user" : "ki" }
}
}
A boost can also be associated with the query:
GET /_search
{ "query": {
"prefix" : { "user" : { "value" : "ki", "boost" : 2.0 } }
}
}
Or :
GET /_search
{ "query": {
"prefix" : { "user" : { "prefix" : "ki", "boost" : 2.0 } }
}
}
This multi term query allows you to control how it gets rewritten using the rewrite
parameter.
43.27. Query and Filter Context
The behaviour of a query clause depends on whether it is used in query context or in filter
context:
Query context
A query clause used in query context answers the question `How well does this
document match this query clause?'' Besides deciding whether or
not the document matches, the query clause also calculates a
`_score representing how well the document matches, relative to other documents.
Query context is in effect whenever a query clause is passed to a query parameter, such as
the query parameter in the search API.
Filter context
In filter context, a query clause answers the question ``Does this document match

this query clause?'' The answer is a simple Yes or No¬—¬no scores are calculated.
Filter context is mostly used for filtering structured data, e.g.
• Does this timestamp fall into the range 2015 to 2016?
• Is the status field set to "published"?
Frequently used filters will be cached automatically by NG|Storage, to speed up

performance.
Filter context is in effect whenever a query clause is passed to a filter parameter, such
as the filter or must_not parameters in the bool query, the filter parameter in the
constant_score query, or the filter aggregation.
Below is an example of query clauses being used in query and filter context in the search
API. This query will match documents where all of the following conditions are met:
• The title field contains the word search.
• The content field contains the word NG|Storage.
• The status field contains the exact word published.
• The publish_date field contains a date from 1 Jan 2015 onwards.
GET /_search
{
"query": { 1
"bool": { 2
"must": [
{ "match": { "title": "Search" }}, 2
{ "match": { "content": "ngStorage" }} 2
],
"filter": [ 3
{ "term": { "status": "published" }}, 4
{ "range": { "publish_date": { "gte": "2015-01-01" }}} 4
]
}
}
}
1 - The query parameter indicates query context.
2 - The bool and two match clauses are used in query context, which means that they
are used to score how well each document matches.
3 - The filter parameter indicates filter context.
4 - The term and range clauses are used in filter context. They will filter out documents

which do not match, but they will not affect the score for matching documents.
• Use query clauses in query context for conditions which should affect the score of
matching documents (i.e. how well does the document match), and use all other query
clauses in filter context.
43.28. Query String Query
A query that uses a query parser in order to parse its content. Here is an example:
GET /_search
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}
}
The query_string top level parameters include:
Parameter Description
query The actual query to be parsed. See Query
String Syntax.
default_field The default field for query terms if no prefix
field is specified. Defaults to the
index.query.default_field index
settings, which in turn defaults to _all.
default_operator The default operator used if no explicit
operator is specified. For example, with a
default operator of OR, the query capital
of Hungary is translated to capital OR
of OR Hungary, and with default operator
of AND, the same query is translated to
capital AND of AND Hungary. The
default value is OR.
analyzer The analyzer name used to analyze the query
string.
allow_leading_wildcard When set, * or ? are allowed as the first
character. Defaults to true.
lowercase_expanded_terms Whether terms of wildcard, prefix, fuzzy, and
range queries are to be automatically lower-
cased or not (since they are not analyzed).
Default it true.

enable_position_increments Set to true to enable position increments in
result queries. Defaults to true.
fuzzy_max_expansions Controls the number of terms fuzzy queries
will expand to. Defaults to 50
fuzziness Set the fuzziness for fuzzy queries. Defaults
to AUTO. See [fuzziness] for allowed
settings.
fuzzy_prefix_length Set the prefix length for fuzzy queries.
Default is 0.
phrase_slop Sets the default slop for phrases. If zero,
then exact phrase matches are required.
Default value is 0.
boost Sets the boost value of the query. Defaults to
1.0.
analyze_wildcard By default, wildcards terms in a query string
are not analyzed. By setting this value to
true, a best effort will be made to analyze
those as well.
auto_generate_phrase_queries Defaults to false.
max_determinized_states Limit on how many automaton states regexp
queries are allowed to create. This protects
against too-difficult (e.g. exponentially hard)
regexps. Defaults to 10000.
minimum_should_match A value controlling how many "should"
clauses in the resulting boolean query
should match. It can be an absolute value (2),
a percentage (30%) or a combination of both.
lenient If set to true will cause format based
failures (like providing text to a numeric
field) to be ignored.
locale Locale that should be used for string
conversions. Defaults to ROOT.
time_zone Time Zone to be applied to any range query
related to dates. See also JODA timezone.
When a multi term query is being generated, one can control how it gets rewritten using the
rewrite parameter.
Default Field
When not explicitly specifying the field to search on in the query string syntax, the
index.query.default_field will be used to derive which field to search on. It defaults
to _all field.

So, if _all field is disabled, it might make sense to change it to set a different default field.
Multi Field
The query_string query can also run against multiple fields. Fields can be provided via
the "fields" parameter (example below).
The idea of running the query_string query against multiple fields is to expand each
query term to an OR clause like this:
field1:query_term OR field2:query_term | ...
For example, the following query
GET /_search
{
"query": {
"query_string" : {
"fields" : ["content", "name"],
"query" : "this AND that"
}
}
}
matches the same words as
GET /_search
{
"query": {
"query_string": {
"query": "(content:this OR name:this) AND (content:that OR
name:that)"
}
}
}
Since several queries are generated from the individual search terms, combining them can
be automatically done using either a dis_max query or a simple bool query. For example
(the name is boosted by 5 using ^5 notation):

GET /_search
{
"query": {
"query_string" : {
"fields" : ["content", "name^5"],
"query" : "this AND that OR thus",
"use_dis_max" : true
}
}
}
Simple wildcard can also be used to search "within" specific inner elements of the
document. For example, if we have a city object with several fields (or inner object with
fields) in it, we can automatically search on all "city" fields:
GET /_search
{
"query": {
"query_string" : {
"fields" : ["city.*"],
}
}
}
Another option is to provide the wildcard fields search in the query string itself (properly
escaping the sign), for example: city.\:something.
When running the query_string query against multiple fields, the following additional
parameters are allowed:
use_dis_max Should the queries be combined using
dis_max (set it to true), or a bool query
(set it to false). Defaults to true.
tie_breaker When using dis_max, the disjunction max
tie breaker. Defaults to 0.
The fields parameter can also include pattern based field names, allowing to automatically
expand to the relevant fields (dynamically introduced fields included). For example:

GET /_search
{
"query": {
"query_string" : {
"fields" : ["content", "name.*^5"],
}
}
}
43.29. Query String Syntax
The query string `mini-language'' is used by the Query String Query and
by the `q query string parameter in the search API.
The query string is parsed into a series of terms and operators. A term can be a single
word¬—¬quick or brown¬—¬or a phrase, surrounded by double quotes¬—¬"quick
brown"¬—¬which searches for all the words in the phrase, in the same order.
Operators allow you to customize the search¬—¬the available options are explained below.
Field names
As mentioned in Query String Query, the default_field is searched for the search
terms, but it is possible to specify other fields in the query syntax:
• where the status field contains active
status:active
• where the title field contains quick or brown. If you omit the OR operator the default
operator will be used
title:(quick OR brown)
title:(quick brown)
• where the author field contains the exact phrase "john smith"
author:"John Smith"
• where any of the fields book.title, book.content or book.date contains quick

or brown (note how we need to escape the * with a backslash):

book.\*:(quick brown)
• where the field title has no value (or is missing):
_missing_:title
• where the field title has any non-null value:
_exists_:title
Wildcards
Wildcard searches can be run on individual terms, using ? to replace a single character,
and * to replace zero or more characters:
qu?ck bro*
Be aware that wildcard queries can use an enormous amount of memory and perform very
badly¬—¬just think how many terms need to be queried to match the query string "a* b*
c*".
Allowing a wildcard at the beginning of a word (eg "*ing") is particularly

heavy, because all terms in the index need to be examined, just in case
 they match. Leading wildcards can be disabled by setting
allow_leading_wildcard to false.
Wildcarded terms are not analyzed by default¬—¬they are lowercased

(lowercase_expanded_terms defaults to true) but no further analysis is done, mainly
because it is impossible to accurately analyze a word that is missing some of its letters.
However, by setting analyze_wildcard to true, an attempt will be made to analyze
wildcarded words before searching the term list for matching terms.
Regular expressions
Regular expression patterns can be embedded in the query string by wrapping them in
forward-slashes ("/"):
name:/joh?n(ath[oa]n)/
The supported regular expression syntax is explained in Regular Expression Syntax.

The allow_leading_wildcard parameter does not have any control
over regular expressions. A query string such as the following would force
NG|Storage to visit every term in the index:

/.*n/
Use with caution!
Fuzziness
We can search for terms that are similar to, but not exactly like our search terms, using the
``fuzzy'' operator:
quikc~ brwn~ foks~
This uses the Damerau-Levenshtein distance to find all terms with a maximum of two
changes, where a change is the insertion, deletion or substitution of a single character, or
transposition of two adjacent characters.
The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of
all human misspellings. It can be specified as:
quikc~1
Proximity searches
While a phrase query (eg "john smith") expects all of the terms in exactly the same
order, a proximity query allows the specified words to be further apart or in a different
order. In the same way that fuzzy queries can specify a maximum edit distance for
characters in a word, a proximity search allows us to specify a maximum edit distance of
words in a phrase:
"fox quick"~5
The closer the text in a field is to the original order specified in the query string, the more
relevant that document is considered to be. When compared to the above example query,
the phrase "quick fox" would be considered more relevant than "quick brown fox".
Ranges
Ranges can be specified for date, numeric or string fields. Inclusive ranges are specified
with square brackets [min TO max] and exclusive ranges with curly brackets {min TO
max}.
• All days in 2012:
date:[2012-01-01 TO 2012-12-31]
• Numbers 1..5
count:[1 TO 5]
• Tags between alpha and omega, excluding alpha and omega:
tag:{alpha TO omega}
• Numbers from 10 upwards
count:[10 TO *]
• Dates before 2012
date:{* TO 2012-01-01}
Curly and square brackets can be combined:
• Numbers from 1 up to but not including 5
count:[1 TO 5}
Ranges with one side unbounded can use the following syntax:
age:>10
age:>=10
age:<10
age:<=10
To combine an upper and lower bound with the simplified syntax, you
would need to join two clauses with an AND operator:

age:(>=10 AND <20)
age:(+>=10 +<20)

The parsing of ranges in query strings can be complex and error prone. It is much more
reliable to use an explicit range query.
Boosting
Use the boost operator ^ to make one term more relevant than another. For instance, if we
want to find all documents about foxes, but we are especially interested in quick foxes:
quick^2 fox
The default boost value is 1, but can be any positive floating point number. Boosts between
0 and 1 reduce relevance.
Boosts can also be applied to phrases or to groups:
"john smith"^2 (foo bar)^4
Boolean operators
By default, all terms are optional, as long as one term matches. A search for foo bar
baz will find any document that contains one or more of foo or bar or baz. We have
already discussed the default_operator above which allows you to force all terms to be
required, but there are also boolean operators which can be used in the query string itself
to provide more control.
The preferred operators are + (this term must be present) and - (this term must not be
present). All other terms are optional. For example, this query:
quick brown +fox -news
states that:
• fox must be present
• news must not be present
• quick and brown are optional¬—¬their presence increases the relevance
The familiar operators AND, OR and NOT (also written &&, || and !) are also supported.
However, the effects of these operators can be more complicated than is obvious at first
glance. NOT takes precedence over AND, which takes precedence over OR. While the + and
- only affect the term to the right of the operator, AND and OR can affect the terms to the

left and right.
Rewriting the above query using AND, OR and NOT demonstrates the complexity:
quick OR brown AND fox AND NOT news

This is incorrect, because brown is now a required term.
(quick OR brown) AND fox AND NOT news

This is incorrect because at least one of quick or brown is now required and the
search for those terms would be scored differently from the original query.
((quick AND fox) OR (brown AND fox) OR fox) AND NOT news
This form now replicates the logic from the original query correctly, but the
relevance scoring bears little resemblance to the original.
In contrast, the same query rewritten using the match query would look like this:
{
"bool": {
"must": { "match": "fox" },
"should": { "match": "quick brown" },
"must_not": { "match": "news" }
}
}
Grouping
Multiple terms or clauses can be grouped together with parentheses, to form sub-queries:
(quick OR brown) AND fox
Groups can be used to target a particular field, or to boost the result of a sub-query:
status:(active OR pending) title:(full text search)^2
Reserved characters
If you need to use any of the characters which function as operators in your query itself (and
not as operators), then you should escape them with a leading backslash. For instance, to
search for (1+1)=2, you would need to write your query as $1\+1$\=2.
The reserved characters are: + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /

Failing to escape these special characters correctly could lead to a syntax error which
prevents your query from running.
Watch this space
A space may also be a reserved character. For instance, if you have a synonym list
which converts "wi fi" to "wifi", a query_string search for "wi fi" would fail.
The query string parser would interpret your query as a search for "wi OR fi", while
the token stored in your index is actually "wifi". Escaping the space will protect it
from being touched by the query string parser: "wi\ fi".
Empty Query
If the query string is empty or only contains whitespaces the query will yield an empty
result set.
43.30. Range Query
Matches documents with fields that have terms within a certain range. The type of the
Lucene query depends on the field type, for string fields, the TermRangeQuery, while for
number/date fields, the query is a NumericRangeQuery. The following example returns
all documents where age is between 10 and 20:
GET _search
{
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20,
"boost" : 2.0
}
}
}
}
The range query accepts the following parameters:
gte
Greater-than or equal to
gt
Greater-than

lte
Less-than or equal to
lt
Less-than
boost
Sets the boost value of the query, defaults to 1.0
Ranges on date fields
When running range queries on fields of type date, ranges can be specified using [date-
math]:
GET _search
{
"query": {
"range" : {
"date" : {
"gte" : "now-1d/d",
"lt" : "now/d"
}
}
}
}
Date math and rounding
When using date math to round dates to the nearest day, month, hour, etc, the rounded
dates depend on whether the ends of the ranges are inclusive or exclusive.
Rounding up moves to the last millisecond of the rounding scope, and rounding down to the
first millisecond of the rounding scope. For example:
gt
Greater than the date rounded up: 2014-11-18||/M becomes 2014-11-
30T23:59:59.999, ie excluding the entire month.
gte
Greater than or equal to the date rounded down: 2014-11-18||/M becomes 2014-
11-01, ie including the entire month.
lt
Less than the date rounded down: 2014-11-18||/M becomes 2014-11-01, ie

excluding the entire month.
lte
Less than or equal to the date rounded up: 2014-11-18||/M becomes 2014-11-
30T23:59:59.999, ie including the entire month.
Date format in range queries
Formatted dates will be parsed using the format specified on the date field by default, but
it can be overridden by passing the format parameter to the range query:
GET _search
{
"query": {
"range" : {
"born" : {
"gte": "01/01/2012",
"lte": "2013",
"format": "dd/MM/yyyy||yyyy"
}
}
}
}
Time zone in range queries
Dates can be converted from another timezone to UTC either by specifying the time zone in
the date value itself (if the format accepts it), or it can be specified as the time_zone
parameter:
GET _search
{
"query": {
"range" : {
"timestamp" : {
"gte": "2015-01-01 00:00:00", 1
"lte": "now", 2
"time_zone": "+01:00"
}
}
}
}
1 - This date will be converted to 2014-12-31T23:00:00 UTC.
2 - now is not affected by the time_zone parameter (dates must be stored as UTC).

43.31. Regexp Query
The regexp query allows you to use regular expression term queries. See Regular
Expression Syntax for details of the supported regular expression language. The "term
queries" in that first sentence means that NG|Storage will apply the regexp to the terms
produced by the tokenizer for that field, and not to the original text of the field.
Note: The performance of a regexp query heavily depends on the regular expression
chosen. Matching everything like . is very slow as well as using lookaround
regular expressions. If possible, you should try to use a long
prefix before your regular expression starts. Wildcard matchers
like .?+ will mostly lower performance.
GET /_search
{
"query": {
"regexp":{
"name.first": "s.*y"
}
}
}
Boosting is also supported
GET /_search
{
"query": {
"regexp":{
"name.first":{
"value":"s.*y",
"boost":1.2
}
}
}
}
You can also use special flags

GET /_search
{
"query": {
"regexp":{
"name.first": {
"value": "s.*y",
"flags" : "INTERSECTION|COMPLEMENT|EMPTY"
}
}
}
}
Possible flags are ALL (default), ANYSTRING, COMPLEMENT, EMPTY, INTERSECTION,

INTERVAL, or NONE. Please check the Lucene documentation for their meaning
Regular expressions are dangerous because it’s easy to accidentally create an innocuous
looking one that requires an exponential number of internal determinized automaton states
(and corresponding RAM and CPU) for Lucene to execute. Lucene prevents these using the
max_determinized_states setting (defaults to 10000). You can raise this limit to allow
more complex regular expressions to execute.
GET /_search
{
"query": {
"regexp":{
"name.first": {
"value": "s.*y",
"flags" : "INTERSECTION|COMPLEMENT|EMPTY",
"max_determinized_states": 20000
}
}
}
}
43.32. Regular Expression Syntax
Regular expression queries are supported by the regexp and the query_string queries.
The Lucene regular expression engine is not Perl-compatible but supports a smaller range
of operators.
We will not attempt to explain regular expressions, but just explain the
 supported operators.
Standard operators

Anchoring
Most regular expression engines allow you to match any part of a string. If you want
the regexp pattern to start at the beginning of the string or finish at the end of the
string, then you have to anchor it specifically, using ^ to indicate the beginning or $ to
indicate the end.
Lucene’s patterns are always anchored. The pattern provided must match the entire
string. For string "abcde":
ab.* # match
abcd # no match
Allowed characters
Any Unicode characters may be used in the pattern, but certain characters are
reserved and must be escaped. The standard reserved characters are:
. ? + * | { } [ ] ( ) " \
If you enable optional features (see below) then these characters may also be reserved:
# @ & < > ~
Any reserved character can be escaped with a backslash "\*" including a literal backslash
character: "\\"
Additionally, any characters (except double quotes) are interpreted literally when
surrounded by double quotes:
john"@smith.com"
Match any character
The period "." can be used to represent any character. For string "abcde":
ab... # match
a.c.e # match
One-or-more
The plus sign "+" can be used to repeat the preceding shortest pattern once or more
times. For string "aaabbb":

a+b+ # match
aa+bb+ # match
a+.+ # match
aa+bbb+ # match
Zero-or-more
The asterisk "*" can be used to match the preceding shortest pattern zero-or-more
times. For string `"aaabbb`":
a*b* # match
a*b*c* # match
.*bbb.* # match
aaa*bbb* # match
Zero-or-one
The question mark "?" makes the preceding shortest pattern optional. It matches
zero or one times. For string "aaabbb":
aaa?bbb? # match
aaaa?bbbb? # match
.....?.? # match
aa?bb? # no match
Min-to-max
Curly brackets "{}" can be used to specify a minimum and (optionally) a maximum
number of times the preceding shortest pattern can repeat. The allowed forms are:
{5} # repeat exactly 5 times

{2,5} # repeat at least twice and at most 5 times
{2,} # repeat at least twice
For string "aaabbb":
a{3}b{3} # match
a{2,4}b{2,4} # match
a{2,}b{2,} # match
.{3}.{3} # match
a{4}b{4} # no match
a{4,6}b{4,6} # no match
a{4,}b{4,} # no match
Grouping
Parentheses "()" can be used to form sub-patterns. The quantity operators listed
above operate on the shortest previous pattern, which can be a group. For string
"ababab":
(ab)+ # match
ab(ab)+ # match
(..)+ # match
(...)+ # no match
(ab)* # match
abab(ab)? # match
ab(ab)? # no match
(ab){3} # match
(ab){1,2} # no match
Alternation
The pipe symbol "|" acts as an OR operator. The match will succeed if the pattern on
either the left-hand side OR the right-hand side matches. The alternation applies to
the longest pattern, not the shortest. For string "aabb":
aabb|bbaa # match
aacc|bb # no match
aa(cc|bb) # match
a+|b+ # no match
a+b+|b+a+ # match
a+(b|c)+ # match
Character classes
Ranges of potential characters may be represented as character classes by enclosing

them in square brackets "[]". A leading ^ negates the character class. The allowed
forms are:
[abc] # 'a' or 'b' or 'c'

[a-c] # 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[âbc] # any character except 'a' or 'b' or 'c'
[â-c] # any character except 'a' or 'b' or 'c'
[^-abc] # any character except '-' or 'a' or 'b' or 'c'
[âbc\-] # any character except '-' or 'a' or 'b' or 'c'
Note that the dash "-" indicates a range of characters, unless it is the first character or if
it is escaped with a backslash.
For string "abcd":
ab[cd]+ # match
[a-d]+ # match
[â-d]+ # no match

Optional operators
These operators are available by default as the flags parameter defaults to ALL. Different
flag combinations (concatened with "\") can be used to enable/disable specific operators:
{
"regexp": {
"username": {
"value": "john~athon<1-5>",
"flags": "COMPLEMENT|INTERVAL"
}
}
}
Complement
The complement is probably the most useful option. The shortest pattern that follows
a tilde "~" is negated. For instance, `"ab~cd" means:
• Starts with a
• Followed by b
• Followed by a string of any length that it anything but c
• Ends with d
For the string "abcdef":
ab~df # match
ab~cf # match
ab~cdef # no match
a~(cb)def # match
a~(bc)def # no match
Enabled with the COMPLEMENT or ALL flags.
Interval
The interval option enables the use of numeric ranges, enclosed by angle brackets
"<>". For string: "foo80":
foo<1-100> # match
foo<01-100> # match
foo<001-100> # no match
Enabled with the INTERVAL or ALL flags.

Intersection
The ampersand "&" joins two patterns in a way that both of them have to match. For
string "aaabbb":
aaa.+&.+bbb # match
aaa&bbb # no match
Using this feature usually means that you should rewrite your regular expression.
Enabled with the INTERSECTION or ALL flags.
Any string
The at sign "@" matches any string in its entirety. This could be combined with the
intersection and complement above to express `èverything except''. For instance:
@&~(foo.+) # anything except string beginning with "foo"
Enabled with the ANYSTRING or ALL flags.
43.33. Script Query
A query allowing to define scripts as queries. They are typically used in a filter context, for
example:
GET /_search
{
"query": {
"bool" : {
"must" : {
"script" : {
"script" : {
"inline": "doc['num1'].value > 1",
"lang": "painless"
}
}
}
}
}
}
Custom Parameters
Scripts are compiled and cached for faster execution. If the same script can be used, just
with different parameters provider, it is preferable to use the ability to pass parameters to
the script itself, for example:
GET /_search
{
"query": {
"bool" : {
"must" : {
"script" : {
"script" : {
"inline" : "doc['num1'].value > params.param1",
"params" : {
"param1" : 5
}
}
}
}
}
}
}
43.34. Simple Query String Query
A query that uses the SimpleQueryParser to parse its context. Unlike the regular
query_string query, the simple_query_string query will never throw an exception,
and discards invalid parts of the query. Here is an example:
GET /_search
{
"query": {
"simple_query_string" : {
"query": "\"fried eggs\" +(eggplant | potato) -frittata",
"analyzer": "snowball",
"fields": ["body^5","_all"],
"default_operator": "and"
}
}
}
The simple_query_string top level parameters include:
query The actual query to be parsed. See below for
syntax.
fields The fields to perform the parsed query
against. Defaults to the
index.query.default_field index
settings, which in turn defaults to _all.

default_operator The default operator used if no explicit
operator is specified. For example, with a
default operator of OR, the query capital
of Hungary is translated to capital OR
of OR Hungary, and with default operator
of AND, the same query is translated to
capital AND of AND Hungary. The
default value is OR.
analyzer The analyzer used to analyze each term of
the query when creating composite queries.
flags Flags specifying which features of the
simple_query_string to enable. Defaults
to ALL.
lowercase_expanded_terms Whether terms of prefix and fuzzy queries
should be automatically lower-cased or not
(since they are not analyzed). Defaults to
true.
analyze_wildcard Whether terms of prefix queries should be
automatically analyzed or not. If true a best
effort will be made to analyze the prefix.
However, some analyzers will be not able to
provide a meaningful results based just on
the prefix of a term. Defaults to false.
locale Locale that should be used for string
conversions. Defaults to ROOT.
lenient If set to true will cause format based
failures (like providing text to a numeric
field) to be ignored.
minimum_should_match The minimum number of clauses that must
match for a document to be returned. See
the minimum_should_match
documentation for the full list of options.
Simple Query String Syntax
The simple_query_string supports the following special characters:
• + signifies AND operation
• | signifies OR operation
• - negates a single token
• " wraps a number of tokens to signify a phrase for searching
• * at the end of a term signifies a prefix query

• ( and ) signify precedence
• ~N after a word signifies edit distance (fuzziness)
• ~N after a phrase signifies slop amount
In order to search for any of these special characters, they will need to be escaped with \.
Default Field
When not explicitly specifying the field to search on in the query string syntax, the
index.query.default_field will be used to derive which field to search on. It defaults
to _all field.
So, if _all field is disabled, it might make sense to change it to set a different default field.
Multi Field
The fields parameter can also include pattern based field names, allowing to automatically
expand to the relevant fields (dynamically introduced fields included). For example:
GET /_search
{
"query": {
"fields" : ["content", "name.*^5"],
"query" : "foo bar baz"
}
}
}
Flags
simple_query_string support multiple flags to specify which parsing features should

be enabled. It is specified as a |-delimited string with the flags parameter:
GET /_search
{
"query": {
"query" : "foo | bar + baz*",
"flags" : "OR|AND|PREFIX"
}
}
}
The available flags are: ALL, NONE, AND, OR, NOT, PREFIX, PHRASE, PRECEDENCE,
ESCAPE, WHITESPACE, FUZZY, NEAR, and SLOP.

43.35. Span Containing Query
Returns matches which enclose another span query. The span containing query maps to
Lucene SpanContainingQuery. Here is an example:
GET /_search
{
"query": {
"span_containing" : {
"little" : {
"span_term" : { "field1" : "foo" }
},
"big" : {
"span_near" : {
"clauses" : [
{ "span_term" : { "field1" : "bar" } },
{ "span_term" : { "field1" : "baz" } }
],
"slop" : 5,
"in_order" : true
}
}
}
}
}
The big and little clauses can be any span type query. Matching spans from big that
contain matches from little are returned.
43.36. Span First Query
Matches spans near the beginning of a field. The span first query maps to Lucene
SpanFirstQuery. Here is an example:
GET /_search
{
"query": {
"span_first" : {
"match" : {
"span_term" : { "user" : "kimchy" }
},
"end" : 3
}
}
}
The match clause can be any other span type query. The end controls the maximum end
position permitted in a match.

43.37. Span Multi Term Query
The span_multi query allows you to wrap a multi term query (one of wildcard, fuzzy,
prefix, term, range or regexp query) as a span query, so it can be nested. Example:
GET /_search
{
"query": {
"span_multi":{
"match":{
"prefix" : { "user" : { "value" : "ki" } }
}
}
}
}
GET /_search
{
"query": {
"span_multi":{
"match":{
"prefix" : { "user" : { "value" : "ki", "boost" : 1.08 }
}
}
}
}
}
43.38. Span Near Query
Matches spans which are near one another. One can specify slop, the maximum number of
intervening unmatched positions, as well as whether matches are required to be in-order.
The span near query maps to Lucene SpanNearQuery. Here is an example:

GET /_search
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "value1" } },
{ "span_term" : { "field" : "value3" } }
],
"slop" : 12,
"in_order" : false
}
}
}
The clauses element is a list of one or more other span type queries and the slop
controls the maximum number of intervening unmatched positions permitted.
43.39. Span Not Query
Removes matches which overlap with another span query. The span not query maps to
Lucene SpanNotQuery. Here is an example:
GET /_search
{
"query": {
"span_not" : {
"include" : {
"span_term" : { "field1" : "hoya" }
},
"exclude" : {
"span_near" : {
"clauses" : [
{ "span_term" : { "field1" : "la" } },
{ "span_term" : { "field1" : "hoya" } }
],
"slop" : 0,
"in_order" : true
}
}
}
}
}
The include and exclude clauses can be any span type query. The include clause is
the span query whose matches are filtered, and the exclude clause is the span query
whose matches must not overlap those returned.
In the above example all documents with the term hoya are filtered except the ones that

have 'la' preceding them.
Other top level options:
pre
If set the amount of tokens before the include span can’t have overlap with the
exclude span.
post
If set the amount of tokens after the include span can’t have overlap with the exclude
span.
dist
If set the amount of tokens from within the include span can’t have overlap with the
exclude span. Equivalent of setting both pre and post.
43.40. Span Or Query
Matches the union of its span clauses. The span or query maps to Lucene SpanOrQuery.
Here is an example:
GET /_search
{
"query": {
"span_or" : {
"clauses" : [
{ "span_term" : { "field" : "value3" } }
]
}
}
}
The clauses element is a list of one or more other span type queries.
43.41. Span Queries
Span queries are low-level positional queries which provide expert control over the order
and proximity of the specified terms. These are typically used to implement very specific
queries on legal documents or patents.
Span queries cannot be mixed with non-span queries (with the exception of the
span_multi query).

span_term query
The equivalent of the term query but for use with other span queries.
span_multi query
Wraps a term, range, prefix, wildcard, regexp, or fuzzy query.
span_first query
Accepts another span query whose matches must appear within the first N positions
of the field.
span_near query
Accepts multiple span queries whose matches must be within the specified distance
of each other, and possibly in the same order.
span_or query
Combines multiple span queries¬—¬returns documents which match any of the

specified queries.
span_not query
Wraps another span query, and excludes any documents which match that query.
span_containing query
Accepts a list of span queries, but only returns those spans which also match a
second span query.
span_within query
The result from a single span query is returned as long is its span falls within the
spans returned by a list of other span queries.
43.42. Span Term Query
Matches spans containing a term. The span term query maps to Lucene SpanTermQuery.
Here is an example:

GET /_search
{
"query": {
"span_term" : { "user" : "kimchy" }
}
}
GET /_search
{
"query": {
"span_term" : { "user" : { "value" : "kimchy", "boost" : 2.0 } }
}
}
Or :
GET /_search
{
"query": {
"span_term" : { "user" : { "term" : "kimchy", "boost" : 2.0 } }
}
}
43.43. Span Within Query
Returns matches which are enclosed inside another span query. The span within query
maps to Lucene SpanWithinQuery. Here is an example:

GET /_search
{
"query": {
"span_within" : {
"little" : {
"span_term" : { "field1" : "foo" }
},
"big" : {
"span_near" : {
"clauses" : [
{ "span_term" : { "field1" : "bar" } },
{ "span_term" : { "field1" : "baz" } }
],
"slop" : 5,
"in_order" : true
}
}
}
}
}
The big and little clauses can be any span type query. Matching spans from little
that are enclosed within big are returned.
43.44. Specialized Queries
This group contains queries which do not fit into the other groups:
more_like_this query
This query finds documents which are similar to the specified text, document, or
collection of documents.
template query
The template query accepts a Mustache template (either inline, indexed, or from a
file), and a map of parameters, and combines the two to generate the final query to
execute.
script query
This query allows a script to act as a filter. Also see the function_score query.
percolate query
This query finds queries that are stored as documents that match with the specified
document.

43.45. Template Query
A query that accepts a query template and a map of key/value pairs to fill in template
parameters. Templating is based on Mustache. For simple token substitution all you
provide is a query containing some variable that you want to substitute and the actual
values:
GET /_search
{
"query": {
"template": {
"inline": { "match": { "text": "{{query_string}}" }},
"params" : {
"query_string" : "all about search"
}
}
}
}
The above request is translated into:
GET /_search
{
"query": {
"match": {
"text": "all about search"
}
}
}
Alternatively passing the template as an escaped string works as well:
GET /_search
{
"query": {
"template": {
"inline": "{ \"match\": { \"text\": \"{{query_string}}\" }}",
1
"params" : {
}
}
}
}
1 - New line characters (\n) should be escaped as \\n or removed, and quotes (") should
be escaped as \\".
Stored templates
You can register a template by storing it in the config/scripts directory, in a file using
the .mustache extension. In order to execute the stored template, reference it by name in
the file parameter:
GET /_search
{
"query": {
"template": {
"file": "my_template", 1
"params" : {
}
}
}
}
1 - Name of the query template in config/scripts/, i.e., my_template.mustache.
Alternatively, you can register a query template in the cluster state with:
PUT /_search/template/my_template
{
"template": { "match": { "text": "{{query_string}}" }}
}
and refer to it in the template query with the id parameter:
GET /_search
{
"query": {
"template": {
"id": "my_template", 1
"params" : {
}
}
}
}
1 - Name of the query template in config/scripts/, i.e., my_template.mustache.
There is also a dedicated template endpoint, allows you to template an entire search
request. Please see Search Template for more details.
43.46. Term Level Queries
While the full text queries will analyze the query string before executing, the term-level

queries operate on the exact terms that are stored in the inverted index.
These queries are usually used for structured data like numbers, dates, and enums, rather
than full text fields. Alternatively, they allow you to craft low-level queries, foregoing the
analysis process.
term query
Find documents which contain the exact term specified in the field specified.
terms query
Find documents which contain any of the exact terms specified in the field specified.
range query
Find documents where the field specified contains values (dates, numbers, or strings)
in the range specified.
exists query
Find documents where the field specified contains any non-null value.
prefix query
Find documents where the field specified contains terms which being with the exact
prefix specified.
wildcard query
Find documents where the field specified contains terms which match the pattern
specified, where the pattern supports single character wildcards (?) and multi-
character wildcards (*)
regexp query
Find documents where the field specified contains terms which match the regular
expression specified.
fuzzy query
Find documents where the field specified contains terms which are fuzzily similar to
the specified term. Fuzziness is measured as a Levenshtein edit distance of 1 or 2.
type query
Find documents of the specified type.

ids query
Find documents with the specified type and IDs.
43.47. Term Query
The term query finds documents that contain the exact term specified in the inverted
index. For instance:
POST _search
{
"query": {
"term" : { "user" : "Kimchy" } 1
}
}
1 - Finds documents which contain the exact term Kimchy in the inverted index of the user
field.
A boost parameter can be specified to give this term query a higher relevance score than
another query, for instance:
GET _search
{
"query": {
"bool": {
"should": [
{
"term": {
"status": {
"value": "urgent",
"boost": 2.0 1
}
}
},
{
"term": {
"status": "normal" 2
}
}
]
}
}
}
1 - The urgent query clause has a boost of 2.0, meaning it is twice as important as the
query clause for normal.
2 - The normal clause has the default neutral boost of 1.0.

Why doesn’t the term query match my document?
String fields can be of type text (treated as full text, like the body of an email), or
keyword (treated as exact values, like an email address or a zip code). Exact values
(like numbers, dates, and keywords) have the exact value specified in the field added to
the inverted index in order to make them searchable.
However, text fields are analyzed. This means that their values are first passed
through an analyzer to produce a list of terms, which are then added to the inverted
index.
There are many ways to analyze text: the default standard analyzer drops most
punctuation, breaks up text into individual words, and lower cases them. For
instance, the standard analyzer would turn the string `Quick Brown Fox!''
into the terms [`quick, brown, fox].
This analysis process makes it possible to search for individual words within a big
block of full text.
The term query looks for the exact term in the field’s inverted index¬—¬it doesn’t
know anything about the field’s analyzer. This makes it useful for looking up values in
keyword fields, or in numeric or date fields. When querying full text fields, use the
match query instead, which understands how the field has been analyzed.
To demonstrate, try out the example below. First, create an index, specifying the field
mappings, and index a document:

PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"full_text": {
"type": "text" 1
},
"exact_value": {
"type": "keyword" 2
}
}
}
}
}
{
"full_text": "Quick Foxes!", 3
"exact_value": "Quick Foxes!" 4
}
1 - The full_text field is of type text and will be analyzed.
2 - The exact_value field is of type keyword and will NOT be analyzed.
3 - The full_text inverted index will contain the terms: [quick, foxes].
4 - The exact_value inverted index will contain the exact term: [Quick Foxes!].
Now, compare the results for the term query and the match query:

{
"query": {
"term": {
"exact_value": "Quick Foxes!" 1
}
}
}
{
"query": {
"term": {
"full_text": "Quick Foxes!" 2
}
}
}
{
"query": {
"term": {
"full_text": "foxes" 3
}
}
}
{
"query": {
"match": {
"full_text": "Quick Foxes!" 4
}
}
}
1 - This query matches because the exact_value field contains the exact term
Quick Foxes!.
2 - This query does not match, because the full_text field only contains the terms
quick and foxes. It does not contain the exact term Quick Foxes!.
3 - A term query for the term foxes matches the full_text field.
4 - This match query on the full_text field first analyzes the query string, then
looks for documents containing quick or foxes or both.
43.48. Terms Query
Filters documents that have fields that match any of the provided terms (not analyzed). For

example:
GET /_search
{
"query": {
"filter" : {
"terms" : { "user" : ["kimchy", "ngStorage"]}
}
}
}
}
The terms query is also aliased with in as the filter name for simpler usage
deprecated[5.0.0,use terms instead].
Terms lookup mechanism
When it’s needed to specify a terms filter with a lot of terms it can be beneficial to fetch
those term values from a document in an index. A concrete example would be to filter
tweets tweeted by your followers. Potentially the amount of user ids specified in the terms
filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup
mechanism.
The terms lookup mechanism supports the following options:
index
The index to fetch the term values from. Defaults to the current index.
type
The type to fetch the term values from.
id
The id of the document to fetch the term values from.
path
The field specified as path to fetch the actual values for the terms filter.
routing
A custom routing value to be used when retrieving the external terms doc.
The values for the terms filter will be fetched from a field in a document with the specified
id in the specified type and index. Internally a get request is executed to fetch the values
from the specified path. At the moment for this feature to work the _source needs to be

stored.
Also, consider using an index with a single shard and fully replicated across all nodes if the
"reference" terms data is not large. The lookup terms filter will prefer to execute the get
request on a local node if possible, reducing the need for networking.
Terms lookup twitter example
At first we index the information for user with id 2, specifically, its followers, than index a
tweet from user with id 1. Finally we search on all the tweets that match the followers of
user 2.
PUT /users/user/2
{
"followers" : ["1", "3"]
}
PUT /tweets/tweet/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}
The structure of the external terms document can also include array of inner objects, for
example:
curl -XPUT localhost:9200/users/user/2 -d '{

"followers" : [
{
"id" : "1"
},
{
"id" : "2"
}
]
}'

In which case, the lookup path will be followers.id.
43.49. Type Query
Filters documents matching the provided document / mapping type.
GET /_search
{
"query": {
"type" : {
"value" : "my_type"
}
}
}
43.50. Wildcard Query
Matches documents that have fields matching a wildcard expression (not analyzed).
Supported wildcards are , which matches any character sequence (including
the empty one), and ?, which matches any single character. Note
this query can be slow, as it needs to iterate over many terms. In
order to prevent extremely slow wildcard queries, a wildcard term
should not start with one of the wildcards or ?. The wildcard query maps
to Lucene WildcardQuery.
GET /_search
{
"query": {
"wildcard" : { "user" : "ki*y" }
}
}
GET /_search
{
"query": {
"wildcard" : { "user" : { "value" : "ki*y", "boost" : 2.0 } }
}
}
Or :

GET /_search
{
"query": {
"wildcard" : { "user" : { "wildcard" : "ki*y", "boost" : 2.0 } }
}
}
This multi term query allows to control how it gets rewritten using the rewrite parameter.
Search APIs
Most search APIs are multi-index, multi-type, with the exception of the Explain API
endpoints.
Routing
When executing a search, it will be broadcast to all the index/indices shards (round robin
between replicas). Which shards will be searched on can be controlled by providing the
routing parameter. For example, when indexing tweets, the routing value can be the user
name:
$ curl -XPOST 'http://localhost:9200/twitter/tweet?routing=kimchy' -d '{

"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
}
'
In such a case, if we want to search only on the tweets for a specific user, we can specify it
as the routing, resulting in the search hitting only the relevant shard:

$ curl -XGET 'http://localhost:9200/twitter/tweet/_search?routing=kimchy'
-d '{
"query": {
"bool" : {
"must" : {
"query_string" : {
"query" : "some query string here"
}
},
"filter" : {
}
}
}
}
'
The routing parameter can be multi valued represented as a comma separated string. This
will result in hitting the relevant shards where the routing values match to.
Stats Groups
A search can be associated with stats groups, which maintains a statistics aggregation per
group. It can later be retrieved using the indices stats API specifically. For example, here is
a search body request that associate the request with two different groups:
{
"query" : {
"match_all" : {}
},
"stats" : ["group1", "group2"]
}
Global Search Timeout
Individual searches can have a timeout as part of the Request Body Search. Since search
requests can originate from many sources, NG|Storage has a dynamic cluster-level setting
for a global search timeout that applies to all search requests that do not set a timeout in
the Request Body Search. The default value is no global timeout. The setting key is
search.default_search_timeout and can be set using the Cluster Update Settings
endpoints. Setting this value to -1 resets the global search timeout to no timeout.

Chapter 44. Search APIs
44.1. Request Body Search
The search request can be executed with a search DSL, which includes the Query DSL,
within its body. Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{

"query" : {
}
}
'
And here is a sample response:
{
"_shards":{
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits":{
"total" : 1,
"hits" : [
{
"_type" : "tweet",
"_id" : "1",
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
}
}
]
}
}
Parameters
timeout
A search timeout, bounding the search request to be executed within the specified
time value and bail with the hits accumulated up to that point when expired. Defaults
to no timeout. See [time-units].
from
670 | Chapter 44. Search APIs

To retrieve hits from a certain offset. Defaults to 0.
size
The number of hits to return. Defaults to 10. If you do not care about getting some
hits back but only about the number of matches and/or aggregations, setting the
value to 0 will help performance.
search_type
The type of the search operation to perform. Can be dfs_query_then_fetch or
query_then_fetch. Defaults to query_then_fetch. See Search Type for more.
request_cache
Set to true or false to enable or disable the caching of search results for requests
where size is 0, ie aggregations and suggestions (no top hits returned). See Shard
Request Cache.
terminate_after
The maximum number of documents to collect for each shard, upon reaching which
the query execution will terminate early. If set, the response will have a boolean field
terminated_early to indicate whether the query execution has actually
terminated_early. Defaults to no terminate_after.
Out of the above, the search_type and the request_cache must be passed as query-
string parameters. The rest of the search request should be passed within the body itself.
The body content can also be passed as a REST parameter named source.
Both HTTP GET and HTTP POST can be used to execute search with body. Since not all
clients support GET with body, POST is allowed as well.
Fast check for any matching docs
In case we only want to know if there are any documents matching a specific query, we can
set the size to 0 to indicate that we are not interested in the search results. Also we can
set terminate_after to 1 to indicate that the query execution can be terminated
whenever the first matching document was found (per shard).
$ curl -XGET
'http://localhost:9200/_search?q=tag:wow&size=0&terminate_after=1'
The response will not contain any hits as the size was set to 0. The hits.total will be
either equal to 0, indicating that there were no matching documents, or greater than 0
Chapter 44. Search APIs | 671

meaning that there were at least as many documents matching the query when it was early
terminated. Also if the query was terminated early, the terminated_early flag will be
set to true in the response.
{
"took": 3,
"timed_out": false,
"terminated_early": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
}
}
44.1.1. Doc Value Fields
Allows to return the doc value representation of a field for each hit, for example:
GET /_search
{
"query" : {
"match_all": {}
},
"docvalue_fields" : ["test1", "test2"]
}
Doc value fields can work on fields that are not stored.
Note that if the fields parameter specifies fields without docvalues it will try to load the
value from the fielddata cache causing the terms for that field to be loaded to memory
(cached), which will result in more memory consumption.
44.1.2. Explain
Enables explanation for each hit on how its score was computed.

GET /_search
{
"explain": true,
"query" : {
}
}
44.1.3. From / Size
Pagination of results can be done by using the from and size parameters. The from
parameter defines the offset from the first result you want to fetch. The size parameter
allows you to configure the maximum amount of hits to be returned.
Though from and size can be set as request parameters, they can also be set within the
search body. from defaults to 0, and size defaults to 10.
GET /_search
{
"from" : 0, "size" : 10,
"query" : {
}
}
Note that from + size can not be more than the index.max_result_window index
setting which defaults to 10,000. See the Scroll or Search After API for more efficient ways
to do deep scrolling.
44.1.4. Highlighting
Allows to highlight search results on one or more fields. The implementation uses either
the lucene plain highlighter, the fast vector highlighter (fvh) or postings highlighter.
chapter.
44.1.5. Index Boost
Allows to configure different boost level per index when searching across more than one
indices. This is very handy when hits coming from one index matter more than hits coming
from another index (think social graph where each user has an index).

GET /_search
{
"indices_boost" : {
"index1" : 1.4,
"index2" : 1.3
}
}
44.1.6. Inner Hits
The parent/child and nested features allow the return of documents that have matches in a
different scope. In the parent/child case, parent document are returned based on matches
in child documents or child document are returned based on matches in parent documents.
In the nested case, documents are returned based on matches in nested inner objects.
In both cases, the actual matches in the different scopes that caused a document to be
returned is hidden. In many cases, it’s very useful to know which inner nested objects (in
the case of nested) or children/parent documents (in the case of parent/child) caused
certain information to be returned. The inner hits feature can be used for this. This feature
returns per search hit in the search response additional nested hits that caused a search
hit to match in a different scope.
Inner hits can be used by defining an inner_hits definition on a nested, has_child or

has_parent query and filter. The structure looks like this:
"<query>" : {
"inner_hits" : {
<inner_hits_options>
}
}
If _inner_hits is defined on a query that supports it then each search hit will contain an
inner_hits json object with the following structure:

"hits": [
{
"_index": ...,
"_type": ...,
"_id": ...,
"inner_hits": {
"<inner_hits_name>": {
"hits": {
"total": ...,
"hits": [
{
"_type": ...,
"_id": ...,
...
},
...
]
}
}
},
...
},
...
]
Options
Inner hits support the following options:
from
The offset from where the first hit to fetch for each inner_hits in the returned
regular search hits.
size
The maximum number of hits to return per inner_hits. By default the top three
matching hits are returned.
sort
How the inner hits should be sorted per inner_hits. By default the hits are sorted
by the score.
name
The name to be used for the particular inner hit definition in the response. Useful
when multiple inner hits have been defined in a single search request. The default
depends in which query the inner hit is defined. For has_child query and filter this
is the child type, has_parent query and filter this is the parent type and the nested
query and filter this is the nested path.

Inner hits also supports the following per document features:
• Highlighting
• Explain
• Source filtering
• Script fields
• Doc value fields
• Include versions
Nested inner hits
chapter.
Hierarchical levels of nested object fields and inner hits.
chapter.
Parent/child inner hits
chapter.
44.1.7. Minimum Score
Exclude documents which have a _score less than the minimum specified in min_score:
GET /_search
{
"min_score": 0.5,
"query" : {
}
}
Note, most times, this does not make much sense, but is provided for advanced use cases.
44.1.8. Named Queries
Each filter and query can accept a _name in its top level definition.

GET /_search
{
"query": {
"bool" : {
"should" : [
{"match" : { "name.first" : {"query" : "shay", "_name" :
"first"} }},
{"match" : { "name.last" : {"query" : "banon", "_name" :
"last"} }}
],
"filter" : {
"terms" : {
"name.last" : ["banon", "kimchy"],
"_name" : "test"
}
}
}
}
}
The search response will include for each hit the matched_queries it matched on. The
tagging of queries and filters only make sense for the bool query.
44.1.9. Post Filter
The post_filter is applied to the search hits at the very end of a search request, after
aggregations have already been calculated. Its purpose is best explained by example:
Imagine that you are selling shirts that have the following properties:
PUT /shirts
{
"mappings": {
"item": {
"properties": {
"brand": { "type": "keyword"},
"color": { "type": "keyword"},
"model": { "type": "keyword"}
}
}
}
}
PUT /shirts/item/1?refresh
{
"brand": "gucci",
"color": "red",
"model": "slim"
}
Imagine a user has specified two filters:

color:red and brand:gucci. You only want to show them red shirts made by Gucci in
the search results. Normally you would do this with a bool query:
GET /shirts/_search
{
"query": {
"bool": {
"filter": [
{ "term": { "color": "red" }},
{ "term": { "brand": "gucci" }}
]
}
}
}
However, you would also like to use faceted navigation to display a list of other options that
the user could click on. Perhaps you have a model field that would allow the user to limit
their search results to red Gucci t-shirts or dress-shirts.
This can be done with a terms aggregation:
GET /shirts/_search
{
"query": {
"bool": {
"filter": [
{ "term": { "color": "red" }},
{ "term": { "brand": "gucci" }}
]
}
},
"aggs": {
"models": {
"terms": { "field": "model" } 1
}
}
}
1 - Returns the most popular models of red shirts by Gucci.
But perhaps you would also like to tell the user how many Gucci shirts are available in
other colors. If you just add a terms aggregation on the color field, you will only get back
the color red, because your query returns only red shirts by Gucci.
Instead, you want to include shirts of all colors during aggregation, then apply the colors
filter only to the search results. This is the purpose of the post_filter:

GET /shirts/_search
{
"query": {
"bool": {
"filter": {
"term": { "brand": "gucci" } 1
}
}
},
"aggs": {
"colors": {
"terms": { "field": "color" } 2
},
"color_red": {
"filter": {
"term": { "color": "red" } 3
},
"aggs": {
"models": {
"terms": { "field": "model" } 3
}
}
}
},
"post_filter": { 4
"term": { "color": "red" }
}
}
1 - The main query now finds all shirts by Gucci, regardless of color.
2 - The colors agg returns popular colors for shirts by Gucci.
3 - The color_red agg limits the models sub-aggregation to red Gucci shirts.
4 - Finally, the post_filter removes colors other than red from the search hits.
44.1.10. Preference
Controls a preference of which shard replicas to execute the search request on. By
default, the operation is randomized between the shard replicas.
The preference is a query string parameter which can be set to:
_primary
The operation will go and be executed only on the primary shards.
_primary_first
The operation will go and be executed on the primary shard, and if not available

(failover), will execute on other shards.
_replica
The operation will go and be executed only on a replica shard.
_replica_first
The operation will go and be executed only on a replica shard, and if not available
(failover), will execute on other shards.
_local
The operation will prefer to be executed on a local allocated shard if possible.
_prefer_nodes:abc,xyz
Prefers execution on the nodes with the provided node ids (abc or xyz in this case) if
applicable.
_shards:2,3
Restricts the operation to the specified shards. (2 and 3 in this case). This preference
can be combined with other preferences but it has to appear first:
_shards:2,3;_primary
_only_nodes
Restricts the operation to nodes specified in node specification
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster.html
Custom (string) value
A custom value will be used to guarantee that the same shards will be used for the
same custom value. This can help with "jumping values" when hitting different shards
in different refresh states. A sample value can be something like the web session id,
or the user name.
For instance, use the user’s session ID to ensure consistent ordering of results for the
user:
GET /_search?preference=xyzabc123
{
"query": {
"match": {
"title": "ngStorage"
}
}
}

44.1.11. Query
The query element within the search request body allows to define a query using the Query
DSL.
GET /_search
{
"query" : {
}
}
44.1.12. Rescoring
Rescoring can help to improve precision by reordering just the top (eg 100 - 500)
documents returned by the query and post_filter phases, using a secondary (usually
more costly) algorithm, instead of applying the costly algorithm to all documents in the
index.
A rescore request is executed on each shard before it returns its results to be sorted by
the node handling the overall search request.
Currently the rescore API has only one implementation: the query rescorer, which uses a
query to tweak the scoring. In the future, alternative rescorers may be made available, for
example, a pair-wise rescorer.
when exposing pagination to your users, you should not change

window_size as you step through each page (by passing different from
 values) since that can alter the top hits causing results to confusingly shift
as the user steps through pages.
Query rescorer
The query rescorer executes a second query only on the Top-K results returned by the
query and post_filter phases. The number of docs which will be examined on each
shard can be controlled by the window_size parameter, which defaults to from and
size.
By default the scores from the original query and the rescore query are combined linearly
to produce the final _score for each document. The relative importance of the original
query and of the rescore query can be controlled with the query_weight and
rescore_query_weight respectively. Both default to 1.
For example:
curl -s -XPOST 'localhost:9200/_search' -d '{

"query" : {
"match" : {
"field1" : {
"operator" : "or",
"query" : "the quick brown",
"type" : "boolean"
}
}
},
"rescore" : {
"window_size" : 50,
"query" : {
"rescore_query" : {
"match" : {
"field1" : {
"type" : "phrase",
"slop" : 2
}
}
},
"query_weight" : 0.7,
"rescore_query_weight" : 1.2
}
}
}
'
The way the scores are combined can be controlled with the score_mode:
Score Mode Description

total Add the original score and the rescore query
score. The default.
multiply Multiply the original score by the rescore
query score. Useful for function query
rescores.
avg Average the original score and the rescore
query score.
max Take the max of original score and the
rescore query score.
min Take the min of the original score and the
rescore query score.
Multiple Rescores
It is also possible to execute multiple rescores in sequence:

"query" : {
"match" : {
"field1" : {
"operator" : "or",
"type" : "boolean"
}
}
},
"rescore" : [ {
"window_size" : 100,
"query" : {
"rescore_query" : {
"match" : {
"field1" : {
"type" : "phrase",
"slop" : 2
}
}
},
"query_weight" : 0.7,
"rescore_query_weight" : 1.2
}
}, {
"window_size" : 10,
"query" : {
"score_mode": "multiply",
"rescore_query" : {
"function_score" : {
"script_score": {
"script": {
"lang": "painless",
"inline": "Math.log10(doc['numeric'].value + 2)"
}
}
}
}
}
} ]
}
'
The first one gets the results of the query then the second one gets the results of the first,
etc. The second rescore will "see" the sorting done by the first rescore so it is possible to
use a large window on the first rescore to pull documents into a smaller window for the
second rescore.
44.1.13. Script Fields
Allows to return a script evaluation (based on different fields) for each hit, for example:

GET /_search
{
"query" : {
"match_all": {}
},
"script_fields" : {
"test1" : {
"script" : {
"lang": "painless",
"inline": "doc['my_field_name'].value * 2"
}
},
"test2" : {
"script" : {
"lang": "painless",
"inline": "doc['my_field_name'].value * factor",
"params" : {
"factor" : 2.0
}
}
}
}
}
Script fields can work on fields that are not stored (my_field_name in the above case),
and allow to return custom values to be returned (the evaluated value of the script).
Script fields can also access the actual _source document indexed and extract specific
elements to be returned from it (can be an "object" type). Here is an example:
GET /_search
{
"query" : {
"match_all": {}
},
"script_fields" : {
"test1" : {
"script" : "_source.obj1.obj2"
}
}
}
Note the _source keyword here to navigate the json-like model.
It’s important to understand the difference between doc['my_field'].value and

_source.my_field. The first, using the doc keyword, will cause the terms for that field to
be loaded to memory (cached), which will result in faster execution, but more memory
consumption. Also, the doc[…¬] notation only allows for simple valued fields (can’t return
a json object from it) and make sense only on non-analyzed or single term based fields.

The _source on the other hand causes the source to be loaded, parsed, and then only the
relevant part of the json is returned.
44.1.14. Scroll
While a search request returns a single `page'' of results, the `scroll API can
be used to retrieve large numbers of results (or even all results) from a single search
request, in much the same way as you would use a cursor on a traditional database.
Scrolling is not intended for real time user requests, but rather for processing large
amounts of data, e.g. in order to reindex the contents of one index into a new index with a
different configuration.
The results that are returned from a scroll request reflect the state of the
index at the time that the initial search request was made, like a
 snapshot in time. Subsequent changes to documents (index, update or
delete) will only affect later search requests.
In order to use scrolling, the initial search request should specify the scroll parameter in
the query string, which tells NG|Storage how long it should keep the `search context''
alive (see [scroll-search-context]), eg `?scroll=1m.
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '

{
"query": {
"match" : {
"title" : "ngStorage"
}
}
}
'
The result from the above request includes a _scroll_id, which should be passed to the
scroll API in order to retrieve the next batch of results.
curl -XGET 'localhost:9200/_search/scroll' -d'

1,2
{
"scroll" : "1m", 3
"scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1" 4
}
'
1 - GET or POST can be used.

2 - The URL should not include the index or type name¬—¬these are specified in the
original search request instead.
3 - The scroll parameter tells NG|Storage to keep the search context open for another
1m.
4 - The scroll_id parameter
Each call to the scroll API returns the next batch of results until there are no more
results left to return, ie the hits array is empty.
For backwards compatibility, scroll_id and scroll can be passed in the query string.
And the scroll_id can be passed in the request body
curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d

'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1'
The initial search request and each subsequent scroll request returns a
 new _scroll_id¬—¬only the most recent _scroll_id should be used.
If the request specifies aggregations, only the initial search response will
 contain the aggregations results.
Scroll requests have optimizations that make them faster when the sort
 order is _doc. If you want to iterate over all documents regardless of the
order, this is the most efficient option:
curl -XGET 'localhost:9200/_search?scroll=1m' -d '

{
"sort": [
"_doc"
]
}
'
Keeping the search context alive
chapter.
Clear scroll API

chapter.
Sliced Scroll
chapter.
44.1.15. Search After
Pagination of results can be done by using the from and size but the cost becomes
prohibitive when the deep pagination is reached. The index.max_result_window which
defaults to 10,000 is a safeguard, search requests take heap memory and time proportional
to from + size. The Scroll api is recommended for efficient deep scrolling but scroll
contexts are costly and it is not recommended to use it for real time user requests. The
search_after parameter circumvents this problem by providing a live cursor. The idea is
to use the results from the previous page to help the retrieval of the next page.
Suppose that the query to retrieve the first page looks like this:
GET twitter/tweet/_search
{
"size": 10,
"query": {
"match" : {
}
},
"sort": [
{"date": "asc"},
{"_uid": "desc"}
]
}
A field with one unique value per document should be used as the
tiebreaker of the sort specification. Otherwise the sort order for
 documents that have the same sort values would be undefined. The
recommended way is to use the field _uid which is certain to contain one
unique value for each document.
The result from the above request includes an array of sort values for each document.
These sort values can be used in conjunction with the search_after parameter to
start returning results "after" any document in the result list. For instance we can use the
sort values of the last document and pass it to search_after to retrieve the next page
of results:
GET twitter/tweet/_search
{
"size": 10,
"query": {
"match" : {
}
},
"search_after": [1463538857, "tweet#654323"],
"sort": [
{"date": "asc"},
{"_uid": "desc"}
]
}
The parameter from must be set to 0 (or -1) when search_after is

 used.
search_after is not a solution to jump freely to a random page but rather to scroll many
queries in parallel. It is very similar to the scroll API but unlike it, the search_after
parameter is stateless, it is always resolved against the latest version of the searcher. For
this reason the sort order may change during a walk depending on the updates and deletes
of your index.
44.1.16. Search Type
There are different execution paths that can be done when executing a distributed search.
The distributed search operation needs to be scattered to all the relevant shards and then
all the results are gathered back. When doing scatter/gather type execution, there are
several ways to do that, specifically with search engines.
One of the questions when executing a distributed search is how much results to retrieve
from each shard. For example, if we have 10 shards, the 1st shard might hold the most
relevant results from 0 till 10, with other shards results ranking below it. For this reason,
when executing a request, we will need to get results from 0 till 10 from all shards, sort
them, and then return the results if we want to ensure correct results.
Another question, which relates to the search engine, is the fact that each shard stands on
its own. When a query is executed on a specific shard, it does not take into account term
frequencies and other search engine information from the other shards. If we want to
support accurate ranking, we would need to first gather the term frequencies from all
shards to calculate global term frequencies, then execute the query on each shard using
these global frequencies.

Also, because of the need to sort the results, getting back a large document set, or even
scrolling it, while maintaining the correct sorting behavior can be a very expensive
operation. For large result set scrolling, it is best to sort by _doc if the order in which
documents are returned is not important.
NG|Storage is very flexible and allows to control the type of search to execute on a per
search request basis. The type can be configured by setting the search_type parameter in
the query string. The types are:
Query Then Fetch
Parameter value: query_then_fetch.
The request is processed in two phases. In the first phase, the query is forwarded to all
involved shards. Each shard executes the search and generates a sorted list of results,
local to that shard. Each shard returns just enough information to the coordinating node to
allow it merge and re-sort the shard level results into a globally sorted set of results, of
maximum length size.
During the second phase, the coordinating node requests the document content (and
highlighted snippets, if any) from only the relevant shards.
This is the default setting, if you do not specify a search_type in your

 request.
Dfs, Query Then Fetch
Parameter value: dfs_query_then_fetch.
Same as "Query Then Fetch", except for an initial scatter phase which goes and computes
the distributed term frequencies for more accurate scoring.
44.1.17. Sort
Allows to add one or more sort on specific fields. Each sort can be reversed as well. The
sort is defined on a per field level, with special field name for _score to sort by score, and
_doc to sort by index order.
chapter.

44.1.18. Source Filtering
Allows to control how the _source field is returned with every hit.
By default operations return the contents of the _source field unless you have used the
fields parameter or if the _source field is disabled.
You can turn off _source retrieval by using the _source parameter:
To disable _source retrieval set to false:
GET /_search
{
"_source": false,
"query" : {
}
}
The _source also accepts one or more wildcard patterns to control what parts of the
_source should be returned:
For example:
GET /_search
{
"_source": "obj.*",
"query" : {
}
}
Or
GET /_search
{
"_source": [ "obj1.*", "obj2.*" ],
"query" : {
}
}
Finally, for complete control, you can specify both include and exclude patterns:

GET /_search
{
"_source": {
"include": [ "obj1.*", "obj2.*" ],
"exclude": [ "*.description" ]
},
"query" : {
}
}
44.1.19. Fields
The stored_fields parameter is about fields that are explicitly marked

as stored in the mapping, which is off by default and generally not
 recommended. Use source filtering instead to select subsets of the
original source document to be returned.
Allows to selectively load specific stored fields for each document represented by a search
hit.
GET /_search
{
"stored_fields" : ["user", "postDate"],
"query" : {
}
}
* can be used to load all stored fields from the document.
An empty array will cause only the _id and _type for each hit to be returned, for example:
GET /_search
{
"stored_fields" : [],
"query" : {
}
}
For backwards compatibility, if the fields parameter specifies fields which are not stored
(store mapping set to false), it will load the _source and extract it from it. This
functionality has been replaced by the source filtering parameter.
Field values fetched from the document it self are always returned as an array. Metadata

fields like _routing and _parent fields are never returned as an array.
Also only leaf fields can be returned via the field option. So object fields can’t be returned
and such requests will fail.
Script fields can also be automatically detected and used as fields, so things like
_source.obj1.field1 can be used, though not recommended, as obj1.field1 will
work as well.
44.1.20. Version
Returns a version for each search hit.
GET /_search
{
"version": true,
"query" : {
}
}
44.2. Suggesters
The suggest feature suggests similar looking terms based on a provided text by using a
suggester. Parts of the suggest feature are still under development.
The suggest request part is either defined alongside the query part in a _search request
or via the REST _suggest endpoint.

"query" : {
...
},
"suggest" : {
...
}
}'
Suggest requests executed against the _suggest endpoint should omit the surrounding
suggest element which is only used if the suggest request is part of a search.

curl -XPOST 'localhost:9200/_suggest' -d '{
"my-suggestion" : {
"text" : "the amsterdma meetpu",
"term" : {
"field" : "body"
}
}
}'
Several suggestions can be specified per request. Each suggestion is identified with an
arbitrary name. In the example below two suggestions are requested. Both my-suggest-1
and my-suggest-2 suggestions use the term suggester, but have a different text.
"suggest" : {
"my-suggest-1" : {
"term" : {
"field" : "body"
}
},
"my-suggest-2" : {
"text" : "the rottredam meetpu",
"term" : {
"field" : "title"
}
}
}
The below suggest response example includes the suggestion response for my-suggest-
1 and my-suggest-2. Each suggestion part contains entries. Each entry is effectively a
token from the suggest text and contains the suggestion entry text, the original start offset
and length in the suggest text and if found an arbitrary number of options.

{
...
"suggest": {
"my-suggest-1": [
{
"text" : "amsterdma",
"offset": 4,
"length": 9,
"options": [
...
]
},
...
],
"my-suggest-2" : [
...
]
}
...
}
Each options array contains an option object that includes the suggested text, its document
frequency and score compared to the suggest entry text. The meaning of the score depends
on the used suggester. The term suggester’s score is based on the edit distance.
"options": [
{
"text": "amsterdam",
"freq": 77,
"score": 0.8888889
},
...
]
Global Suggest Text
To avoid repetition of the suggest text, it is possible to define a global text. In the example
below the suggest text is defined globally and applies to the my-suggest-1 and my-
suggest-2 suggestions.

"suggest" : {
"my-suggest-1" : {
"term" : {
"field" : "title"
}
},
"my-suggest-2" : {
"term" : {
"field" : "body"
}
}
}
The suggest text can in the above example also be specified as suggestion specific option.
The suggest text specified on suggestion level override the suggest text on the global level.
Other Suggest Example
In the below example we request suggestions for the following suggest text: devloping
distibutd saerch engies on the title field with a maximum of 3 suggestions per
term inside the suggest text. Note that in this example we set size to 0. This isn’t required,
but a nice optimization. The suggestions are gathered in the query phase and in the case
that we only care about suggestions (so no hits) we don’t need to execute the fetch phase.

"size": 0,
"suggest" : {
"my-title-suggestions-1" : {
"text" : "devloping distibutd saerch engies",
"term" : {
"size" : 3,
"field" : "title"
}
}
}
}'
The above request could yield the response as stated in the code example below. As you
can see if we take the first suggested options of each suggestion entry we get developing
distributed search engines as result.
{
...
"suggest": {
"my-title-suggestions-1": [
{
"text": "devloping",

"offset": 0,
"length": 9,
"options": [
{
"text": "developing",
"freq": 77,
"score": 0.8888889
},
{
"text": "deloping",
"freq": 1,
"score": 0.875
},
{
"text": "deploying",
"freq": 2,
"score": 0.7777778
}
]
},
{
"text": "distibutd",
"offset": 10,
"length": 9,
"options": [
{
"text": "distributed",
"freq": 217,
"score": 0.7777778
},
{
"text": "disributed",
"freq": 1,
"score": 0.7777778
},
{
"text": "distribute",
"freq": 1,
"score": 0.7777778
}
]
},
{
"text": "saerch",
"offset": 20,
"length": 6,
"options": [
{
"text": "search",
"freq": 1038,
"score": 0.8333333
},
{
"text": "smerch",
"freq": 3,
"score": 0.8333333
},
{

"text": "serch",
"freq": 2,
"score": 0.8
}
]
},
{
"text": "engies",
"offset": 27,
"length": 6,
"options": [
{
"text": "engines",
"freq": 568,
"score": 0.8333333
},
{
"text": "engles",
"freq": 3,
"score": 0.8333333
},
{
"text": "eggies",
"freq": 1,
"score": 0.8333333
}
]
}
]
}
...
}
chapter.
44.3. Count API
The count API allows to easily execute a query and get the number of matches for that
query. It can be executed across one or more indices and across one or more types. The
query can either be provided using a simple query string as a parameter, or using the Query
DSL defined within the request body. Here is an example:

PUT /twitter/tweet/1?refresh
{
"user": "kimchy"
}
GET /twitter/tweet/_count?q=user:kimchy
GET /twitter/tweet/_count
{
"query" : {
}
}
The query being sent in the body must be nested in a query key, same as
 the search api works
Both examples above do the same thing, which is count the number of tweets from the
twitter index for a certain user. The result is:
{
"count" : 1,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
The query is optional, and when not provided, it will use match_all to count all the docs.
Multi index, Multi type
The count API can be applied to multiple types in multiple indices.
Request Parameters
When executing count using the query parameter q, the query passed is a query string
using Lucene query parser. There are additional parameters that can be passed:
Name Description
df The default field to use when no field prefix
is defined within the query.
analyzer The analyzer name to be used when
analyzing the query string.
default_operator The default operator to be used, can be AND
or OR. Defaults to OR.

Name Description
lenient If set to true will cause format based failures
(like providing text to a numeric field) to be
ignored. Defaults to false.
lowercase_expanded_terms Should terms be automatically lowercased
or not. Defaults to true.
analyze_wildcard Should wildcard and prefix queries be
analyzed or not. Defaults to false.
terminate_after The maximum count for each shard, upon
reaching which the query execution will
terminate early. If set, the response will have
a boolean field terminated_early to
indicate whether the query execution has
actually terminated_early. Defaults to no
terminate_after.
Request Body
The count can use the Query DSL within its body in order to express the query that should
be executed. The body content can also be passed as a REST parameter named source.
Both HTTP GET and HTTP POST can be used to execute count with body. Since not all
clients support GET with body, POST is allowed as well.
Distributed
The count operation is broadcast across all shards. For each shard id group, a replica is
chosen and executed against it. This means that replicas increase the scalability of count.
Routing
The routing value (a comma separated list of the routing values) can be specified to control
which shards the count request will be executed on.
44.4. Explain API
The explain api computes a score explanation for a query and a specific document. This can
give useful feedback whether a document matches or didn’t match a specific query.
The index and type parameters expect a single index and a single type respectively.
Usage
Full query example:

curl -XGET 'localhost:9200/twitter/tweet/1/_explain' -d '{
"query" : {
"term" : { "message" : "search" }
}
}'
This will yield the following result:
{
"matches" : true,
"explanation" : {
"value" : 0.15342641,
"description" : "fieldWeight(message:search in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(message:search)=1)"
}, {
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=message, doc=0)"
} ]
}
}
There is also a simpler way of specifying the query via the q parameter. The specified q
parameter value is then parsed as if the query_string query was used. Example usage of
the q parameter in the explain api:
curl -XGET 'localhost:9200/twitter/tweet/1/_explain?q=message:search'
This will yield the same result as the previous request.
All parameters:
_source
Set to true to retrieve the _source of the document explained. You can also retrieve
part of the document by using _source_include & _source_exclude (see Get
API for more details)
fields
Allows to control which stored fields to return as part of the document explained.
routing
Controls the routing in the case the routing was used during indexing.

parent
Same effect as setting the routing parameter.
preference
Controls on which shard the explain is executed.
source
Allows the data of the request to be put in the query string of the url.
q
The query string (maps to the query_string query).
df
The default field to use when no field prefix is defined within the query. Defaults to
_all field.
analyzer
The analyzer name to be used when analyzing the query string. Defaults to the
analyzer of the _all field.
analyze_wildcard
Should wildcard and prefix queries be analyzed or not. Defaults to false.
lowercase_expanded_terms
Should terms be automatically lowercased or not. Defaults to true.
lenient
If set to true will cause format based failures (like providing text to a numeric field) to
be ignored. Defaults to false.
default_operator
The default operator to be used, can be AND or OR. Defaults to OR.
44.5. Multi Search API
The multi search API allows to execute several search requests within the same API. The
endpoint for it is _msearch.
The format of the request is similar to the bulk API format, and the structure is as follows
(the structure is specifically optimized to reduce parsing if a specific search ends up
redirected to another node):

header\n
body\n
header\n
body\n
The header part includes which index / indices to search on, optional (mapping) types to
search on, the search_type, preference, and routing. The body includes the typical
search body request (including the query, aggregations, from, size, and so on). Here
is an example:
$ cat requests
{"index" : "test"}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10}
{"index" : "test", "search_type" : "dfs_query_then_fetch"}
{"query" : {"match_all" : {}}}
{}

{"search_type" : "dfs_query_then_fetch"}
$ curl -XGET localhost:9200/_msearch --data-binary "@requests"; echo
Note, the above includes an example of an empty header (can also be just without any
content) which is supported as well.
The response returns a responses array, which includes the search response and status
code for each search request matching its order in the original multi search request. If
there was a complete failure for that specific search request, an object with error
message and corresponding status code will be returned in place of the actual search
response.
The endpoint allows to also search against an index/indices and type/types in the URI itself,
in which case it will be used as the default unless explicitly defined otherwise in the header.
For example:
$ cat requests
{}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10}
{}
{"index" : "test2"}
$ curl -XGET localhost:9200/test/_msearch --data-binary @requests; echo

The above will execute the search against the test index for all the requests that don’t
define an index, and the last one will be executed against the test2 index.
The search_type can be set in a similar manner to globally apply to all search requests.
The msearch’s max_concurrent_searches request parameter can be used to control

the maximum number of concurrent searches the multi search api will execute. This
default is based on the number of data nodes and the default search thread pool size.
Security
experimental[]
The Profile API provides detailed timing information about the execution of individual
components in a search request. It gives the user insight into how search requests are
executed at a low level so that the user can understand why certain requests are slow, and
take steps to improve them.
The output from the Profile API is very verbose, especially for complicated requests
executed across many shards. Pretty-printing the response is recommended to help
understand the output
Usage
Any _search request can be profiled by adding a top-level profile parameter:
curl -XGET 'localhost:9200/_search' -d '{

"profile": true, 1
"query" : {
"match" : { "message" : "search test" }
}
}'
1 - Setting the top-level profile parameter to true will enable profiling for the search
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.078072,
"hits": [ ... ] 1
},
"profile": {
"shards": [
{
"id": "[2aE02wS1R8q_QFnYu6vDVQ][test][1]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "message:search message:test",
"time": "15.52889800ms",
"breakdown": {
"score": 6352,
"score_count": 1,
"build_scorer": 1800776,
"build_scorer_count": 1,
"match": 0,
"match_count": 0,
"create_weight": 667400,
"create_weight_count": 1,
"next_doc": 10563,
"next_doc_count": 2,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "message:search",
"time": "4.938855000ms",
"breakdown": {
"score": 0,
"score_count": 0,
"match": 0,
"match_count": 0,
"next_doc": 0,
"advance": 0,
"advance_count": 0
}
},
{
"description": "message:test",
"time": "0.5016660000ms",
"breakdown": {
"score": 5014,
"score_count": 1,

"match": 0,
"match_count": 0,
"next_doc": 5542,
"advance": 0,
"advance_count": 0
}
}
]
}
],
"rewrite_time": 870954,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "0.009783000000ms"
}
]
}
]
}
]
}
}
1 - Search results are returned, but were omitted here for brevity
Even for a simple query, the response is relatively complicated. Let’s break it down piece-
by-piece before moving to more complex examples.
First, the overall structure of the profile response is as follows:
{
"profile": {
"shards": [
{
"id": "[2aE02wS1R8q_QFnYu6vDVQ][test][1]", 1
"searches": [
{
"query": [...], 2
"rewrite_time": 870954, 3
"collector": [...] 4
}
],
"aggregations": [...] 5
}
]
}
}

1 - A profile is returned for each shard that participated in the response, and is identified by
a unique ID
2 - Each profile contains a section which holds details about the query execution
3 - Each profile has a single time representing the cumulative rewrite time
4 - Each profile also contains a section about the Lucene Collectors which run the search
5 - Each profile contains a section which holds the details about the aggregation execution
Because a search request may be executed against one or more shards in an index, and a
search may cover one or more indices, the top level element in the profile response is an
array of shard objects. Each shard object lists it’s id which uniquely identifies the shard.
The ID’s format is [nodeID][indexName][shardID].
The profile itself may consist of one or more "searches", where a search is a query
executed against the underlying Lucene index. Most Search Requests submitted by the
user will only execute a single search against the Lucene index. But occasionally multiple
searches will be executed, such as including a global aggregation (which needs to execute
a secondary "match_all" query for the global context).
Inside each search object there will be two arrays of profiled information: a query array
and a collector array. Alongside the search object is an aggregations object that
contains the profile information for the aggregations. In the future, more sections may be
added, such as suggest, highlight, etc
There will also be a rewrite metric showing the total time spent rewriting the query (in
nanoseconds).
Profiling Queries

The details provided by the Profile API directly expose Lucene class names
and concepts, which means that complete interpretation of the results
require fairly advanced knowledge of Lucene. This page attempts to give a
crash-course in how Lucene executes queries so that you can use the
Profile API to successfully diagnose and debug queries, but it is only an
 overview. For complete understanding, please refer to Lucene’s

documentation and, in places, the code.
With that said, a complete understanding is often not required to fix a slow
query. It is usually sufficient to see that a particular component of a query
is slow, and not necessarily understand why the advance phase of that
query is the cause, for example.
query Section
The query section contains detailed timing of the query tree executed by Lucene on a
particular shard. The overall structure of this query tree will resemble your original
NG|Storage query, but may be slightly (or sometimes very) different. It will also use similar
but not always identical naming. Using our previous term query example, let’s analyze the
query section:
"query": [
{
"type": "BooleanQuery",
"description": "message:search message:test",
"time": "15.52889800ms",
"breakdown": {...}, 1
"children": [
{
"time": "4.938855000ms",
"breakdown": {...}
},
{
"description": "message:test",
"time": "0.5016660000ms",
"breakdown": {...}
}
]
}
]
1 - The breakdown timings are omitted for simplicity

Based on the profile structure, we can see that our match query was rewritten by Lucene
into a BooleanQuery with two clauses (both holding a TermQuery). The "type" field
displays the Lucene class name, and often aligns with the equivalent name in NG|Storage.
The "description" field displays the Lucene explanation text for the query, and is made
available to help differentiating between parts of your query (e.g. both
"message:search" and "message:test" are TermQuery’s and would appear identical
otherwise.
The "time" field shows that this query took ~15ms for the entire BooleanQuery to execute.
The recorded time is inclusive of all children.
The "breakdown" field will give detailed stats about how the time was spent, we’ll look at
that in a moment. Finally, the "children" array lists any sub-queries that may be
present. Because we searched for two values ("search test"), our BooleanQuery holds two
children TermQueries. They have identical information (type, time, breakdown, etc).
Children are allowed to have their own children.
Timing Breakdown
The "breakdown" component lists detailed timing statistics about low-level Lucene
execution:
"breakdown": {
"score": 5014,
"score_count": 1,
"match": 0,
"match_count": 0,
"next_doc": 5542,
"advance": 0,
"advance_count": 0
}
Timings are listed in wall-clock nanoseconds and are not normalized at all. All caveats
about the overall "time" apply here. The intention of the breakdown is to give you a feel
for A) what machinery in Lucene is actually eating time, and B) the magnitude of differences
in times between the various components. Like the overall time, the breakdown is inclusive
of all children times.
The meaning of the stats are as follows:

All parameters:
create_weight
A Query in Lucene must be capable of reuse across multiple IndexSearchers (think of
it as the engine that executes a search against a specific Lucene Index). This puts
Lucene in a tricky spot, since many queries need to accumulate temporary
state/statistics associated with the index it is being used against, but the Query
contract mandates that it must be immutable.
To get around this, Lucene asks each query to generate a Weight object which acts as
a temporary context object to hold state associated with this particular
(IndexSearcher, Query) tuple. The weight metric shows how long this process takes
build_scorer
This parameter shows how long it takes to build a Scorer for the query. A Scorer is
the mechanism that iterates over matching documents generates a score per-
document (e.g. how well does "foo" match the document?). Note, this records the
time required to generate the Scorer object, not actually score the documents. Some
queries have faster or slower initialization of the Scorer, depending on optimizations,
complexity, etc.
This may also showing timing associated with caching, if enabled and/or applicable
for the query
next_doc
The Lucene method next_doc returns Doc ID of the next document matching the
query. This statistic shows the time it takes to determine which document is the next
match, a process that varies considerably depending on the nature of the query.
Next_doc is a specialized form of advance() which is more convenient for many
queries in Lucene. It is equivalent to advance(docId() + 1)
advance
advance is the "lower level" version of next_doc: it serves the same purpose of
finding the next matching doc, but requires the calling query to perform extra tasks
such as identifying and moving past skips, etc. However, not all queries can use
next_doc, so advance is also timed for those queries.
Conjunctions (e.g. must clauses in a boolean) are typical consumers of advance

matches
Some queries, such as phrase queries, match documents using a "Two Phase"
process. First, the document is "approximately" matched, and if it matches
approximately, it is checked a second time with a more rigorous (and expensive)
process. The second phase verification is what the matches statistic measures.
For example, a phrase query first checks a document approximately by ensuring all
terms in the phrase are present in the doc. If all the terms are present, it then
executes the second phase verification to ensure the terms are in-order to form the
phrase, which is relatively more expensive than just checking for presence of the
terms.
Because this two-phase process is only used by a handful of queries, the metric
statistic will often be zero
score
This records the time taken to score a particular document via it’s Scorer
*_count
Records the number of invocations of the particular method. For example,
"next_doc_count": 2, means the nextDoc() method was called on two
different documents. This can be used to help judge how selective queries are, by
comparing counts between different query components.
collectors Section
The Collectors portion of the response shows high-level execution details. Lucene works by
defining a "Collector" which is responsible for coordinating the traversal, scoring and
collection of matching documents. Collectors are also how a single query can record
aggregation results, execute unscoped "global" queries, execute post-query filters, etc.
Looking at the previous example:
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"time": "2.206529000ms"
}
]
We see a single collector named SimpleTopScoreDocCollector. This is the default

"scoring and sorting" Collector used by NG|Storage. The "reason" field attempts to give a
plain english description of the class name. The "time" is similar to the time in the Query
tree: a wall-clock time inclusive of all children. Similarly, children lists all sub-
collectors.
It should be noted that Collector times are independent from the Query times. They are
calculated, combined and normalized independently! Due to the nature of Lucene’s
execution, it is impossible to "merge" the times from the Collectors into the Query section,
so they are displayed in separate portions.
For reference, the various collector reason’s are:
search_sorted
A collector that scores and sorts documents. This is the most common collector and
will be seen in most simple searches
search_count
A collector that only counts the number of documents that match the query, but does
not fetch the source. This is seen when size: 0 is specified
search_terminate_after_count
A collector that terminates search execution after n matching documents have been
found. This is seen when the terminate_after_count query parameter has been
specified
search_min_score
A collector that only returns matching documents that have a score greater than n.
This is seen when the top-level parameter min_score has been specified.
search_multi
A collector that wraps several other collectors. This is seen when combinations of
search, aggregations, global aggs and post_filters are combined in a single search.
search_timeout
A collector that halts execution after a specified period of time. This is seen when a
timeout top-level parameter has been specified.
aggregation
A collector that NG|Storage uses to run aggregations against the query scope. A
single aggregation collector is used to collect documents for all aggregations, so
you will see a list of aggregations in the name rather.
global_aggregation
A collector that executes an aggregation against the global query scope, rather than
the specified query. Because the global scope is necessarily different from the
executed query, it must execute it’s own match_all query (which you will see added to
the Query section) to collect your entire dataset
rewrite Section
All queries in Lucene undergo a "rewriting" process. A query (and its sub-queries) may be
rewritten one or more times, and the process continues until the query stops changing.
This process allows Lucene to perform optimizations, such as removing redundant clauses,
replacing one query for a more efficient execution path, etc. For example a Boolean →
Boolean → TermQuery can be rewritten to a TermQuery, because all the Booleans are
unnecessary in this case.
The rewriting process is complex and difficult to display, since queries can change
drastically. Rather than showing the intermediate results, the total rewrite time is simply
displayed as a value (in nanoseconds). This value is cumulative and contains the total time
for all queries being rewritten.
A more complex example
To demonstrate a slightly more complex query and the associated results, we can profile
the following query:

GET /test/_search
{
"profile": true,
"query": {
"term": {
"message": {
"value": "search"
}
}
},
"aggs": {
"non_global_term": {
"terms": {
"field": "agg"
},
"aggs": {
"second_term": {
"terms": {
"field": "sub_agg"
}
}
}
},
"another_agg": {
"cardinality": {
"field": "aggB"
}
},
"global_agg": {
"global": {},
"aggs": {
"my_agg2": {
"terms": {
"field": "globalAgg"
}
}
}
}
},
"post_filter": {
"term": {
"my_field": "foo"
}
}
}
This example has:
• A query
• A scoped aggregation
• A global aggregation
• A post_filter

And the response:
{
"profile": {
"shards": [
{
"id": "[P6-vulHtQRWuD4YnubWb7A][test][0]",
"searches": [
{
"query": [
{
"description": "my_field:foo",
"time": "0.4094560000ms",
"breakdown": {
"score": 0,
"score_count": 1,
"next_doc": 0,
"match": 0,
"match_count": 0,
"advance": 0
"advance_count": 0
}
},
{
"time": "0.3037020000ms",
"breakdown": {
"score": 0,
"score_count": 1,
"next_doc": 5936,
"match": 0,
"match_count": 0,
"advance": 0
"advance_count": 0
}
}
],
"collector": [
{
"name": "MultiCollector",
"reason": "search_multi",
"time": "1.378943000ms",
"children": [
{

"name": "FilteredCollector",
"reason": "search_post_filter",
"time": "0.4036590000ms",
"children": [
{
"name":
"SimpleTopScoreDocCollector",
"time": "0.006391000000ms"
}
]
},
{
"name": "BucketCollector:
[[non_global_term, another_agg]]",
"reason": "aggregation",
"time": "0.9546020000ms"
}
]
}
]
},
{
"query": [
{
"type": "MatchAllDocsQuery",
"description": "*:*",
"time": "0.04829300000ms",
"breakdown": {
"score": 0,
"score_count": 1,
"next_doc": 3672,
"match": 0,
"match_count": 0,
"advance": 0
"advance_count": 0
}
}
],
"collector": [
{
"name": "GlobalAggregator: [global_agg]",
"reason": "aggregation_global",
"time": "0.1226310000ms"
}
]
}
]
}
]
}
}

As you can see, the output is significantly verbose from before. All the major portions of the
query are represented:
1. The first TermQuery (message:search) represents the main term query
2. The second TermQuery (my_field:foo) represents the post_filter query
3. There is a MatchAllDocsQuery (*:*) query which is being executed as a second,

distinct search. This was not part of the query specified by the user, but is auto-
generated by the global aggregation to provide a global query scope
The Collector tree is fairly straightforward, showing how a single MultiCollector wraps a
FilteredCollector to execute the post_filter (and in turn wraps the normal scoring
SimpleCollector), a BucketCollector to run all scoped aggregations. In the MatchAll
search, there is a single GlobalAggregator to run the global aggregation.
Understanding MultiTermQuery output
A special note needs to be made about the MultiTermQuery class of queries. This
includes wildcards, regex and fuzzy queries. These queries emit very verbose responses,
and are not overly structured.
Essentially, these queries rewrite themselves on a per-segment basis. If you imagine the
wildcard query b*, it technically can match any token that begins with the letter "b". It
would be impossible to enumerate all possible combinations, so Lucene rewrites the query
in context of the segment being evaluated. E.g. one segment may contain the tokens
[bar, baz], so the query rewrites to a BooleanQuery combination of "bar" and "baz".
Another segment may only have the token [bakery], so query rewrites to a single
TermQuery for "bakery".
Due to this dynamic, per-segment rewriting, the clean tree structure becomes distorted
and no longer follows a clean "lineage" showing how one query rewrites into the next. At
present time, all we can do is apologize, and suggest you collapse the details for that
query’s children if it is too confusing. Luckily, all the timing statistics are correct, just not
the physical layout in the response, so it is sufficient to just analyze the top-level
MultiTermQuery and ignore it’s children if you find the details too tricky to interpret.
Hopefully this will be fixed in future iterations, but it is a tricky problem to solve and still in-
progress :)
Profiling Aggregations

aggregations Section
The aggregations section contains detailed timing of the aggregation tree executed by a
particular shard. The overall structure of this aggregation tree will resemble your original
NG|Storage request. Let’s consider the following example aggregations request:
curl -XGET "http://localhost:9200/house-prices/_search" -d'

{
"profile": true,
"size": 0,
"aggs": {
"property_type": {
"terms": {
"field": "propertyType"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}'
Which yields the following aggregation profile output

"aggregations": [
{
"type":
"org.ngStorage.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsA
ggregator",
"description": "property_type",
"time": "4280.456978ms",
"breakdown": {
"reduce": 0,
"reduce_count": 0,
"build_aggregation": 49765,
"build_aggregation_count": 300,
"initialise": 52785,
"initialize_count": 300,
"collect": 3155490036,
"collect_count": 1800
},
"children": [
{
"type":
"org.ngStorage.search.aggregations.metrics.avg.AvgAggregator",
"description": "avg_price",
"time": "1124.864392ms",
"breakdown": {
"reduce": 0,
"reduce_count": 0,
"initialise": 2883,
"collect": 1124860115,
}
}
]
}
]
From the profile structure we can see our property_type terms aggregation which is
internally represented by the GlobalOrdinalsStringTermsAggregator class and the
sub aggregator avg_price which is internally represented by the AvgAggregator class.
The type field displays the class used internally to represent the aggregation. The
description field displays the name of the aggregation.
The "time" field shows that it took ~4 seconds for the entire aggregation to execute. The
recorded time is inclusive of all children.
The "breakdown" field will give detailed stats about how the time was spent, we’ll look at
that in a moment. Finally, the "children" array lists any sub-aggregations that may be
present. Because we have an avg_price aggregation as a sub-aggregation to the
property_type aggregation we see it listed as a child of the property_type
aggregation. the two aggregation outputs have identical information (type, time,
breakdown, etc). Children are allowed to have their own children.
Timing Breakdown
The "breakdown" component lists detailed timing statistics about low-level Lucene
execution:
"breakdown": {
"reduce": 0,
"reduce_count": 0,
"initialise": 52785,
"collect": 3155490036,
}
Timings are listed in wall-clock nanoseconds and are not normalized at all. All caveats
about the overall time apply here. The intention of the breakdown is to give you a feel for
A) what machinery in NG|Storage is actually eating time, and B) the magnitude of
differences in times between the various components. Like the overall time, the
breakdown is inclusive of all children times.
The meaning of the stats are as follows:
All parameters:
initialise
This times how long it takes to create and initialise the aggregation before starting to
collect documents.
collect
This represents the cumulative time spent in the collect phase of the aggregation.
This is where matching documents are passed to the aggregation and the state of the
aggregator is updated based on the information contained in the documents.
build_aggregation
This represents the time spent creating the shard level results of the aggregation
ready to pass back to the reducing node after the collection of documents is finished.
reduce
This is not currently used and will always report 0. Currently aggregation profiling

only times the shard level parts of the aggregation execution. Timing of the reduce
phase will be added later.
*_count
Records the number of invocations of the particular method. For example,
"collect_count": 2, means the collect() method was called on two different
documents.
Profiling Considerations
Performance Notes
Like any profiler, the Profile API introduces a non-negligible overhead to search execution.
The act of instrumenting low-level method calls such as collect, advance and
next_doc can be fairly expensive, since these methods are called in tight loops.
Therefore, profiling should not be enabled in production settings by default, and should not
be compared against non-profiled query times. Profiling is just a diagnostic tool.
There are also cases where special Lucene optimizations are disabled, since they are not
amenable to profiling. This could cause some queries to report larger relative times than
their non-profiled counterparts, but in general should not have a drastic effect compared to
other components in the profiled query.
Limitations
• Profiling statistics are currently not available for suggestions, highlighting,

dfs_query_then_fetch
• Profiling of the reduce phase of aggregation is currently not available
• The Profiler is still highly experimental. The Profiler is instrumenting parts of Lucene
that were never designed to be exposed in this manner, and so all results should be
viewed as a best effort to provide detailed diagnostics. We hope to improve this over
time. If you find obviously wrong numbers, strange query structures or other bugs,
please report them!
44.6. Search
The search API allows you to execute a search query and get back search hits that match
the query. The query can either be provided using a simple query string as a parameter, or
using a request body.
Multi-Index, Multi-Type
All search APIs can be applied across multiple types within an index, and across multiple
indices with support for the multi index syntax. For example, we can search on all
documents across all types within the twitter index:
$ curl -XGET 'http://localhost:9200/twitter/_search?q=user:kimchy'
We can also search within specific types:
$ curl -XGET
'http://localhost:9200/twitter/tweet,user/_search?q=user:kimchy'
We can also search all tweets with a certain tag across several indices (for example, when
each user has his own index):
$ curl -XGET
'http://localhost:9200/kimchy,ngStorage/tweet/_search?q=tag:wow'
Or we can search all tweets across all available indices using _all placeholder:
$ curl -XGET 'http://localhost:9200/_all/tweet/_search?q=tag:wow'
Or even search across all indices and all types:
$ curl -XGET 'http://localhost:9200/_search?q=tag:wow'
By default NG|Storage rejects search requests that would query more than 1000 shards.
The reason is that such large numbers of shards make the job of the coordinating node very
CPU and memory intensive. It is usually a better idea to organize data in such a way that
there are fewer larger shards. In case you would like to bypass this limit, which is
discouraged, you can update the action.search.shard_count.limit cluster setting
to a greater value.
44.7. Search Shards API
The search shards api returns the indices and shards that a search request would be
executed against. This can give useful feedback for working out issues or planning
optimizations with routing and shard preferences.
The index and type parameters may be single values, or comma-separated.
Usage
Full example:
curl -XGET 'localhost:9200/twitter/_search_shards'
{
"nodes": {
"JklnKbD7Tyqi9TP3_Q_tBg": {
"name": "Rl'nnd",
"transport_address": "inet[/192.168.1.113:9300]"
}
},
"shards": [
[
{
"index": "twitter",
"node": "JklnKbD7Tyqi9TP3_Q_tBg",
"primary": true,
"relocating_node": null,
"shard": 3,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"primary": true,
"shard": 4,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"primary": true,
"shard": 0,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"primary": true,
"shard": 2,
"state": "STARTED"
}
],
[

{
"index": "twitter",
"primary": true,
"shard": 1,
"state": "STARTED"
}
]
]
}
And specifying the same request, this time with a routing value:
curl -XGET 'localhost:9200/twitter/_search_shards?routing=foo,baz'
{
"nodes": {
"JklnKbD7Tyqi9TP3_Q_tBg": {
"name": "Rl'nnd",
"transport_address": "inet[/192.168.1.113:9300]"
}
},
"shards": [
[
{
"index": "twitter",
"primary": true,
"shard": 2,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"primary": true,
"shard": 4,
"state": "STARTED"
}
]
]
}
This time the search will only be executed against two of the shards, because routing
values have been specified.
All parameters:
routing
A comma-separated list of routing values to take into account when determining
which shards a request would be executed against.
preference
Controls a preference of which shard replicas to execute the search request on. By
default, the operation is randomized between the shard replicas. See the preference
documentation for a list of all acceptable values.
local
A boolean value whether to read the cluster state locally in order to determine where
shards are allocated instead of using the Master node’s cluster state.
44.8. Search Template
The /_search/template endpoint allows to use the mustache language to pre render
search requests, before they are executed and fill existing templates with template
parameters.
GET /_search/template
{
"inline" : {
"query": { "match" : { "{{my_field}}" : "{{my_value}}" } },
"size" : "{{my_size}}"
},
"params" : {
"my_field" : "foo",
"my_value" : "bar",
"my_size" : 5
}
}
For more information on how Mustache templating and what kind of templating you can do
with it check out the online documentation of the mustache project.
The mustache language is implemented in NG|Storage as a sandboxed

scripting language, hence it obeys settings that may be used to enable or
 disable scripts per language, source and operation as described in
scripting docs
More template examples
Filling in a query string with a single value

{
"inline": {
"query": {
"match": {
"title": "{{query_string}}"
}
}
},
"params": {
"query_string": "search for these words"
}
}
Passing an array of strings
{
"inline": {
"query": {
"terms": {
"status": [
"{{#status}}",
"{{.}}",
"{{/status}}"
]
}
}
},
"params": {
"status": [ "pending", "published" ]
}
}
which is rendered as:
{
"query": {
"terms": {
}
}
Concatenating array of values
The {{#join}}array{{/join}} function can be used to concatenate the values of an

array as a comma delimited string:

{
"inline": {
"query": {
"match": {
"emails": "{{#join}}emails{{/join}}"
}
}
},
"params": {
"emails": [ "username@email.com", "lastname@email.com" ]
}
}
{
"query" : {
"match" : {
"emails" : "username@email.com,lastname@email.com"
}
}
}
The function also accepts a custom delimiter:
{
"inline": {
"query": {
"range": {
"born": {
"gte" : "{{date.min}}",
"lte" : "{{date.max}}",
"format": "{{#join delimiter='||'}}date.formats{{/join
delimiter='||'}}"
}
}
}
},
"params": {
"date": {
"min": "2016",
"max": "31/12/2017",
"formats": ["dd/MM/yyyy", "yyyy"]
}
}
}

{
"query" : {
"range" : {
"born" : {
"gte" : "2016",
"lte" : "31/12/2017",
"format" : "dd/MM/yyyy||yyyy"
}
}
}
}
Default values
A default value is written as {{var}}{{^var}}default{{/var}} for instance:
{
"inline": {
"query": {
"range": {
"line_no": {
"gte": "{{start}}",
"lte": "{{end}}{{ênd}}20{{/end}}"
}
}
}
},
"params": { ... }
}
When params is { "start": 10, "end": 15 } this query would be rendered as:
{
"range": {
"line_no": {
"gte": "10",
"lte": "15"
}
}
}
But when params is { "start": 10 } this query would use the default value for end:
{
"range": {
"line_no": {
"gte": "10",
"lte": "20"
}
}
}

Converting parameters to JSON
The {{toJson}}parameter{{/toJson}} function can be used to convert parameters

like maps and array to their JSON representation:
{
"inline": "{\"query\":{\"bool\":{\"must\":
{{#toJson}}clauses{{/toJson}} }}}",
"params": {
"clauses": [
{ "term": "foo" },
{ "term": "bar" }
]
}
}
{
"query" : {
"bool" : {
"must" : [
{
"term" : "foo"
},
{
"term" : "bar"
}
]
}
}
}
Conditional clauses
Conditional clauses cannot be expressed using the JSON form of the template. Instead, the
template must be passed as a string. For instance, let’s say we wanted to run a match
query on the line field, and optionally wanted to filter by line numbers, where start and
end are optional.
The params would look like:

{
"params": {
"text": "words to search for",
"line_no": { 1
"start": 10, 1
"end": 20 1
}
}
}
1 - All three of these elements are optional.
We could write the query as:
{
"query": {
"bool": {
"must": {
"match": {
"line": "{{text}}" 1
}
},
"filter": {
{{#line_no}} 2
"range": {
"line_no": {
{{#start}} 3
"gte": "{{start}}" 4
{{#end}},{{/end}} 5
{{/start}} 3
{{#end}} 6
"lte": "{{end}}" 7
{{/end}} 6
}
}
{{/line_no}} 2
}
}
}
}
1 - Fill in the value of param text
2 - Include the range filter only if line_no is specified
3 - Include the gte clause only if line_no.start is specified
4 - Fill in the value of param line_no.start
5 - Add a comma after the gte clause only if line_no.start AND line_no.end are
specified

6 - Include the lte clause only if line_no.end is specified
7 - Fill in the value of param line_no.end
As written above, this template is not valid JSON because it includes the
section markers like {{#line_no}}. For this reason, the template
should either be stored in a file (see [pre-registered-templates]) or, when
used via the REST API, should be written as a string:

"inline":
"{\"query\":{\"bool\":{\"must\":{\"match\":{\"line\":\"{{tex
t}}\"}},\"filter\":{{{#line_no}}\"range\":{\"line_no\":{{{#s
tart}}\"gte\":\"{{start}}\"{{#end}},{{/end}}{{/start}}{{#end
}}\"lte\":\"{{end}}\"{{/end}}}}{{/line_no}}}}}}"
Pre-registered template
You can register search templates by storing it in the config/scripts directory, in a file
using the .mustache extension. In order to execute the stored template, reference it by
it’s name under the template key:
{
"file": "storedTemplate", 1
"params": {
}
}
1 - Name of the query template in config/scripts/, i.e., storedTemplate.mustache.
You can also register search templates by storing it in the cluster state. There are REST
APIs to manage these indexed templates.
POST /_search/template/<templatename>
{
"template": {
"query": {
"match": {
}
}
}
}
This template can be retrieved by

GET /_search/template/<templatename>
{
"template": {
"query": {
"match": {
}
}
}
}
This template can be deleted by
DELETE /_search/template/<templatename>
To use an indexed template at search time use:
{
"id": "templateName", 1
"params": {
}
}
1 - Name of the query template stored in the .scripts index.
Validating templates
A template can be rendered in a response with given parameters using

GET /_render/template
{
"inline": {
"query": {
"terms": {
"status": [
"{{#status}}",
"{{.}}",
"{{/status}}"
]
}
}
},
"params": {
}
}
This call will return the rendered template:
{
"template_output": {
"query": {
"terms": {
"status": [ 1
"pending",
"published"
]
}
}
}
}
1 - status array has been populated with values from the params object.
File and indexed templates can also be rendered by replacing inline with file or id
respectively. For example, to render a file template
GET /_render/template
{
"file": "my_template",
"params": {
}
}
Pre-registered templates can also be rendered using

GET /_render/template/<template_name>
{
"params": {
"..."
}
}
Multi Search Template
The multi search template API allows to execute several search template requests within
the same API using the _msearch/template endpoint.
The format of the request is similar to the Multi Search API format:
header\n
body\n
header\n
body\n
The header part supports the same index, types, search_type, preference, and
routing options as the usual Multi Search API.
The body includes a search template body request and supports inline, stored and file
templates. Here is an example:
$ cat requests
{"index": "test"}
{"inline": {"query": {"match": {"user" : "{{username}}" }}}, "params":
{"username": "john"}} 1
{"index": "_all", "types": "accounts"}
{"inline": {"query": {"{{query_type}}": {"name": "{{name}}" }}}, "params":
{"query_type": "match_phrase_prefix", "name": "Smith"}}
{"index": "_all"}
{"id": "template_1", "params": {"query_string": "search for these words"
}} 2
{"types": "users"}
{"file": "template_2", "params": {"field_name": "fullname", "field_value":
"john smith" }} 3
$ curl -XGET localhost:9200/_msearch/template --data-binary "@requests";

echo
1 - Inline search template request
2 - Search template request based on a stored template
3 - Search template request based on a file template

The response returns a responses array, which includes the search template response
for each search template request matching its order in the original multi search template
request. If there was a complete failure for that specific search template request, an object
with error message will be returned in place of the actual search response.
44.9. URI Search
A search request can be executed purely using a URI by providing request parameters. Not
all search options are exposed when executing a search using this mode, but it can be
handy for quick "curl tests". Here is an example:
GET twitter/tweet/_search?q=user:kimchy
And here is a sample response:
{
"timed_out": false,
"took": 62,
"_shards":{
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits":{
"total" : 1,
"max_score": 0.2876821,
"hits" : [
{
"_type" : "tweet",
"_id" : "0",
"_score": 0.2876821,
"_source" : {
"user" : "kimchy",
"date" : "2009-11-15T14:12:12",
"message" : "trying out ngStorage",
"likes": 0
}
}
]
}
}
Parameters
The parameters allowed in the URI are:

Name Description
q The query string (maps to the
query_string query, see Query String
Query for more details).
explain For each hit, contain an explanation of how
scoring of the hits was computed.
_source Set to false to disable retrieval of the
_source field. You can also retrieve part of
the document by using _source_include
& _source_exclude (see the request body
documentation for more details)
stored_fields The selective stored fields of the document
to return for each hit, comma delimited. Not
specifying any value will cause no fields to
return.
sort Sorting to perform. Can either be in the form
of fieldName, or fieldName:asc
/fieldName:desc. The fieldName can
either be an actual field within the document,
or the special _score name to indicate
sorting based on scores. There can be
several sort parameters (order is
important).
track_scores When sorting, set to true in order to still
track scores and return them as part of each
hit.
timeout A search timeout, bounding the search
request to be executed within the specified
time value and bail with the hits accumulated
up to that point when expired. Defaults to no
timeout.

Name Description
terminate_after The maximum number of documents to
collect for each shard, upon reaching which
the query execution will terminate early. If
set, the response will have a boolean field
terminated_early to indicate whether the
query execution has actually
terminated_early. Defaults to no
terminate_after.
from The starting from index of the hits to return.
Defaults to 0.
size The number of hits to return. Defaults to 10.
search_type The type of the search operation to perform.
Can be dfs_query_then_fetch or
query_then_fetch. Defaults to
query_then_fetch. See Search Type for
more details on the different types of search
that can be performed.
44.10. Validate API
The validate API allows a user to validate a potentially expensive query without executing it.
We’ll use the following test data to explain _validate:
PUT twitter/tweet/_bulk?refresh
{"index":{"_id":1}}
{"user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" :
"trying out ngStorage"}
{"index":{"_id":2}}
{"user" : "kimchi", "post_date" : "2009-11-15T14:12:13", "message" : "My
username is similar to @kimchy!"}
When sent a valid query:
GET twitter/_validate/query?q=user:foo
The response contains valid:true:
{"valid":true,"_shards":{"total":1,"successful":1,"failed":0}}
Request Parameters
When executing exists using the query parameter q, the query passed is a query string
using Lucene query parser. There are additional parameters that can be passed:

Name Description
The query may also be sent in the request body:
GET twitter/tweet/_validate/query
{
"query" : {
"bool" : {
"must" : {
"query_string" : {
"query" : "*:*"
}
},
"filter" : {
}
}
}
}
The query being sent in the body must be nested in a query key, same as
 the search api works
If the query is invalid, valid will be false. Here the query is invalid because NG|Storage
knows the post_date field should be a date due to dynamic mapping, and 'foo' does not
correctly parse into a date:
GET twitter/tweet/_validate/query?q=post_date:foo
{"valid":false,"_shards":{"total":1,"successful":1,"failed":0}}
An explain parameter can be specified to get more detailed information about why a

query failed:
GET twitter/tweet/_validate/query?q=post_date:foo&explain=true
responds with:
{
"valid" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"explanations" : [ {
"index" : "twitter",
"valid" : false,
"error" : "twitter/IAEc2nIXSSunQA_suI0MLw] QueryShardException[failed
to create query:...failed to parse date field [foo]"
} ]
}
When the query is valid, the explanation defaults to the string representation of that query.
With rewrite set to true, the explanation is more detailed showing the actual Lucene
query that will be executed.
For Fuzzy Queries:
GET twitter/tweet/_validate/query?rewrite=true
{
"query": {
"match": {
"user": {
"query": "kimchy",
"fuzziness": "auto"
}
}
}
}
Response:

{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "twitter",
"valid": true,
"explanation": "+user:kimchy +user:kimchi^0.75
#(ConstantScore(_type:tweet))^0.0"
}
]
}
For More Like This:
GET twitter/tweet/_validate/query?rewrite=true
{
"query": {
"more_like_this": {
"like": {
"_id": "2"
},
"boost_terms": 1
}
}
}
Response:
{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "twitter",
"valid": true,
"explanation": "((user:terminator^3.71334 plot:future^2.763601
plot:human^2.8415773 plot:sarah^3.4193945 plot:kyle^3.8244398
plot:cyborg^3.9177752 plot:connor^4.040236 plot:reese^4.7133346 ... )~6)
-ConstantScore(_uid:tweet#2)) #(ConstantScore(_type:tweet))^0.0"
}
]
}

The request is executed on a single shard only, which is randomly
 selected. The detailed explanation of the query may depend on which shard
is being hit, and therefore may vary from one request to another.
Setup NG|Storage
This section includes information on how to setup NG|Storage and get it running, including:
• Downloading
• Installing
• Starting
• Configuring
Supported Platforms
The matrix of officially supported operating systems and JVMs is available here: Support
Matrix. NG|Storage is tested on the listed platforms, but it is possible that it will work on
other platforms too.
Java (JVM) Version
NG|Storage is built using Java, and requires at least Java 8 in order to run. Only Oracle’s
Java and the OpenJDK are supported. The same JVM version should be used on all
NG|Storage nodes and clients.
We recommend installing Java version {jdk} or later. NG|Storage will refuse to start if a
known-bad version of Java is used.
The version of Java that NG|Storage will use can be configured by setting the JAVA_HOME
environment variable.

Chapter 45. Installing NG|Storage
NG|Storage is provided in the following package formats:
zip/tar.gz
The zip and tar.gz packages are suitable for installation on any system and are the
easiest choice for getting started with NG|Storage.
Install NG|Storage with .zip or .tar.gz or Install NG|Storage on Windows
deb
The deb package is suitable for Debian, Ubuntu, and other Debian-based systems.
Debian packages may be downloaded from the NG|Storage website or from our
Debian repository.
Install NG|Storage with Debian Package
rpm
The rpm package is suitable for installation on Red Hat, Centos, SLES, OpenSuSE and
other RPM-based systems. RPMs may be downloaded from the NG|Storage website
or from our RPM repository.
[rpm]
Configuration Management Tools
Te following configuration management tools help with large deployments:
Puppet
puppet-NG|Storage
Chef
cookbook-NG|Storage
45.1. Checking that NG|Storage is running
You can test that your NG|Storage node is running by sending an HTTP request to port
9200 on localhost:
curl localhost:9200
Chapter 45. Installing NG|Storage | 741

which should give you a response something like this:
{
"name" : "Harry Leland",
"cluster_name" : "ngStorage",
"version" : {
"number" : "5.0.0-alpha1",
"build_hash" : "f27399d",
"build_date" : "2016-03-30T09:51:41.449Z",
"build_snapshot" : false,
"lucene_version" : "6.0.0"
},
"tagline" : "You Know, for Search"
}
45.2. Install NG|Storage with Debian Package
The Debian package for NG|Storage can be downloaded from our website or from our APT
repository. It can be used to install NG|Storage on any Debian-based system such as
Debian and Ubuntu.
Import the NG|Storage PGP Key
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-

key add -
¬* *Installing from the APT repository
You may need to install the apt-transport-https package on Debian before

proceeding:
sudo apt-get install apt-transport-https
Save the repository definition to /etc/apt/sources.list.d/ngStorage-{major-version}.list:
echo "deb https://packages.elastic.co/ngStorage/{major-version}/debian

stable main" | sudo tee -a /etc/apt/sources.list.d/ngStorage-{major-
version}.list
742 | Chapter 45. Installing NG|Storage

Do not use add-apt-repository as it will add a deb-src entry as well,
but we do not provide a source package. If you have added the deb-src
entry, you will see an error like the following:
 Unable to find expected entry 'main/source/Sources' in

Release file
(Wrong sources.list entry or malformed file)
Delete the deb-src entry from the /etc/apt/sources.list file and

the installation should work as expected.
You can install the NG|Storage Debian package with:
sudo apt-get update && sudo apt-get install ngStorage
If two entries exist for the same NG|Storage repository, you will see an
error like this during apt-get update:
 Duplicate sources.list entry

https://packages.elastic.co/elasticsearch/{major-version}/debian/ …¬`
Examine /etc/apt/sources.list.d/ngStorage-{major-version}.list for the

duplicate entry or locate the duplicate entry amongst the files in
/etc/apt/sources.list.d/ and the /etc/apt/sources.list file.
Download and install the Debian package manually
The Debian package for Elastisearch v{version} can be downloaded from the website and
installed as follows:
wget
https://download.elastic.co/elasticsearch/release/org/elasticsearch/distri
bution/deb/elasticsearch/{version}/elasticsearch-{version}.deb
sha1sum elasticsearch-{version}.deb 1
sudo dpkg -i elasticsearch-{version}.deb
1 - Compare the SHA produced by sha1sum or shasum with the published SHA.
Running NG|Storage with SysV init
Use the update-rc.d command to configure NG|Storage to start automatically when the
system boots up:
sudo update-rc.d ngStorage defaults 95 10
NG|Storage can be started and stopped using the service command:
sudo -i service ngStorage start

sudo -i service ngStorage stop
If NG|Storage fails to start for any reason, it will print the reason for failure to STDOUT. Log
files can be found in /var/log/ngStorage/.
Configuring NG|Storage
NG|Storage loads its configuration from the /etc/ngStorage/ngStorage.yml file by

default. The format of this config file is explained in [settings].
The Debian package also has a system configuration file (/etc/default/ngStorage),

which allows you to set the following parameters.
Distributions that use systemd require that system resource limits be
 configured via systemd rather than via the

/etc/sysconfig/ngStorage file. See [systemd] for more information.
Directory layout of Debian package
The Debian package places config files, logs, and the data directory in the appropriate
locations for a Debian-based system:
Type Description Default Location Setting

home NG Storage home /usr/share/ngSto
directory or rage
$ES_HOME
bin Binary scripts /usr/share/ngSto
including rage/bin
ngStorage to
start a node
and ngStorage-
plugin to
install plugins
conf Configuration /etc/ngStorage
files including
ngStorage.yml

path.conf conf Environment /etc/default/ngS
variables torage
including heap
size, file
descriptors.
data The location of /var/lib/ngStora
the data files ge
of each index /
shard allocated
on the node. Can
hold multiple
locations.
path.data logs Log files /var/log/ngStora
location. ge
path.logs plugins Plugin files /usr/share/ngSto
location. Each rage/plugins
plugin will be
contained in a
subdirectory.
repo Shared file system Not configured path.repo
repository locations.
Can hold multiple
locations. A file
system repository can
be placed in to any
subdirectory of any
directory specified
here.
script Location of script /etc/ngStorage/s path.scripts
files. cripts
45.3. SysV init vs systemd
NG|Storage is not started automatically after installation. How to start and stop NG|Storage
depends on whether your sytem uses SysV init or systemd (used by newer distributions).
You can tell which is being used by running this command:
ps -p 1
45.4. Next Steps
You now have a test NG|Storage environment set up. Before you start serious development
or go into production with NG|Storage, you will need to do some additional setup:
• Learn how to configure NG|Storage.

• Configure important NG|Storage settings.
• Configure important system settings.
ES_USER
The user to run as, defaults to NG|Storage.
ES_GROUP
The group to run as, defaults to NG|Storage.
JAVA_HOME
Set a custom Java path to be used.
MAX_OPEN_FILES
Maximum number of open files, defaults to 65536.
MAX_LOCKED_MEMORY
Maximum locked memory size. Set to unlimited if you use the
`bootstrap.memory_lock option in ngStorage.yml.
MAX_MAP_COUNT
Maximum number of memory map areas a process may have. If you use mmapfs as
index store type, make sure this is set to a high value. For more information, check
the linux kernel documentation about max_map_count. This is set via sysctl before
starting NG|Storage. Defaults to 262144.
LOG_DIR
Log directory, defaults to /var/log/ngStorage.
DATA_DIR
Data directory, defaults to /var/lib/ngStorage.
CONF_DIR
Configuration file directory (which needs to include ngStorage.yml and
logging.yml files), defaults to /etc/ngStorage.
ES_JAVA_OPTS
Any additional JVM system properties you may want to apply.
RESTART_ON_UPGRADE
Configure restart on package upgrade, defaults to false. This means you will have to
restart your NG|Storage instance after installing a package manually. The reason for
this is to ensure, that upgrades in a cluster do not result in a continuous shard
reallocation resulting in high network traffic and reducing the response times of your
cluster.
45.5. Running NG|Storage with systemd
To configure NG|Storage to start automatically when the system boots up, run the following
commands:
sudo /bin/systemctl daemon-reload

sudo /bin/systemctl enable ngStorageservice
NG|Storage can be started and stopped as follows:
sudo systemctl start ngStorageservice

sudo systemctl stop ngStorageservice
These commands provide no feedback as to whether NG|Storage was started successfully

or not. Instead, this information will be written to the systemd journal, which can be tailed
as follows:
sudo journalctl -f
Log files can be found in /var/log/ngStorage/.
45.6. Install NG|Storage on Windows
NG|Storage can be installed on Windows using the .zip package. This comes with a
service.bat command which will setup NG|Storage to run as a service.
Download and install the .zip package
Download the .zip archive for Elastisearch v{version} from:

https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/zip/elastic
search/{version}/elasticsearch-{version}.zip
Unzip it with your favourite unzip tool. This will create a folder called NG|Storage-{version},
which we will refer to as %ES_HOME%. In a terminal window, CD to the %ES_HOME%
directory, for instance:
CD c:\NG|Storage-{version}

Running NG|Storage from the command line
NG|Storage can be started from the command line as follows:
.\bin\NG|Storage
By default, NG|Storage runs in the foreground, prints its logs to STDOUT, and can be
stopped by pressing Ctrl-C.
Configuring NG|Storage on the command line
NG|Storage loads its configuration from the %ES_HOME%/config/ngStorage.yml file by

Any settings that can be specified in the config file can also be specified on the command
line, using the -E syntax as follows:
./bin/ngStorage -Ecluster.name=my_cluster -Enode.name=node_1
Values that contain spaces must be surrounded with quotes. For instance
 -Epath.logs="C:\My Logs\logs".
Typically, any cluster-wide settings (like cluster.name) should be added
 to the ngStorageyml config file, while any node-specific settings such as

node.name could be specified on the command line.
Installing NG|Storage as a Service on Windows
NG|Storage can be installed as a service to run in the background or start automatically at

boot time without any user interaction. This can be achieved through the service.bat
script in the bin\ folder which allows one to install, remove, manage or configure the
service and potentially start and stop the service, all from the command-line.
c:\NG|Storage-{version}\bin>service
Usage: service.bat install|remove|start|stop|manager [SERVICE_ID]
The script requires one parameter (the command to execute) followed by an optional one
indicating the service id (useful when installing multiple NG|Storage services).
The commands available are:

install
Install NG|Storage as a service
remove
Remove the installed NG|Storage service (and stop the service if started)
start
Start the NG|Storage service (if installed)
stop
Stop the NG|Storage service (if started)
manager
Start a GUI for managing the installed service
Based on the architecture of the available JDK/JRE (set through JAVA_HOME), the
appropriate 64-bit(x64) or 32-bit(x86) service will be installed. This information is made
available during install:
c:\NG|Storage-{version}\bin>service install
Installing service : "NG|Storage-service-x64"
Using JAVA_HOME (64-bit): "c:\jvm\jdk1.8"
The service 'NG|Storage-service-x64' has been installed.
The service installer requires that the thread stack size setting be
configured in jvm.options before you install the service. On 32-bit
 Windows, you should add -Xss320k to the jvm.options file, and on 64-bit
Windows you should add -Xss1m to the jvm.options file.
While a JRE can be used for the NG|Storage service, due to its use of a
client VM (as opposed to a server JVM which offers better performance for
 long-running applications) its usage is discouraged and a warning will be
issued.
Upgrading (or downgrading) JVM versions does not require the service to
 be reinstalled. However, upgrading across JVM types (e.g. JRE versus SE)
is not supported, and does require the service to be reinstalled.
Customizing service settings
The NG|Storage service can be configured prior to installation by setting the the following
environment variables (either using the set command from the command line, or through
the System Properties¬Environment Variables GUI).
SERVICE_ID
A unique identifier for the service. Useful if installing multiple instances on the same
machine. Defaults to NG|Storage-service-x86 (on 32-bit Windows) or
NG|Storage-service-x64 (on 64-bit Windows).
SERVICE_USERNAME
The user to run as, defaults to the local system account.
SERVICE_PASSWORD
The password for the user specified in %SERVICE_USERNAME%.
SERVICE_DISPLAY_NAME
The name of the service. Defaults to NG|Storage <version> %SERVICE_ID%.
SERVICE_DESCRIPTION
The description of the service. Defaults to NG|Storage <version> Windows
Service - https://elastic.co.
JAVA_HOME
The installation directory of the desired JVM to run the service under.
LOG_DIR
Log directory, defaults to %ES_HOME%\logs.
DATA_DIR
Data directory, defaults to %ES_HOME%\data.
CONF_DIR
Configuration file directory (which needs to include ngStorageyml and
logging.yml files), defaults to %ES_HOME%\conf.
ES_JAVA_OPTS
Any additional JVM system properties you may want to apply.
ES_START_TYPE
Startup mode for the service. Can be either auto or manual (default).
ES_STOP_TIMEOUT
The timeout in seconds that procrun waits for service to exit gracefully. Defaults to 0.
At its core, service.bat relies on Apache Commons Daemon project to
install the service. Environment variables set prior to the service
 installation are copied and will be used during the service lifecycle. This
means any changes made to them after the installation will not be picked
up unless the service is reinstalled.
On Windows, the heap size can be configured as for any other NG|Storage
installation when running NG|Storage from the command line, or when
 installing NG|Storage as a service for the first time. To adjust the heap size
for an already installed service, use the service manager:
bin\service.bat manager.
Using the Manager GUI
It is also possible to configure the service after it’s been installed using the manager
GUI (NG|Storage-service-mgr.exe), which offers insight into the installed
service, including its status, startup type, JVM, start and stop settings amongst other
things. Simply invoking service.bat manager from the command-line will open
up the manager window.
Most changes (like JVM settings) made through the manager GUI will require a restart of
the service in order to take affect.
Directory layout of .zip archive
The .zip package is entirely self-contained. All files and directories are, by default,
contained within %ES_HOME%¬—¬the directory created when unpacking the archive.
This is very convenient because you don’t have to create any directories to start using
NG|Storage, and uninstalling NG|Storage is as easy as removing the %ES_HOME% directory.
However, it is advisable to change the default locations of the config directory, the data
directory, and the logs directory so that you do not delete important data later on.

home NG Storage home Directory created by
directory or unpacking the archive
%ES_HOME%
bin Binary scripts Storage` to
including `NG start a node
and `NG
Storage-plugin` to %ES_HOME%\bin conf
install plugins

Configuration files %ES_HOME%\config path.conf data
including
ngStorageyml
The location of the %ES_HOME%\data path.data logs
data files of each
index / shard
allocated on the
node. Can hold
multiple locations.
Log files location. %ES_HOME%\logs path.logs plugins
Plugin files location. %ES_HOME%\plugins repo Shared file
Each plugin will be system
repository
contained in a locations. Can
subdirectory. hold multiple
locations. A
file system
repository can
be placed in to
any subdirectory
of any directory
specified here.
Not configured path.repo script Location of
script files.
45.7. Install NG|Storage with .zip or .tar.gz
NG|Storage is provided as a .zip and as a .tar.gz package. These packages can be

used to install NG|Storage on any system and are the easiest package format to use when
trying out NG|Storage.
Download and install the .zip package
The .zip archive for Elastisearch v{version} can be downloaded and installed as follows:
wget
bution/zip/elasticsearch/{version}/elasticsearch-{version}.zip
sha1sum elasticsearch-{version}.zip 1
unzip elasticsearch-{version}.zip
cd elasticsearch-{version}/ 2
2 - This directory is known as $ES_HOME.
Download and install the .tar.gz package

The .tar.gz archive for Elastisearch v{version} can be downloaded and installed as
follows:
wget
bution/tar/elasticsearch/{version}/elasticsearch-{version}.tar.gz
sha1sum elasticsearch-{version}.tar.gz
1
tar -xzf elasticsearch-{version}.tar.gz
cd elasticsearch-{version}/
2
2 - This directory is known as $ES_HOME.
Running NG|Storage from the command line
NG|Storage can be started from the command line as follows:
./bin/ngStorage
By default, NG|Storage runs in the foreground, prints its logs to STDOUT, and can be
stopped by pressing Ctrl-C.
Running as a daemon
To run NG|Storage as a daemon, specify -d on the command line, and record the process
ID in a file using the -p option:
./bin/ngStorage -d -p pid
Log messages can be found in the $ES_HOME/logs/ directory.
To shut down NG|Storage, kill the process ID recorded in the pid file:
kill `cat pid`
The startup scripts provided in the RPM and Debian packages take care of
 starting and stopping the NG|Storage process for you.
Configuring NG|Storage on the command line
NG|Storage loads its configuration from the $ES_HOME/config/ngStorage.yml file by

Any settings that can be specified in the config file can also be specified on the command
line, using the -E syntax as follows:
./bin/ngStorage -d -Ecluster.name=my_cluster -Enode.name=node_1
Typically, any cluster-wide settings (like cluster.name) should be added
 to the ngStorageyml config file, while any node-specific settings such as

node.name could be specified on the command line.
Directory layout of .zip and .tar.gz archives
The .zip and .tar.gz packages are entirely self-contained. All files and directories are,
by default, contained within $ES_HOME¬—¬the directory created when unpacking the
archive.
This is very convenient because you don’t have to create any directories to start using
NG|Storage, and uninstalling NG|Storage is as easy as removing the $ES_HOME directory.
However, it is advisable to change the default locations of the config directory, the data
directory, and the logs directory so that you do not delete important data later on.

home NG Storage home Directory created by
directory or unpacking the archive
$ES_HOME
bin Binary scripts Storage` to
including `NG start a node
and `NG
Storage-plugin` to $ES_HOME/bin conf
install plugins
Configuration files $ES_HOME/config path.conf data
including
ngStorageyml
The location of the $ES_HOME/data path.data logs
data files of each
index / shard
allocated on the
node. Can hold
multiple locations.
Log files location. $ES_HOME/logs path.logs plugins

Plugin files location. $ES_HOME/plugins repo Shared file
Each plugin will be system
repository
contained in a locations. Can
subdirectory. hold multiple
locations. A
file system
repository can
be placed in to
any subdirectory
of any directory
specified here.
Not configured path.repo script Location of
script files.

Chapter 46. Important System Configuration
Ideally, NG|Storage should run alone on a server and use all of the resources available to it.
In order to do so, you need to configure your operating system to allow the user running
NG|Storage to access more resources than allowed by default.
The following settings must be addressed before going to production:
• Set JVM heap size
• Disable swapping
• Increase file descriptors
• Ensure sufficient virtual memory
• Ensure sufficient threads
Development Mode vs Production Mode
By default, NG|Storage assumes that you are working in development mode. If any of the
above settings are not configured correctly, a warning will be written to the log file, but you
will be able to start and run your NG|Storage node.
As soon as you configure a network setting like network.host, NG|Storage assumes that
you are moving to production and will upgrade the above warnings to exceptions. These
exceptions will prevent your NG|Storage node from starting. This is an important safety
measure to ensure that you will not lose data because of a malconfigured server.
46.1. Configuring System Settings
Where to configure systems settings depends on which package you have used to install
NG|Storage, and which operating system you are using.
When using the .zip or .tar.gz packages, system settings can be configured:
• temporarily with ulimit, or
• permanently in /etc/security/limits.conf.
When using the RPM or Debian packages, most system settings are set in the system
configuration file. However, systems which use systemd require that system limits are
specified in a systemd configuration file.
756 | Chapter 46. Important System Configuration

ulimit
On Linux systems, ulimit can be used to change resource limits on a temporary basis.
Limits usually need to be set as root before switching to the user that will run NG|Storage.
For example, to set the number of open file handles (ulimit -n) to 65,536, you can do the
following:
sudo su 1
ulimit -n 65536 2
su ngStorage 3
1 - Become root.
2 - Change the max number of open files.
3 - Become the NG|Storage user in order to start NG|Storage.
The new limit is only applied during the current session.
You can consult all currently applied limits with ulimit -a.
/etc/security/limits.conf
On Linux systems, persistent limits can be set for a particular user by editing the
/etc/security/limits.conf file. To set the maximum number of open files for the
NG|Storage user to 65,536, add the following line to the limits.conf file:
ngStorage - nofile 65536
This change will only take effect the next time the NG|Storage user opens a new session.
Ubuntu and limits.conf
Ubuntu ignores the limits.conf file for processes started by init.d.

To enable the limits.conf file, edit /etc/pam.d/su and uncomment
 the following line:
# session required pam_limits.so
Sysconfig file
When using the RPM or Debian packages, system settings and environment variables can
be specified in the system configuration file, which is located in:
Chapter 46. Important System Configuration | 757

RPM
/etc/sysconfig/ngStorage
Debian
/etc/default/ngStorage
However, for systems which uses systemd, system limits need to be specified via systemd.
Systemd configuration
When using the RPM or Debian packages on systems that use systemd, system limits must
be specified via systemd.
The systemd service file (/usr/lib/systemd/system/ngStorage.service) contains

the limits that are applied by default.
To override these, add a file called

/etc/systemd/system/ngStorage.service.d/ngStorage.conf) and specify any
changes in that file, such as:
LimitMEMLOCK=infinity
Setting JVM system properties
The preferred method of setting Java Virtual Machine options (including system properties
and JVM flags) is via the jvm.options configuration file. The default location of this file is
config/jvm.options (when installing from the tar or zip distributions) and
/etc/ngStorage/jvm.options (when installing from the Debian or RPM packages).
This file contains a line-delimited list of JVM arguments, which must begin with -. You can
add custom JVM flags to this file and check this configuration into your version control
system.
An alternative mechanism for setting Java Virtual Machine options is via the
ES_JAVA_OPTS environment variable. For instance:
export ES_JAVA_OPTS="$ES_JAVA_OPTS -Djava.io.tmpdir=/path/to/temp/dir"

./bin/ngStorage
When using the RPM or Debian packages, ES_JAVA_OPTS can be specified in the system
configuration file.

46.2. File Descriptors
NG|Storage uses a lot of file descriptors or file handles. Running out of file descriptors can
be disastrous and will most probably lead to data loss. Make sure to increase the limit on
the number of open files descriptors for the user running NG|Storage to 65,536 or higher.
For the .zip and .tar.gz packages, set ulimit -n 65536 as root before starting
NG|Storage, or set nofile to 65536 in /etc/security/limits.conf.
RPM and Debian packages already default the maximum number of file descriptors to
65536 and do not require further configuration.
You can check the max_file_descriptors configured for each node using the Nodes
Stats API, with:
curl
'localhost:9200/_nodes/stats/process?pretty&filter_path=**.max_file_descri
ptors'
46.3. Set JVM heap Size via jvm.options
In development mode, NG|Storage tells the JVM to use a heap with a minimum size of 256
MB and a maximum size of 1 GB. When moving to production, it is important to configure
heap size to ensure that NG|Storage has enough heap available.
NG|Storage will assign the entire heap specified in es-java-opts via the Xms (minimum
heap size) and Xmx (maximum heap size) settings.
The value for these setting depends on the amount of RAM available on your server. Good
rules of thumb are:
• Set the minimum heap size (Xms) and maximum heap size (Xmx) to be equal to each
other.
• The more heap available to NG|Storage, the more memory it can use for caching. But
note that too much heap can subject you to long garbage collection pauses.
• Set Xmx to no more than 50% of your physical RAM, to ensure that there is enough
physical RAM left for kernel file system caches.
• Don’t set Xmx to above the cutoff that the JVM uses for compressed object pointers
(compressed oops); the exact cutoff varies but is near 32 GB. You can verify that you are
under the limit by looking for a line in the logs like the following:
heap size [1.9gb], compressed ordinary object pointers [true]
• Even better, try to stay below the threshold for zero-based compressed oops; the exact
cutoff varies but 26 GB is safe on most systems, but can be as large as 30 GB on some
systems. You can verify that you are under the limit by starting NG|Storage with the JVM
options -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode
and looking for a line like the following:
heap address: 0x000000011be00000, size: 27648 MB, zero based Compressed

Oops showing that zero-based compressed oops are enabled instead of
heap address: 0x0000000118400000, size: 28672 MB, Compressed Oops with

base: 0x00000001183ff000
Here are examples of how to set the heap size via the jvm.options file:
Xms2g 1
Xmx2g 2
1 - Set the minimum heap size to 2g. 2 - Set the maximum heap size to 2g.
It is also possible to set the heap size via an environment variable. This can be done by
commenting out the Xms and Xmx settings in the jvm.options file and setting these values
via ES_JAVA_OPTS:
ES_JAVA_OPTS="-Xms2g -Xmx2g" ./bin/ngStorage 1

ES_JAVA_OPTS="-Xms4000m -Xmx4000m" ./bin/ngStorage 2
1 - Set the minimum and maximum heap size to 2 GB. 2 - Set the minimum and maximum
heap size to 4000 MB.
Configuring the heap for the Windows service is different than the above.
The values initiallly populated for the Windows service can be configured
 as above but are different after the service has been installed. Consult the
Windows service documentation for additional details.
46.4. Disable Swapping
Most operating systems try to use as much memory as possible for file system caches and
eagerly swap out unused application memory. This can result in parts of the JVM heap
being swapped out to disk.
Swapping is very bad for performance and for node stability and should be avoided at all
costs. It can cause garbage collections to last for minutes instead of milliseconds and can
cause nodes to respond slowly or even to disconnect from the cluster.
There are three approaches to disabling swapping:
Enable bootstrap.memory_lock
The first option is to use mlockall on Linux/Unix systems, or VirtualLock on Windows, to try
to lock the process address space into RAM, preventing any NG|Storage memory from
being swapped out. This can be done, by adding this line to the config/ngStorage.yml
file:
bootstrap.memory_lock: true
mlockall might cause the JVM or shell session to exit if it tries to

 allocate more memory than is available!
After starting NG|Storage, you can see whether this setting was applied successfully by
checking the value of mlockall in the output from this request:
curl 'http://localhost:9200/_nodes?pretty&filter_path=**.mlockall'
If you see that mlockall is false, then it means that the mlockall request has failed.
You will also see a line with more information in the logs with the words Unable to lock
JVM Memory.
The most probable reason, on Linux/Unix systems, is that the user running NG|Storage
doesn’t have permission to lock memory. This can be granted as follows:
.zip and .tar.gz
Set ulimit -l unlimited as root before starting NG|Storage, or set memlock to

unlimited in /etc/security/limits.conf.
RPM and Debian
Set MAX_LOCKED_MEMORY to unlimited in the system configuration file (or see

below for systems using systemd).

Systems using systemd
Set LimitMEMLOCK to infinity in the systemd configuration.
Another possible reason why mlockall can fail is that the temporary directory (usually
/tmp) is mounted with the noexec option. This can be solved by specifying a new temp
directory, by starting NG|Storage with:
./bin/ngStorage -Djava.io.tmpdir=/path/to/temp/dir
or using the ES_JAVA_OPTS environment variable:
export ES_JAVA_OPTS="$ES_JAVA_OPTS -Djava.io.tmpdir=/path/to/temp/dir"

./bin/ngStorage
Disable all swap files
The second option is to completely disable swap. Usually NG|Storage is the only service
running on a box, and its memory usage is controlled by the JVM options. There should be
no need to have swap enabled.
On Linux systems, you can disable swap temporarily by running: sudo swapoff -a. To
disable it permanently, you will need to edit the /etc/fstab file and comment out any
lines that contain the word swap.
On Windows, the equivalent can be achieved by disabling the paging file entirely via System
Properties ¬ Advanced ¬ Performance ¬ Advanced ¬ Virtual memory.
Configure swappiness
The second option available on Linux systems is to ensure that the sysctl value
vm.swappiness is set to 1. This reduces the kernel’s tendency to swap and should not
lead to swapping under normal circumstances, while still allowing the whole system to
swap in emergency conditions.
46.5. Number of Threads
NG|Storage uses a number of thread pools for different types of operations. It is important
that it is able to created new threads whenever needed. Make sure that the number of
threads that the NG|Storage user can create is at least 2048.
This can be done by setting ulimit -u 2048 as root before starting NG|Storage, or by

setting nproc to 2048 in /etc/security/limits.conf.
46.6. Virtual Memory
NG|Storage uses a hybrid mmapfs / niofs directory by default to store its indices.
The default operating system limits on mmap counts is likely to be too low, which may
result in out of memory exceptions.
On Linux, you can increase the limits by running the following command as root:
sysctl -w vm.max_map_count=262144
To set this value permanently, update the vm.max_map_count setting in

/etc/sysctl.conf. To verify after rebooting, run sysctl vm.max_map_count.
The RPM and Debian packages will configure this setting automatically. No further
configuration is required.

Chapter 47. Bootstrap Checks
chapter.
764 | Chapter 47. Bootstrap Checks

Chapter 48. Configuring NG|Storage
NG|Storage ships with good defaults and requires very little configuration. Most settings
can be changed on a running cluster using the Cluster Update Settings API.
The configuration files should contain settings which are node-specific (such as
node.name and paths), or settings which a node requires in order to be able to join a
cluster, such as cluster.name and network.host.
Config file location
NG|Storage has two configuration files:
• ngStorage.yml for configuring NG|Storage, and
• logging.yml for configuring NG|Storage logging.
These files are located in the config directory, whose location defaults to
$ES_HOME/config/. The Debian and RPM packages set the config directory location to
/etc/ngStorage/.
The location of the config directory can be changed with the path.conf setting, as follows:
./bin/ngStorage -Epath.conf=/path/to/my/config/
Config file format
The configuration format is YAML. Here is an example of changing the path of the data and
logs directories:
path:
data: /var/lib/ngStorage
logs: /var/log/ngStorage
Settings can also be flattened as follows:
path.data: /var/lib/ngStorage
path.logs: /var/log/ngStorage
Environment variable subsitution
Environment variables referenced with the ${…¬} notation within the configuration file will
be replaced with the value of the environment variable, for instance:
Chapter 48. Configuring NG|Storage | 765

node.name: ${HOSTNAME}
network.host: ${ES_NETWORK_HOST}
Prompting for settings
For settings that you do not wish to store in the configuration file, you can use the value
${prompt.text} or ${prompt.secret} and start NG|Storage in the foreground.
${prompt.secret} has echoing disabled so that the value entered will not be shown in
your terminal; ${prompt.text} will allow you to see the value as you type it in. For
example:
node:
name: ${prompt.text}
When starting NG|Storage, you will be prompted to enter the actual value like so:
Enter value for [node.name]:
NG|Storage will not start if ${prompt.text} or ${prompt.secret} is
 used in the settings and the process is run as a service or in the

background.
Setting default settings
New default settings may be specified on the command line using the default. prefix.
This will specify a value that will be used by default unless another value is specified in the
config file.
For instance, if NG|Storage is started as follows:
./bin/ngStorage -Edefault.node.name=My_Node
the value for node.name will be My_Node, unless it is overwritten on the command line
with es.node.name or in the config file with node.name.
Logging configuration
NG|Storage uses an internal logging abstraction and comes, out of the box, with log4j. It
tries to simplify log4j configuration by using YAML to configure it, and the logging
configuration file is config/logging.yml. The JSON and properties formats are also
supported. Multiple configuration files can be loaded, in which case they will get merged, as
766 | Chapter 48. Configuring NG|Storage

long as they start with the logging. prefix and end with one of the supported suffixes
(either .yml, .yaml, .json or .properties). The logger section contains the java
packages and their corresponding log level, where it is possible to omit the
org.NG|Storage prefix. The appender section contains the destinations for the logs.
Extensive information on how to customize logging and all the supported appenders can be
found on the log4j documentation.
Additional Appenders and other logging classes provided by log4j-extras are also available,
out of the box.
Deprecation logging
In addition to regular logging, NG|Storage allows you to enable logging of deprecated

actions. For example this allows you to determine early, if you need to migrate certain
functionality in the future. By default, deprecation logging is disabled. You can enable it in
the config/logging.yml file by setting the deprecation log level to DEBUG.
deprecation: DEBUG, deprecation_log_file
This will create a daily rolling deprecation log file in your log directory. Check this file
regularly, especially when you intend to upgrade to a new major version.
Chapter 48. Configuring NG|Storage | 767

Chapter 49. Important NG|Storage Configuration
While NG|Storage requires very little configuration, there are a number of settings which
need to be configured manually and should definitely be configured before going into
production.
• path.data and path.logs
• cluster.name
• node.name
• bootstrap.memory_lock
• network.host
• discovery.zen.ping.unicast.hosts
• discovery.zen.minimum_master_nodes
• node.max_local_storage_nodes
path.data and path.logs
If you are using the .zip or .tar.gz archives, the data and logs directories are sub-
folders of $ES_HOME. If these important folders are left in their default locations, there is a
high risk of them being deleted while upgrading NG|Storage to a new version.
In production use, you will almost certainly want to change the locations of the data and log
folder:
path:
logs: /var/log/ngStorage
data: /var/data/ngStorage
The RPM and Debian distributions already use custom paths for data and logs.
The path.data settings can be set to multiple paths, in which case all paths will be used
to store data (although the files belonging to a single shard will all be stored on the same
data path):
path:
data:
- /mnt/ngStorage_1
- /mnt/ngStorage_2
- /mnt/ngStorage_3
cluster.name
768 | Chapter 49. Important NG|Storage Configuration
A node can only join a cluster when it shares its cluster.name with all the other nodes in
the cluster. The default name is NG|Storage, but you should change it to an appropriate
name which describes the purpose of the cluster.
cluster.name: logging-prod
Make sure that you don’t reuse the same cluster names in different environments,
otherwise you might end up with nodes joining the wrong cluster.
node.name
By default, NG|Storage will randomly pick a descriptive node.name from a list of around
3000 Marvel characters when your node starts up, but this also means that the node.name
will change the next time the node restarts.
It is worth configuring a more meaningful name which will also have the advantage of
persisting after restarting the node:
node.name: prod-data-2
The node.name can also be set to the server’s HOSTNAME as follows:
node.name: ${HOSTNAME}
bootstrap.memory_lock
It is vitally important to the health of your node that none of the JVM is ever swapped out to
disk. One way of achieving that is set the bootstrap.memory_lock setting to true.
For this setting to have effect, other system settings need to be configured first. See
[mlockall] for more details about how to set up memory locking correctly.
network.host
By default, NG|Storage binds to loopback addresses only¬—¬e.g. 127.0.0.1 and [::1].

This is sufficient to run a single development node on a server.
In fact, more than one node can be started from the same $ES_HOME
location on a single node. This can be useful for testing NG|Storage’s
 ability to form clusters, but it is not a configuration recommended for
production.
Chapter 49. Important NG|Storage Configuration | 769

In order to communicate and to form a cluster with nodes on other servers, your node will
need to bind to a non-loopback address. While there are many network settings, usually all
you need to configure is network.host:
network.host: 192.168.1.10
The network.host setting also understands some special values such as local, site,
global and modifiers like :ip4 and :ip6, details of which can be found in [network-
interface-values].
As soon you provide a custom setting for network.host, NG|Storage

assumes that you are moving from development mode to production mode,
 and upgrades a number of system startup checks from warnings to

exceptions. See Development Mode vs Production Mode for more
information.
discovery.zen.ping.unicast.hosts
Out of the box, without any network configuration, NG|Storage will bind to the available
loopback addresses and will scan ports 9300 to 9305 to try to connect to other nodes
running on the same server. This provides an auto- clustering experience without having to
do any configuration.
When the moment comes to form a cluster with nodes on other servers, you have to provide
a seed list of other nodes in the cluster that are likely to be live and contactable. This can
be specified as follows:
discovery.zen.ping.unicast.hosts:
- 192.168.1.10:9300
- 192.168.1.11 1
- seeds.mydomain.com 2
1 - The port will default to 9300 if not specified. 2 - A hostname that resolves to multiple IP
addresses will try all resolved addresses.
discovery.zen.minimum_master_nodes
To prevent data loss, it is vital to configure the

discovery.zen.minimum_master_nodes setting so that each master-eligible node
knows the minimum number of master-eligible nodes that must be visible in order to form
a cluster.
770 | Chapter 49. Important NG|Storage Configuration

Without this setting, a cluster that suffers a network failure is at risk of having the cluster
split into two independent clusters¬—¬a split brain¬—¬which will lead to data loss. A more
detailed explanation is provided in [split-brain].
To avoid a split brain, this setting should be set to a quorum of master- eligible nodes:
(master_eligible_nodes / 2) + 1
In other words, if there are three master-eligible nodes, then minimum master nodes
should be set to (3 / 2) + 1 or 2:
discovery.zen.minimum_master_nodes: 2
If discovery.zen.minimum_master_nodes is not set when
 NG|Storage is running in production mode, an exception will be thrown

which will prevent the node from starting.
node.max_local_storage_nodes
It is possible to start more than one node on the same server from the same $ES_HOME,
just by doing the following:
./bin/ngStorage -d
./bin/ngStorage -d
This works just fine: the data directory structure is designed to let multiple nodes coexist.
However, a single instance of NG|Storage is able to use all of the resources of a single
server and it seldom makes sense to run multiple nodes on the same server in production.
It is, however, possible to start more than one node on the same server by mistake and to
be completely unaware that this problem exists. To prevent more than one node from
sharing the same data directory, it is advisable to add the following setting:
node.max_local_storage_nodes: 1
Chapter 49. Important NG|Storage Configuration | 771

NGStorage Admin Guide-2.3.6

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

NGStorage Admin Guide-2.3.6

Încărcat de

Drepturi de autor:

Formate disponibile

NG|Storage Admin Guide

• You have analytics/business-intelligence needs and want to quickly investigate, analyze,

Near Realtime (NRT)

An index is a collection of documents that have somewhat similar characteristics. For

In a single cluster, you can define as many indexes as you want.

Chapter 2. Shards & Replicas

Sharding is important for two primary reasons:

• It allows you to horizontally split/scale your content volume

• It allows you to distribute and parallelize operations across shards (potentially on

In a network/cloud environment where failures can be expected anytime, it is very useful

Replication is important for two primary reasons:

Each NG|Storage shard is a Lucene index. There is a maximum number of

6 | Chapter 2. Shards & Replicas

Chapter 3. Exploring Your Cluster

The REST API

• Administer your cluster, node, and index data and metadata

• Execute advanced search operations such as paging, sorting, filtering, scripting,

3.1. Cluster Health

And the response:

epoch timestamp cluster status node.total node.data shards pri

Chapter 3. Exploring Your Cluster | 7

We can also get a list of nodes in our cluster as follows:

And the response:

3.2. List All Indices

Now let’s take a peek at our indices:

And the response:

Which simply means we have no indices yet in the cluster.

8 | Chapter 3. Exploring Your Cluster

curl -XPUT 'localhost:9200/customer?pretty'

And the response:

curl -XPUT 'localhost:9200/customer?pretty'

3.4. Index and Query a Document

Our JSON document: { "name": "John Doe" }

Chapter 3. Exploring Your Cluster | 9

And the response:

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '

Let’s now retrieve that document that we just indexed:

curl -XGET 'localhost:9200/customer/external/1?pretty'

And the response:

curl -XGET 'localhost:9200/customer/external/1?pretty'

10 | Chapter 3. Exploring Your Cluster

curl -XDELETE 'localhost:9200/customer?pretty'

And the response:

curl -XDELETE 'localhost:9200/customer?pretty'

curl -XPUT 'localhost:9200/customer'

curl -X<REST Verb> <Node>:<Port>/<Index>/<Type>/<ID>

Chapter 3. Exploring Your Cluster | 11

Chapter 4. Modifying Your Data

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '

curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '

The above indexes a new document with an ID of 2.

12 | Chapter 4. Modifying Your Data

curl -XPOST 'localhost:9200/customer/external?pretty' -d '

4.1. Updating Documents

curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '

curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '

curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '

Chapter 4. Modifying Your Data | 13

4.2. Deleting Documents

curl -XDELETE 'localhost:9200/customer/external/2?pretty'

4.3. Batch Processing

curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '

curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '