CVSA18-M09-Archive, Index & Search

ARCHIVE, INDEX &
SEARCH
{UPDATED CONTENT AND NARRATIVE FOR CVSA18}
Welcome to the Commvault® Archive, Index and Search, solution architect module.
1
LEARNING GOALS
Understand
▪ Technical deep dive into the architecture and sizing methodologies of
various features provided by Commvault® Software, including:
– Commvault Search, Index & Analytics architecture
– Email Archiving
– File-system Archiving
▪ Network Data Flows for solutions based on Commvault search and access
technology
Design
▪ Learn the step by step processes to design search and archive solutions
Narrative:
In this module we will perform a technical deep-dive into the underlying architecture
and sizing methodologies of the various archive, search and index features provided by
Commvault® software.
In addition we will learn a step by step process to help you design archive and search
solutions using Commvault Software.
2
COMMVAULT® INDEX TIERS
▪ Moving up the index stack

increases value
Job-Level Index Object-Level Index Content-Level Index
▪ Find objects most relevant to a

specific goal
▪ Search on-premises, in the CommServe®

Database
MediaAgent
Index Cache
Content
Index
cloud or hyrbrid environments
▪ Extends to external data Functional Data Value

sources
{NEW CONTENT FOR CVSA18}
Indexing remains, and has historically always been, an integral part of the Commvault ®
platform. Content Indexing is simply the next tier of indexing, providing object-level
content search. This object-level content search sits on top of the object-level
metadata in the MediaAgent Index Cache and the job-level metadata in the
CommServe® Database. {CLICK} As one moves up the indexing tiers, the value provided
to customers also increases.
Regardless of customer’s penchant for on-premises, hybrid, or cloud oriented

architectures, Commvault search fits all deployment models. Commvault natively
allows content searching against data resident in the cloud, as well as running search
infrastructure within the cloud. Search opens the value window for information
hidden within not only unstructured data, but also applications such as Exchange and
Sharepoint, and with DataCube, extends to structured database repositories, websites,
and other data sources not necessarily contained within the ContentStore. This on-
demand, multi-tier indexing engine drives simplified, agile workload operations,
providing metadata indexing, content indexing, advanced query building and
visualization in on-premises, hybrid, or cloud environments.
3
Commvault® Search Architecture, Features &
Components
First we will talk about the Commvault® Search Architecture, features and components
4
SEARCH, INDEX AND ANALYTICS ARCHITECTURE
Data Analytics / Download Edge Exchange Log Compliance NFS
Web Analytics Data Cube Center Drive® MB Monitoring review sets ObjectStore
Index Server
Third Party
Connectors Index Store Clients
Data Sources
User Access Consoles
Data Extraction
Search
Engine
Index & Search Search-Only
The Commvault® Search Architecture relies on several core components, many of

which you will now be familiar with but let’s discuss how it all hangs together.
{CLICK1} At the heart of most deployments is the ContentStore where secondary copies
of data are stored and managed. Commvault already has the associated job and
metadata index information recorded from the various data protection tasks.
{CLICK2} As discussed in an earlier module, when you require any type of Content
indexing features then you will require the Web Sever and Web Console components
There are two primary configurations for Content Indexing but they are not mutually
exclusive.
{CLICK3} The first is a so called Search Engine that is required for offline content
indexing of unstructured data stored within the ContentStore, this includes file, NAS
and a number of other semi-structured data sources. A Search Engine is a highly
distributed indexing and searching configuration that enables you to index
simultaneously on multiple servers and search across these search servers. Indexing
large volumes of data and searching instantly across them is easy to manage. As your
data grows, you can add more servers to meet your needs without having to move your
5
data manually.
{CLICK4} The second more recent configuration is known as an Index Server. An Index
Server is a logical CommCell® entity that you can configure to support indexing, search,
and analytics functions for different Commvault product and features, {CLICK5} including
Web Analytics, Data Analytics and Data Cube, Download Center, Edge Drive®, Exchange
Mailbox, Log and system monitoring, Compliance Search Review Sets, and NFS
ObjectStore. Different features require one or more roles to be configured on the Index
Server. Some Commvault products and features can support multiple roles configured
on the same Index Server, while others require a standalone Index Server. For more
information, refer to the Commvault documentation website for the products and
features that use Index Server.
{CLICK6} The Index Store Package is the software that must be installed on a client
within the CommCell environment. Clients with Index Store are configured as Index
Server entities and together support the indexing, search and analytics operations of
the aforementioned products and features.
{CLICK7} Several of the features supported by an Index Server, such as DataCube for
example, utilize third party data sources through various native Data Connectors and
custom REST APIs, in addition to client-based data housed within the ContentStore.
{CLICK8} Content Analyzer and entity extraction can identify the data in an environment
that contains one or more specific types of information. This knowledge can then be
used to create more targeted strategies and increase the efficiency of other data
management operations.
{CLICK9} For example, you can add entity information to your content indexed data to
support searching for specific entities from Compliance Search. Additionally, you can
also configure rules to archive data based on matching entity types.
{CLICK10} Some Data Cube connectors support Entity Extraction. With entity extraction
for Data Cube, you can search for and identify different types of data across many
different types of data repositories in your environment, regardless of whether or not
they are protected by a Commvault agent.
{CLICK11} Through the Admin Console and Web Console are the various user access
consoles, again we have already discussed these in an earlier module.
Finally, do not be overwhelmed by all of the components shown here, most designs will
be much simpler than this as they will not require every use case and feature we have
discussed here.
5
SEARCH ENGINE ARCHITECTURE – V1 INDEX
▪ Search Nodes & Engines

▪ Content Indexing Files from Backup
▪ Web Server
▪ Web Console
– Compliance / End User Search
▪ Sizing per node

– *30 million objects
– 30 % of source data (up to 6 TB) Search
Engine
Index & Search Search-Only
*To get higher scalability for content indexing files from backups, you can turn off the generate
previews option on the Search Engine node. When generate previews is off, each search node can
accommodate approximately 60 million objects / 15% of source data (up to 6TB)
{UPDATED CONTENT & NARRATIVE FOR CVSA18}

Narrative:
Let’s take a closer look at the Search Engine configuration used for offline content
indexing. It is worth mentioning that the Search Engine targets clients that utilize the v1
index cache mechanism in Commvault® v11. We will not cover the full scope of v1 and
v2 indexing in this course, however by way of comparison, the v1 index utilizes the c-
tree index cache mechanism for metadata with the Content Indexes built separately
inside the Search Engine. Agents that support the newer v2 index, such as the Exchange
Mailbox Agent, utilize the SOLR engine, for both metadata and content indexes. This
saves having to configure a separate Content indexing engine and duplicating the
metadata inside it.
A Content Indexing search engine is composed of one or more content indexing nodes.
Each node falls into one of two categories; either an index and search node or a search-
only node. {CLICK} Each search engine must contain at least one index node. {CLICK}
The index node receives data from a Media Agent during a content indexing job and
writes the data to the search engine. The ingestion of data into the search engine is
highly IO and CPU intensive and therefore should ideally run on a physical server, but a
VM can also be used for index and search nodes too. Once the ingest node reaches
either 30M objects or 6TB in size, {CLICK} the index created thus far may be moved to a
search-only node in the search engine. Because the search-only node only services
6
search requests and occasional object deletions, the CPU and IO requirements drop
significantly. Therefore the search-only node may be run either on a physical or virtual
server. However, careful consideration should be given to the location of the search
only guest. Placement on a hypervisor with other IO intensive guests is not
recommended
{CLICK} The Web Server component provides authentication and data services for
various search functionality. Various clients such as {CLICK} Compliance Search and the
Web Console rely upon the services provided by the Web Server. For example, when a
user performs a search, the Web Server receives the request and queries the search
engine. The search engine returns the search hits to the Web Server which in turn
sends the data to the Search client which displays the results. In a multi-node search
engine, the query is executed on every node in the search engine. All nodes must be
online for a query to succeed.
6
EXCHANGE MAILBOX INDEXING
C-Tree
Index Cache
▪ Exchange Mailbox Agent MetaData
based on v2 Index Good SOLR Search Engine
MetaData Content Previews
▪ Formally known as the

Analytics Engine Index Server
Exchange Index
MetaData
▪ Further enhancements Better SOLR Search Engine
and support planned for MetaData Content Previews
additional agents
Index Server
Exchange Index
Best MetaData Content
On-demand
Preview
Narrative:
You should already be familiar with the new Exchange Mailbox Agent from previous
training. Along with the new Exchange Mailbox agent comes a new Index Server role,
formally known as the Analytics Engine. The Exchange Index Server is used to hold the
MediaAgent Index Cache information for the new Exchange Mailbox agent. Prior to the
introduction of the new Exchange Mailbox agent and V2 Indexing, our architecture
looked like the “GOOD” side of this illustration.
The email metadata was stored in the C-Tree MediaAgent Index Cache. When the
email was Content Indexed, we duplicated that metadata into the Search Engine, which
also contained the Content Index and item previews for Search purposes.
{CLICK} With the introduction of the new Exchange Mailbox agent, we shifted to the
“BETTER” side of the illustration. Here, we used V2 Indexing and stored the metadata
in the Exchange Index Server, which under the hood resides within an instance of Solr.
{CLICK} Today we also support for Content Indexing the new Exchange Mailbox agent.
This shifts the picture to the “BEST” side, where instead of duplicating the Exchange
Mailbox metadata into a separate search engine, we add the index of the content to
the Index Server, and provide on-demand previews instead of pre-generating the
7
previews and storing them in the search engine.
Further enhancements are planned both for this mechanism and the index server
function in general, with further support to be added for additional agents. Remember
to check the Commvault® documentation website for the latest information.
7
CONTENT ANALYZER & ENTITY EXTRACTION
▪ Add entity information to content

indexed data (Search Engine only)
▪ Configure rules to archive data Compliance Search
based on matching entities

Search & Index Search Only Search Only Search Only
Archiving
Rules
▪ Data Cube connectors support Search Engine
Entity Extraction*
▪ Requires Windows Server with

Content Analyzer Package
*Search & supported connectors only
Content Analyzer and entity extraction helps customers identify the data in their
environment that contains one or more specific types of information. They can then
use that knowledge to create more targeted data management strategies and increase
efficiency.
The Content Analyzer can be used to add semantic identifiers to the data that has been
content indexed within a Search Engine. Remember this is typically offline file data that
has been indexed from the backup copy.
You can use these additional semantic identifiers to create more targeted rules for
archiving based on specific types of information present in the content of the data. You
can also use Compliance Search to query data based on entities that are present in the
data.
The Content Analyzer package reads the information in the Search Engine and looks for
the entities that you want to identify, such as telephone numbers, social security
numbers, and other meaningful types.
With entity extraction for Data Cube, you can search for and identify different types of
8
data across many different types of data repositories in your environment, regardless of
whether or not they are protected by a Commvault® agent.
8
SIZING FOR ALL INDEXING
{NEW SLIDE FOR CVSA18}
The following table provided on the Commvault® Documentation Website contains the
system and hardware requirements for deploying search and analytics solutions
according to the type of data that you are targeting. Each column in the table
represents a set of requirements for a different search and analytics use-case. For
example, if you have file system data that you want to analyze with Data Analytics, refer
to the Indexing File Metadata Only column in the File Data table.
Sizing: http://documentation.commvault.com/commvault/v11/article?p=44381.htm
9
Search Engine Solution Design
{UPDATED FOR CVSA18}
In this section we will focus on the design of a traditional search engine cloud.
10
SEARCH ENGINE SIZING
End user: Optimized for high search performance.
Maximum of 4 nodes or 120M objects.
Search & Index
Use cases: File-system, Endpoint search,
Search Only
Compliance Search
Search Only Search Only
Search Node
End User / Compliance Search Engine
Index and Search
or
Search Only
On request: Optimized for higher capacity search.

Search & Index Search Only Search Only Search Only Maximum of 8 nodes or 240M objects.
Use case: Larger search requirements
Search Only Search Only Search Only Search Only
High Capacity Search Engine
▪ Additional nodes can be added as needed ▪ Engine Size limit is hard-coded

▪ Search & Index Nodes can be virtualized ▪ Not all data has to be available in index
▪ Design for 50k objects per hour ingest rate at all times
Narrative:
{CLICK} The search engine can be deployed with a maximum limit of 4 nodes or 120
million objects
{CLICK} Solutions requiring a higher capacity may increase this limit up to 8 nodes or
240 million objects. This requires an exception ticket to be raised and you should
engage with Commvault® product management, or for partners, your local
Commvault® representative, to validate sizing over the aforementioned 4 node design.
{CLICK} Additional Nodes can be added to an engine as needed up to the maximum

limited supported.
{CLICK}Search and Index Nodes can be deployed as Virtual machines but remember to
Caveat to the customer that index rates cannot be guaranteed, it is normal to see a
~40% reduction in performance.
{CLICK} Each search engine should be designed for an ingestion speed of 50,000
objects/hour. Often the engines run up to three times faster that this, but we want to
design conservatively to ensure our deployments consistently perform. Multiple nodes
can also handle ingest, while a rare occurrence ensure in very large environments that
you have enough nodes to handle the daily ingest of new objects.
{CLICK} The engine size limits are hard-coded in the software, however there are
11
situations where an exception might be granted. For example if the average object size
was very small and the node had exceeded its object limit while only using a small
proportion of the allocated 6TB disk space then an exception might allow the object
count to be increased for each node. However, All exceptions would require review and
authorization from the Commvault Engineering department and your design should
always be based on the numbers documented here and on the Commvault
documentation website.
{CLICK} One important competitive differentiator you definitely need to be aware of is
that you don’t have to have all data available in the index at all times.
With Commvault® software you can content index jobs from a
Storage Policy at any time, regardless of when the data was
collected!
For example, leave only the most recent 2 years in content index. If
a requirement arises to search data from January to March five years
ago, simply re-index those jobs when required!
11
SEARCH ENGINE SIZING
24TB LUN presented to
physical multi-instanced
server
Maximum of two
search and index
Instance001 node instances on a
single physical server. Instance001 Index 6TB
Instance002
Up to a maximum of
6TB
Instance003 four instances total on Instance002 Index
a single physical
Instance004 server. 6TB
Instance003 Index
Multi-Instance
Instance004 Index 6TB
{UPDATED CONTENT & NARRATIVE FOR CVSA18}
Narrative:
One item to keep in mind when architecting a Commvault® search engine solution is
multi-instancing. Commvault has long had the capability to install multiple instances of
software onto a single system. This is useful in many situations, such as if you needed
to connect a client to more than one CommCell®, for example. We can take advantage
of the multi-instancing capabilities to stack multiple Content Indexing instances on a
single physical server (not really recommended for VMs in most circumstances)
One common design pattern you will see for multi-instancing with CI is the aptly named
“four pack”. {CLICK}
Here, we present an 18TB LUN to a multi-instanced server on which we plan to run four
instances. {CLICK}
You can run a maximum of two search and index node instances on a single physical
server. By running two search and index instances, we can increase the overall
throughput for content indexing jobs. However, running more than two actively
indexing instances is not recommended, as the CPU and IO requirements may lead to
performance issues. {CLICK}
12
For each of those two index and search instances, we need a 6TB partition. {CLICK}
We can run a maximum of four instances total on a single physical server, so the
remaining two will be search-only.
Streams matter!! Must have multiple streams to feed multiple ingest nodes.
12
SEARCH NODE SIZING EXAMPLE #1
File Indexing Example

▪ 15TB of files to index
▪ 20 million total files
▪ 15TB / 20 million = .79MB average file size
▪ Index size per object is ~30% of the file size
▪ .78MB * 30% = .24MB average index size per file
▪ 6TB of index capacity / .24MB = ~25 million objects
▪ 20M files / 25M per node = 1 search nodes required
Index per node will be 6TB while active

and support 25 million objects
Now that we understand the basics of search node sizing, let’s look at some examples
of how you determine how many objects a single search node can index.
{CLICK} In example number one we’ll be looking to index a total of 15TB of files,
consisting of 20 million objects.
{CLICK} To determine the average file size, we need to divide the 15TB by 20 million.
This shows that the average file size in this environment is .79MB.
{CLICK} When a search server indexes an object, the resulting index is approximately
30% of the original file size.
{CLICK} By multiplying the average file size of .79MB by 30%, we can see that the
average index size per file will be 0.24MB.
{CLICK} Now we need to determine how many objects a single node can index before
the 6TB limit is hit. By dividing 6TB by the average index size per file of .24MB, we get
approximately 25 million objects per node.
{CLICK} Finally we need to determine how many nodes are needed in this solution. Take
13
the 20 million total files that will be indexed, and divide them by 25 million objects per
node. The results rounds up to 1 search nodes.
{CLICK} In summary you will need to provision one search nodes to index this data. Each
will require a full 6TB of disk space for the index and will support 25 million objects.
13
SEARCH NODE SIZING EXAMPLE #2
File Indexing Example

▪ 10TB of files to index
▪ 70 million total files
▪ 10TB / 70 million = 150KB average file size
▪ Index size per object is ~30% of the file size
▪ 150KB * 30% = 45KB average index size per file
▪ 6TB of index capacity / 45KB > 30 million objects
▪ 70M files / 30M per node = 3 search nodes required
Index per node will be 1.35TB

and support 30 million objects
In the next example we will look at another file indexing scenario.
{CLICK} For this customer we have 10TB of total file data to index. Inside there are a
total of 70 million files.
{CLICK} By dividing these two numbers we get an average file size of approximately
150KB. Since this is mostly machine generated data this is fairly common.
{CLICK} Then by using the same 30% rule we see that we have an average 45KB index
size per file.
{CLICK} By dividing the 6TB of available disk capacity by 45KB, we get a result much
larger than 30 million objects. So we need to default to the maximum value of 30
million objects per node.
{CLICK} By dividing the total 70 million files by 30 million per node, we see that three
search nodes will be required.
{CLICK} In this solution, each of the three nodes will require 1.35TB of disk space for the
index, and can support 30 million objects.
14
DATA FLOW
INTERNET DMZ INTRANET
1 2 3
6 5 4
Compliance Web Server Search Engine

Search
1. User queries Compliance Search 4. Search Engine returns query results and preview to Web Server
2. Search client request sent to Web Server 5. Web Server returns query results and preview to Compliance Search
3. Web Server queries Search Engine 6. Compliance Search displays query results and preview to user
Now let’s talk about the data flow for a search solution. {CLICK}
Here we see a typical configuration of components with an end-user in the internet,
Compliance Search installed in the DMZ, and the Web Server and Search Engine in the
intranet. In this use case, if the compliance officers do not require external access for
searches, all components can reside in the intranet. {CLICK}
Our user accesses Compliance Search and submits a query. {CLICK}
Next, the Compliance Search client sends the request to the Web Server {CLICK}
The Web Server queries the search engine {CLICK}
The search engine returns the query results and previews to the Web Server {CLICK}
The Web Server returns the query results and previews to the Compliance Search
interface {CLICK}
And the Compliance Search interface displays the results and previews to the end user.
15
DATA FLOW
1 2 3
6 5 4
Web Console Web Server Search Engine
1. User searches in Web Console 4. Search Engine returns query results and preview to Web Server
2. Web Console Client request sent to Web Server 5. Web Server returns query results and preview to Web Console
3. Web Server queries Search Engine 6. Web Console displays query results and preview to user
Looking at the same configuration for Web Console, things are much the same, simply
replacing Compliance Search with Web Console.
16
REVIEW: SEARCH ENGINE SIZING STEPS
1 Size Search Engines and

Search Nodes
Search & Index Search Only Search Only Search Only
2 Size Web Console(s)

Domain 1 Domain 2
3
End User Search
Size Web Server(s)
Compliance Search
4 Optimize Design
Multi-Instance
{UPDATED CONTENT AND NARRATIVE}
Narrative:
Let’s look at four basic steps to help size a search engine environment.
{CLICK}
One: Using your data profiles—look at the average object size for each data type
and the number of objects. Use the sizing guidelines available on
documentation.Commvault.com to size the search engine. Ensure there is a
search engine for each Storage Policy, any exceptions to these parameters
would require prior approval by Commvault® Engineering.
Two: Size the Web Console, it is recommended to Start with a single server and Add
per Commvault documented guidance, this is based upon the number of
concurrent users and access profile. If SSO is required, a separate web
console per domain is required at a minimum. The Web Console can also be
Virtualized.
Three: Size the Web Server, like the Web Console start with a single server and Add per
Commvault documented guidance, this is based upon the number of concurrent
users and access profile. It is recommended to deploy separate servers for
both end user and compliance in heavy usage scenarios. Again the server can
17
be virtualized in low load environments.
Four: Consider how your proposal can be optimized, especially in competitive

situations! For example, when an index node hits capacity, move the index to a
virtual server and re-use the physical index server. Consider Multi-instancing
with a physical server that meets the aggregate requirements. You can also
combine the Web console and Web Server roles in very small environments.
17
Email Archiving
{UPDATED NARRATIVE FOR CVSA18}
Narrative:
This next section covers email archiving, predominately focussed on the newer v11
Exchange Mailbox Agent. If you are still interested in learning about the Commvault
OnePass™ Mailbox agent for Exchange, or so called Mailbox Classic including PST
archiving, then please visit documentation.commvault.com
18
EXCHANGE MAILBOX AGENT
Protect
Benefits Recover
▪ Easy Configuration using one Exchange Mailbox Client Index
▪ Discrete Exchange Policies provides User-Level Control Control
▪ Leverages Index Server to Index Data Preview
▪ Multi-streamed backups, restores & stubbing Retain
▪ Scale Out architecture using Proxies Browse
▪ Mail previews without content indexing Archive
▪ Received time based (No Synthetic Fulls)
{VERY MINOR MOD TO NARRATIVE FOR CVSA18}
Narrative:
The Commvault® Exchange Mailbox Agent is a comprehensive solution that
incorporates separate archiving and clean up operations.
For clarity, you can think of archiving as a collection that, depending on the type of
policy configured will also incorporate data protection.
You can move mailbox messages to secondary storage and use these messages as an
archive copy. The archived messages are available for quick and easy retrieval by
administrators and end users.
For the User Mailbox, clean-up operations create stubs on the production storage. The
stubs point to the messages that were moved as part of the archiving and clean-up
operations.
For the Journal Mailbox, the cleanup operation removes journaled messages from the
Exchange server.
For the ContentStore Mail Server, clean-up operations delete the contents of the
journal mailbox on the ContentStore Mail Server (SMTP).
The Exchange Mailbox Agent utilizes an Index Driven Retention based on message
received time, therefore no Synthetic Fulls are required.
19
Now Let’s look at the architecture of the Exchange Mailbox Agent.
19
EXCHANGE MAILBOX AGENT ARCHITECTURE
USER MAILBOX & JOURNAL MAILBOX
Exchange Servers
Microsoft Outlook
Data Data Browsers
IMAP Clients
Index
MediaAgents /
Exchange Mailbox
Agents
Data Data
Perimeter Network
(DMZ)
Narrative:
The Exchange Mailbox client is a single entity that stores the configuration information
for all the proxy clients that are associated with it, and coordinates data protection
operations for different proxy clients. The Exchange Mailbox client is a logical entity,
also referred to as a pseudo client.
The Exchange Mailbox Agent or so called Proxy is installed on a Windows server and
provides the mailbox extraction logic and transport pipeline for mailbox data to flow
from the Exchange servers to the MediaAgents. To help overcome the performance
limitations of MAPI your design can incorporate additional proxies, each one is capable
of handling multiple data streams and the greater number of proxies added will yield
greater performance. In some cases it may be beneficial to stand-up a larger number of
proxies to perform the initial ingest of data. This number could then be reduced once
day to day operations begin. While Exchange on-premises still utilizes MAPI, Exchange
Online now fully utilizes EWS.
To provide LAN-free transport of data to the ContentStore It is advisable to place the

MediaAgents on the same servers as the Proxies if possible.
20
As previously discussed, the Indexing mechanism is based on the v2 indexing
architecture and replays action logs containing Metadata into the Index Server. Make
sure that Index server has enough space allocated to it. Use the guidelines documented
in Commvault® Documentation Online, to select the appropriate hardware for your
Mailbox Proxies and Index Servers. These guidelines are based the number of messages
that currently reside on your Exchange servers and whether you are using Content
Indexing or not.
For Exchange on-premises, O365 and Hybrid architectures, please refer to the Microsoft
Applications and Databases module within this course.
20
OUTLOOK ADD-IN & CONTENTSTORE EMAIL VIEWER
Outlook add-in: End User Access to archived emails directly from Outlook
▪ Stub based message preview and download
▪ Self-Service Data Access
▪ Efficient Restores
▪ Requires Outlook 2007 SP2 >
ContentStore Email Viewer
▪ Access Content Indexed email & attachments
▪ No Stubs, ContentStore appears in Outlook folder view

Exchange
ContentStore Mail
▪ Keyword and field based queries
▪ Requires Outlook 2010 >
▪ Requires Web Server, Web Console with Proxy and Compliance Search
Narrative:
The Outlook add-in lets users preview and download messages from their Exchange
mailbox profile using Microsoft Outlook providing that Stubs have been configured by
the Administrator. Please note that the Commvault® Outlook add-in requires Outlook
2007 SP2 or higher.
{CLICK}The Outlook add-in with Contentstore Email Viewer additionally lets users
access all of their collected mailbox content in a separate sub directory within the
Outlook folder view. This has the added benefit of not requiring stubs to be created and
provides full keyword and field based search queries.
To use the ContentStore Email Viewer, Outlook 2010 or higher must be used. A Web
Server must be installed and a Web Console including the Proxy Server for
ContentStore. This Proxy Server communicates with the Web Server and Search facility
to retrieve end-users’ archived and content indexed emails. Finally, the ContentStore
Email Viewer also requires that Compliance Search be installed to provide the
administrative functions for configuring preferences and settings.
The Outlook add-in can be installed through GPO Or via a third party deployment
software.
21
CONTENTSTORE AND EDGE ATTACHMENT STORE
Email with Attachments

ContentStore Email Viewer and Edge Attachment Store
▪ Automatically move attachments – does not
consume space on Exchange Server
Outlook
▪ Send large files as Email attachments add in Scan
▪ Target attachments greater than a specific size Message

Skip sent
▪ Attachments are protected and can be restored at any time

Links Added
▪ Requires Outlook 2010 >
Narrative:
One additional feature to be aware of is the Outlook Add-in with ContentStore and
Edge Attachment Store.
The Edge Attachment Store is an add-in to Microsoft Outlook for Windows that sends
the file attachments in your Outlook email message to Commvault® ObjectStore
storage. When you send an email from Outlook, the add-in uploads the attachments in
the email to the ObjectStore. After upload, the add-in replaces the attachment in the
email message with links to preview or download the file from the ObjectStore.
{CLICK} When the user sends an email, the add-in is triggered and scans the attachment
list to see if there are any files qualified to be uploaded. {CLICK} The add-in skips the
Inline attachments in the email message. After scanning the email, the qualified
attachments are moved to the ObjectStore and the email message is send with links to
the attachments in the ObjectStore .
You might also be wondering, what is the Commvault ObjectStore? ObjectStore is the
Commvault data storage repository used by third-party applications to store, manage,
and retrieve application data as objects by using ObjectStore REST APIs. Unlike Edge
22
Drive®, the ObjectStore does not require a user interface (UI).
22
EXCHANGE MAILBOX CONTENT INDEXING
Features:
▪ Content attributes added to existing Metadata Index
▪ Leverages (SOLR) Index Server
▪ Does not require separate Commvault® Search Engine
▪ More granular Indexing selection (Subclient/Users)
▪ Previews of messages are available whether or not content indexing is configured
Requirements:
▪ Each access node that is used to run content
indexing jobs must have the Exchange Mailbox
Agent and the Web Server installed.
▪ Allow disk space for 20% of the data size
{UPDATED CONTENT AND NARRATIVE}
Narrative:
To search the entire contents of archived messages in Commvault® search

environments such as the Web Console, Compliance Search interface, and the
ContentStore Email Viewer, the archived messages must be content indexed.
The Exchange Mailbox agent is the first Commvault agent where the Content Indexing
process is abstracted from the Storage policy layer and does not require a separate,
dedicated Search Engine.
Note the additional requirements for Content Indexing of Mailbox data, Each access
node that is used to run content indexing jobs must have the Exchange Mailbox Agent
and the Web Server installed.. Allow disk space on the index server for at least 20
percent of the application size.
23
ADDITIONAL CAPABILITIES
Journal Mailbox
▪ Collects and deletes email from Exchange journal mailboxes
▪ Data indexed for End User / Compliance Search
ContentStore Mail Server (SMTP)

▪ Journaling support for OnPrem & O365 (No OnPrem Exchange required)
▪ Exchange CAL license not required
▪ Support for multiple Gateways to provide HA and load balancing
Lotus Notes
▪ Traditional agent (not Commvault OnePass™) provides full support for mailbox archiving and
compliance/journal archives
▪ Supports stub based recall, NSF archiving, and search
▪ Recall link support (data does not get restored back to the server)
{VERY MINOR UPDATE TO CONTENT & NARRATIVE FOR CVSA18}
Narrative:
The Exchange Mailbox Agent Journal Mailbox works in conjunction with the message
journaling feature of Microsoft Exchange Server software to archive all incoming and
outgoing messages and attachments. A copy of all incoming and outgoing messages are
captured in the Exchange Journal Mailbox, which is then archived with the Exchange
Mailbox Agent according the to the schedules that you create. Incoming messages are
written to the journal mailbox as they come into the Exchange server. Therefore, all
messages are recorded, regardless of whether the message recipients delete the
message in their mailboxes.
ContentStore Mailbox replaces the traditional Exchange Journal Mailbox. The

ContentStore Mailbox is an SMTP Gateway that directly receives emails from Exchange
and stages them to a specified disk location. Commvault® then archives the messages
directly from there into ContentStore on a regular basis before cleaning up the
messages in the staging area. Therefore there is no requirement for a staging area on
Exchange or in the case of O365 no requirement to have an additional on-premise
Exchange server.
For Lotus Notes archiving, Commvault uses a traditional agent which is not based on
24
Commvault OnePass™. This provides full support for mailbox archiving as well as
compliance/journal archives. Stub based recall, NSF archiving, and compliance search
are all supported. In addition recall link support is now provided where archived items
can be opened without triggering a recall back to the server.
/presenter notes, not for recording

Gateway components receives email and stages to disk, scheduled archive job (e.g.
hourly) reads email from disk and protects it, clean-up job cleans protected mails from
disk. IP addresses added on dashboard to allow traffic. (Trusted list), configured staging
area, create mail contact on Exchange, Create send connector, smart host configured (or
HW Load balancer) (ip addresses of CS mail servers for load balancing), create journal
configuration. Recommended to use single mailbox to remove duplicate emails being
journaled.
24
DATA FLOW – OUTLOOK ADD-IN WITH CONTENTSTORE MAIL
1 2 3
6 5 4
Proxy Web Server Index Server
1. User accesses ContentStore Mail (Online Mode, External Access) 4.. Index Server returns query results and preview to Web Server
2. Proxy request sent to Web Server 5.. Web Server returns query results and preview to Proxy
3. Web Server queries Search Engine 6.. Proxy returns query results and preview to ContentStore Mail
Like the previous data flows we looked at, when ContentStore Mail is the client,
WebConsole is replaced with the Proxy component when accessing content externally
in Online mode.
25
DATA FLOW – OUTLOOK ADD-IN WITH CONTENTSTORE OFFLINE MODE
4 5 6
1
9 8 7
2
3 cache Proxy Web Server Index Server
1. User accesses ContentStore Mail (Offline Mode, External Access) 6. Web Server queries the Index Server
2. Local cache queried 7. Index Server returns new items to Web Server
3. Cached results returned to ContentStore Mail 8. Web Server returns new items to Proxy
4. Periodic synchronization initialized 9. Results from Proxy stored in local cache
5. Proxy update request sent to Web Server
{UPDATED CONTENT FOR CVSA18}
In this case, we have a user accessing ContentStore Mail externally in Offline Mode. In
this scenario, items are cached on the local device. When the user accesses
ContentStore Mail, the local cache is queried and the cached results are returned.
On a periodic interval, a synchronization request is initiated. A path similar to the

Online mode is used to query the search engine and cache the results.
26
EXCHANGE MAILBOX INDEX SIZING
Mailbox Indexing Requirements
Type Size Messages
Archive only (user mailboxes) 3-5% of data size 500 million per index node
Archive only (journal mailboxes) 5-7% of data size 500 million per index node
Archive + Content Indexing 20% of data size 250 million per index node

Narrative:
For reference, here are the sizing metrics for the Exchange mailbox index. These are
correct at the time of writing but please refer to documentation.commvault.com for
the most up-to-date information, along with system requirements for the index server,
mailbox proxy and ContentStore mail server.
27
EMAIL ARCHIVING DESIGN GUIDELINES
1
Determine scope of solution from data
profiles (no. of: mailbox servers,
mailboxes, data volume, avg. msg size)
Determine solution requirements
2 (archiving, journaling, PST archiving),

Is Compliance Search, eDiscovery or
Case Manager required?
For Exchange, Size no. of Mailbox
3 Agents (proxies), Mailbox Index Servers

and ContentStore Mail Servers (SMTP), if
required.
Scope Web Server(s), and Web
4 Console(s). Non Exchange Mailbox

solutions require a dedicated Search
Engine for Content Indexing.
{VERY MINOR UPDATE TO CONTENT & NARRATIVE FOR CVSA18}
Narrative:
Let’s discuss a four step process to help you design an email archiving solution.
One. First determine the scope of your overall solution, Your data profiling exercise
should assist you discovering what the environment looks like. However, you may need
to speak to an Exchange admin to get all the nitty-gritty detail.
Two: Ask yourself the question; what is the customer ultimately trying to achieve? Are
you proposing the correct software features to address the requirements of the
customer? Remember you may have multiple requirements to meet and different
individuals to engage with, for example the Compliance officer and the Exchange
admin.
Steps one and two are tightly integrated, and it is a good idea to sketch the solution out
on a whiteboard or notepad as it will help you visualize the environment.
Three: For Exchange, remember to specify at least one Mailbox Proxy that can also be a
MediaAgent. You should also design a separate Mailbox Index Server as per the
guidance on Commvault® Documentation Online. If you are deploying the ContentStore
28
Mail Server (SMTP) solution remember to include more than one SMTP Servers for fault
tolerance and load balancing.
Four: Scope the appropriate number of Web Servers and Web Consoles as per the
previous guidance in this module. You may only need one of each in small
environments. Finally, remember that any solution not leveraging the Exchange Mailbox
agent (for example Lotus Notes), will required a dedicated Search Engine for Content
Indexing.
28
File System Archive
In this next section we will discuss designing for Commvault® file system archiving.
29
COMMVAULT ONEPASS™ OVERVIEW
Synthetic Full Backups
Data Primary Storage Secondary Storage
Cleanup Cleanup
based on based on
Archiving Retention
Rules Parameters
Rule-based OnePass Archiving
Narrative:
As you may recall from a previous module, The Commvault OnePass™ feature is a
comprehensive solution incorporating the traditional back up and archiving processes
in a single operation. The data gets backed up only once as part of the backup
operation and the files that meet the archiving rules are stubbed in place. Stubs point
to the data that was already moved as part of the backup. This integrated agent is able
to selectively age items based on data and stubs that are deleted on the primary
storage. This allows you to reclaim space in your secondary storage. It also provides
data statistics and reporting functions.
Additionally this agent also allows backup data statistics and analyze reporting function
without having to revisit the primary storage again.
30
ARCHIVE PREDICTION
System Discovery Tool

▪ Any environment
▪ Lightweight tool (no installer)
▪ System information
▪ File Level Analytics
File Level Analytics Report

▪ Existing Commvault environments
▪ File modified times
▪ Access or creation times
▪ Files based on size
▪ Granular File Categories & File
Extensions
Narrative:
Before finalising your Commvault OnePass™ design it might be worthwhile running
some form of data analytics to see what impact archiving will have on your customers
environment.
Firstly. the System Discovery tool will help find specific machines and file level details
such as file sizes, file types based on extensions, and file ages. You can then upload the
data to the cloud network where the contents of the ZIP file are used to analyze the
discovered data and generate reports based on the data.
{CLICK} If the customer is already running Commvault® then You can use File Level
Analytics to predict how much space would be saved if files meeting a specific criteria
are archived e.g. files that have not been accessed for a long time, files that are exceed
a certain size, or even files that are accessed recently. File Level Analytics can help in
predicting how much space is consumed on backup disk based on certain granular
filters.
The report sample here shows list of files that were modified at a certain time and have
a specific Size:
31
FILE DISCOVERY, ARCHIVING AND MIGRATING DATA TO CLOUD
Share
Access
Nodes
Dashboard
reports CIFS
Shares
Index Server
Index Store
Data Analytics Role
NFS
Shares
▪ Web-based UI through ▪ Analyze contents and ▪ View statistical information

Admin Console structure of file servers – Data volume, type, age,
size etc.
The File Discovery and Archiving solution provides a simple web-based user interface to
discover and analyze the contents and structure of an organizations's file server and
anticipate their need for saving storage space. The file discovery solution runs a data
discovery job and collects the data. Based on the data collected, the file discovery
solution displays a dashboard in the Admin Console that allows you to view statistical
information about your data such as total data, data types, data age, and data size.
{CLICK} A backup operation is not required to analyze content with the archiving
solution. After you run the data discovery job, the Admin Console dashboard will be
updated to show the following:
Total size of data available on the disk or file server, Different types of data available
based on file extensions such as documents, multimedia, and binaries.
And The modified time and last access time of files.
{CLICK} The Rules pane lets you customize the archiving rules based on File size and file
modified or access time. The dashboard will update automatically, by dragging the
slider bars horizontally, providing the user instant feedback on what the impact of such
archiving conditions would be should they be implemented. You can then also archive
data based on the information you see in the dashboard.
32
{CLICK} The file details pane displays a list of individual files that will either remain on
disk, be archived on the next job, or have already been archived. Additionally, you can
apply several filters to each list to help you identify specific files or data sets.
{CLICK}
The File discovery solution requires an index server to be deployed with the data
analytics engine enabled.
{CLICK} The shared access node is a system that has access to the network shares and is
used for live scan and archive operations. The shared access node require the File
System Agent and MediaAgent packages to be installed. {CLICK} For NFS you must have
a UNIX MediaAgent that has the NFS share mounted on it.
The Archiving solution allows you to apply the rules from the dashboard to back up and
archive data in a single operation. Additionally, you can also archive data to cloud
storage and retain the data for a longer period of time.
32
COMMVAULT ONEPASS™ FOR NAS
Network Filer
CIFS
with CIFS shares
Network Filer
with NFS shares
NFS
Commvault Server registers

as archive provider using filer
native API (e.g NetApp
Fpolicy, Celerra FileMover)
The Commvault OnePass™ Agent for NAS allows you to protect and archive network
filers from NetApp, EMC, BlueArc and Isilon . Though it fundamentally functions in the
same way as the OnePass Agent for filesystems, there are some key differences to
illustrate.
{CLICK}
Often network filers present their shares using the CIFS or SMB protocol. This is the
primary network sharing protocol for Windows clients.
{CLICK}
Before we can archive data, the Commvault server must register itself as an archive
provider. We use the vender’s native API such as NetApp FPOLICY or Celerra FileMover.
This allows for seamless recall of stub data from clients accessing the shares.
{CLICK}
Next you must identify a proxy machine. This can be any Windows client with the
filesystem agent installed. Often the MediaAgent is used for this role. When the backup
occurs, the data will be moved from the filer to the proxy computer via the CIFS
protocol. One important point to note is that self-service requires CIFS.
{CLICK}
33
In other cases shares are presented via NFS. This is the primary file sharing protocol for
Linux or Unix clients.
{CLICK}
The backup follows the same process, however, you must identify a Linux or Unix proxy
client in order to perform NFS based backups. Again, only the standard filesystem agent
must be installed on this machine.
33
COMMVAULT ONEPASS™ FOR NETWORK SHARES
Network Share (CIFS)
In addition to NAS Filers, the Commvault OnePass™ feature also supports network
shares hosted on Windows servers. In this configuration a Windows File System Agent
is installed on a computer that acts as a proxy computer. The proxy computer initiates
the archive and recall operations with the file server.
Commvault uses the CIFS protocol to access the data that is located on the file server.
After OnePass is enabled on a client, users can create subclients with the content that
they want to archive.
When a job starts, the proxy computer scans the file server for data that meets the
archiving rules that you defined, and then creates the stubs. Data that does not meet
the archiving rules is backed up. The archived data is then moved to the MediaAgent.
This frees up space on the file server, but the stub file acts as a place holder for the
archived data.
34
EXPLORER PLUG-IN
Integrated Solution for Microsoft Windows Explorer

▪ Seamless user access to protected and archived data
▪ Supports single, multiple, folder and recursive recalls
▪ Requires Windows File System Agent
▪ Finder Plugin available for Mac OS
▪ Required for NAS Recalls in most cases
{VERY MINOR UPDATE TO SLIDE CONTENT}
Narrative:
The Commvault® Windows Explorer Plug-in allows users to access the data protected
and archived by Commvault. It provides browsable views of backed up data and stubs
on Windows Explorer by utilizing custom Windows libraries and overlay icons. The
Explorer plug-in needs to be installed on all clients accessing an archived file system
and it avoids having to install the Commvault OnePass™ driver on the server itself,
therefore the target server does not need to be rebooted, which is important for some
customers.
{CLICK} There is also a Finder plugin for mac OS.
{CLICK} In most cases the Explorer plugin is also utilized for end user recalls from NAS;
however this is specific to the vendor and configuration type, please refer to the
Commvault documentation when planning your design.
35
FILE ARCHIVING DESIGN GUIDELINES
Determine scope of solution from data
1 profiles (no. of File Servers, NAS, Files,

Data Volume) Use Archive Prediction
tools if appropriate.
Determine solution requirements. Is
2
Compliance Search, eDiscovery, or
Content Analyzer required? Determine
how end users will recall data if
required.
Size core infrastructure components
3 such as MediaAgents and Storage. Plan

appropriate Proxy server(s) if archiving
from NAS / File Shares.
Scope Web Server(s), and Web
4 Console(s). File Archiving solutions

require a dedicated Search Engine for
Content Indexing.
Narrative:
Let’s discuss a simple four step process to help you design a file archiving solution.
One. First determine the scope of your overall solution, Your data profiling exercise
should assist you discovering what the environment looks like. How many file or NAS
servers are you dealing with, are there a large number of files to contend with, or large
file sizes perhaps? Remember, you can utilize archive prediction tools if it is appropriate
in your opportunity.
Two: Determine the customer requirements. Is this more than just a space saving
exercise? Is Compliance Search, eDiscovery, or Content Analyzer functionality required?
Also, remember to consider how end users will access or recall their data, if this is a
requirement.
Steps one and two are tightly integrated, and it is a good idea to sketch the solution out
on a whiteboard or notepad as it will help you visualize the environment.
Three: For File Archiving you could potentially be dealing with very large data volumes.
Take care to size the core infrastructure components such as Media Agents and storage
36
repositories appropriately, allowing for future growth and placement of such
components to help facilitate data access in a timely fashion. If you are proposing File
archiving on NAS then remember to plan for Proxy servers with the appropriate OS,
depending on the protocol.
Four: Scope the appropriate number of Web Servers and Web Consoles as per the
previous guidance in this module. You may only need one of each in small
environments. Finally, remember that any file system archiving solution will require a
dedicated Search Engine for Content Indexing.
36
WRAP-UP
Commvault® Search Architecture, Features & Components
▪ Architecture
▪ Sizing the Search Engine
▪ Web Components
▪ Data Flows
Email Archiving
▪ Exchange Mailbox Agent Architecture
▪ Outlook Add-in and ContentStore Email Viewer
▪ Exchange Mailbox Content Indexing
▪ Additional Capabilities
▪ Design Guidelines
File Archiving
▪ Commvault OnePass™ Architecture
▪ Archive Predicition
▪ OnePass for File-System, NAS, and File Shares
▪ Explorer Plug-In
▪ Design Guidelines
Narrative:
Thank you for watching.
In this module you first learned about the Commvault® architecture, features, and
components. You then learned how to size a search engine and heard about the
required web components and the data flows associated with those components.
We then discussed how to design Commvault solutions for Email Archiving and learnt
about the additional associated components and capabilities.
Finally we discussed File-system archiving for both file servers, NAS and File Shares. You
also heard how end users can access their data via the Explorer plug-in and the design
guidelines to follow when architecting a file archiving solution.
37
Questions?
Suggestions?
38

CVSA18-M09-Archive, Index &amp; Search

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CVSA18-M09-Archive, Index &amp; Search

Încărcat de

Drepturi de autor:

Formate disponibile

ARCHIVE, INDEX &

{UPDATED CONTENT AND NARRATIVE FOR CVSA18}

{UPDATED CONTENT AND NARRATIVE FOR CVSA18}

▪ Moving up the index stack

▪ Find objects most relevant to a

▪ Search on-premises, in the CommServe®

▪ Extends to external data Functional Data Value

{NEW CONTENT FOR CVSA18}

Regardless of customer’s penchant for on-premises, hybrid, or cloud oriented

Index & Search Search-Only

{NEW CONTENT FOR CVSA18}

The Commvault® Search Architecture relies on several core components, many of

▪ Search Nodes & Engines

▪ Sizing per node

Index & Search Search-Only

{UPDATED CONTENT & NARRATIVE FOR CVSA18}

▪ Exchange Mailbox Agent MetaData

based on v2 Index Good SOLR Search Engine

MetaData Content Previews

▪ Formally known as the

▪ Further enhancements Better SOLR Search Engine

and support planned for MetaData Content Previews

Best MetaData Content

{NEW CONTENT FOR CVSA18}

▪ Add entity information to content

▪ Configure rules to archive data Compliance Search

based on matching entities

▪ Data Cube connectors support Search Engine

▪ Requires Windows Server with

*Search & supported connectors only

{NEW CONTENT FOR CVSA18}

{NEW SLIDE FOR CVSA18}

{UPDATED FOR CVSA18}

On request: Optimized for higher capacity search.

High Capacity Search Engine

▪ Additional nodes can be added as needed ▪ Engine Size limit is hard-coded

{UPDATED CONTENT AND NARRATIVE FOR CVSA18}

{CLICK} Additional Nodes can be added to an engine as needed up to the maximum

{UPDATED CONTENT & NARRATIVE FOR CVSA18}

File Indexing Example

Index per node will be 6TB while active

{UPDATED CONTENT AND NARRATIVE FOR CVSA18}

File Indexing Example

Index per node will be 1.35TB

{UPDATED CONTENT AND NARRATIVE FOR CVSA18}

In the next example we will look at another file indexing scenario.

INTERNET DMZ INTRANET

Compliance Web Server Search Engine

INTERNET DMZ INTRANET

Web Console Web Server Search Engine

1 Size Search Engines and

2 Size Web Console(s)

{UPDATED CONTENT AND NARRATIVE}

Four: Consider how your proposal can be optimized, especially in competitive

{UPDATED NARRATIVE FOR CVSA18}

{VERY MINOR MOD TO NARRATIVE FOR CVSA18}

{UPDATED CONTENT AND NARRATIVE FOR CVSA18}

To provide LAN-free transport of data to the ContentStore It is advisable to place the

▪ Stub based message preview and download

▪ Self-Service Data Access

▪ Requires Outlook 2007 SP2 >

ContentStore Email Viewer

▪ Access Content Indexed email & attachments

▪ No Stubs, ContentStore appears in Outlook folder view

CVSA18-M09-Archive, Index & Search

CVSA18-M09-Archive, Index & Search