tMongoDBLookupInput - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tMongoDBLookupInput reads a database and extracts fields based on a query.

Purpose

tMongoDBLookupInput executes a database query with a strictly defined order which must correspond to the schema definition.

It passes on the extracted data to tMap in order to provide the lookup data to the main flow. It must be directly connected to a tMap component and requires this tMap to use Reload at each row or Reload at each row (cache) for the lookup flow.

tMongoDBLookupInput properties in Spark Streaming Jobs

Component family

Databases/MongoDB

 

Basic settings

Property type

Either Built-In or Repository.

Built-In: No property data stored centrally.

Repository: Select the repository file where the properties are stored.

 

MongoDB configuration

Select this check box and in the Component List click the relevant connection component to reuse the connection details you already defined.

 

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

If a column in the database is a JSON document and you need to read the entire document, put an asterisk (*) in the DB column column, without quotation marks around.

 

Collection

Enter the name of the collection to be used.

A MongoDB collection is the equivalent of an RDBMS table and contains documents.

 

Set read preference

Select this check box and from the Read preference drop-down list that is displayed, select the member to which you need to direct the read operations.

If you leave this check box clear, the Job uses the default Read preference, that is to say, uses the primary member in a replica set.

For further information, see MongoDB's documentation about Replication and its Read preferences.

 

Query

Specify the query statement to select documents from the collection specified in the Collection field.

For example

"{'customer_id':" + row1.customer_id +"}"

In this code, row1 is not the label of the link to tMongoDBLookupInput, but represents the main row entering into tMap.

The result of the query must contain only records that match join key you need to use in tMap. In other words, you must use the schema of the main flow to tMap to construct the SQL statement here in order to load only the matched records into the lookup flow.

This approach ensures that no redundant records are loaded into memory and outputted to the component that follows.

 

Mapping

Each column of the schema defined for this component represents a field of the documents to be read. In this table, you need to specify the parent nodes of these fields, if any.

For example, in the document reading as follows

{
               _id: ObjectId("5099803df3f4948bd2f98391"),
               person: { first: "Joe", last: "Walker" }
            }

The first and the last fields have person as their parent node but the _id field does not have any parent node. So once completed, this Mapping table should read as follows:

Column     Parent node path
_id
first       "person"
last        "person"
 

Limit

Enter the maximum number of records to be retrieved.

Advanced settings

No query timeout

Select this check box to prevent MongoDB servers from stopping idle cursors at the end of 10-minute inactivity of these cursors. In this situation, an idle cursor will stay open until either the results of this cursor are exhausted or you manually close it using the cursor.close() method.

A cursor for MongoDB is a pointer to the result set of a query. By default, that is to say, with this check box being clear, a MongoDB server automatically stops idle cursors after a given inactivity period to avoid excess memory use. For further information about MongoDB cursors, see https://docs.mongodb.org/manual/core/cursors/.

Usage in Spark Streaming Jobs

This component is used as a start component and requires an output link.

This component should use a tMongoDBConfiguration component present in the same Job to connect to a MongoDB database. You need to drop a tMongoDBConfiguration component alongside this component and configure the Basic settings of this component to use tMongoDBConfiguration.

This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Spark Connection

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:

This connection is effective on a per-Job basis.

Reading and writing data in MongoDB using a Spark Streaming Job

In this scenario, you create a Spark Streaming Job to extract data about given movie directors from MongoDB, use this data to filter and complete movie information and then write the result into a MongoDB collection.

The sample data about movie directors reads as follows:

1;Gregg Araki	
2;P.J. Hogan 
3;Alan Rudolph 
4;Alex Proyas
5;Alex Sichel

This data contains the names of these directors and the ID numbers distributed to them.

The structure of this data in MongoDB reads as follows:

{ "_id" : ObjectId("575546da3b1c7e22bc7b2189"), "person" : { "id" : 3, "name" : "Alan Rudolph" } }
{ "_id" : ObjectId("575546da3b1c7e22bc7b218b"), "person" : { "id" : 4, "name" : "Alex Proyas" } }
{ "_id" : ObjectId("575546da3b1c7e22bc7b218c"), "person" : { "id" : 5, "name" : "Alex Sichel" } }
{ "_id" : ObjectId("575546da3b1c7e22bc7b2188"), "person" : { "id" : 1, "name" : "Gregg Arakit" } }
{ "_id" : ObjectId("575546da3b1c7e22bc7b218a"), "person" : { "id" : 2, "name" : "P.J. Hogan" } }

Note that the sample data is created for demonstration purposes only.

Prerequisites:

  • The Spark cluster and the MongoDB database to be used have been properly installed and are running.

  • The above-mentioned data has been loaded in the MongoDB collection to be used.

To replicate this scenario, proceed as follows:

Linking the components

  1. In the Integration perspective of the Studio, create an empty Spark Batch Job from the Job Designs node in the Repository tree view.

    For further information about how to create a Spark Streaming Job, see Talend Big Data Getting Started Guide.

  2. In the workspace, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are tHDFSConfiguration, tMongoDBConfiguration, tFixedFlowInput, tMongoDBOutput, tMongoDBLookupInput, tMap and tLogRow.

    The tFixedFlowInput components are used to load the data about movies into the data flow. In the real-world practice, you can use other components such as tFileInputDelimited instead to design a sophisticated process to prepare your data to be processed.

  3. Connect tFixedFlowInput to tMap using the Row > Main link.

    This way, the main flow to tMap is created. The movie information is sent via this flow.

  4. Connect tMongoDBLookupInput to tMap using the Row > Main link.

    This way, the lookup flow to tMap is created. The movie director information is sent via this flow.

  5. Connect tMap to tMongoDBOutput using the Row > Main link and name this connection in the dialog box that is displayed. For example, name it to out1.

  6. Do the same to connect tMap to tLogRow and name this connection to reject.

  7. Leave tHDFSConfiguration and tMongoDBConfiguration alone without any connection.

Setting up Spark connection

  1. Click Run to open its view and then click the Spark Configuration tab to display its view for configuring the Spark connection.

    This view looks like the image below:

  2. Select the type of the Spark cluster you need to connect to.

    • Local: the Studio builds the Spark environment in itself at runtime to run the Job locally within the Studio. With this mode, each processor of the local machine is used as a Spark worker to perform the computations. This mode requires minimum parameters to be set in this configuration view.

      Note this local machine is the machine in which the Job is actually run. The Local mode is the default mode and you need to clear its check box to display the drop-down list for you to select the other modes.

    • Standalone: the Studio connects to a Spark-enabled cluster to run the Job from this cluster.

    • Yarn client: the Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.

  3. If you are using the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.

    For further information about how to create an Hadoop connection in Repository, see the chapter describing the Hadoop cluster node of the Talend Studio User Guide.

  4. Select the version of the Hadoop distribution you are using along with Spark.

    • If you select Microsoft HD Insight 3.4, you need to configure the connections to the Livy service, the HD Insight service and the Windows Azure Storage service of that cluster in the areas that are displayed. A demonstration video about how to configure a connection to Microsoft HD Insight cluster is available in the following link: https://www.youtube.com/watch?v=A3QTT6VsNoM.

      The configuration of Livy is not presented in this video. The Hostname of Livy uses the following syntax: your_spark_cluster_name.azurehdinsight.net. For further information about the Livy service used by HD Insight, see Submit Spark jobs using Livy.

    • If you select Amazon EMR, see the article Amazon EMR - Getting Started on about how to configure the connection on Talend Help Center (https://help.talend.com).

      It is recommended to install your Talend Jobserver in the EMR cluster. For further information about this Jobserver, see Talend Installation Guide.

    If you cannot find the distribution corresponding to yours from this drop-down list, this means the distribution you want to connect to is not officially supported by Talend. In this situation, you can select Custom, then select the Spark version of the cluster to be connected and click the button to display the dialog box in which you can alternatively:

    1. Select Import from existing version to import an officially supported distribution as base and then add other required jar files which the base distribution does not provide.

    2. Select Import from zip to import the configuration zip for the custom distribution to be used. This zip file should contain the libraries of the different Hadoop/Spark elements and the index file of these libraries.

      Note that custom versions are not officially supported by Talend. Talend and its community provide you with the opportunity to connect to custom versions from the Studio but cannot guarantee that the configuration of whichever version you choose will be easy. As such, you should only attempt to set up such a connection if you have sufficient Hadoop and Spark experience to handle any issues on your own.

  5. Configure the connection information to the principal services of the cluster to be used.

    If you are using the Yarn client mode, you need to enter the addresses of the following different services in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored ):

    • In the Resource manager field, enter the address of the ResourceManager service of the Hadoop cluster to be used.

    • Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.

    • Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

    • Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

    • If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution. For example, in a CDH4 distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

      • If this cluster is a MapR cluster of the version 4.0.1 or later, you can set the MapR ticket authentication configuration in addition or as an alternative by following the explanation in Connecting to a security-enabled MapR.

        Keep in mind that this configuration generates a new MapR security ticket for the username defined in the Job in each execution. If you need to reuse an existing ticket issued for the same username, leave both the Force MapR ticket authentication check box and the Use Kerberos authentication check box clear, and then MapR should be able to automatically find that ticket on the fly.

      If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field.

      Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

    • The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used.

      Since the Job needs to upload jar files to HDFS of the cluster to be used, you must ensure that this user name is the same as the one you have put in tHDFSConfiguration, the component used to provides HDFS connection information to Spark.

    If you are using the Standalone mode, you need to set the following parameters:

    • In the Spark host field, enter the URI of the Spark Master of the Hadoop cluster to be used.

    • In the Spark home field, enter the location of the Spark executable installed in the Hadoop cluster to be used.

  6. If you need to run the current Job on Windows, it is recommended to specify where the winutils.exe program to be used is stored.

    • If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.

    • Otherwise, leave this check box clear, the Studio generates one by itself and automatically uses it for this Job.

  7. If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

    Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

  8. In the Batch size field, enter the time interval at the end of which the Job reviews the source data to identify changes and processes the new micro batches.

  9. If needs be, select the Define a streaming timeout check box and in the field that is displayed, enter the time frame at the end of which the streaming Job automatically stops running.

  10. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the streaming computation such as the metadata and the generated RDDs of this computation.

    For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#checkpointing.

  11. Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory:

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Set executor memory: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.

      • Auto: you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

      This feature is available to the Yarn client mode only.

    • Activate backpressure: select this check box to enable the backpressure feature of Spark. The backpressure feature is available in the Spark verson 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process.

  12. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

  13. In the Yarn client mode, you can enable the Spark application logs of this Job to be persistent in the file system. To do this, select the Enable Spark event logging check box.

    The parameters relevant to Spark logs are displayed:

    • Spark event logs directory: enter the directory in which Spark events are logged. This is actually the spark.eventLog.dir property.

    • Spark history server address: enter the location of the history server. This is actually the spark.yarn.historyServer.address property.

    • Compress Spark event logs: if needs be, select this check box to compress the logs. This is actually the spark.eventLog.compress property.

    Since the administrator of your cluster could have defined these properties in the cluster configuration files, it is recommended to contact the administrator for the exact values.

  14. In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by the Studio.

    The advanced properties required by different Hadoop distributions and their values are listed below:

    • Hortonworks Data Platform V2.4:

      • spark.yarn.am.extraJavaOptions: -Dhdp.version=2.4.0.0-169

      • spark.driver.extraJavaOptions: -Dhdp.version=2.4.0.0-169

      In addition, you need to add -Dhdp.version=2.4.0.0-169 to the JVM settings area either in the Advanced settings tab of the Run view or in the Talend > Run/Debug view of the [Preferences] window. Setting this argument in the [Preferences] window applies it on all the Jobs that are designed in the same Studio.

    • MapR V5.1 and V5.2 when the cluster is used with the HBase or the MapRDB components:

      • spark.hadoop.yarn.application.classpath: enter the value of this parameter specific to your cluster and add, if missing, the classpath to HBase to ensure that the Job to be used can find the required classes and packages in the cluster.

        For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a coma(,). The added paths is where HBase is usually installed in a MapR cluster. If your HBase is installed elsewhere, contact the administrator of your cluster for details and adapt these paths accordingly.

        For a step-by-step explanation about how to add this parameter, see the documentation HBase/MapR-DB Job cannot successfully run with MapR 5.1 or 5.2 on Talend Help Center.

    For further information about the valid Spark properties, see Spark documentation at https://spark.apache.org/docs/latest/configuration.

Configuring the connection to the file system to be used by Spark

  1. Double-click tHDFSConfiguration to open its Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.

    Spark uses this component to connect to the HDFS system to which the jar files dependent on the Job are transferred.

  2. In the Version area, select the Hadoop distribution you need to connect to and its version.

  3. In the NameNode URI field, enter the location of the machine hosting the NameNode service of the cluster.

  4. In the Username field, enter the authentication information used to connect to the HDFS system to be used. Note that the user name must be the same as you have put in the Spark configuration tab.

Configuring the connection to the MongoDB database to be used by Spark

  1. Double-click tMongoDBConfiguration to open its Component view.

  2. From the DB Version list, select the version of the MongoDB database to be used.

  3. In the Server field and the Port field, enter corresponding information of the MongoDB database.

  4. In the Database field, enter the name of the database. This database must already exist.

Loading the movie data

  1. Double-click the tFixedFlowIput component to open its Component view.

  2. Click the [...] button next to Edit schema to open the schema editor.

  3. Click the [+] button to add the schema columns as shown in this image.

  4. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

  5. In the Input repetition interval field, enter the time interval at the end of which tFixedFlowInput sends the movie data another time. This allows you to generate a stream of data.

  6. In the Mode area, select the Use Inline Content radio button and paste the following data into the Content field that is displayed.

    691;Dark City;1998;http://us.imdb.com/M/title-exact?imdb-title-118929;4
    1654;Chairman of the Board;1998;http://us.imdb.com/Title?Chairman+of+the+Board+(1998);6
    903;Afterglow;1997;http://us.imdb.com/M/title-exact?imdb-title-118566;3
    255;My Best Friend's Wedding;1997;http://us.imdb.com/M/title-exact?My+Best+Friend%27s+Wedding+(1997);2
    1538;All Over Me;1997;http://us.imdb.com/M/title-exact?All%20Over%20Me%20%281997%29;5
  7. In the Field separator field, enter a semicolon (;).

Extracting director data from MongoDB

  1. Double-click tMongoDBLookupInput to open its Component view.

  2. Click the [...] button next to Edit schema to open the schema editor.

  3. Click the [+] button to add the schema columns as shown in this image.

  4. In the Collection field, enter the name of the collection from which tMongoDBLookupInput extracts data.

  5. In the Query field, enter the following query.

    "{'person.id':" + row2.directorID +"}"

    In this statement, row2 represents the main flow to tMap and row2.directorID the directorID column of this flow. You need to adapt this row2 to the label of the main flow link in your Job.

    The whole statement means to select every record in which the id field within the person field has the same value as this directorID column.

    The example above shows how to use the schema of the main flow to construct the SQL statement to load only the matched records into the lookup flow. This approach ensures that no redundant records are stored in memory before being sent to tMap.

  6. In the Mapping table, the id and the name columns have been automatically added. Enter, within double quotation marks, person in the Parent node path column for each row.

    This table defines how the hierarchical construct of the data from MongoDB should be interpreted in order to fit the schema of tMongoDBLookupInput.

Configuring the transformation in tMap

  • Double-click tMap to open its Map Editor view.

Creating the output schema

  1. On the input side (left side) of the Map Editor, each of the two tables represents one of the input flow, the upper one for the main flow and the lower one for the lookup flow.

    On the output side (right side), the two tables represent the output flows that you named as out1 and reject previously.

    From the main flow table, drop the movieID, the title, the release and the url columns onto each of the output flow table.

  2. Drop as well the directorID column from the main flow table to the reject output table.

  3. From the lookup flow, drop the name column onto each of the output flow table.

    Then from the Schema editor view, you can see the schemas of the both sides have been completed.

Setting the mapping conditions

  1. From the main flow table, drop the directorID column onto the lookup table, in the Expr. key column of the id row.

    This defines the column used to provide join keys.

  2. On the lookup flow table, click the button to open the setting panel in this table.

  3. Click the Value column of the Lookup model row to display the [...] button and click this button to open the [Options] window.

  4. Select Reload at each row and click OK to validate this choice.

  5. Do the same in the Join model row to display the corresponding [Options] window.

  6. Select Inner Join to ensure that only the matched records between the main flow and the lookup flow are outputted.

  7. On the reject output flow table, click the button to open the setting panel.

  8. In the Catch lookup inner join reject row, click the Value column to display the [...] button and click this button to open the [Options] window.

  9. Select true to send the records filtered out by the inner join into the reject flow and click OK to validate this choice.

  10. Click Apply, then click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

Writing processed data to MongoDB

  1. Double-click tMongoDBOutput to open its Component view.

  2. If this component does not have the same schema of the preceding component, a warning icon appears. In this situation, click the Sync columns button to retrieve the schema from the preceding one and once done, the warning icon disappears.

  3. In the Collection field, enter the name of the collection to which you need to write data. If this collection does not exist, it will be automatically created at runtime.

  4. From the Action on data list, select the operation to be performed on the data. In this example, select Insert, which creates documents in MongoDB whether these documents already exist or not and in either case, generates a new technical ID for each of the new documents.

  5. Leave the Mapping table as is. This adds each record to the root of each document.

Writing rejected data to tLogRow

  1. Double-click tLogRow to open its Component view.

  2. If this component does not have the same schema of the preceding component, a warning icon appears. In this situation, click the Sync columns button to retrieve the schema from the preceding one and once done, the warning icon disappears.

  3. Select the Table radio button to present the result in a table.

Executing the Job

Then you can press F6 to run this Job.

Once done, in the console of the Run view, you can see the data rejected by inner join.

This data is displayed for several times because tFixedFlowInput has created a data stream by regularly sending out the same records.

Note that you can manage the level of the execution information to be outputted in this console by selecting the log4jLevel check box in the Advanced settings tab and then selecting the level of the information you want to display.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

In the default MongoDB database, you can check the documents that have been created in the movie collection.

{ "_id" : ObjectId("57559a613b1c7e2e6497b2bb"), "movieID" : 691, "title" : "Dark City", "release" : "1998", "url" : "http://us.imdb.com/M/title-exact?imdb-title-118929", "director_name" : "Alex Proyas" }
{ "_id" : ObjectId("57559a613b1c7e2e6497b2bc"), "movieID" : 903, "title" : "Afterglow", "release" : "1997", "url" : "http://us.imdb.com/M/title-exact?imdb-title-118566", "director_name" : "Alan Rudolph " }
{ "_id" : ObjectId("57559a613b1c7e2e6497b2be"), "movieID" : 255, "title" : "My Best Friend's Wedding", "release" : "1997", "url" : "http://us.imdb.com/M/title-exact?My+Best+Friend%27s+Wedding+(1997)", "director_name" : "P.J. Hogan " }
{ "_id" : ObjectId("57559a613b1c7e2e6497b2c0"), "movieID" : 1538, "title" : "All Over Me", "release" : "1997", "url" : "http://us.imdb.com/M/title-exact?All%20Over%20Me%20%281997%29", "director_name" : "Alex Sichel" }
{ "_id" : ObjectId("57559a613b1c7e2e6497b2ba"), "movieID" : 691, "title" : "Dark City", "release" : "1998", "url" : "http://us.imdb.com/M/title-exact?imdb-title-118929", "director_name" : "Alex Proyas" }
{ "_id" : ObjectId("57559a613b1c7e2e6497b2bd"), "movieID" : 903, "title" : "Afterglow", "release" : "1997", "url" : "http://us.imdb.com/M/title-exact?imdb-title-118566", "director_name" : "Alan Rudolph " }
{ "_id" : ObjectId("57559a613b1c7e2e6497b2bf"), "movieID" : 255, "title" : "My Best Friend's Wedding", "release" : "1997", "url" : "http://us.imdb.com/M/title-exact?My+Best+Friend%27s+Wedding+(1997)", "director_name" : "P.J. Hogan " }
{ "_id" : ObjectId("57559a613b1c7e2e6497b2c1"), "movieID" : 1538, "title" : "All Over Me", "release" : "1997", "url" : "http://us.imdb.com/M/title-exact?All%20Over%20Me%20%281997%29", "director_name" : "Alex Sichel" }

The movie information now contains the names instead of the IDs of their directors and the same records have been written several times in the collection but their technical IDs (the _id field) are all distinct.