Web Wash: Indexing PDF Docs using Search API and Solr in Drupal

To add the view as a tab on the content admin page:

Before adding facet filters, create the taxonomy structure to categorize documents.

Facets enable users to filter search results by categories, authors, or other field values.

Click the Fields tab and add:

Views default to unrestricted access. For admin pages, configure permissions:

Setting Up Media Thumbnails

Boost Your Web Development Skills

Install the DDEV Solr add-on from the terminal:

Search API Attachments requires a configured private file system. In your settings.php file, add:

Installing Media Thumbnail Modules

Set the Document field type to Fulltext to enable keyword searching within document content.

Solr’s built-in extractor provides the simplest setup because it requires no additional libraries beyond Solr itself.

Configure facet display options:

Get lifetime access to all premium courses, a private site builder forum, and exclusive live streams.

Configuring Word Document Thumbnails

$settings['file_private_path'] = '/var/www/html/private';

Drupal automatically renders radio buttons for single-value fields and checkboxes for multi-value fields.

Installing Search API and Search API Attachments

composer require 'drupal/search_api_solr'
composer require 'drupal/search_api_attachments'

Navigate to Configuration > Search and metadata > Search API Attachments settings:

Both modules require ImageMagick on the server. DDEV includes ImageMagick by default, making local development straightforward.

Module Installation

Two modules generate preview thumbnails from document files:

Configure the view with search-specific fields:

  • Search API Solr Admin: Core Solr integration with configuration upload capability.
  • Search API Attachments: Enables document content extraction and indexing.

Navigate to Configuration > Search API and click Add server:

Setting Up Apache Solr with DDEV

Create a search server in Drupal to connect to the Solr instance.

Installing the Solr Add-on

NOTE: The host must be “solr” rather than “localhost” because Drupal communicates with Solr through the internal Docker network.

Navigate to Structure > Views > Add view:

Install the Facets (3.0.2) module and Better Exposed Filters (7.1.1):

  • Username: solr
  • Password: SolrRocks

Configuring the Solr Server

Search API Solr generates a configuration package specific to your Solr version. The Solr admin interface displays the new collection after successful upload.

Creating the Server Connection

Install the required modules using Composer:

  1. Enter a server name (e.g., “Media Index”).
  2. Select Solr as the backend.
  3. Select Solr Cloud with Basic Auth as the backend.
  4. Configure connection settings:
    • Solr host: solr (not localhost when using DDEV).
    • Solr port: 8983.
    • Solr path: /.
    • Default Solr collection: media_index.
  5. Enter the authentication credentials (solr / SolrRocks).
  6. Click Save.

Indexing document content in Drupal enables users to search within PDF and Word files, making document libraries more accessible and discoverable. Search API Attachments combined with Apache Solr provides a powerful solution for extracting and indexing text from uploaded documents.

Uploading the Configuration Set

Multi-select dropdowns provide poor usability. Convert facets to checkboxes:

  1. Navigate to your Solr server view page.
  2. Click Upload Configset.
  3. Click Upload and create collection.

After creating the server, upload the Solr configuration:

composer require 'drupal/media_thumbnails_pdf'
composer require 'drupal/media_thumbnails_word'

New to Search API? Check out our Getting Started with Search API in Drupal video.

Index Configuration

Install the modules via Composer:

  1. Enter an index name (e.g., “Media Docs”).
  2. Select Media as the data source.
  3. Under Bundles, select Document.
  4. Choose your Solr server.
  5. Click Save.

Configuring Search API Attachments

When dealing with private files, verify that generated thumbnails save to the private file system. Public thumbnails could expose the first page of private documents to anonymous users.

Setting the Extraction Method

After enabling the processor, the document field becomes available in the Fields configuration.

  1. Select Solr extractor as the extraction method.
  2. Choose your search index from the dropdown.
  3. Click Save configuration.

Table of Contents

Configuring the Private File System

To index document content, enable the File Attachments processor:

Search API Attachments requires configuration to specify the extraction method for document content.

Enabling File Attachments

Word document thumbnails require configuring the mPDF library path:

  1. Click the Processors tab on your index.
  2. Enable File attachments.
  3. Click Save.

Navigate to Configuration > Search API and click Add index:

Adding Index Fields

Search API (8.x-1.40) abstracts the search backend from your search configuration. This architecture enables switching between database search, Apache Solr, or Elasticsearch without rebuilding search interfaces.

  • Search api attachments: Document: Contains extracted document text.
  • Name: The media item title.
  • Authored by: The user who uploaded the document.
  • Authored on: The upload date.

Enable processors to improve search quality:

Configuring Processors

Don’t forget to subscribe to our YouTube channel to stay up-to-date.

  • Highlight: Highlights search keywords in result excerpts.
  • HTML filter: Strips HTML tags from indexed content.
  • Ignore case: Makes searches case-insensitive.

For local development, DDEV’s Solr add-on simplifies the setup process. Production deployments require hosting that supports Apache Solr or a managed Solr service.

Creating the Search View

After configuring fields and processors, click Index now to populate the index.

Building the Views Page

composer require 'drupal/facets:^3.0'
composer require 'drupal/better_exposed_filters'

Add relevance-based sorting:

  1. Click No menu under Page Settings.
  2. Select Menu tab.
  3. Enter the tab title.
  4. Set parent to Administration.
  5. Adjust the weight to control tab order.

Configuring Access Permissions

ddev add-on get ddev/ddev-solr
ddev restart

Access the Solr admin interface at the URL provided by ddev status. The default credentials are:

  1. Add Thumbnail with an appropriate image style.
  2. Add Document to display the file link.
  3. Add Search: Excerpt to show matched text with highlighting.
  4. Add Operations for edit/delete links.

In the video above, you’ll learn how to set up document thumbnails for PDFs and Word files, install and configure Search API with Search API Attachments, set up Apache Solr locally using DDEV, configure Solr’s built-in extractor for document content, and create search views with faceted filtering.

  1. Click Add in Filter criteria.
  2. Select Fulltext search.
  3. Check Expose this filter to visitors.
  4. Click Apply.

Add an exposed filter for searching:

  1. Click Add in Sort criteria.
  2. Select Relevance (descending).
  3. Click Apply.

Adding Faceted Filtering

Search API integrates with Views for building search interfaces.

Installing Facets

Apache Solr runs as a separate application from Drupal. DDEV provides an add-on for local Solr development.

Create an index to specify which content and fields to index.

Configuring Taxonomy for Facets

Edit your search view and add facet filters:

Creating the Vocabulary

  1. Navigate to Structure > Taxonomy > Add vocabulary.
  2. Create a vocabulary named “Category”.

Adding Terms

  1. Add terms to the Category vocabulary such as:
    • Drupal
    • WordPress
    • Cakes

Adding the Field to Media Type

  1. Navigate to Structure > Media types > Document > Manage fields.
  2. Add a new field referencing the Category vocabulary.

Adding the Field to the Search Index

  1. Return to your search index under Configuration > Search API.
  2. Click the Fields tab and add the Category field.
  3. Click Index now to reindex content with the new field.

Adding Facet Filters to Views

The Search API Solr Admin module provides the ability to upload configuration sets directly to Solr from the Drupal interface.

  1. Click Add in Filter criteria.
  2. Select fields from the Facets category (e.g., Category).
  3. Click Apply.

Search API Solr (4.3.10) provides an Apache Solr backend for Search API with support for facets, multi-index searches, and multilingual content.

  1. Click Settings on the facet filter.
  2. Enable Transform entity ID to label.
  3. Enable Show the amount of results.
  4. Click Apply.

Converting to Checkboxes

Indexing PDF and Word documents in Drupal involves several components working together:

  1. Expand Advanced in your view.
  2. Change exposed form style to Better Exposed Filters.
  3. Click Settings.
  4. For each facet, set widget to Checkboxes/Radio Buttons.
  5. Click Apply.

NOTE: If you can’t see Upload Configset, make sure you install the Search API Solr Admin submodule.

Summary

Before configuring search functionality, enhancing the document display with proper thumbnails improves the content management experience. By default, Drupal displays generic icons for document media types.

  • Search API provides the framework for indexing and searching content.
  • Search API Solr connects Drupal to the Apache Solr search engine.
  • Search API Attachments extracts text content from document files.
  • Solr’s built-in extractor parses document content without additional server configuration.
  • Facets and Better Exposed Filters enhance the search interface with filtering options.

Search API Attachments (10.0.5) is an add-on module that enables indexing and searching of file attachments by extracting text content from documents using various methods including Apache Tika, Solr’s built-in extractor, pdftotext, or other extraction tools.

Similar Posts