68 Commits

Author SHA1 Message Date
Jürgen Mummert 02e5d4099b Rename composer package to mummert/meilisearch-bundle 2026-03-18 18:41:06 +01:00
Jürgen Mummert 0c18b268ae Fix composer namespace references 2026-03-09 10:50:17 +01:00
Jürgen Mummert 188de3d03f Rename vendor and namespace to mummert 2026-03-09 10:10:13 +01:00
Jürgen Mummert d874fe4274 Add resilient CDN loader for Meilisearch browser client 2026-02-24 12:58:34 +01:00
Jürgen Mummert c790a1c312 Fix browser import for Meilisearch frontend client 2026-02-24 12:52:47 +01:00
Jürgen Mummert 7e757bbb6a Fix: remove search-active body class on reset button click 2026-01-15 09:21:41 +01:00
Jürgen Mummert 9f86a5240d add files uuid 2026-01-12 11:28:22 +01:00
Jürgen Mummert 59b261c333 add files uuid 2026-01-12 11:20:02 +01:00
Jürgen Mummert 0912738528 add files uuid 2026-01-12 11:19:09 +01:00
Jürgen Mummert 8b2c6e6b92 add files uuid 2026-01-12 11:18:02 +01:00
Jürgen Mummert ca1305c9c6 add files uuid 2026-01-12 11:11:35 +01:00
Jürgen Mummert e6e3e9339a add files uuid 2026-01-12 11:07:08 +01:00
Jürgen Mummert b7a5e95c7d add files uuid 2026-01-12 10:59:36 +01:00
Jürgen Mummert bc35527f3f add files uuid 2026-01-12 10:39:09 +01:00
Jürgen Mummert e04dfb2bd4 add files uuid 2026-01-12 10:32:45 +01:00
Jürgen Mummert 6d8d0938f1 add files uuid 2026-01-12 10:26:02 +01:00
Jürgen Mummert 8f9c9cea72 add files uuid 2026-01-12 10:23:08 +01:00
Jürgen Mummert ad532e7b4c add files uuid 2026-01-12 10:19:48 +01:00
Jürgen Mummert 2257178cb6 add files uuid 2026-01-12 10:05:46 +01:00
Jürgen Mummert 2f8eddda36 add files uuid 2026-01-12 10:01:53 +01:00
Jürgen Mummert 5026f615f2 add files uuid 2026-01-12 09:54:45 +01:00
Jürgen Mummert f402e6546a add files uuid 2026-01-12 09:41:54 +01:00
Jürgen Mummert 579f58b614 add files uuid 2026-01-12 09:35:46 +01:00
Jürgen Mummert 17188537bc Debug File Indexing 2026-01-11 19:31:31 +01:00
Jürgen Mummert 3427f6b60b Tika Title encoding 2026-01-11 18:51:38 +01:00
Jürgen Mummert 6e41df002e Tika Title encoding 2026-01-11 18:49:15 +01:00
Jürgen Mummert 838f574574 Tika Title encoding 2026-01-11 18:41:16 +01:00
Jürgen Mummert 8549e4e9da Tika Title encoding 2026-01-11 18:35:00 +01:00
Jürgen Mummert 29f7920cb5 Tika Title encoding 2026-01-11 18:29:25 +01:00
Jürgen Mummert 0c637c2f92 Fix file indexing in Contao 5.6 (inject DBAL connection, add debug logs) 2026-01-11 18:19:40 +01:00
Jürgen Mummert 86b81affdc Tika Title encoding 2026-01-10 19:06:58 +01:00
Jürgen Mummert 2d3ddac945 Tika Title encoding 2026-01-10 18:57:13 +01:00
Jürgen Mummert 17da2a8434 Tika Title encoding 2026-01-10 18:31:00 +01:00
Jürgen Mummert c085911877 Tika Title encoding 2026-01-10 18:26:00 +01:00
Jürgen Mummert 40792870bd Tika Title encoding 2026-01-10 12:30:20 +01:00
Jürgen Mummert 38372539c2 Tika Title encoding 2026-01-10 12:05:15 +01:00
Jürgen Mummert 2bd52f77e0 new Twig 2026-01-09 22:04:52 +01:00
Jürgen Mummert 99ef883da5 new Twig 2026-01-09 22:01:16 +01:00
Jürgen Mummert 56d806c579 Prepare release 0.3.0 (Apache Tika integration) 2026-01-09 16:46:13 +01:00
Jürgen Mummert 2989d205d7 update Index Command 2026-01-09 16:41:23 +01:00
Jürgen Mummert 02b1657f19 update Index Command 2026-01-09 16:36:24 +01:00
Jürgen Mummert f1c864dfca add Parse Command 2026-01-09 16:28:43 +01:00
Jürgen Mummert 5cd8286286 add Parse Command 2026-01-09 16:16:44 +01:00
Jürgen Mummert 0fa0642618 add Parse Command 2026-01-09 16:01:23 +01:00
Jürgen Mummert 874ed0e656 add Parse Command 2026-01-09 15:57:12 +01:00
Jürgen Mummert b91281614b add Parse Command 2026-01-09 15:54:14 +01:00
Jürgen Mummert 4c1b4ac4b7 optimize cleaner 2026-01-09 15:36:03 +01:00
Jürgen Mummert 1f75418d9b add file indexing 2026-01-09 15:25:01 +01:00
Jürgen Mummert 278ae9f36f add tl_search_files 2026-01-09 12:14:17 +01:00
Jürgen Mummert 17ecdaec17 change IndexPage 2026-01-09 12:07:09 +01:00
Jürgen Mummert 8b22467799 Add conditional Tika URL setting 2026-01-09 11:58:52 +01:00
Jürgen Mummert cd9b918aff Add conditional Tika URL setting 2026-01-09 11:53:51 +01:00
Jürgen Mummert 8d4af1f61d Add conditional Tika URL setting 2026-01-09 11:47:35 +01:00
Jürgen Mummert f16e7a98d1 Fix duplicate Meilisearch marker injection 2026-01-09 11:03:41 +01:00
Jürgen Mummert c223ae692f add logging 2026-01-09 10:32:05 +01:00
Jürgen Mummert b4cd9199c8 add logging 2026-01-09 10:17:57 +01:00
Jürgen Mummert 6329c9e790 remove cron 2026-01-09 09:52:22 +01:00
Jürgen Mummert d2c9263755 add logging to cron 2026-01-09 09:40:08 +01:00
Jürgen Mummert e9f06f7cc9 services.yml change 2026-01-06 09:07:13 +01:00
Jürgen Mummert 6d2f4458bc add cron 2026-01-05 11:28:02 +01:00
Jürgen Mummert 9adad9ca8d add cron 2026-01-05 11:19:09 +01:00
Jürgen Mummert 356b18c8c8 add cron 2026-01-05 11:13:11 +01:00
Jürgen Mummert 7dc30c435f add cron 2026-01-05 11:05:35 +01:00
Jürgen Mummert ac001fb53c change Grace period zu 24h 2026-01-05 10:43:16 +01:00
Jürgen Mummert 6ea558bbca remove table reset 2026-01-05 10:37:21 +01:00
Jürgen Mummert cf0a84b85e add last_seen 2026-01-05 10:29:09 +01:00
Jürgen Mummert d9b8646835 Change Delete Command 2026-01-05 10:25:58 +01:00
Jürgen Mummert b684267541 Add cleanup command for stale indexed files 2026-01-05 10:21:37 +01:00
23 changed files with 932 additions and 784 deletions
+46 -5
View File
@@ -2,6 +2,7 @@
Eine schlanke Schnittstelle zwischen **Contao CMS (4.13 / 5.6 / 5.7 ready) unter PHP 8.4** und einer **selbst gehosteten Meilisearch-Instanz**.
Das Bundle erweitert den Contao-Suchindex um strukturierte Daten und ermöglicht eine performante, moderne Volltextsuche.
Das Parsen von Dateien erfolgt über eine Apache-Tika-Instanz, welche extern bereitgestellt werden muss.
---
@@ -20,13 +21,53 @@ Das Bundle erweitert den Contao-Suchindex um strukturierte Daten und ermöglicht
- Kompatibel mit:
- Contao **4.13**, **5.6** und **5.7**
- PHP **8.4**
- Entwickelt als **eigenständiges Contao-Bundle**
---
## 📦 Installation
## ⏱️ Scheduled Indexing (Cron setup)
Installation über Composer:
Das Bundle stellt eigene Commands zur Verfügung, um Dateien zu bereinigen und den Meilisearch-Index neu aufzubauen.
Für den produktiven Einsatz wird empfohlen, diese Commands regelmäßig per **System-Crontab** auszuführen.
```bash
composer require mummertmedia/contao-meilisearch-bundle:^0.1
Das Bundle nutzt **keinen eigenen Contao-Cron**, sondern System-Cronjobs.
## Verfügbare Commands
### Datei-Cleanup
```
/vendor/bin/contao-console meilisearch:files:cleanup
```
### Datei-Parsing
```
/vendor/bin/contao-console meilisearch:files:parse
```
### Meilisearch-Index
```
/vendor/bin/contao-console meilisearch:index
```
## Beispiel Crontab
```
0 5 * * * /usr/bin/php8.4 /path/to/project/vendor/bin/contao-console meilisearch:files:cleanup
1 5 * * * /usr/bin/php8.4 /path/to/project/vendor/bin/contao-console contao:crawl
10 5 * * * /usr/bin/php8.4 /path/to/project/vendor/bin/contao-console meilisearch:files:parse
20 5 * * * /usr/bin/php8.4 /path/to/project/vendor/bin/contao-console meilisearch:index
```
## Logging
```
>> var/logs/meilisearch_cron.log 2>&1
```
## Lizenz
MIT
+4 -8
View File
@@ -1,5 +1,5 @@
{
"name": "mummert-media/contao-meilisearch-bundle",
"name": "mummert/meilisearch-bundle",
"description": "Contao Meilisearch integration bundle",
"type": "contao-bundle",
"license": "MIT",
@@ -8,18 +8,14 @@
"contao/core-bundle": "^4.13 || ^5.6 || ^5.7",
"contao/calendar-bundle": "^4.13 || ^5.6 || ^5.7",
"contao/news-bundle": "^4.13 || ^5.6 || ^5.7",
"meilisearch/meilisearch-php": "^1.16",
"smalot/pdfparser": "^2.12",
"phpoffice/phpword": "^1.4",
"phpoffice/phpspreadsheet": "^3.0",
"phpoffice/phppresentation": "^1.2"
"meilisearch/meilisearch-php": "^1.16"
},
"autoload": {
"psr-4": {
"MummertMedia\\ContaoMeilisearchBundle\\": "src/"
"Mummert\\ContaoMeilisearchBundle\\": "src/"
}
},
"extra": {
"contao-manager-plugin": "MummertMedia\\ContaoMeilisearchBundle\\ContaoManager\\Plugin"
"contao-manager-plugin": "Mummert\\ContaoMeilisearchBundle\\ContaoManager\\Plugin"
}
}
@@ -0,0 +1,100 @@
<?php
namespace Mummert\ContaoMeilisearchBundle\Command;
use Contao\CoreBundle\Framework\ContaoFramework;
use Doctrine\DBAL\Connection;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
class MeilisearchFilesCleanupCommand extends Command
{
public function __construct(
private readonly ContaoFramework $framework,
private readonly Connection $connection,
) {
parent::__construct();
}
protected function configure(): void
{
$this
->setName('meilisearch:files:cleanup')
->setDescription('Remove stale indexed files from tl_search_files')
->addOption(
'grace',
null,
InputOption::VALUE_OPTIONAL,
'Grace period in seconds (files newer than now-grace are kept)',
86400
)
->addOption(
'dry-run',
null,
InputOption::VALUE_NONE,
'Show how many entries would be removed without deleting them'
);
}
protected function execute(InputInterface $input, OutputInterface $output): int
{
$this->framework->initialize();
$this->log('Cleaner gestartet');
try {
$grace = max(0, (int) $input->getOption('grace'));
$dryRun = (bool) $input->getOption('dry-run');
$cutoff = time() - $grace;
if ($dryRun) {
$count = $this->connection->fetchOne(
'SELECT COUNT(*) FROM tl_search_files WHERE last_seen < ?',
[$cutoff]
);
$message = sprintf(
'[DRY-RUN] %d stale file(s) would be removed (last_seen < %s)',
$count,
date('Y-m-d H:i:s', $cutoff)
);
$output->writeln('<comment>' . $message . '</comment>');
$this->log($message);
$this->log('Cleaner stopped (dry-run)');
return Command::SUCCESS;
}
$affected = $this->connection->executeStatement(
'DELETE FROM tl_search_files WHERE last_seen < ?',
[$cutoff]
);
$message = sprintf(
'Removed %d stale file(s) (last_seen < %s)',
$affected,
date('Y-m-d H:i:s', $cutoff)
);
$output->writeln('<info>' . $message . '</info>');
$this->log($message);
$this->log('Cleaner successfully stopped');
return Command::SUCCESS;
} catch (\Throwable $e) {
$this->log('Cleaner ERROR: ' . $e->getMessage());
$output->writeln('<error>' . $e->getMessage() . '</error>');
return Command::FAILURE;
}
}
private function log(string $message): void
{
error_log(sprintf('[%s] %s', date('Y-m-d H:i:s'), $message));
}
}
@@ -0,0 +1,277 @@
<?php
namespace Mummert\ContaoMeilisearchBundle\Command;
use Contao\CoreBundle\Framework\ContaoFramework;
use Contao\Database;
use Contao\System;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\HttpClient\HttpClient;
class MeilisearchFilesParseCommand extends Command
{
public function __construct(
private readonly ContaoFramework $framework,
) {
parent::__construct();
}
protected function configure(): void
{
$this
->setName('meilisearch:files:parse')
->setDescription('Parse indexed files via Apache Tika and enrich tl_search_files')
->addOption(
'limit',
null,
InputOption::VALUE_OPTIONAL,
'Maximum number of files to check per run'
)
->addOption(
'dry-run',
null,
InputOption::VALUE_NONE,
'Do not send files to Tika'
);
}
protected function execute(InputInterface $input, OutputInterface $output): int
{
$this->framework->initialize();
$this->log('Parser gestartet');
$dryRun = (bool) $input->getOption('dry-run');
$limitOption = $input->getOption('limit');
$limit = $limitOption !== null ? max(1, (int) $limitOption) : null;
$tikaUrl = rtrim((string) ($GLOBALS['TL_CONFIG']['meilisearch_tika_url'] ?? ''), '/');
if ($tikaUrl === '') {
$output->writeln('<error>Tika URL not configured</error>');
return Command::FAILURE;
}
$db = Database::getInstance();
$sql = "SELECT * FROM tl_search_files ORDER BY tstamp ASC";
if ($limit !== null) {
$sql .= " LIMIT " . (int) $limit;
}
$files = $db->query($sql)->fetchAllAssoc();
if (!$files) {
$this->log('No files to parse');
return Command::SUCCESS;
}
$client = HttpClient::create([
'timeout' => 180,
]);
foreach ($files as $file) {
$originalUrl = (string) $file['url'];
$existingTitle = trim((string) ($file['title'] ?? ''));
$normalized = $originalUrl;
// -------------------------------------------------
// Normalize URL
// -------------------------------------------------
if (str_contains($normalized, '?')) {
$parts = parse_url($normalized);
if (!empty($parts['query'])) {
parse_str($parts['query'], $query);
if (!empty($query['file'])) {
$normalized = (string) $query['file'];
} else {
$this->log('Not a direct file url, skip', ['url' => $originalUrl]);
continue;
}
}
}
$normalized = strtok($normalized, '#');
$normalized = rawurldecode($normalized);
$normalized = ltrim($normalized, '/');
if (!str_starts_with($normalized, 'files/')) {
$this->log('Not in files/, skip', ['url' => $originalUrl]);
continue;
}
$root = defined('TL_ROOT')
? TL_ROOT
: System::getContainer()->getParameter('kernel.project_dir') . '/public';
$absolutePath = $root . '/' . $normalized;
if (!is_file($absolutePath)) {
$this->log('File missing, skip', [
'url' => $originalUrl,
'path' => $absolutePath,
]);
continue;
}
$mtime = filemtime($absolutePath) ?: 0;
$checksum = md5($normalized . '|' . $mtime);
// -------------------------------------------------
// Skip unchanged
// -------------------------------------------------
if ($file['checksum'] === $checksum && !empty($file['text'])) {
continue;
}
if ($dryRun) {
$output->writeln('[DRY-RUN] Would parse: ' . $normalized);
continue;
}
// -------------------------------------------------
// MIME-Type
// -------------------------------------------------
$ext = strtolower(pathinfo($normalized, PATHINFO_EXTENSION));
$mimeType = match ($ext) {
'pdf' => 'application/pdf',
'docx' => 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'xlsx' => 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'pptx' => 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
default => null,
};
if ($mimeType === null) {
$this->log('Unsupported file type, skip', ['url' => $normalized]);
continue;
}
// -------------------------------------------------
// Tika BODY (roher Plaintext)
// -------------------------------------------------
try {
$this->log('Parsing file', ['url' => $normalized]);
$bodyResponse = $client->request(
'PUT',
$tikaUrl . '/tika/main',
[
'headers' => [
'Accept' => 'text/plain',
'Content-Type' => $mimeType,
],
'body' => fopen($absolutePath, 'rb'),
]
);
$text = trim((string) $bodyResponse->getContent(false));
} catch (\Throwable $e) {
$this->log('Body parse failed', [
'url' => $normalized,
'error' => $e->getMessage(),
]);
continue;
}
// -------------------------------------------------
// TITLE: keep existing editor-defined title
// -------------------------------------------------
$title = $existingTitle !== '' ? $existingTitle : null;
// -------------------------------------------------
// Tika METADATA (Title) only if no existing title
// -------------------------------------------------
if ($title === null) {
try {
$metaResponse = $client->request(
'PUT',
$tikaUrl . '/meta',
[
'headers' => [
'Accept' => 'application/json',
'Content-Type' => $mimeType,
],
'body' => fopen($absolutePath, 'rb'),
]
);
$meta = json_decode($metaResponse->getContent(false), true);
$rawTitle =
$meta['dc:title'][0]
?? $meta['pdf:docinfo:title'][0]
?? null;
if ($rawTitle) {
$title = html_entity_decode(
$rawTitle,
ENT_QUOTES | ENT_HTML5,
'UTF-8'
);
}
} catch (\Throwable) {
// Metadata optional
}
}
// -------------------------------------------------
// TITLE → ASCII SAFE (only if newly generated)
// -------------------------------------------------
if ($existingTitle === '' && $title) {
$title = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $title);
$title = preg_replace('/\s+/', ' ', $title);
$title = trim($title);
}
// -------------------------------------------------
// FALLBACK: Dateiname (only if still empty)
// -------------------------------------------------
if (!$title || strlen($title) < 5) {
$title = pathinfo($normalized, PATHINFO_FILENAME);
$title = str_replace(['_', '-'], ' ', $title);
$title = preg_replace('/\s+/', ' ', $title);
$title = trim($title);
}
// -------------------------------------------------
// Store result
// -------------------------------------------------
$db->prepare(
"UPDATE tl_search_files
SET text = ?, title = ?, checksum = ?, file_mtime = ?, tstamp = ?
WHERE id = ?"
)->execute(
$text,
$title,
$checksum,
$mtime,
time(),
$file['id']
);
$this->log('File parsed', [
'url' => $normalized,
'chars' => mb_strlen($text),
'title' => $title,
]);
}
$this->log('Parser finished');
return Command::SUCCESS;
}
private function log(string $message, array $context = []): void
{
$ctx = $context
? ' | ' . json_encode($context, JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE)
: '';
error_log('[MeilisearchFilesParse] ' . $message . $ctx);
}
}
+24 -2
View File
@@ -1,8 +1,8 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\Command;
namespace Mummert\ContaoMeilisearchBundle\Command;
use MummertMedia\ContaoMeilisearchBundle\Service\MeilisearchIndexService;
use Mummert\ContaoMeilisearchBundle\Service\MeilisearchIndexService;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
@@ -24,12 +24,34 @@ class MeilisearchIndexCommand extends Command
protected function execute(InputInterface $input, OutputInterface $output): int
{
$this->log('Meilisearch index gestartet');
$output->writeln('<info>Meilisearch index started</info>');
try {
$this->indexService->run();
$this->log('Meilisearch index successfully stopped');
$output->writeln('<info>Meilisearch index finished</info>');
return Command::SUCCESS;
} catch (\Throwable $e) {
$this->log('Meilisearch index ERROR: ' . $e->getMessage());
$output->writeln('<error>' . $e->getMessage() . '</error>');
return Command::FAILURE;
}
}
/**
* Einheitliches Logging mit Zeitstempel
*/
private function log(string $message): void
{
error_log(sprintf(
'[%s] %s',
date('Y-m-d H:i:s'),
$message
));
}
}
+2 -2
View File
@@ -1,6 +1,6 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\ContaoManager;
namespace Mummert\ContaoMeilisearchBundle\ContaoManager;
use Contao\CalendarBundle\ContaoCalendarBundle;
use Contao\CoreBundle\ContaoCoreBundle;
@@ -8,7 +8,7 @@ use Contao\ManagerPlugin\Bundle\BundlePluginInterface;
use Contao\ManagerPlugin\Bundle\Config\BundleConfig;
use Contao\ManagerPlugin\Bundle\Parser\ParserInterface;
use Contao\NewsBundle\ContaoNewsBundle;
use MummertMedia\ContaoMeilisearchBundle\ContaoMeilisearchBundle;
use Mummert\ContaoMeilisearchBundle\ContaoMeilisearchBundle;
class Plugin implements BundlePluginInterface
{
+1 -1
View File
@@ -1,6 +1,6 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle;
namespace Mummert\ContaoMeilisearchBundle;
use Symfony\Component\HttpKernel\Bundle\Bundle;
@@ -1,6 +1,6 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\Controller\FrontendModule;
namespace Mummert\ContaoMeilisearchBundle\Controller\FrontendModule;
use Contao\Config;
use Contao\CoreBundle\Controller\FrontendModule\AbstractFrontendModuleController;
-22
View File
@@ -1,22 +0,0 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\Cron;
use Contao\CoreBundle\Framework\ContaoFramework;
use MummertMedia\ContaoMeilisearchBundle\Service\MeilisearchIndexService;
class MeilisearchIndexCron
{
public function __construct(
private readonly MeilisearchIndexService $indexService,
private readonly ContaoFramework $framework,
) {}
public function __invoke(): void
{
// Contao initialisieren (wichtig!)
$this->framework->initialize();
// einmal täglich indexieren
$this->indexService->run();
}
}
@@ -1,6 +1,6 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\DependencyInjection;
namespace Mummert\ContaoMeilisearchBundle\DependencyInjection;
use Symfony\Component\Config\FileLocator;
use Symfony\Component\DependencyInjection\ContainerBuilder;
+28 -138
View File
@@ -1,22 +1,20 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\EventListener;
namespace Mummert\ContaoMeilisearchBundle\EventListener;
use Contao\Config;
use MummertMedia\ContaoMeilisearchBundle\Service\PdfIndexService;
use MummertMedia\ContaoMeilisearchBundle\Service\OfficeIndexService;
use Contao\System;
use Mummert\ContaoMeilisearchBundle\Service\MeilisearchFileHelper;
class IndexPageListener
{
public function __construct(
private readonly PdfIndexService $pdfIndexService,
private readonly OfficeIndexService $officeIndexService,
) {}
private readonly MeilisearchFileHelper $fileHelper,
) {
}
private function debug(string $message, array $context = []): void
{
// Debug bewusst immer aktiv (bis du es wieder entfernst)
// Kontext kurz halten, damit Logs nicht explodieren
$ctx = $context ? ' | ' . json_encode($context, JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE) : '';
error_log('[ContaoMeilisearch][IndexPageListener] ' . $message . $ctx);
}
@@ -30,22 +28,6 @@ class IndexPageListener
'set_keys' => array_keys($set),
]);
/*
* =====================
* PDF: Reset genau 1× pro Crawl
* =====================
*/
try {
$this->debug('PDF resetTableOnce(): call');
$this->pdfIndexService->resetTableOnce();
$this->debug('PDF resetTableOnce(): ok');
} catch (\Throwable $e) {
$this->debug('PDF resetTableOnce(): failed', [
'error' => $e->getMessage(),
'class' => $e::class,
]);
}
/*
* =====================
* SEITEN-METADATEN
@@ -94,8 +76,6 @@ class IndexPageListener
$parsed['page']['keywords'] ?? null,
];
$this->debug('Meta: keyword sources', ['sources' => $keywordSources]);
$keywords = [];
foreach ($keywordSources as $src) {
if (!is_string($src) || trim($src) === '') {
@@ -110,33 +90,17 @@ class IndexPageListener
$set['keywords'] = implode(' ', array_unique($keywords));
}
$this->debug('Meta: keywords result', [
'keywords' => $set['keywords'] ?? null,
]);
// IMAGEPATH (UUID)
$searchImage = $parsed['page']['searchimage'] ?? null;
$this->debug('Meta: searchimage candidate', ['searchimage' => $searchImage]);
if (!empty($searchImage)) {
// >>> HINWEIS: falls dein tl_search-Feld "image" heißt, hier auf $set['image'] ändern!
$set['imagepath'] = trim((string) $searchImage);
// IMAGEPATH
if (!empty($parsed['page']['searchimage'] ?? null)) {
$set['imagepath'] = trim((string) $parsed['page']['searchimage']);
}
// STARTDATE
$startDate =
$parsed['event']['startDate']
?? $parsed['news']['startDate']
?? null;
$this->debug('Meta: startDate candidate', ['startDate' => $startDate]);
if (is_numeric($startDate) && (int) $startDate > 0) {
$set['startDate'] = (int) $startDate;
if (is_numeric($parsed['event']['startDate'] ?? null)) {
$set['startDate'] = (int) $parsed['event']['startDate'];
}
// CHECKSUM
try {
$checksumSeed = (string) ($data['checksum'] ?? '');
$checksumSeed .= '|' . ($set['keywords'] ?? '');
$checksumSeed .= '|' . ($set['priority'] ?? '');
@@ -144,109 +108,53 @@ class IndexPageListener
$checksumSeed .= '|' . ($set['startDate'] ?? '');
$set['checksum'] = md5($checksumSeed);
$this->debug('Checksum generated', [
'seed_preview' => substr($checksumSeed, 0, 120) . (strlen($checksumSeed) > 120 ? '…' : ''),
'checksum' => $set['checksum'],
]);
} catch (\Throwable $e) {
$this->debug('Failed to generate checksum', [
'error' => $e->getMessage(),
'class' => $e::class,
]);
}
$this->debug('Meta: final set snapshot', [
'priority' => $set['priority'] ?? null,
'keywords' => $set['keywords'] ?? null,
'imagepath' => $set['imagepath'] ?? null,
'startDate' => $set['startDate'] ?? null,
'checksum' => $set['checksum'] ?? null,
]);
}
}
/*
* =====================
* DATEI-INDEXIERUNG (PDF / OFFICE)
* DATEI-ERKENNUNG (NUR ERKENNUNG!)
* =====================
*/
if ((int) ($data['protected'] ?? 0) !== 0) {
$this->debug('Abort: protected page', ['protected' => $data['protected'] ?? null]);
return;
}
$indexPdfs = (bool) Config::get('meilisearch_index_pdfs');
$indexOffice = (bool) Config::get('meilisearch_index_office');
$this->debug('File indexing settings', [
'meilisearch_index_pdfs' => $indexPdfs,
'meilisearch_index_office' => $indexOffice,
]);
if (!$indexPdfs && !$indexOffice) {
$this->debug('Abort: file indexing disabled');
if (!Config::get('meilisearch_index_files')) {
return;
}
$links = $this->findAllLinks($content);
$this->debug('Links found', ['count' => count($links)]);
$pdfLinks = [];
$officeLinks = [];
$fileLinks = [];
foreach ($links as $link) {
$type = $this->detectIndexableFileType($link['url']);
if ($type === 'pdf' && $indexPdfs) {
$pdfLinks[] = $link;
continue;
}
if (in_array($type, ['docx', 'xlsx', 'pptx'], true) && $indexOffice) {
$officeLinks[] = $link;
if ($type !== null) {
$fileLinks[] = $link + ['type' => $type];
}
}
$this->debug('Indexable file links', [
'pdf' => count($pdfLinks),
'office' => count($officeLinks),
$this->debug('Indexable file links found', [
'count' => count($fileLinks),
]);
try {
if ($pdfLinks !== []) {
$this->debug('PDF handlePdfLinks(): call', ['count' => count($pdfLinks)]);
$this->pdfIndexService->handlePdfLinks($pdfLinks);
$this->debug('PDF handlePdfLinks(): ok');
if ($fileLinks) {
foreach ($fileLinks as $file) {
$this->fileHelper->collect(
$file['url'],
$file['type'],
(int) ($data['pid'] ?? 0)
);
}
if ($officeLinks !== []) {
$this->debug('Office handleOfficeLinks(): call', ['count' => count($officeLinks)]);
$this->officeIndexService->handleOfficeLinks($officeLinks);
$this->debug('Office handleOfficeLinks(): ok');
}
} catch (\Throwable $e) {
$this->debug('File indexing failed', [
'error' => $e->getMessage(),
'class' => $e::class,
]);
}
$this->debug('Hook end', [
'final_set_keys' => array_keys($set),
'final_set' => [
'priority' => $set['priority'] ?? null,
'keywords' => $set['keywords'] ?? null,
'imagepath' => $set['imagepath'] ?? null,
'startDate' => $set['startDate'] ?? null,
'checksum' => $set['checksum'] ?? null,
],
]);
}
/**
* Extrahiert MEILISEARCH_JSON aus HTML-Kommentar
*/
/* === Hilfsmethoden unverändert === */
private function extractMeilisearchJson(string $content): ?array
{
if (!preg_match('/<!--\s*MEILISEARCH_JSON\s*(\{.*?\})\s*-->/s', $content, $m)) {
@@ -261,9 +169,6 @@ class IndexPageListener
: null;
}
/**
* Sammle alle <a href="…"> Links
*/
private function findAllLinks(string $content): array
{
if (!preg_match_all(
@@ -286,20 +191,11 @@ class IndexPageListener
return $result;
}
/**
* Ermittelt indexierbaren Dateityp (pdf|docx|xlsx|pptx) oder null
*/
private function detectIndexableFileType(string $url): ?string
{
// Hash entfernen
$url = strtok($url, '#');
$parts = parse_url($url);
if (!$parts) {
return null;
}
// direkter Pfad (/files/…)
if (!empty($parts['path'])) {
$ext = strtolower(pathinfo($parts['path'], PATHINFO_EXTENSION));
if (in_array($ext, ['pdf', 'docx', 'xlsx', 'pptx'], true)) {
@@ -307,18 +203,12 @@ class IndexPageListener
}
}
// Query-Parameter (Contao 4 + 5)
if (!empty($parts['query'])) {
parse_str($parts['query'], $query);
foreach (['file', 'p', 'f'] as $param) {
if (!empty($query[$param])) {
$candidate = (string) $query[$param];
// sicher decodieren (Contao 4 + 5)
$candidate = html_entity_decode($candidate, ENT_QUOTES);
$candidate = rawurldecode($candidate);
$candidate = rawurldecode(html_entity_decode((string) $query[$param], ENT_QUOTES));
$ext = strtolower(pathinfo($candidate, PATHINFO_EXTENSION));
if (in_array($ext, ['pdf', 'docx', 'xlsx', 'pptx'], true)) {
@@ -1,6 +1,6 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\EventListener;
namespace Mummert\ContaoMeilisearchBundle\EventListener;
use Contao\CalendarEventsModel;
use Contao\Config;
@@ -16,6 +16,13 @@ class MeilisearchPageMarkerListener
return $buffer;
}
// ⛔ Marker bereits vorhanden → nichts mehr tun
if (str_contains($buffer, '⟦MEILISEARCH_META⟧')
|| str_contains($buffer, 'MEILISEARCH_JSON')
) {
return $buffer;
}
$data = [];
/*
+10 -10
View File
@@ -2,24 +2,24 @@ services:
# Alias MUSS vorhanden sein (richtig platziert)
Psr\Container\ContainerInterface: '@service_container'
MummertMedia\ContaoMeilisearchBundle\:
resource: '../../{Command,Cron,EventListener,Service}'
Mummert\ContaoMeilisearchBundle\:
resource: '../../{Command,EventListener,Service}'
autowire: true
autoconfigure: true
MummertMedia\ContaoMeilisearchBundle\EventListener\IndexPageListener:
Mummert\ContaoMeilisearchBundle\EventListener\MeilisearchPageMarkerListener:
autowire: true
autoconfigure: false
tags:
- { name: contao.hook, hook: outputFrontendTemplate, method: onOutputFrontendTemplate }
Mummert\ContaoMeilisearchBundle\EventListener\IndexPageListener:
autowire: true
autoconfigure: false
tags:
- { name: contao.hook, hook: indexPage, method: onIndexPage }
MummertMedia\ContaoMeilisearchBundle\Cron\MeilisearchIndexCron:
autowire: true
autoconfigure: false
tags:
- { name: contao.cron, interval: daily, method: __invoke }
MummertMedia\ContaoMeilisearchBundle\Controller\FrontendModule\MeilisearchSearchController:
Mummert\ContaoMeilisearchBundle\Controller\FrontendModule\MeilisearchSearchController:
autowire: true
autoconfigure: false
tags:
+1 -1
View File
@@ -1,6 +1,6 @@
<?php
use MummertMedia\ContaoMeilisearchBundle\EventListener\MeilisearchPageMarkerListener;
use Mummert\ContaoMeilisearchBundle\EventListener\MeilisearchPageMarkerListener;
$GLOBALS['TL_HOOKS']['outputFrontendTemplate'][] = [
@@ -2,16 +2,18 @@
use Contao\DC_Table;
$GLOBALS['TL_DCA']['tl_search_pdf'] = [
$GLOBALS['TL_DCA']['tl_search_files'] = [
'config' => [
'dataContainer' => DC_Table::class,
'sql' => [
'keys' => [
'id' => 'primary',
'checksum' => 'unique',
'page_id' => 'index',
'url' => 'index',
'type' => 'index', // ⬅️ NEU
'url' => 'unique',
'type' => 'index',
'checksum' => 'index',
'uuid' => 'index',
'last_seen' => 'index',
],
],
],
@@ -25,10 +27,18 @@ $GLOBALS['TL_DCA']['tl_search_pdf'] = [
'sql' => "int(10) unsigned NOT NULL default 0",
],
/*
* Zeitpunkt, wann die Datei zuletzt beim Crawl gesehen wurde
* Basis für Cleanup
*/
'last_seen' => [ // ⬅️ NEU
'sql' => "int(10) unsigned NOT NULL default 0",
],
/*
* Dateityp: pdf | docx | xlsx | pptx
*/
'type' => [ // ⬅️ NEU
'type' => [
'sql' => "varchar(16) NOT NULL default 'pdf'",
],
@@ -54,6 +64,10 @@ $GLOBALS['TL_DCA']['tl_search_pdf'] = [
'sql' => "mediumtext NULL",
],
'uuid' => [
'sql' => "binary(16) NULL",
],
/*
* md5(url + filemtime)
* erkennt Änderungen zuverlässig
@@ -64,7 +78,7 @@ $GLOBALS['TL_DCA']['tl_search_pdf'] = [
/*
* Herkunftsseite (tl_page.id)
* Cleanup / Referenz
* optional, Debug / Referenz
*/
'page_id' => [
'sql' => "int(10) unsigned NOT NULL default 0",
+45 -22
View File
@@ -4,8 +4,11 @@ use Contao\CoreBundle\DataContainer\PaletteManipulator;
use Contao\System;
/**
* -------------------------------------------------
* Fields
* -------------------------------------------------
*/
$GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_host'] = [
'inputType' => 'text',
'eval' => [
@@ -59,17 +62,9 @@ $GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_imagesize'] = [
'chosen' => true,
'includeBlankOption' => true,
],
// 🔥 DAS HAT GEFEHLT
'sql' => "int(10) unsigned NOT NULL default 0",
];
$GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_index_past_events'] = [
'inputType' => 'checkbox',
'eval' => [
'tl_class' => 'w50 clr',
],
];
$GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_fallback_image'] = [
'inputType' => 'fileTree',
'eval' => [
@@ -80,25 +75,54 @@ $GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_fallback_image'] = [
'sql' => "varbinary(16) NULL",
];
$GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_index_pdfs'] = [
'label' => &$GLOBALS['TL_LANG']['tl_settings']['meilisearch_index_pdfs'],
$GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_index_past_events'] = [
'inputType' => 'checkbox',
'eval' => [
'tl_class' => 'w50',
'tl_class' => 'w50 clr',
],
'sql' => "char(1) NOT NULL default '1'",
];
$GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_index_office'] = [
'label' => &$GLOBALS['TL_LANG']['tl_settings']['meilisearch_index_office'],
'inputType' => 'checkbox',
'eval' => ['tl_class' => 'w50'],
'sql' => "char(1) NOT NULL default '0'",
];
/**
* Palette
* -------------------------------------------------
* Datei-Indexierung (Tika)
* -------------------------------------------------
*/
$GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_index_files'] = [
'inputType' => 'checkbox',
'eval' => [
'tl_class' => 'w50',
'submitOnChange' => true,
],
'sql' => "char(1) NOT NULL default '0'",
];
$GLOBALS['TL_DCA']['tl_settings']['fields']['meilisearch_tika_url'] = [
'inputType' => 'text',
'eval' => [
'rgxp' => 'url',
'mandatory' => true,
'tl_class' => 'w50 clr',
],
];
/**
* -------------------------------------------------
* Selector / Subpalette
* -------------------------------------------------
*/
$GLOBALS['TL_DCA']['tl_settings']['palettes']['__selector__'][] = 'meilisearch_index_files';
$GLOBALS['TL_DCA']['tl_settings']['subpalettes']['meilisearch_index_files']
= 'meilisearch_tika_url';
/**
* -------------------------------------------------
* Palette
* -------------------------------------------------
*/
PaletteManipulator::create()
->addLegend('meilisearch_legend', null, PaletteManipulator::POSITION_AFTER, true)
->addField('meilisearch_host', 'meilisearch_legend')
@@ -108,6 +132,5 @@ PaletteManipulator::create()
->addField('meilisearch_imagesize', 'meilisearch_legend')
->addField('meilisearch_fallback_image', 'meilisearch_legend')
->addField('meilisearch_index_past_events', 'meilisearch_legend')
->addField('meilisearch_index_pdfs', 'meilisearch_legend')
->addField('meilisearch_index_office', 'meilisearch_legend')
->addField('meilisearch_index_files', 'meilisearch_legend')
->applyToPalette('default', 'tl_settings');
@@ -28,10 +28,10 @@ $GLOBALS['TL_LANG']['tl_settings']['meilisearch_index_past_events'][0]
$GLOBALS['TL_LANG']['tl_settings']['meilisearch_index_past_events'][1]
= 'Vergangene Kalender-Events werden ebenfalls in Meilisearch indexiert.';
$GLOBALS['TL_LANG']['tl_settings']['meilisearch_index_pdfs'] = [
'PDFs indexieren',
'Aktiviert die Indexierung von PDF-Dateien für die Suche.',
$GLOBALS['TL_LANG']['tl_settings']['meilisearch_index_files'] = [
'Dateien indexieren',
'Aktiviert die Indexierung von PDF-Dateien sowie DOCX, XLSX und PPTX.',
];
$GLOBALS['TL_LANG']['tl_settings']['meilisearch_index_office']
= ['Office-Dateien indexieren', 'DOCX, XLSX und PPTX in die Suche aufnehmen.'];
$GLOBALS['TL_LANG']['tl_settings']['meilisearch_tika_url']
= ['Apache Tika URL', 'URL der Apache Tika Instanz (z. B. https://tika.domain.tld).'];
@@ -4,6 +4,7 @@ Contao 5 Frontend Module Template
#}
<!-- indexer::stop -->
{% block meilisearch %}
<div
id="topsearch"
class="meilisearch-search"
@@ -59,10 +60,44 @@ Contao 5 Frontend Module Template
</div>
</div>
<script type="module">
import MeiliSearch from 'https://cdn.jsdelivr.net/npm/meilisearch@latest/dist/bundles/meilisearch.esm.js';
<script>
(function () {
const CDN_URLS = [
'https://cdn.jsdelivr.net/npm/meilisearch@0.39.0/dist/bundles/meilisearch.umd.min.js',
'https://unpkg.com/meilisearch@0.39.0/dist/bundles/meilisearch.umd.min.js'
];
document.addEventListener('DOMContentLoaded', () => {
function loadClient(urls, onDone) {
if (typeof MeiliSearch !== 'undefined') {
onDone(true, null);
return;
}
if (!urls.length) {
onDone(false, 'Alle CDN-Quellen fehlgeschlagen (mögliche CSP-Blockierung von script-src).');
return;
}
const url = urls.shift();
const script = document.createElement('script');
script.src = url;
script.async = true;
script.crossOrigin = 'anonymous';
script.onload = () => {
if (typeof MeiliSearch !== 'undefined') {
onDone(true, null);
} else {
loadClient(urls, onDone);
}
};
script.onerror = () => loadClient(urls, onDone);
document.head.appendChild(script);
}
function initSearch() {
const wrapper = document.querySelector('.meilisearch-search');
if (!wrapper) return;
@@ -95,6 +130,10 @@ Contao 5 Frontend Module Template
input.value = '';
results.innerHTML = '';
clear.classList.add('is-hidden');
// ✅ WICHTIG: Suchmodus verlassen
document.body.classList.remove('search-active');
input.focus();
});
@@ -216,6 +255,19 @@ Contao 5 Frontend Module Template
results.appendChild(node);
}
}
}
document.addEventListener('DOMContentLoaded', () => {
loadClient([...CDN_URLS], (ok, reason) => {
if (!ok) {
console.error('[Meilisearch] Browser client konnte nicht geladen werden. ' + reason);
return;
}
initSearch();
});
});
})();
</script>
{% endblock %}
<!-- indexer::continue -->
+259
View File
@@ -0,0 +1,259 @@
<?php
namespace Mummert\ContaoMeilisearchBundle\Service;
use Contao\FilesModel;
use Contao\StringUtil;
use Contao\System;
use Doctrine\DBAL\Connection;
class MeilisearchFileHelper
{
public function __construct(
private readonly Connection $connection,
) {
}
/**
* Zentrale Datei-Verarbeitung
*/
public function collect(string $url, string $type, int $pageId): void
{
$this->log('collect() start', [
'url' => $url,
'type' => $type,
'pageId' => $pageId,
]);
// -------------------------------------------------
// 1. URL normalisieren
// -------------------------------------------------
$cleanUrl = strtok($url, '#');
$parts = parse_url($cleanUrl);
if (!$parts) {
$this->log('Invalid URL, skip');
return;
}
// -------------------------------------------------
// 2. Externe Datei? → skip
// -------------------------------------------------
if (!empty($parts['host'])) {
$currentRequest = System::getContainer()
->get('request_stack')
->getCurrentRequest();
$pageHost = $currentRequest
? parse_url($currentRequest->getSchemeAndHttpHost(), PHP_URL_HOST)
: null;
if ($pageHost && $parts['host'] !== $pageHost) {
$this->log('External file detected, skip', [
'host' => $parts['host'],
]);
return;
}
}
// -------------------------------------------------
// 3. Pfad-Kandidaten sammeln (ohne Annahmen!)
// -------------------------------------------------
$query = [];
if (!empty($parts['query'])) {
parse_str($parts['query'], $query);
}
$pathCandidates = [];
// direkter Pfad
if (!empty($parts['path'])) {
$pathCandidates[] = $parts['path'];
}
// Download-Parameter
foreach (['file', 'f', 'p'] as $param) {
if (!empty($query[$param])) {
$pathCandidates[] = $query[$param];
}
}
// normalisieren
$pathCandidates = array_values(array_unique(array_filter(array_map(
static function ($candidate) {
$candidate = rawurldecode(html_entity_decode((string) $candidate, ENT_QUOTES));
return ltrim($candidate, '/') ?: null;
},
$pathCandidates
))));
$this->log('Path candidates (normalized)', [
'candidates' => $pathCandidates,
]);
// -------------------------------------------------
// 4. FilesModel (DBAFS) auflösen → UUID
// -------------------------------------------------
$fileModel = null;
foreach ($pathCandidates as $candidate) {
// 1) direkt
$model = FilesModel::findByPath($candidate);
if ($model && $model->uuid) {
$fileModel = $model;
$this->log('Resolved via FilesModel (direct)', [
'candidate' => $candidate,
'path' => $model->path,
]);
break;
}
// 2) fallback: files/ davor
if (!str_starts_with($candidate, 'files/')) {
$model = FilesModel::findByPath('files/' . $candidate);
if ($model && $model->uuid) {
$fileModel = $model;
$this->log('Resolved via FilesModel (files/ prefix)', [
'candidate' => $candidate,
'path' => $model->path,
]);
break;
}
}
}
if (!$fileModel) {
$this->log('No Contao file model found, skip', [
'candidates' => $pathCandidates,
]);
return;
}
$normalizedPath = (string) $fileModel->path;
$uuidBin = $fileModel->uuid;
$uuid = StringUtil::binToUuid($uuidBin);
$canonicalUrl = '/' . ltrim($normalizedPath, '/');
$this->log('UUID resolved', [
'path' => $canonicalUrl,
'uuid' => $uuid,
]);
// -------------------------------------------------
// 5. Datei im Filesystem prüfen
// -------------------------------------------------
$projectDir = System::getContainer()->getParameter('kernel.project_dir');
$abs = $projectDir . '/public/' . $normalizedPath;
if (!is_file($abs)) {
$this->log('Resolved model but file missing on filesystem, skip', [
'path' => $normalizedPath,
'abs' => $abs,
]);
return;
}
// -------------------------------------------------
// 6. Redaktionellen Titel aus tl_files.meta
// -------------------------------------------------
$title = null;
$meta = StringUtil::deserialize($fileModel->meta, true);
// 1) bevorzugte Sprache (falls vorhanden)
$lang = $GLOBALS['TL_LANGUAGE'] ?? null;
if ($lang && !empty($meta[$lang]['title'])) {
$title = trim((string) $meta[$lang]['title']);
}
// 2) Fallback: erste verfügbare Sprache
if ($title === null && is_array($meta)) {
foreach ($meta as $langKey => $langMeta) {
if (!empty($langMeta['title'])) {
$title = trim((string) $langMeta['title']);
break;
}
}
}
if ($title) {
$this->log('Title resolved from tl_files', [
'title' => $title,
]);
}
// -------------------------------------------------
// 7. Datei-Infos
// -------------------------------------------------
$mtime = filemtime($abs) ?: 0;
$checksum = md5($normalizedPath . '|' . $mtime);
$now = time();
// -------------------------------------------------
// 8. Upsert über UUID
// -------------------------------------------------
$existing = $this->connection->fetchAssociative(
'SELECT id FROM tl_search_files WHERE uuid = ?',
[$uuidBin]
);
if ($existing) {
$data = [
'tstamp' => $now,
'last_seen' => $now,
'type' => $type,
'url' => $canonicalUrl,
'page_id' => $pageId,
'file_mtime' => $mtime,
'checksum' => $checksum,
];
if ($title !== null) {
$data['title'] = $title;
}
$this->connection->update(
'tl_search_files',
$data,
['id' => $existing['id']]
);
$this->log('File updated by UUID', [
'uuid' => $uuid,
]);
} else {
$this->connection->insert(
'tl_search_files',
[
'tstamp' => $now,
'last_seen' => $now,
'type' => $type,
'url' => $canonicalUrl,
'title' => $title ?? basename($normalizedPath),
'page_id' => $pageId,
'file_mtime' => $mtime,
'checksum' => $checksum,
'uuid' => $uuidBin,
]
);
$this->log('File inserted by UUID', [
'uuid' => $uuid,
]);
}
$this->log('collect() end');
}
// -------------------------------------------------
// Logging
// -------------------------------------------------
private function log(string $message, array $context = []): void
{
$ctx = $context
? ' | ' . json_encode($context, JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE)
: '';
error_log('[ContaoMeilisearch][MeilisearchFileHelper] ' . $message . $ctx);
}
}
+1 -1
View File
@@ -1,6 +1,6 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\Service;
namespace Mummert\ContaoMeilisearchBundle\Service;
use Contao\Config;
use Contao\CoreBundle\Framework\ContaoFramework;
+16 -30
View File
@@ -1,6 +1,6 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\Service;
namespace Mummert\ContaoMeilisearchBundle\Service;
use Contao\Config;
use Contao\CoreBundle\Framework\ContaoFramework;
@@ -72,7 +72,7 @@ class MeilisearchIndexService
}
$this->indexTlSearch($index);
$this->indexTlSearchPdf($index);
$this->indexTlSearchFiles($index);
}
private function ensureIndexSettings(Indexes $index): void
@@ -80,6 +80,7 @@ class MeilisearchIndexService
$index->updateSettings([
'searchableAttributes' => ['title', 'keywords', 'text'],
'sortableAttributes' => ['priority', 'startDate'],
'filterableAttributes' => ['type', 'filetype'],
]);
}
@@ -132,7 +133,7 @@ class MeilisearchIndexService
}
/**
* tl_search indexieren
* tl_search indexieren (Seiten / News / Events)
*/
private function indexTlSearch(Indexes $index): void
{
@@ -164,13 +165,11 @@ class MeilisearchIndexService
}
}
$cleanText = $this->stripMeilisearchMeta((string) $row['text']);
$doc = [
'id' => $type . '_' . $row['id'],
'type' => $type,
'title' => $row['title'],
'text' => $cleanText,
'text' => $this->stripMeilisearchMeta((string) $row['text']),
'url' => $row['url'],
'protected' => (bool) $row['protected'],
'checksum' => $row['checksum'],
@@ -192,31 +191,24 @@ class MeilisearchIndexService
$documents[] = $doc;
} catch (\Throwable $e) {
error_log(
'[ContaoMeilisearch] Failed to build document for tl_search ID '
. ($row['id'] ?? '?') . ': ' . $e->getMessage()
);
error_log('[ContaoMeilisearch] Failed to build tl_search document: ' . $e->getMessage());
}
}
if ($documents !== []) {
try {
$index->addDocuments($documents);
} catch (\Throwable $e) {
error_log('[ContaoMeilisearch] Failed to add tl_search documents: ' . $e->getMessage());
}
}
}
/**
* tl_search_pdf indexieren
* tl_search_files indexieren (PDF / Office)
*/
private function indexTlSearchPdf(Indexes $index): void
private function indexTlSearchFiles(Indexes $index): void
{
try {
$rows = $this->connection->fetchAllAssociative('SELECT * FROM tl_search_pdf');
$rows = $this->connection->fetchAllAssociative('SELECT * FROM tl_search_files');
} catch (\Throwable $e) {
error_log('[ContaoMeilisearch] Failed to read tl_search_pdf: ' . $e->getMessage());
error_log('[ContaoMeilisearch] Failed to read tl_search_files: ' . $e->getMessage());
return;
}
@@ -233,10 +225,11 @@ class MeilisearchIndexService
: 'pdf';
$documents[] = [
'id' => $fileType . '_' . $row['id'],
'type' => $fileType,
'title' => $row['title'],
'text' => $this->stripMeilisearchMeta((string) $row['text']),
'id' => 'file_' . $row['id'],
'type' => 'file',
'filetype' => $fileType,
'title' => $row['title'] ?: basename($row['url']),
'text' => (string) $row['text'],
'url' => $row['url'],
'checksum' => $row['checksum'],
'poster' => self::FILETYPE_ICON_MAP[$fileType]
@@ -244,19 +237,12 @@ class MeilisearchIndexService
];
} catch (\Throwable $e) {
error_log(
'[ContaoMeilisearch] Failed to build PDF document for ID '
. ($row['id'] ?? '?') . ': ' . $e->getMessage()
);
error_log('[ContaoMeilisearch] Failed to build file document: ' . $e->getMessage());
}
}
if ($documents !== []) {
try {
$index->addDocuments($documents);
} catch (\Throwable $e) {
error_log('[ContaoMeilisearch] Failed to add tl_search_pdf documents: ' . $e->getMessage());
}
}
}
-273
View File
@@ -1,273 +0,0 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\Service;
use Contao\Database;
use PhpOffice\PhpWord\IOFactory as WordIOFactory;
use PhpOffice\PhpSpreadsheet\IOFactory as SpreadsheetIOFactory;
use PhpOffice\PhpPresentation\IOFactory as PresentationIOFactory;
use Symfony\Component\DependencyInjection\ParameterBag\ParameterBagInterface;
class OfficeIndexService
{
private string $projectDir;
// pro Crawl-Durchlauf: doppelte Verarbeitung vermeiden
private array $seenThisCrawl = [];
public function __construct(ParameterBagInterface $params)
{
$this->projectDir = rtrim((string) $params->get('kernel.project_dir'), '/');
}
/**
* @param array<int,array{url:string,linkText:?string}> $officeLinks
*/
public function handleOfficeLinks(array $officeLinks): void
{
foreach ($officeLinks as $row) {
$url = (string) ($row['url'] ?? '');
$linkText = $row['linkText'] ?? null;
if ($url === '') {
continue;
}
try {
// innerhalb des Crawls gleiche URL nicht mehrfach parsen
$seenKey = md5($url);
if (isset($this->seenThisCrawl[$seenKey])) {
continue;
}
$this->seenThisCrawl[$seenKey] = true;
$normalized = $this->normalizeOfficeUrl($url);
if ($normalized === null) {
continue;
}
[$relativePath, $type] = $normalized;
$absolutePath = $this->getAbsolutePath($relativePath);
if (!is_file($absolutePath)) {
continue;
}
$mtime = (int) (filemtime($absolutePath) ?: 0);
$checksum = md5($relativePath . '|' . $mtime);
$title = $linkText ?: basename($absolutePath);
$text = $this->parseOfficeFile($absolutePath, $type);
if ($text === '') {
continue;
}
$this->upsertOffice(
$relativePath,
$title,
$text,
$checksum,
$mtime,
$type
);
} catch (\Throwable $e) {
error_log(
'[ContaoMeilisearch] Office indexing failed for "' . $url . '": ' . $e->getMessage()
);
}
}
}
/**
* @return array{string,string}|null [relativePath, type]
*/
private function normalizeOfficeUrl(string $url): ?array
{
$decoded = html_entity_decode($url);
$parts = parse_url($decoded);
// 1) files/... (ohne führenden Slash)
if (!empty($parts['path']) && str_starts_with($parts['path'], 'files/')) {
$ext = strtolower(pathinfo($parts['path'], PATHINFO_EXTENSION));
if (in_array($ext, ['docx', 'xlsx', 'pptx'], true)) {
return ['/' . $parts['path'], $ext];
}
}
// 2) /files/...
if (!empty($parts['path']) && str_starts_with($parts['path'], '/files/')) {
$ext = strtolower(pathinfo($parts['path'], PATHINFO_EXTENSION));
if (in_array($ext, ['docx', 'xlsx', 'pptx'], true)) {
return [$parts['path'], $ext];
}
}
if (empty($parts['query'])) {
return null;
}
parse_str($parts['query'], $query);
// 3) Contao 4: ?file=files/...
if (!empty($query['file'])) {
$file = urldecode((string) $query['file']);
$file = ltrim($file, '/');
$ext = strtolower(pathinfo($file, PATHINFO_EXTENSION));
if (
str_starts_with($file, 'files/')
&& in_array($ext, ['docx', 'xlsx', 'pptx'], true)
) {
return ['/' . $file, $ext];
}
}
// 4) Contao 5: ?p=...
if (!empty($query['p'])) {
$p = urldecode((string) $query['p']);
$ext = strtolower(pathinfo($p, PATHINFO_EXTENSION));
if (in_array($ext, ['docx', 'xlsx', 'pptx'], true)) {
return ['/files/' . ltrim($p, '/'), $ext];
}
}
return null;
}
private function getAbsolutePath(string $relativePath): string
{
return $this->projectDir . '/' . ltrim($relativePath, '/');
}
private function upsertOffice(
string $url,
string $title,
string $text,
string $checksum,
int $mtime,
string $type
): void {
try {
Database::getInstance()
->prepare('
INSERT INTO tl_search_pdf
(tstamp, type, url, title, text, checksum, file_mtime)
VALUES
(?, ?, ?, ?, ?, ?, ?)
ON DUPLICATE KEY UPDATE
tstamp=VALUES(tstamp),
type=VALUES(type),
url=VALUES(url),
title=VALUES(title),
text=VALUES(text),
file_mtime=VALUES(file_mtime)
')
->execute(
time(),
$type,
$url,
$title,
$text,
$checksum,
$mtime
);
} catch (\Throwable $e) {
error_log(
'[ContaoMeilisearch] Failed to write Office index entry (' . $url . '): ' . $e->getMessage()
);
}
}
private function parseOfficeFile(string $absolutePath, string $type): string
{
return match ($type) {
'docx' => $this->parseDocx($absolutePath),
'xlsx' => $this->parseXlsx($absolutePath),
'pptx' => $this->parsePptx($absolutePath),
default => '',
};
}
private function parseDocx(string $absolutePath): string
{
try {
$phpWord = WordIOFactory::load($absolutePath);
$text = '';
foreach ($phpWord->getSections() as $section) {
foreach ($section->getElements() as $element) {
if (method_exists($element, 'getText')) {
$text .= ' ' . $element->getText();
}
}
}
return $this->cleanText($text);
} catch (\Throwable $e) {
error_log(
'[ContaoMeilisearch] Failed to parse DOCX "' . $absolutePath . '": ' . $e->getMessage()
);
return '';
}
}
private function parseXlsx(string $absolutePath): string
{
try {
$spreadsheet = SpreadsheetIOFactory::load($absolutePath);
$text = '';
foreach ($spreadsheet->getAllSheets() as $sheet) {
foreach ($sheet->toArray() as $row) {
$text .= ' ' . implode(' ', array_filter($row, 'is_scalar'));
}
}
return $this->cleanText($text);
} catch (\Throwable $e) {
error_log(
'[ContaoMeilisearch] Failed to parse XLSX "' . $absolutePath . '": ' . $e->getMessage()
);
return '';
}
}
private function parsePptx(string $absolutePath): string
{
try {
$presentation = PresentationIOFactory::load($absolutePath);
$text = '';
foreach ($presentation->getAllSlides() as $slide) {
foreach ($slide->getShapeCollection() as $shape) {
if (method_exists($shape, 'getPlainText')) {
$text .= ' ' . $shape->getPlainText();
}
}
}
return $this->cleanText($text);
} catch (\Throwable $e) {
error_log(
'[ContaoMeilisearch] Failed to parse PPTX "' . $absolutePath . '": ' . $e->getMessage()
);
return '';
}
}
private function cleanText(string $text): string
{
if (class_exists(\Normalizer::class)) {
$text = \Normalizer::normalize($text, \Normalizer::FORM_C) ?? $text;
}
$text = str_replace(["\r\n", "\r"], "\n", $text);
$text = preg_replace('/[^\p{L}\p{N}\p{P}\p{Z}\n]/u', ' ', $text);
$text = preg_replace('/\s+/u', ' ', $text);
return trim(mb_substr($text, 0, 20000));
}
}
-224
View File
@@ -1,224 +0,0 @@
<?php
namespace MummertMedia\ContaoMeilisearchBundle\Service;
use Contao\Database;
use Smalot\PdfParser\Parser;
use Symfony\Component\DependencyInjection\ParameterBag\ParameterBagInterface;
class PdfIndexService
{
private string $projectDir;
private bool $didReset = false;
private array $seenThisCrawl = [];
public function __construct(ParameterBagInterface $params)
{
$this->projectDir = rtrim((string) $params->get('kernel.project_dir'), '/');
}
/**
* Wird aus dem Listener beim ersten Hook-Call pro Crawl aufgerufen.
*/
public function resetTableOnce(): void
{
if ($this->didReset) {
return;
}
$this->didReset = true;
$this->seenThisCrawl = [];
Database::getInstance()->execute('TRUNCATE tl_search_pdf');
}
/**
* @param array<int,array{url:string,linkText:?string}> $pdfLinks
*/
public function handlePdfLinks(array $pdfLinks): void
{
foreach ($pdfLinks as $row) {
$url = (string) ($row['url'] ?? '');
$linkText = $row['linkText'] ?? null;
if ($url === '') {
continue;
}
// innerhalb eines Crawls doppelte URLs vermeiden
$seenKey = md5($url);
if (isset($this->seenThisCrawl[$seenKey])) {
continue;
}
$this->seenThisCrawl[$seenKey] = true;
$normalizedPath = $this->normalizePdfUrl($url);
if ($normalizedPath === null) {
continue;
}
$absolutePath = $this->getAbsolutePath($normalizedPath);
if (!is_file($absolutePath)) {
continue;
}
$mtime = (int) (filemtime($absolutePath) ?: 0);
$checksum = md5($normalizedPath . '|' . $mtime);
// Titel-Priorität:
// 1) Linktext
// 2) PDF-Metadaten
// 3) Dateiname
$pdfMetaTitle = $this->readPdfMetaTitle($absolutePath);
$title = $linkText ?: ($pdfMetaTitle ?: basename($absolutePath));
$text = $this->parsePdf($absolutePath);
if ($text === '') {
continue;
}
$this->upsertPdf(
$normalizedPath,
$title,
$text,
$checksum,
$mtime
);
}
}
private function normalizePdfUrl(string $url): ?string
{
$decoded = html_entity_decode($url);
$parts = parse_url($decoded);
// 1) files/...pdf (ohne führenden Slash)
if (
!empty($parts['path'])
&& str_starts_with($parts['path'], 'files/')
&& str_ends_with(strtolower($parts['path']), '.pdf')
) {
return '/' . $parts['path'];
}
// 2) /files/...pdf
if (
!empty($parts['path'])
&& str_starts_with($parts['path'], '/files/')
&& str_ends_with(strtolower($parts['path']), '.pdf')
) {
return $parts['path'];
}
if (empty($parts['query'])) {
return null;
}
parse_str($parts['query'], $query);
// 3) Contao 4: ?file=files/...
if (!empty($query['file'])) {
$file = urldecode((string) $query['file']);
$file = ltrim($file, '/');
if (
str_starts_with($file, 'files/')
&& str_ends_with(strtolower($file), '.pdf')
) {
return '/' . $file;
}
}
// 4) Contao 5: ?p=...
if (!empty($query['p'])) {
$p = urldecode((string) $query['p']);
return '/files/' . ltrim($p, '/');
}
return null;
}
private function getAbsolutePath(string $relativePath): string
{
return $this->projectDir . '/' . ltrim($relativePath, '/');
}
private function upsertPdf(
string $url,
string $title,
string $text,
string $checksum,
int $mtime
): void {
Database::getInstance()
->prepare('
INSERT INTO tl_search_pdf
(tstamp, url, title, text, checksum, file_mtime)
VALUES
(?, ?, ?, ?, ?, ?)
ON DUPLICATE KEY UPDATE
tstamp=VALUES(tstamp),
url=VALUES(url),
title=VALUES(title),
text=VALUES(text),
file_mtime=VALUES(file_mtime)
')
->execute(
time(),
$url,
$title,
$text,
$checksum,
$mtime
);
}
private function parsePdf(string $absolutePath): string
{
try {
$parser = new Parser();
$pdf = $parser->parseFile($absolutePath);
$text = $this->cleanPdfContent($pdf->getText());
return mb_substr($text, 0, 20000);
} catch (\Throwable) {
return '';
}
}
private function readPdfMetaTitle(string $absolutePath): ?string
{
try {
$parser = new Parser();
$pdf = $parser->parseFile($absolutePath);
$details = $pdf->getDetails();
foreach (['Title', 'title'] as $key) {
if (!empty($details[$key]) && is_string($details[$key])) {
$t = trim($details[$key]);
if ($t !== '') {
return $t;
}
}
}
} catch (\Throwable) {
}
return null;
}
private function cleanPdfContent(string $text): string
{
if (class_exists(\Normalizer::class)) {
$text = \Normalizer::normalize($text, \Normalizer::FORM_C) ?? $text;
}
$text = str_replace(["\r\n", "\r"], "\n", $text);
$text = preg_replace('/[^\p{L}\p{N}\p{P}\p{Z}\n]/u', ' ', $text);
$text = preg_replace('/(?<=\p{L})\s+(?=\p{L})/u', ' ', $text);
$text = preg_replace('/\s+/u', ' ', $text);
return trim($text);
}
}