Помогоите сделать парсер

⚠️ Bymas - Форум разработчиков

Печать RSS

276

Жека 25.04.2010 / 12:09 Автор
Пацак

Есть 10 000 htm файлов. нужно вытащить из них тексты.
Сделал скрипт парсер на php, но в чём приход не врублю...
Если обрабатываю 10-20 файлов...норм всё...но как запускаю массово так всякая хрень лезит и файлы создаёт по 20 - 30 метров...

<?php
@set_time_limit(0); 
$mask='*.htm'; // маска имён файлов 
$files = glob($mask); 
foreach ($files AS $file) { 
$file = file_get_contents($file);
preg_match('|<b>Название:</b>.*htm">(.*)</a>|Uis', $file, $out);
preg_match('|<b>Категория:</b>.*htm">(.*)</a>|Uis', $file, $out1);
preg_match('|<div style="text-align: justify;">(.*)</div>|Uis', $file, $out2);
$text = preg_replace('/<script.*<\/script>/si', '', $out2[1]);
$text = str_replace('.', '. ', $text);
$text = str_replace(',', ', ', $text);
$text = str_replace('&nbsp;', '', $text);
$txt .= $out[1]."<br/>\r\n";
$txt .=  'Категория: '.$out1[1]."<br/>\r\n";
$txt .=  $text."<br/>\r\n";
mkdir($out1[1]);
	$fp = fopen($out1[1].'/'.$out[1].'.txt',"w+");
	fputs($fp, $txt);
	fclose($fp);
}
?>

Изменил: Жека (25.04.2010 / 12:22)

Жека 25.04.2010 / 12:22 Автор
Пацак

всё...разобрался....ступил блин sad

<?php
@set_time_limit(0); 
function un($file) {
preg_match('|<b>Название:</b>.*htm">(.*)</a>|Uis', $file, $out);
preg_match('|<b>Категория:</b>.*htm">(.*)</a>|Uis', $file, $out1);
preg_match('|<div style="text-align: justify;">(.*)</div>|Uis', $file, $out2);
$text = preg_replace('/<script.*<\/script>/si', '', $out2[1]);
$text = str_replace('.', '. ', $text);
$text = str_replace(',', ', ', $text);
$text = str_replace('&nbsp;', '', $text);
$txt .= $out[1]."<br/>\r\n";
$txt .= 'Категория: '.$out1[1]."<br/>\r\n";
$txt .= $text."<br/>\r\n";
return array('txt' => $txt, 'name' => $out[1], 'local' => $out1[1]);
}
$mask='*.htm'; // маска имён файлов 
$files = glob($mask); 
foreach ($files AS $file) { 
$file = file_get_contents($file);
$dan = un($file);
mkdir('new/'.$dan['local']);
	$fp = fopen('new/'.$dan['local'].'/'.$dan['name'].'.txt',"w");
	fputs($fp, $dan['txt']);
	fclose($fp);
}
?>

Tony V 25.04.2010 / 12:34
Транклюкаторщик

Тема закрыта для обсуждения!

Для выполнения действия необходимо авторизоваться!

Стикеры / Теги / Правила / Топ тем / Топ постов / Поиск